NERD
Text Annotation

Text Annotation

You've obtained a gleaming new token through Okta or RSA, you've grabbed some of your favorite documents, and you're ready to start extracting entities with Kensho NERD! Follow along to make your first requests through the NERD API. Or, jump straight to the full API Reference.

You can also check out a short video guide (opens in a new tab) to the NERD API.

Annotation Quickstart

The NERD API supports two workflows for obtaining annotations for a text document:

  • A synchronous workflow through the /annotations-sync endpoint that is useful for real-time annotations
  • An asynchronous workflow though the /annotations-async endpoint that is useful for bulk processing of documents

The two endpoints expect input of the same shape.

Keeping things simple, let's make our first request to the synchronous (real-time) endpoint. We're going to extract Wikimedia entities from the following text:

"The Supreme Court in Nairobi rendered its decision to the AU on Wednesday."

This example is trickier than it seems! Both "Supreme Court" and "AU" can only be linked properly if the model takes into account the surrounding entity of "Nairobi" and the joint context of the three entities together. Standard entity extraction solutions might simply return the most "salient" Supreme Court, i.e., the Supreme Court of the United States, and any of the many meanings of the acronym AU, for example Australia and American University. NERD's context awareness enables it to link even the toughest of entities consistently and accurately.

Let's hit the /annotations-sync endpoint and get our results. In Python, this could look like:

import json
import requests
 
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-sync"
my_access_token = ""  # copy-paste your Access Token string obtained from login in the quotation marks
 
data = {
    "knowledge_bases": [
        "wikimedia"
    ],
    "text": "The Supreme Court in Nairobi rendered its decision to the AU on Wednesday."
}
# Send the document to the NERD API
response = requests.post(
    NERD_API_URL,
    data=json.dumps(data),
    headers={"Content-Type": "application/json",
             "Authorization": "Bearer " + my_access_token}  # "Bearer " must be included
)
annotations_results = response.json()

The results represent a list of entity annotations. Each annotation includes a start and end location in the text and the ID, name (label), and type of the entity in the selected knowledge base. For this example, we'd see results like:

{
    "results": [
        {
            "annotations": [
                {
                    "start_index": 4,
                    "end_index": 17,
                    "text": "Supreme Court",
                    "entity_kb_id": "2368297",
                    "entity_label": "Supreme Court of Kenya",
                    "entity_type": "GOVERNMENT",
                    "ned_score": 0.1703,
                    "ner_score": 1.0,
                },
                {
                    "start_index": 21,
                    "end_index": 28,
                    "text": "Nairobi",
                    "entity_kb_id": "3870",
                    "entity_label": "Nairobi",
                    "entity_type": "CITY",
                    "ned_score": 0.2128,
                    "ner_score": 1.0,
 
                },
                {
                    "start_index": 58,
                    "end_index": 60,
                    "text": "AU",
                    "entity_kb_id": "7159",
                    "entity_label": "African Union",
                    "entity_type": "NGO",
                    "ned_score": 0.0919,
                    "ner_score": 1.0,
                },
            ],
            "knowledge_base": "wikimedia",
        }
    ]
}

NERD has successfully disambiguated all three of our entities: the Supreme Court of Kenya, the city of Nairobi, and the African Union!

Capital IQ Organization Entities

To extract Capital IQ organization entities, simply replace "wikimedia" with "capiq" and add an extra parameter, "originating_entity_id". Optimized for the financial domain, the Capital IQ variant of NERD allows a user to specify an "originating entity", or the entity that issued the document in question. For example, the company whose earnings call transcript or 10K filing is passed through NERD would be the originating entity for that document. The value of this parameter should be the originating entity's Capital IQ ID. For example:

import json
import requests
 
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-sync"
my_access_token = ""  # copy-paste your Access Token string obtained from login in the quotation marks
 
data = {
    "knowledge_bases": [
        "capiq"
    ],
    "text": "The LEGO Group today reported first half earnings for the six months ending June 30, 2020.",
    # Capital IQ ID of LEGO A/S. Profile page: https://www.capitaliq.com/CIQDotNet/company.aspx?companyid=701221
    "originating_entity_id": "701221"
}
# Send the document to the NERD API
response = requests.post(
    NERD_API_URL,
    data=json.dumps(data),
    headers={"Content-Type": "application/json",
             "Authorization": "Bearer " + my_access_token}  # "Bearer " must be included
)
annotations_results = response.json()

with the results:

{
    "results": [
        {
            "annotations": [
                {
                    "end_index": 14,
                    "entity_kb_id": "701221",
                    "entity_label": "LEGO A/S",
                    "entity_type": "ORG",
                    "ned_score": 0.9993,
                    "ner_score": 0.9973,
                    "start_index": 0,
                    "text": "The LEGO Group",
                }
            ],
            "knowledge_base": "capiq",
        }
    ]
}

Providing an originating entity ID allows NERD to take even more context into account and therefore produce more precise annotations. If there isn't an appropriate originating entity for a document, such as in the case of a news article, simply enter "0" or omit the field.

For more information on the NERD API's output, check out our video guide (opens in a new tab).

Capital IQ Person Entities

In addition to organizations, authorized users can also request people by adding another parameter to the request: "tag_people": true. Possible use cases could include:

  • Analyzing speaking time in an earnings call transcript
  • Finding relevant contacts for business development from a corpus of documents
  • Redacting personal identifying information before publishing

A call to the People API would look very similar to a CapIQ call, but people-specific results will include "ner_type": "PERSON"

import json
import requests
 
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-sync"
my_access_token = ""  # copy-paste your Access Token string obtained from login in the quotation marks
 
data = {
    "knowledge_bases": [
        "capiq"
    ],
    "text": "Tim Cook has stated that a portion of Apple's R&D this year has been allocated towards generative AI.",
    "originating_entity_id": "0",
    "tag_people": True
}
# Send the document to the NERD API
response = requests.post(
    NERD_API_URL,
    data=json.dumps(data),
    headers={"Content-Type": "application/json",
             "Authorization": "Bearer " + my_access_token}  # "Bearer " must be included
)
annotations_results = response.json()

with the results:

{
    "results": [
        {
            "annotations": [
                {
                    "end_index": 8,
                    "entity_kb_id": "169600",
                    "entity_label": "Timothy Cook",
                    "entity_type": "ORG",
                    "ned_score": 1.0,
                    "ner_score": 1.0,
                    "ner_type": "PERSON",
                    "start_index": 0,
                    "text": "Tim Cook"
                },
                {
                    "end_index": 43,
                    "entity_kb_id": "24937",
                    "entity_label": "Apple Inc.",
                    "entity_type": "ORG",
                    "ned_score": 0.9978336170154303,
                    "ner_score": 0.9997112154960632,
                    "ner_type": "ORG",
                    "start_index": 38,
                    "text": "Apple"
                }
            ],
            "knowledge_base": "capiq"
        }
    ]
}

Asynchronous Workflows

The asynchronous endpoint allows for bulk processing. Each document uploaded to this endpoint is immediately assigned a job_id, which can then be used to request annotations at a later time. We recommend using this endpoint when processing a large batch of documents, e.g., in a backfill. Here is an example of using the async endpoint to upload a document, receive a job_id, then poll the server until the results are ready, or until 5 minutes have elapsed:

import json
import requests
import time
 
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-async"
my_access_token = ""  # copy-paste your Access Token string obtained from login in the quotation marks
TIMEOUT = 300  # 5 minute timeout in seconds
 
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer " + my_access_token,  # "Bearer " must be included
}
 
data = ...  # your document, choice of knowledge base, and possible originating entity, as per above
 
response = requests.post(NERD_API_URL, data=json.dumps(data), headers=headers)
# The POST method, if successful, returns a response including a `job_id` key and value.
if response.status_code != 202:
    raise RuntimeError("Error submitting document to the NERD API")
 
job_id = response.json()["job_id"]
 
start_time = time.time()
# poll until results are ready or timeout is reached
while time.time() <= start_time + TIMEOUT:
    response = requests.get(NERD_API_URL, params={"job_id": job_id}, headers=headers)
    if response.status_code != 200:
        raise RuntimeError(f"Error retrieving job results for {job_id}")
    if response.json().get("status") == "success":
        break
    time.sleep(1)
if not (
        response.json().get("results")
        and response.json().get("results")[0]
        and response.json().get("results")[0].get("annotations")
):
    raise TimeoutError(f"Job {job_id} timed out")
 

Since the Access Token expires one hour after it is provisioned, users who expect their code to run for one hour or longer should use their Refresh Token, which expires after one week, to generate new Access Tokens as needed. Here's an example:

(Production services that run continually should use keypair authentication)

import json
import requests
import time
import os
 
 
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-async"
my_refresh_token = ""  # copy-paste your Refresh Token string obtained from login in the quotation marks
 
 
def get_access_token_from_refresh_token(refresh_token):
    """Get Access Token by Refresh Token."""
    response = requests.get(f"https://nerd.kensho.com/oauth2/refresh?refresh_token={refresh_token}")
    new_access_token = response.json()["access_token"]
    return new_access_token
 
 
class NerdClient:
    def __init__(self, refresh_token):
        self.refresh_token = refresh_token
 
    def update_access_token(self):
        self.access_token = get_access_token_from_refresh_token(self.refresh_token)
 
    def call_api(self, verb, *args, headers={}, **kwargs):
        """Call NERD API, refreshing access token as needed."""
        if not hasattr(self, "access_token"):
            self.update_access_token()
 
        def call_with_updated_headers():
            nonlocal method
            headers["Authorization"] = f"Bearer {self.access_token}"
            return method(*args, headers=headers, **kwargs)
 
        method = getattr(requests, verb)
        response = call_with_updated_headers()
        if response.status_code == 401:
            self.update_access_token()
            response = call_with_updated_headers()
        return response
 
    def make_async_annotations_request(self, data):
        """Make a POST call to NERD Async Endpoint."""
        response = self.call_api(
            "post",
            NERD_API_URL,
            data=json.dumps(data),
            headers={"Content-Type": "application/json"}
        )
        return response.json()["job_id"]
 
    def get_async_annotations_results(self, job_id):
        """Get annotations results from NERD Async Endpoint."""
        while True:
            response = self.call_api(
                "get",
                NERD_API_URL + "?job_id=" + job_id
            )
            result = response.json()
            if result["status"] != "pending":
                break
            time.sleep(10)
        return result
 
# data preparation
file_dir = ""  # file path to directory containing documents you want NERD to process
files = os.listdir(file_dir)
job_dict = {}  # dict to store file_name/job_id pair 
data = {"knowledge_bases": ["capiq"]}
 
nerd_client = NerdClient(my_refresh_token)
 
# submit requests to async endpoint
for file_name in files:
    file_name = os.path.join(file_dir, file_name)
    with open(file_name, "r") as f:
        text = f.read()
        data.update({"text": text})
    job_id = nerd_client.make_async_annotations_request(data)
    job_dict.update({file_name: job_id})
    print(f'Submitted {file_name} as {job_id}')
    time.sleep(0.1)
 
# retrieve results from async endpoint
for file_name, job_id in job_dict.items():
    file_name += '.nerd.json'
    result = nerd_client.get_async_annotations_results(job_id)
    print(f'Wrote result for {job_id} to {file_name}')
    with open(file_name, 'w') as result_file:
        json.dump(result, result_file, indent=4)
    time.sleep(0.1)