V3 (latest)
Output format

Extract API JSON Output Format

Overview

The JSON output of the Extract API consists of two main components: the content and the annotations.

Content is what's in the document. These are the items you can find in the document, for example: paragraphs, headings, titles, figures, and miscellaneous text. Content consists of the type of content it is as well as its text.

Annotations are table structures that represent the relationship between table cells, including the positions so that you can "build" the table from its smallest parts.

There are IDs linking the annotations back to its associated content.

Let's dive into the structure of the JSON.

Overall Structure

The specific dictionary captured in this JSON contains the two overall components we mentioned above. It has two keys: content_tree and annotations.

Content Tree

The content tree contains the text content of the document, whether it's found directly as text or within a table or title. It has a tree structure, meaning each item may have "children", to represent hierarchical structure of the document. The highest level is a DOCUMENT, which contains everything as its children. Under the DOCUMENT, there can be other hierarchical structures, e.g. a header-level 2 H2 can be a child of a header-level 1 H1, and a paragraph and a miscellaneous text can be children of this H2. There can be more complex structures like a table has all its own table cells as its children.

Please note that only when document_type is hierarchical we have hierarchical structure for texts not from tables. When document_type is general, only tables have table cells as its "children".

Extract Content Tree

The types are as follows:

content_tree: Dict[str, Any]

Keys:

children: List[Dict[str, Any]] # recursively contains more dicts with children
content: str # text
type: str
uid: str # unique ID

when document_type is hierarchical, type can be one of

    'DOCUMENT',
    'PARAGRAPH',
    'H1',
    'H2',
    'TABLE_CELL',
    'TABLE',
    'TABLE_TITLE',
    'FIGURE_TITLE',
    'TEXT',

when document_type is general, type can be one of

    'DOCUMENT',
    'TABLE_CELL',
    'TABLE',
    'TEXT',
    'TITLE'

Annotations

Annotations are a list of dictionaries, each of which maps back to the content they relate to as well as contains some useful information about the content. For example, annotations can be relationships between the different table cells in order to know how to put them together into a larger table. Each table cell annotation would tell you where a cell is (row and column) and how much it spans.

The types are as follows:

annotations: List[Dict[str, Any]]

Keys:

content_uids: List[str] # unique ID that maps to content uids
type: str # 'table_structure'
data: Dict[str, List[int]] # contains indices and span of the cell

Keys for "data":

index: [int, int] # row and column in the table
span: [int, int] # how much the cell spans

Connecting The Dots

Let's see an example that will help us make sense of how the UI output, annotations, and content tree all work together.

Here is an example image in the UI:

Extract UI

As you can see, there's a table. But what does that look like in the JSON output?

Let's start at the top level of the content tree:

>>> output['content_tree']['type']
'DOCUMENT'

This is the overall document. We must go down to its children to find the text, titles, and tables.

>>> output['content_tree']['children'][5]['type']
'TABLE'

Here we have a table. This table contains children - its cells:

>>> output['content_tree']['children'][5]['children']
[{'children': [], 'content': '2048', 'type': 'TABLE_CELL', 'uid': '133'}, {'children': [], 'content': '4096', 'type': 'TABLE_CELL', 'uid': '134'}, {'children': [], 'content': 'Methods', 'type': 'TABLE_CELL', 'uid': '135'}, {'children': [], 'content': 'Extrapolation', 'type': 'TABLE_CELL', 'uid': '136'}, {'children': [], 'content': 'ROPE', 'type': 'TABLE_CELL', 'uid': '137'}, {'children': [], 'content': '73.6', 'type': 'TABLE_CELL', 'uid': '138'}, {'children': [], 'content': '294.45', 'type': 'TABLE_CELL', 'uid': '139'}, {'children': [], 'content': 'ROPE + BCA', 'type': 'TABLE_CELL', 'uid': '140'}, {'children': [], 'content': '25.57', 'type': 'TABLE_CELL', 'uid': '141'}, {'children': [], 'content': '25.65', 'type': 'TABLE_CELL', 'uid': '142'}, {'children': [], 'content': 'Alibi', 'type': 'TABLE_CELL', 'uid': '143'}, {'children': [], 'content': '23.14', 'type': 'TABLE_CELL', 'uid': '144'}, {'children': [], 'content': '24.26', 'type': 'TABLE_CELL', 'uid': '145'}, {'children': [], 'content': 'Alibi + BCA', 'type': 'TABLE_CELL', 'uid': '146'}, {'children': [], 'content': '24.6', 'type': 'TABLE_CELL', 'uid': '147'}, {'children': [], 'content': '25.37', 'type': 'TABLE_CELL', 'uid': '148'}, {'children': [], 'content': 'XPOS (Ours)', 'type': 'TABLE_CELL', 'uid': '149'}, {'children': [], 'content': '22.56', 'type': 'TABLE_CELL', 'uid': '150'}, {'children': [], 'content': '28.43', 'type': 'TABLE_CELL', 'uid': '151'}, {'children': [], 'content': 'XPOS + BCA (Ours)', 'type': 'TABLE_CELL', 'uid': '152'}, {'children': [], 'content': '21.6', 'type': 'TABLE_CELL', 'uid': '153'}, {'children': [], 'content': '20.73', 'type': 'TABLE_CELL', 'uid': '154'}]

Nice! We have all the cells that comprise the table. Let's examine an example cell more closely:

>>> output['content_tree']['children'][5]['children'][0]
{'children': [], 'content': '2048', 'type': 'TABLE_CELL', 'uid': '133'}

The content indicates that the text in this cell is "2048", and it has no children (i.e. this is the smallest unit - a single table cell). To check if there are annotations associated with this cell - which would allow us to be able it connect it to other cells - we must find a uid to content_uid match. Here it is:

>>> output['annotations'][126]
{'content_uids': ['133'], 'data': {'index': [0, 1], 'span': [1, 1]}, 'type': 'table_structure'}

Here, we have the annotation of that cell (row 0, column 1, one cell) and its corresponding content ID.

If we go through all the annotations, we will find one for each of the content_uids, which will tell us which row and column the cell is in. That way we have the text in each cell - in the content - as well as their relationship to each other - in the annotations.

See the Toolkit page for further information on rebuilding tables.

Location Bounding Boxes

By default, the output JSON returns bounding box information for the content and annotations.

This can be seen in the output "locations" key within annotations and contents. "locations" is a list of dictionaries with the following keys:

page_number: int # Page number where the content or annotation is found (0-indexed)
height: float # Normalized height of the bounding box
width: float # Normalized width of the bounding box
x: float # Normalized x0 of the bounding box (left)
y: float # Normalized y0 of the bounding box (top)

To remove "locations" from the output JSON use the optional output_format parameter as follows:

response = requests.get(response_url, params={"output_format": "structured_document"}, headers=headers)

"locations" is a list to allow for multiple locations associated with an annotation or content. "locations" will be None if it's associated with a higher-level object like a document page, but its children will each contain a locations entry.

These are page-relative locations. To convert to bounding box coordinates on your own rendered page, use (page_width*x, page_height*y, page_width*(x+width), page_height*(y+height)).