Kensho Extract

The Kensho Extract API allows users to transform PDF documents into structured JSON files. There are two model output types. Please read the following information on both model outputs to determine which is the most optimal for your use case:

hierarchical_v2: Hierarchical v2 is Extract's newest model. Users can expect the same benefits of Hierarchical v1 including capturing specific document structure. Additionally, it includes more classes than the original hierarchical model for increased granularity of text types. Below are new classes offered:

Titles & Subtitles, up to a level of 5 subtitles
Paragraphs
Tables, Table Titles, Table Captions, Table Labels, & Table Footers
Figure Titles, Figure Captions, Figure Labels, & Figure Footers
Non-Figure Image Titles, Image Captions, Image Labels, & Image Footers
Page Headers, Page Footers, & Page Footnotes
Table of Contents & Table of Contents Titles
Miscellaneous Text

Note

Please note hierarchical_v2 is in beta and we will continue to iterate as needed.

hierarchical: The hierarchical model will provide the specific document structure, mimicking the intended hierarchy of a document.

Titles & Subtitles
Paragraphs
Tables & Table Titles
Figure Titles
Miscellaneous Text

general: Choosing the general model will provide a non-hierarchical structure of the document. We recommend trying the hierarchical first, and if you are not satisfied with the output or do not require a hierarchical structure, try the general model!

Text
Tables
Figures
Titles

API V3 Feature Updates

Enhanced Table Extraction: Not only does it improve recognition of rows and columns, it also provides best-in-class support for challenging elements like merged cells and column headers. Users should select “enhanced_table_extraction” as “true” to use our latest model to extract tables from within their documents. Please note: This feature should not be selected “true” when users have scanned documents
Optical Character Recognition (OCR): An exciting new capability for Kensho Extract is to offer OCR on scanned documents! As a beta release, OCR is currently undergoing testing and refinement to ensure accuracy and reliability in various scenarios.
Hierarchical V2: This new document type allows for increased granularity of types of text segments predicted and more levels of title hierarchy. It has also has improved table detection and overall performance with OCRed documents.
Figure Extraction: This new API parameter extracts data from charts and figures in PDF documents. We currently only support extracting data from bar charts, with new chart types coming soon.

Get Started

You can begin using Kensho Extract in seconds via our REST API.

The API behaves in the following fashion:

After authentication, the user is able to submit PDF documents as well as a priority code to the API. By default, the API will treat all documents as first in, first out with the exception that any document marked as low priority will be handled after high priority documents are completed regardless of when they are submitted.
The low priority queue is intended for all bulk document processing to avoid delaying the processing of any high-urgency documents which may need a fast turnaround.
After document submission, the API will return a unique request_id which can be used for a subsequent query to retrieve the document output at a later time.

To sign up, please email support@kensho.com to set up your API profile.

Then, to start extracting documents with Kensho Extract, visit our authentication guide or reference the full API Documentation.