Kensho Extract

The Kensho Extract API allows users to transform PDF documents into structured JSON files.

There is only one currently supported document_type: our latest hierarchical_v2. Set document_type=hierarchical_v2 to capture the specific structure of a document, mimicking its intended hierarchy. It offers a rich set of classes for fine-grained control over the types of text segments predicted:

Titles & Subtitles, up to a level of 5 subtitles
Paragraphs
Tables, Table Titles, Table Captions, Table Labels, & Table Footers
Figure Titles, Figure Captions, Figure Labels, & Figure Footers
Non-Figure Image Titles, Image Captions, Image Labels, & Image Footers
Page Headers, Page Footers, & Page Footnotes
Table of Contents & Table of Contents Titles
Miscellaneous Text

The set of output types is open

The content types (nodes) listed here and on the Output Format page are not a closed set. As we release new features, we may introduce new content types. If the behavior of an existing type changes, we will let you know in advance, but new types can be added at any time. When consuming Extract output, treat the set of types as extensible and tolerate types you do not recognize.

API V3 Feature Updates

Enhanced Table Extraction: Not only does it improve recognition of rows and columns, it also provides best-in-class support for challenging elements like merged cells and column headers. Users should select “enhanced_table_extraction” as “true” to use our latest model to extract tables from within their documents. This feature is safe to use on scanned documents: Extract automatically detects scanned pages and OCRs them, so tables are still extracted correctly.
Optical Character Recognition (OCR): Kensho Extract offers OCR on scanned documents. Extract also automatically detects when a document needs OCR, so scanned pages are handled even when ocr is not set to true. See the OCR page for details.
Hierarchical V2: The hierarchical_v2 document type offers fine-grained text segment types, multiple levels of title hierarchy, strong table detection, and robust performance on OCRed documents.
Figure Extraction: This API parameter extracts data from charts and figures in PDF documents. We support the four most common chart types: bar charts, line plots, scatter plots, and pie charts. See the Figure Extraction page for details.

Get Started

You can begin using Kensho Extract in seconds via our REST API.

The API behaves in the following fashion:

After authentication, the user is able to submit PDF documents as well as a priority code to the API. By default, the API will treat all documents as first in, first out with the exception that any document marked as low priority will be handled after high priority documents are completed regardless of when they are submitted.
The low priority queue is intended for all bulk document processing to avoid delaying the processing of any high-urgency documents which may need a fast turnaround.
After document submission, the API will return a unique request_id which can be used for a subsequent query to retrieve the document output at a later time.
For large documents, using the upload/download URL flow with document_type=hierarchical_v2 lets Extract process pieces of the document in parallel for faster turnaround time. See the quickstart for details.

To sign up, please email support@kensho.com to set up your API profile.

Then, to start extracting documents with Kensho Extract, visit our authentication guide or reference the full API Documentation.

To turn Extract's JSON output into text, markdown, pandas DataFrames, and other downstream-ready formats, use Kenverters, our open-source conversion toolkit.