Kensho Extract
The Kensho Extract API allows users to transform PDF documents into structured JSON files. There are two model output types. Please read the following information on both model outputs to determine which is the most optimal for your use case:
hierarchical_v2: Hierarchical v2 is Extract's newest model. Users can expect the same benefits of Hierarchical v1 including capturing specific document structure. Additionally, it includes more classes than the original hierarchical model for increased granularity of text types. Below are new classes offered:
- Titles & Subtitles, up to a level of 5 subtitles
- Paragraphs
- Tables, Table Titles, Table Captions, Table Labels, & Table Footers
- Figure Titles, Figure Captions, Figure Labels, & Figure Footers
- Non-Figure Image Titles, Image Captions, Image Labels, & Image Footers
- Page Headers, Page Footers, & Page Footnotes
- Table of Contents & Table of Contents Titles
- Miscellaneous Text
Note
Please note hierarchical_v2 is in beta and we will continue to iterate as needed.
hierarchical: The hierarchical model will provide the specific document structure, mimicking the intended hierarchy of a document.
- Titles & Subtitles
- Paragraphs
- Tables & Table Titles
- Figure Titles
- Miscellaneous Text
general: Choosing the general model will provide a non-hierarchical structure of the document. We recommend trying the hierarchical first, and if you are not satisfied with the output or do not require a hierarchical structure, try the general model!
- Text
- Tables
- Figures
- Titles
API V3 Feature Updates
- Enhanced Table Extraction: Not only does it improve recognition of rows and columns, it also provides best-in-class support for challenging elements like merged cells and column headers. Users should select “enhanced_table_extraction” as “true” to use our latest model to extract tables from within their documents. Please note: This feature should not be selected “true” when users have scanned documents
- Optical Character Recognition (OCR): An exciting new capability for Kensho Extract is to offer OCR on scanned documents! As a beta release, OCR is currently undergoing testing and refinement to ensure accuracy and reliability in various scenarios.
- Hierarchical V2: This new document type allows for increased granularity of types of text segments predicted and more levels of title hierarchy. It has also has improved table detection and overall performance with OCRed documents.
- Figure Extraction: This new API parameter extracts data from charts and figures in PDF documents. We currently only support extracting data from bar charts, with new chart types coming soon.
Note
Figure Extraction is only available to call through our API. We will be adding the capability to call through our User Interface in the coming weeks
Get Started
You can begin using Kensho Extract in seconds via our REST API.
The API behaves in the following fashion:
-
After authentication, the user is able to submit PDF documents as well as a priority code to the API. By default, the API will treat all documents as first in, first out with the exception that any document marked as
low
priority will be handled after high priority documents are completed regardless of when they are submitted. -
The
low
priority queue is intended for all bulk document processing to avoid delaying the processing of any high-urgency documents which may need a fast turnaround. -
After document submission, the API will return a unique
request_id
which can be used for a subsequent query to retrieve the document output at a later time.
To sign up, please email support@kensho.com to set up your API profile.
Then, to start extracting documents with Kensho Extract, visit our authentication guide or reference the full API Documentation.