V3 (latest)
Toolkit

Toolkit

Kensho Extract is known for its innovative features, including first-class text ordering and recently released state-of-the-art table structure recognition. As Extract’s features continue to expand into more sophisticated document segments, the richness of its output will only grow. With the newly released Toolkit, aka “Kenverters”, you can leverage several key conversion tools for the output from Kensho Extract.

Please visit our open-source documentation (opens in a new tab) for more information, or click on any of the following six items below.

  1. Conversion to Items (opens in a new tab) Convert Extract’s output to a list of paragraphs, titles, and tables represented as dictionaries.

  2. Full Text Extraction (opens in a new tab) Receive the full text output as a single string. This function will return each separate item (paragraph, title, or table) with \n as a delimiter.

  3. Markdown Conversion (opens in a new tab) Convert all text from a document into markdown will return a string output with # before each title and a markdown representation of each table using the | delimiter between cells.

  4. Table Extraction (opens in a new tab) Take a full document’s Extract output and return all the tables in natural reading order and in a variety of common formats.

  5. Organized Sections (opens in a new tab) Users can receive a list of sections in a document, which returns a list of lists containing document segments (title, table or text).

  6. Visually Formatted Text (opens in a new tab) Users can return visually-formatted text for each page and receive a list of strings, each one containing the text in the page with spaces and line breaks simulating the original white space between the different segments.

Kenverters is Kensho Extract’s open source tooling to power your downstream applications with ease. Using Extract’s rich features is now uniquely user friendly, leading the market in developer experience. For more information, check out our documentation (opens in a new tab).