OCR

Welcome to the release of our Optical Character Recognition (OCR) feature! This functionality allows users to extract text and tables from scanned documents with ease. With the launch of our new Hierarchical v2 model, OCR is officially out of Beta.

Key Highlights

We recommending using the document_type=hierarchical_v2 when using OCR. This ensures that tabular extraction works properly with scanned documents.
Extract text from images within PDFs and scanned documents within PDFs effortlessly.
Support for multiple languages and fonts.

Getting started

To OCR all pages of your document, add "ocr": "true" to the data dictionary in requests.post:

Python

response = requests.post(
    api_url,
    files=files,
    data={
        "document_type": "hierarchical_v2",
        "ocr": "true"
    }
)

Important Notes

Please expect throughput and latency increases when using OCR.
Accuracy may vary depending on image quality and text complexity.
Feedback is encouraged to improve accuracy and expand language support.

Feedback and Support

We value your feedback to enhance OCR's performance. If you encounter any issues or have suggestions, please reach out to extract@kensho.com.

Stay Updated

Keep an eye on our release notes for updates and improvements to OCR based on your feedback and usage.