OCR

Welcome to the beta release of our Optical Character Recognition (OCR) feature! This functionality allows you to extract text from images with ease. As a beta release, OCR is currently undergoing testing and refinement to ensure accuracy and reliability in various scenarios.

Key Highlights

  • Extract text from images within PDFs and scanned documents within PDFs effortlessly.

  • Support for multiple languages and fonts.

Getting started

To OCR all pages of your document, add "ocr": "true" to the data dictionary in requests.post:

response = requests.post(
    api_url,
    files=files,
    data={
        "document_type": "general",
        "ocr": "true"
    }
)

Important Notes

  • Please expect throughput and latency increases when using OCR.

  • Currently, OCR does not support tabular extraction: Table content will be returned as text.

  • Accuracy may vary depending on image quality and text complexity.

  • Feedback is encouraged to improve accuracy and expand language support.

Feedback and Support

We value your feedback to enhance OCR's performance. If you encounter any issues or have suggestions, please reach out to extract@kensho.com.

Stay Updated

Keep an eye on our release notes for updates and improvements to OCR based on your feedback and usage.