olmOCR

olmOCR

by Ai2
olmOCR is an open-source tool for converting PDFs to text with high accuracy, preserving reading order and supporting tables, equations, and handwriting.

What is olmOCR?

olmOCR is an open-source tool developed by Ai2, designed to efficiently convert PDF documents into clean, structured plain text. It combines document-anchoring technology with the Qwen2-VL-7B-Instruct multimodal model to handle various types of PDF documents, including academic papers, books, tables, and charts. olmOCR extracts text and layout information from documents, combining it with page images to accurately extract content and retain structured information. It supports large-scale batch processing, with a cost of only $190 per million pages, significantly lower than other commercial solutions.

Main Features of olmOCR

  • Efficient Document Conversion: Converts PDF documents into clean, structured plain text while retaining structured content such as chapters, tables, lists, and formulas.
  • Supports Multiple Document Types: Handles PDF documents from various fields, including academic papers, legal documents, brochures, charts, and scanned documents.
  • Document Anchoring Technology: Extracts text blocks and image location information from documents, combining them with the original text to form prompts that improve content extraction accuracy.
  • Large-Scale Processing Capability: Optimizes the inference process, supporting batch processing from single documents to millions of pages at a very low cost ($190 per million pages).
  • Open Source and Extensible: All components, including model weights, data, and training code, are open source, supporting various inference frameworks (e.g., vLLM and SGLang) for easy user extension and customization.

Technical Principles of olmOCR

  • Document Anchoring: Extracts text blocks and image location information from PDF pages, combining them with the original text to form prompts. These prompts, along with the rasterized page images, are input into a visual language model (VLM) to better understand the document's structure and layout, reducing extraction errors caused by blurry images or complex layouts.
  • Fine-Tuned Visual Language Model (VLM): Based on the Qwen2-VL-7B-Instruct 7B parameter visual language model, fine-tuned on a dataset of 260,000 PDF pages to adapt to document processing tasks. The model outputs structured JSON data, including page metadata (e.g., language, orientation, presence of tables) and text content in natural reading order.
  • Efficient Inference and Cost Optimization: Uses efficient inference frameworks like SGLang and vLLM to support large-scale parallel processing. With optimized hardware utilization and inference processes, olmOCR's processing cost is extremely low, at $190 per million pages, much lower than other commercial solutions.
  • Robustness Enhancement: Automatically retries and adjusts prompt content in case of extraction failures or repeated generation. Detects page orientation and performs rotation correction to ensure accurate content extraction.

Project Links for olmOCR

Application Scenarios of olmOCR

  • Language Model Training: Extracts high-quality text from PDF documents to provide training corpus for language models.
  • Academic Research: Quickly converts academic papers into structured text, aiding literature review and knowledge mining.
  • Legal Document Processing: Accurately extracts content from legal documents and contracts, supporting legal text analysis and compliance checks.
  • Enterprise Document Management: Converts internal PDF documents into editable text for easy management and updates.
  • Digital Libraries and Archive Digitization: Converts PDF scans of printed books and historical documents into electronic documents for digital preservation and dissemination.

Features & Capabilities

What You Can Do
Pdf To Text Conversion Text Extraction Document Structuring Batch Processing
Categories
PDF Conversion Document Anchoring Structured Text Open Source OCR Text Extraction Batch Processing Academic Research Legal Document Processing Digital Libraries
Example Uses
  • Language model training
  • Academic research
  • Legal document processing
  • Enterprise document management
  • Digital library digitization

Getting Started

Pricing
free Cost of $190 per million pages for large-scale batch processing.

Screenshots & Images

Primary Screenshot
Additional Images

Stats

37 Views
0 Favorites

Similar Tools

77
AgenticObjectDetection by LandingAI
68