SmolDocling

SmolDocling

by ds4sd
SmolDocling is a compact, multimodal document processing model designed for efficient conversion of document images into structured text. It supports a variety of elements including text, formulas, charts, and tables, making it suitable for academic papers, technical reports, and other document types. With only 256M parameters, it ensures fast inference speeds, processing each page in just 0.35 seconds on an A100 GPU. The model is fully compatible with Docling and supports multiple export formats.

What is SmolDocling?

SmolDocling is a lightweight, multimodal document processing model designed for efficient conversion of document images into structured text. It supports various elements including text, formulas, charts, and tables, making it ideal for academic papers, technical reports, and other document types.

Key Features

  • Multimodal Document Conversion: Converts image documents into structured text, supporting both scientific and non-scientific documents.
  • Fast Inference: Processes a page in just 0.35 seconds on an A100 GPU.
  • OCR and Layout Recognition: Accurately extracts text while preserving document structure and element bounding boxes.
  • Complex Element Recognition: Recognizes and processes code blocks, mathematical formulas, charts, and tables.
  • Seamless Integration with Docling: Supports multiple export formats and is compatible with Docling.

Technical Details

  • Lightweight Design: With only 256M parameters, SmolDocling is optimized for fast processing on consumer-grade GPUs.
  • Visual Backbone Network: Uses SigLIP base patch-16/512 for efficient image processing.
  • Text Encoder: Employs SmolLM-2 for text processing and multimodal fusion.
  • Optimized Training: Trained on a diverse dataset with a higher pixel token rate for improved efficiency.

Getting Started

To use SmolDocling, install the necessary dependencies and follow the example code provided in the documentation. The model supports inference using Transformers, VLLM, or ONNX, and results can be exported in multiple formats using Docling.

Application Scenarios

  • Document Conversion and Digitization: Efficiently converts image-based documents into structured text formats.
  • Scientific and Non-Scientific Document Processing: Recognizes and extracts key information from various document types.
  • Mobile and Low-Resource Device Support: Runs on mobile devices or resource-constrained environments.

Model Capabilities

Model Type
multimodal
Supported Tasks
Ocr Text Extraction Formula Recognition Chart Recognition Table Recognition Document Conversion
Tags
Document Processing OCR Multimodal Lightweight Text Extraction Formula Recognition Chart Recognition Table Recognition Academic Papers Technical Reports

Usage & Integration

Pricing
free
API Access
Available
License
Open Source
Requirements
  • Python 3.8+
  • GPU

Screenshots & Images

Primary Screenshot
Additional Images

Stats

0 Views
0 Likes

Similar Models

WarriorCoder by Microsoft, South China University of Technology
0
CSM by Sesame Team
0
Light-R1 by 360 Smart Brain
0