Step-by-Step Guide to Local Deployment of olmOCR: Making PDF Processing a Breeze!

olmOCR Logo

Attention all PDF wranglers! Today, I'm introducing a fantastic tool—olmOCR—that enables language models to easily understand PDFs with even the most unconventional layouts! From academic papers to complex tables, it handles them all. Best of all, it supports local deployment, ensuring your data stays secure! Below, I'll guide you step-by-step through installation and usage👇

🛠️ Preparation: Installing Dependencies

First, we need to install a few system-level dependencies (using Ubuntu as an example):

# One-liner to install the whole package
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

💡 Tip: When encountering font license agreements during installation, press the TAB key to select <Yes> and confirm!

🌱 Creating a Python Environment

It's recommended to use conda for environment management:

conda create -n olmocr python=3.11
conda activate olmocr

# Clone the repository and install
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

⚡ Installing Acceleration Components

Want to use GPU acceleration? These two commands are essential:

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

🚀 Quick Start: PDF Conversion in Action

Single File Conversion

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

Batch Processing

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

The conversion results will be saved in a JSONL file within the ./localworkspace/results directory. View it using this command:

cat localworkspace/results/output_*.jsonl

👀 Visualization and Comparison Tool

Want a visual comparison between the original PDF and the converted result? Try this:

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

Open the HTML file in the generated dolma_previews directory, and you'll see a comparison interface like this👇

Visualization Comparison Example

🧰 Advanced Usage

Million-Scale PDF Processing

For enterprise-level, massive PDF processing, you can use an AWS cluster:

# Initialize on the first node
python -m olmocr.pipeline s3://my_bucket/workspace --pdfs s3://my_bucket/pdfs/*.pdf

# Other nodes join the cluster
python -m olmocr.pipeline s3://my_bucket/workspace

Viewing Complete Parameters

python -m olmocr.pipeline --help

💻 For Docker Enthusiasts

The official repository provides a ready-made Dockerfile, making it even easier to pull the image:

FROM allenai/olmocr-inference:latest
# See the project documentation for specific usage
# Link below:
https://github.com/allenai/olmocr/blob/main/scripts/beaker/Dockerfile-inference

❓ FAQ

GPU errors?
Verify your graphics card driver and CUDA version. It's recommended to use newer cards like RTX 4090/L40S/A100/H100.
Support for non-English PDFs?
Currently optimized for English documents, but you can try other languages using the --apply_filter parameter.
Insufficient disk space?
Ensure at least 30GB of free space. For large files, it's recommended to mount an SSD.

👏 Acknowledgements

olmOCR is developed by the Allen Institute for AI (AI2) and is open-sourced under the Apache 2.0 license. Special thanks to the development team for their contributions (full list of contributors).

Give it a try now! If you encounter any issues, feel free to ask in the Discord community～🎉