Step-by-Step Guide to Local Deployment of olmOCR: Making PDF Processing a Breeze!

OLMOCR Team
March 1, 2025
Attention all PDF wranglers! Today, I'm introducing a fantastic tool—olmOCR—that enables language models to easily understand PDFs with even the most unconventional layouts! From academic papers to complex tables, it handles them all. Best of all, it supports local deployment, ensuring your data stays secure! Below, I'll guide you step-by-step through installation and usage👇
🛠️ Preparation: Installing Dependencies
First, we need to install a few system-level dependencies (using Ubuntu as an example):
# One-liner to install the whole package
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
💡 Tip: When encountering font license agreements during installation, press the TAB
key to select <Yes>
and confirm!
🌱 Creating a Python Environment
It's recommended to use conda for environment management:
conda create -n olmocr python=3.11
conda activate olmocr
# Clone the repository and install
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .
⚡ Installing Acceleration Components
Want to use GPU acceleration? These two commands are essential:
pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
🚀 Quick Start: PDF Conversion in Action
Single File Conversion
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
Batch Processing
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
The conversion results will be saved in a JSONL file within the ./localworkspace/results
directory. View it using this command:
cat localworkspace/results/output_*.jsonl
👀 Visualization and Comparison Tool
Want a visual comparison between the original PDF and the converted result? Try this:
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
Open the HTML file in the generated dolma_previews
directory, and you'll see a comparison interface like this👇
🧰 Advanced Usage
Million-Scale PDF Processing
For enterprise-level, massive PDF processing, you can use an AWS cluster:
# Initialize on the first node
python -m olmocr.pipeline s3://my_bucket/workspace --pdfs s3://my_bucket/pdfs/*.pdf
# Other nodes join the cluster
python -m olmocr.pipeline s3://my_bucket/workspace
Viewing Complete Parameters
python -m olmocr.pipeline --help
💻 For Docker Enthusiasts
The official repository provides a ready-made Dockerfile, making it even easier to pull the image:
FROM allenai/olmocr-inference:latest
# See the project documentation for specific usage
# Link below:
https://github.com/allenai/olmocr/blob/main/scripts/beaker/Dockerfile-inference
❓ FAQ
-
GPU errors?
Verify your graphics card driver and CUDA version. It's recommended to use newer cards like RTX 4090/L40S/A100/H100. -
Support for non-English PDFs?
Currently optimized for English documents, but you can try other languages using the--apply_filter
parameter. -
Insufficient disk space?
Ensure at least 30GB of free space. For large files, it's recommended to mount an SSD.
👏 Acknowledgements
olmOCR is developed by the Allen Institute for AI (AI2) and is open-sourced under the Apache 2.0 license. Special thanks to the development team for their contributions (full list of contributors).
Give it a try now! If you encounter any issues, feel free to ask in the Discord community~🎉