Ollama-OCR: Advanced OCR with Vision Language Models via Ollama

Summary
Ollama-OCR is a robust Python package and Streamlit application for Optical Character Recognition. It leverages state-of-the-art vision language models, accessible through Ollama, to accurately extract text from both images and PDF documents. The tool offers extensive features including support for multiple models, various output formats, and batch processing capabilities.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Ollama-OCR is a powerful and versatile Python package designed for Optical Character Recognition (OCR). It leverages state-of-the-art vision language models, made accessible through Ollama, to accurately extract text from both images and PDF documents. With over 2000 stars and 200 forks on GitHub, this project demonstrates significant community interest and utility. It is available as a Python library for programmatic use and also features a user-friendly Streamlit web application.
Installation
To get started with Ollama-OCR, you first need to install Ollama and pull the desired vision models.
Prerequisites
- Install Ollama: Follow the instructions on the Ollama website.
- Pull Required Models: Use the
ollama pullcommand for models likellama3.2-vision:11b,granite3.2-vision,moondream, orminicpm-v.
ollama pull llama3.2-vision:11b
ollama pull granite3.2-vision
ollama pull moondream
ollama pull minicpm-v
Package Installation
Install the Ollama-OCR Python package using pip:
pip install ollama-ocr
Examples
Ollama-OCR provides flexible options for processing single files or batches, and also includes a Streamlit application for a graphical interface.
Single File Processing
Process an individual image or PDF file with a specified model and output format:
from ollama_ocr import OCRProcessor
# Initialize OCR processor
ocr = OCRProcessor(model_name='llama3.2-vision:11b', base_url="http://host.docker.internal:11434/api/generate")
# You can pass your custom Ollama API URL
# Process an image or PDF
result = ocr.process_image(
image_path="path/to/your/image.png", # or "path/to/your/file.pdf"
format_type="markdown", # Options: markdown, text, json, structured, key_value, table
custom_prompt="Extract all text, focusing on dates and names.", # Optional custom prompt
language="English" # Specify the language of the text
)
print(result)
Batch File Processing
Efficiently process multiple images or PDFs in parallel from a directory:
from ollama_ocr import OCRProcessor
# Initialize OCR processor with parallel workers
ocr = OCRProcessor(model_name='llama3.2-vision:11b', max_workers=4)
# Process multiple files with progress tracking
batch_results = ocr.process_batch(
input_path="path/to/images/folder", # Directory or list of image paths
format_type="markdown",
recursive=True, # Search subdirectories
preprocess=True, # Enable image preprocessing
custom_prompt="Extract all text, focusing on dates and names.", # Optional custom prompt
language="English" # Specify the language of the text
)
# Access results
for file_path, text in batch_results['results'].items():
print(f"\nFile: {file_path}")
print(f"Extracted Text: {text}")
# View statistics
print("\nProcessing Statistics:")
print(f"Total images: {batch_results['statistics']['total']}")
print(f"Successfully processed: {batch_results['statistics']['successful']}")
print(f"Failed: {batch_results['statistics']['failed']}")
Streamlit Web Application
For a user-friendly experience, the project includes a Streamlit web application that supports batch processing, drag-and-drop uploads, and real-time results.
- Clone the repository:
git clone https://github.com/imanoop7/Ollama-OCR.git cd Ollama-OCR - Install dependencies:
pip install -r requirements.txt - Navigate to the
src/ollama_ocrdirectory:cd src/ollama_ocr - Run the Streamlit app:
streamlit run app.py
Why Use Ollama-OCR?
Ollama-OCR stands out due to its comprehensive feature set and flexibility:
- Extensive Model Support: Integrate with various powerful vision language models like LLaVA, Llama 3.2 Vision, Granite3.2-vision, Moondream, and Minicpm-v.
- Versatile Output Formats: Obtain extracted text in Markdown, Plain Text, JSON, Structured, Key-Value Pairs, or Table formats, catering to diverse application needs.
- Batch Processing: Efficiently handle multiple documents with parallel processing and progress tracking.
- Custom Prompts: Tailor text extraction with custom instructions to focus on specific information.
- PDF and Image Support: Process a wide range of document types.
- User-Friendly Streamlit App: A responsive web interface simplifies usage for non-developers, offering drag-and-drop functionality and real-time results.
- Language Selection: Improve OCR accuracy by specifying the language of the text.
Links
- GitHub Repository: imanoop7/Ollama-OCR
- Ollama Official Website: ollama.com
- LLaVA Model: ollama.com/library/llava
- Granite3.2-vision Model: ollama.com/library/granite3.2-vision
- Moondream Model: ollama.com/library/moondream
- Minicpm-v Model: ollama.com/library/minicpm-v
- Example Notebooks: Ollama-OCR Example Notebooks