Ollama-OCR: Advanced OCR with Vision Language Models via Ollama

Introduction

Ollama-OCR is a powerful and versatile Python package designed for Optical Character Recognition (OCR). It leverages state-of-the-art vision language models, made accessible through Ollama, to accurately extract text from both images and PDF documents. With over 2000 stars and 200 forks on GitHub, this project demonstrates significant community interest and utility. It is available as a Python library for programmatic use and also features a user-friendly Streamlit web application.

Explore Ollama-OCR on GitHub

Installation

To get started with Ollama-OCR, you first need to install Ollama and pull the desired vision models.

Prerequisites

Install Ollama: Follow the instructions on the Ollama website.
Pull Required Models: Use the ollama pull command for models like llama3.2-vision:11b, granite3.2-vision, moondream, or minicpm-v.

ollama pull llama3.2-vision:11b
ollama pull granite3.2-vision
ollama pull moondream
ollama pull minicpm-v

Package Installation

Install the Ollama-OCR Python package using pip:

pip install ollama-ocr

Examples

Ollama-OCR provides flexible options for processing single files or batches, and also includes a Streamlit application for a graphical interface.

Single File Processing

Process an individual image or PDF file with a specified model and output format:

from ollama_ocr import OCRProcessor

# Initialize OCR processor
ocr = OCRProcessor(model_name='llama3.2-vision:11b', base_url="http://host.docker.internal:11434/api/generate")
# You can pass your custom Ollama API URL

# Process an image or PDF
result = ocr.process_image(
    image_path="path/to/your/image.png", # or "path/to/your/file.pdf"
    format_type="markdown",  # Options: markdown, text, json, structured, key_value, table
    custom_prompt="Extract all text, focusing on dates and names.", # Optional custom prompt
    language="English" # Specify the language of the text
)
print(result)

Batch File Processing

Efficiently process multiple images or PDFs in parallel from a directory:

from ollama_ocr import OCRProcessor

# Initialize OCR processor with parallel workers
ocr = OCRProcessor(model_name='llama3.2-vision:11b', max_workers=4)

# Process multiple files with progress tracking
batch_results = ocr.process_batch(
    input_path="path/to/images/folder",  # Directory or list of image paths
    format_type="markdown",
    recursive=True,  # Search subdirectories
    preprocess=True,  # Enable image preprocessing
    custom_prompt="Extract all text, focusing on dates and names.", # Optional custom prompt
    language="English" # Specify the language of the text
)
# Access results
for file_path, text in batch_results['results'].items():
    print(f"\nFile: {file_path}")
    print(f"Extracted Text: {text}")

# View statistics
print("\nProcessing Statistics:")
print(f"Total images: {batch_results['statistics']['total']}")
print(f"Successfully processed: {batch_results['statistics']['successful']}")
print(f"Failed: {batch_results['statistics']['failed']}")

Streamlit Web Application

For a user-friendly experience, the project includes a Streamlit web application that supports batch processing, drag-and-drop uploads, and real-time results.

Clone the repository:

git clone https://github.com/imanoop7/Ollama-OCR.git
cd Ollama-OCR

Install dependencies:
```
pip install -r requirements.txt
```
Navigate to the src/ollama_ocr directory:
```
cd src/ollama_ocr
```
Run the Streamlit app:
```
streamlit run app.py
```

Why Use Ollama-OCR?

Ollama-OCR stands out due to its comprehensive feature set and flexibility:

Extensive Model Support: Integrate with various powerful vision language models like LLaVA, Llama 3.2 Vision, Granite3.2-vision, Moondream, and Minicpm-v.
Versatile Output Formats: Obtain extracted text in Markdown, Plain Text, JSON, Structured, Key-Value Pairs, or Table formats, catering to diverse application needs.
Batch Processing: Efficiently handle multiple documents with parallel processing and progress tracking.
Custom Prompts: Tailor text extraction with custom instructions to focus on specific information.
PDF and Image Support: Process a wide range of document types.
User-Friendly Streamlit App: A responsive web interface simplifies usage for non-developers, offering drag-and-drop functionality and real-time results.
Language Selection: Improve OCR accuracy by specifying the language of the text.