text-extract-api: Advanced Document Extraction, OCR, and PII Removal with LLMs

Introduction

The text-extract-api is a robust and privacy-focused solution for advanced document processing. Built with FastAPI, Celery, and Redis, it offers state-of-the-art OCR capabilities combined with Ollama-supported Large Language Models (LLMs) to extract, parse, and transform content from various document types like PDFs, Word, and PPTX files. A key advantage is its self-hosted nature, ensuring no data leaves your environment, making it ideal for sensitive information processing.

Installation

Getting text-extract-api up and running is straightforward, whether you prefer a native setup or Docker.

Prerequisites

Before you begin, ensure you have:

Clone the Repository

Start by cloning the official repository:

git clone https://github.com/CatchTheTornado/text-extract-api.git
cd text-extract-api

Local Setup with Makefile

For a quick local setup, you can use the provided Makefile:

DISABLE_VENV=1 make install
DISABLE_VENV=1 make run

Docker Setup

For containerized deployment, use Docker Compose. First, copy the example environment file:

cp .env.example .env

Then, build and run the containers:

docker-compose up --build

For GPU support, use:

docker-compose -f docker-compose.gpu.yml -p text-extract-api-gpu up --build

Refer to the official README for detailed manual installation steps and specific dependencies for different operating systems.

Examples

The text-extract-api provides a powerful CLI tool to interact with its functionalities. Here are a few examples to get you started:

Pull LLM Models

Before using LLM features, pull the necessary models:

python client/cli.py llm_pull --model llama3.1
python client/cli.py llm_pull --model llama3.2-vision

Convert Document to Markdown or JSON

Upload a PDF file for OCR processing and conversion. You can specify a prompt for LLM processing and even save the result to disk.

python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en --storage_filename "reports/{Y}/{file_name}-{Y}-{mm}-{dd}.md"

This command processes an MRI report, removes PII, and saves the output as a Markdown file, demonstrating both OCR and LLM capabilities. Screenshots in the repository's README illustrate converting MRI reports to Markdown and JSON, and invoices with PII removal.

Get OCR Result by Task ID

After uploading a file, you receive a task ID. You can retrieve the processing result using this ID:

python client/cli.py result --task_id {your_task_id_from_upload_step}

Online Demo

You can also try out a hosted version of the application using the CLI tool against their cloud instance. Visit the online demo for more details.

Why Use It

Choosing text-extract-api offers several compelling advantages for document processing:

Data Privacy and Security: Operate entirely on-premise with no external cloud dependencies, ensuring sensitive data remains within your control.
High Accuracy OCR: Integrates state-of-the-art OCR engines like EasyOCR, MiniCPM-V, and Llama 3.2 Vision, providing exceptional accuracy for various document types and languages.
LLM-Enhanced Processing: Leverage Ollama-supported LLMs to improve OCR results, fix spelling, extract structured JSON, and perform advanced tasks like PII removal.
Flexible Output Formats: Convert documents and images into highly accurate Markdown text or structured JSON, adapting to your application's needs.
Scalable and Robust Architecture: Built with FastAPI and Celery for asynchronous task processing and Redis for caching, supporting distributed workloads.
Versatile Storage Options: Supports various storage strategies, including local filesystem, Google Drive, and Amazon S3, for managing your extracted data.
Developer-Friendly: Provides a comprehensive CLI tool and API clients (e.g., Typescript) for easy integration and automation.