Skip to content

DocExtract - An intelligent document processing application that extracts structured data from invoices, contracts, and receipts using local LLMs and OCR. 100% local processing, zero cloud dependencies.

Notifications You must be signed in to change notification settings

Rayaanxrio/DocExtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DocExtract - Intelligent Document Processing

AI-powered document extraction running 100% locally

DocExtract is a full-stack web application that allows users to upload business documents (invoices, contracts, receipts), automatically extract structured information using OCR and local LLM, review/edit extracted fields, and export data as CSV or JSON.

πŸ”’ Privacy First: All processing happens locally. No cloud APIs, no data leaves your infrastructure.


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              DocExtract System                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                 β”‚         β”‚              Backend (FastAPI)           β”‚   β”‚
β”‚  β”‚   Frontend      β”‚  HTTP   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚  β”‚   (Next.js)     │◄───────►│  β”‚ Upload  β”‚  β”‚Documentsβ”‚  β”‚ Export  β”‚  β”‚   β”‚
β”‚  β”‚                 β”‚   API   β”‚  β”‚ Router  β”‚  β”‚ Router  β”‚  β”‚ Router  β”‚  β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚         β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚  β”‚  β”‚ Dashboard β”‚  β”‚         β”‚       β”‚            β”‚            β”‚       β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚         β”‚       β–Ό            β–Ό            β–Ό       β”‚   β”‚
β”‚  β”‚  β”‚  Upload   β”‚  β”‚         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚         β”‚  β”‚        Services Layer           β”‚    β”‚   β”‚
β”‚  β”‚  β”‚ Documents β”‚  β”‚         β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚    β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚         β”‚  β”‚  β”‚   OCR   β”‚  β”‚     LLM     β”‚   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Groups   β”‚  β”‚         β”‚  β”‚  β”‚ Service β”‚  β”‚   Service   β”‚   β”‚    β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚         β”‚  β”‚  β”‚(Tesser.)β”‚  β”‚(Ollama/HF)  β”‚   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Export   β”‚  β”‚         β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚         β”‚  β”‚       β”‚              β”‚          β”‚    β”‚   β”‚
β”‚  β”‚                 β”‚         β”‚  β”‚       β–Ό              β–Ό          β”‚    β”‚   β”‚
β”‚  β”‚  Tailwind CSS   β”‚         β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚   β”‚
β”‚  β”‚  React          β”‚         β”‚  β”‚  β”‚  Extraction Service     β”‚    β”‚    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚  β”‚  β”‚  (Pipeline Orchestrator)β”‚    β”‚    β”‚   β”‚
β”‚         β”‚                    β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚   β”‚
β”‚         β”‚                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚         β”‚                    β”‚                  β”‚                      β”‚   β”‚
β”‚         β”‚                    β”‚                  β–Ό                      β”‚   β”‚
β”‚         β”‚                    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   β”‚
β”‚         β”‚                    β”‚  β”‚    Database (SQLite/Postgres)   β”‚   β”‚   β”‚
β”‚         β”‚                    β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚   β”‚
β”‚         β”‚                    β”‚  β”‚  β”‚Documents β”‚  β”‚ProcessingLogβ”‚  β”‚   β”‚   β”‚
β”‚         β”‚                    β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚   β”‚
β”‚         β”‚                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚
β”‚         β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚                                                                  β”‚
β”‚         β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚         └────►│                  File Storage                       β”‚      β”‚
β”‚               β”‚                  (./uploads/)                       β”‚      β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Processing Pipeline:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Upload  │───►│   OCR    │───►│  Clean   │───►│   LLM    │───►│  Store   β”‚
β”‚   File   β”‚    β”‚ Extract  β”‚    β”‚   Text   β”‚    β”‚ Analyze  β”‚    β”‚  Result  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

  • πŸ“€ Multi-file Upload: Drag & drop up to 20 documents at once
  • πŸ” OCR Processing: Extract text from PDF, JPG, PNG using Tesseract
  • πŸ€– AI Analysis: Local LLM extracts structured fields (Ollama or HuggingFace)
  • πŸ“Š Document Types: Automatically classify as Invoice, Contract, Receipt
  • ✏️ Review & Edit: View OCR text and edit extracted fields
  • πŸ“ Smart Grouping: Group documents by type, vendor, or date
  • πŸ“₯ Export: Download data as CSV or JSON
  • πŸ“ˆ Dashboard: Overview with statistics and confidence metrics
  • πŸ” 100% Local: No cloud APIs, all processing on your machine

πŸ“‹ Requirements

System Requirements

  • Python 3.10+
  • Node.js 18+
  • Tesseract OCR installed
  • Poppler (for PDF processing)
  • 8GB+ RAM recommended

LLM Requirements (choose one)

  • Option A: Ollama running locally with a model (e.g., llama3.2, mistral)
  • Option B: HuggingFace Transformers (TinyLlama or Phi-2)

πŸš€ Installation

1. Clone the Repository

git clone https://github.com/yourusername/docextract.git
cd docextract

2. Install System Dependencies

Windows:

# Install Tesseract
winget install UB-Mannheim.TesseractOCR

# Install Poppler (for PDF support)
# Download from: https://github.com/oschwartz10612/poppler-windows/releases
# Add bin folder to PATH

# Install Ollama (for LLM)
winget install Ollama.Ollama

macOS:

brew install tesseract poppler
brew install ollama

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install tesseract-ocr poppler-utils
# Install Ollama: curl -fsSL https://ollama.com/install.sh | sh

3. Setup Backend

cd backend

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Copy environment file
copy .env.example .env  # Windows
cp .env.example .env    # macOS/Linux

# Edit .env as needed (especially LLM settings)

4. Setup Frontend

cd frontend

# Install dependencies
npm install

5. Setup LLM (Ollama)

# Pull a model (run Ollama first)
ollama pull llama3.2

# Or for smaller model:
ollama pull tinyllama

▢️ Running the Application

1. Start the Backend

cd backend
# Activate virtual environment first

# Windows:
venv\Scripts\activate

# Then run:
python main.py
# Or with uvicorn:
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Backend will be available at: http://localhost:8000 API Documentation: http://localhost:8000/docs

2. Start the Frontend

cd frontend
npm run dev

Frontend will be available at: http://localhost:3000

3. Start Ollama (if using)

ollama serve

πŸ“ Project Structure

docextract/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py              # FastAPI application entry point
β”‚   β”œβ”€β”€ config.py            # Configuration settings
β”‚   β”œβ”€β”€ database.py          # Database connection
β”‚   β”œβ”€β”€ schemas.py           # Pydantic schemas
β”‚   β”œβ”€β”€ utils.py             # Utility functions
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── document.py      # SQLAlchemy models
β”‚   β”œβ”€β”€ routers/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ upload.py        # Upload endpoints
β”‚   β”‚   β”œβ”€β”€ documents.py     # Document CRUD endpoints
β”‚   β”‚   β”œβ”€β”€ groups.py        # Grouping endpoints
β”‚   β”‚   └── export.py        # Export endpoints
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ ocr_service.py   # Tesseract OCR
β”‚   β”‚   β”œβ”€β”€ llm_service.py   # Local LLM integration
β”‚   β”‚   └── extraction_service.py  # Pipeline orchestration
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── .env.example
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   β”œβ”€β”€ layout.tsx
β”‚   β”‚   β”‚   β”œβ”€β”€ page.tsx     # Landing page
β”‚   β”‚   β”‚   β”œβ”€β”€ globals.css
β”‚   β”‚   β”‚   └── dashboard/
β”‚   β”‚   β”‚       β”œβ”€β”€ layout.tsx
β”‚   β”‚   β”‚       β”œβ”€β”€ page.tsx        # Dashboard
β”‚   β”‚   β”‚       β”œβ”€β”€ upload/
β”‚   β”‚   β”‚       β”œβ”€β”€ documents/
β”‚   β”‚   β”‚       β”œβ”€β”€ groups/
β”‚   β”‚   β”‚       └── export/
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ Sidebar.tsx
β”‚   β”‚   β”‚   β”œβ”€β”€ FileUpload.tsx
β”‚   β”‚   β”‚   β”œβ”€β”€ DocumentTable.tsx
β”‚   β”‚   β”‚   └── StatsCards.tsx
β”‚   β”‚   └── lib/
β”‚   β”‚       β”œβ”€β”€ api.ts       # API client
β”‚   β”‚       └── utils.ts     # Utility functions
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ tailwind.config.js
β”‚   └── next.config.js
β”œβ”€β”€ samples/
β”‚   β”œβ”€β”€ sample_invoice.txt
β”‚   β”œβ”€β”€ sample_contract.txt
β”‚   └── sample_receipt.txt
└── README.md

βš™οΈ Configuration

Edit backend/.env to configure:

# LLM Provider: "ollama" or "huggingface"
LLM_PROVIDER=ollama

# Ollama settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2

# HuggingFace settings (if using)
HUGGINGFACE_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Processing
CONFIDENCE_THRESHOLD=70.0

πŸ”Œ API Endpoints

Method Endpoint Description
POST /api/upload/ Upload multiple files
GET /api/documents/ List documents (with filters)
GET /api/documents/{id} Get document details
PUT /api/documents/{id} Update document fields
POST /api/documents/{id}/verify Mark as verified
POST /api/documents/{id}/reprocess Reprocess document
DELETE /api/documents/{id} Delete document
GET /api/documents/stats Get dashboard statistics
GET /api/groups/ Get grouped documents
GET /api/export/json Export as JSON
GET /api/export/csv Export as CSV

πŸ§ͺ Sample Documents

The samples/ folder contains example documents:

  • sample_invoice.txt - Invoice from ACME Corporation
  • sample_contract.txt - Service Agreement Contract
  • sample_receipt.txt - Store receipt

Convert to PDF/image for testing, or use them to understand expected extraction.


πŸ”§ Troubleshooting

OCR Not Working

  • Ensure Tesseract is installed and in PATH
  • On Windows, you may need to set TESSERACT_CMD in .env

PDF Processing Fails

  • Ensure Poppler is installed and in PATH
  • On Windows, add Poppler's bin folder to PATH

LLM Timeout

  • Ensure Ollama is running (ollama serve)
  • Try a smaller model (tinyllama)
  • Increase LLM_TIMEOUT in .env

Low Confidence Scores

  • Ensure document images are clear and high resolution
  • Try increasing PDF_DPI in config
  • Documents with poor formatting may need manual review

πŸ“„ License

MIT License - See LICENSE file for details.


πŸ™ Acknowledgments

About

DocExtract - An intelligent document processing application that extracts structured data from invoices, contracts, and receipts using local LLMs and OCR. 100% local processing, zero cloud dependencies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published