AI-powered document extraction running 100% locally
DocExtract is a full-stack web application that allows users to upload business documents (invoices, contracts, receipts), automatically extract structured information using OCR and local LLM, review/edit extracted fields, and export data as CSV or JSON.
π Privacy First: All processing happens locally. No cloud APIs, no data leaves your infrastructure.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DocExtract System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββ β
β β β β Backend (FastAPI) β β
β β Frontend β HTTP β βββββββββββ βββββββββββ βββββββββββ β β
β β (Next.js) ββββββββββΊβ β Upload β βDocumentsβ β Export β β β
β β β API β β Router β β Router β β Router β β β
β β βββββββββββββ β β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β β
β β β Dashboard β β β β β β β β
β β βββββββββββββ€ β β βΌ βΌ βΌ β β
β β β Upload β β β βββββββββββββββββββββββββββββββββββ β β
β β βββββββββββββ€ β β β Services Layer β β β
β β β Documents β β β β βββββββββββ βββββββββββββββ β β β
β β βββββββββββββ€ β β β β OCR β β LLM β β β β
β β β Groups β β β β β Service β β Service β β β β
β β βββββββββββββ€ β β β β(Tesser.)β β(Ollama/HF) β β β β
β β β Export β β β β ββββββ¬βββββ ββββββββ¬βββββββ β β β
β β βββββββββββββ β β β β β β β β
β β β β β βΌ βΌ β β β
β β Tailwind CSS β β β βββββββββββββββββββββββββββ β β β
β β React β β β β Extraction Service β β β β
β βββββββββββββββββββ β β β (Pipeline Orchestrator)β β β β
β β β β ββββββββββββββ¬βββββββββββββ β β β
β β β βββββββββββββββββΌβββββββββββββββββ β β
β β β β β β
β β β βΌ β β
β β β βββββββββββββββββββββββββββββββββββ β β
β β β β Database (SQLite/Postgres) β β β
β β β β ββββββββββββ βββββββββββββββ β β β
β β β β βDocuments β βProcessingLogβ β β β
β β β β ββββββββββββ βββββββββββββββ β β β
β β β βββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββ β
β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββΊβ File Storage β β
β β (./uploads/) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Processing Pipeline:
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β Upload βββββΊβ OCR βββββΊβ Clean βββββΊβ LLM βββββΊβ Store β
β File β β Extract β β Text β β Analyze β β Result β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
- π€ Multi-file Upload: Drag & drop up to 20 documents at once
- π OCR Processing: Extract text from PDF, JPG, PNG using Tesseract
- π€ AI Analysis: Local LLM extracts structured fields (Ollama or HuggingFace)
- π Document Types: Automatically classify as Invoice, Contract, Receipt
- βοΈ Review & Edit: View OCR text and edit extracted fields
- π Smart Grouping: Group documents by type, vendor, or date
- π₯ Export: Download data as CSV or JSON
- π Dashboard: Overview with statistics and confidence metrics
- π 100% Local: No cloud APIs, all processing on your machine
- Python 3.10+
- Node.js 18+
- Tesseract OCR installed
- Poppler (for PDF processing)
- 8GB+ RAM recommended
- Option A: Ollama running locally with a model (e.g., llama3.2, mistral)
- Option B: HuggingFace Transformers (TinyLlama or Phi-2)
git clone https://github.com/yourusername/docextract.git
cd docextractWindows:
# Install Tesseract
winget install UB-Mannheim.TesseractOCR
# Install Poppler (for PDF support)
# Download from: https://github.com/oschwartz10612/poppler-windows/releases
# Add bin folder to PATH
# Install Ollama (for LLM)
winget install Ollama.OllamamacOS:
brew install tesseract poppler
brew install ollamaLinux (Ubuntu/Debian):
sudo apt update
sudo apt install tesseract-ocr poppler-utils
# Install Ollama: curl -fsSL https://ollama.com/install.sh | shcd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Copy environment file
copy .env.example .env # Windows
cp .env.example .env # macOS/Linux
# Edit .env as needed (especially LLM settings)cd frontend
# Install dependencies
npm install# Pull a model (run Ollama first)
ollama pull llama3.2
# Or for smaller model:
ollama pull tinyllamacd backend
# Activate virtual environment first
# Windows:
venv\Scripts\activate
# Then run:
python main.py
# Or with uvicorn:
uvicorn main:app --reload --host 0.0.0.0 --port 8000Backend will be available at: http://localhost:8000 API Documentation: http://localhost:8000/docs
cd frontend
npm run devFrontend will be available at: http://localhost:3000
ollama servedocextract/
βββ backend/
β βββ main.py # FastAPI application entry point
β βββ config.py # Configuration settings
β βββ database.py # Database connection
β βββ schemas.py # Pydantic schemas
β βββ utils.py # Utility functions
β βββ models/
β β βββ __init__.py
β β βββ document.py # SQLAlchemy models
β βββ routers/
β β βββ __init__.py
β β βββ upload.py # Upload endpoints
β β βββ documents.py # Document CRUD endpoints
β β βββ groups.py # Grouping endpoints
β β βββ export.py # Export endpoints
β βββ services/
β β βββ __init__.py
β β βββ ocr_service.py # Tesseract OCR
β β βββ llm_service.py # Local LLM integration
β β βββ extraction_service.py # Pipeline orchestration
β βββ requirements.txt
β βββ .env.example
βββ frontend/
β βββ src/
β β βββ app/
β β β βββ layout.tsx
β β β βββ page.tsx # Landing page
β β β βββ globals.css
β β β βββ dashboard/
β β β βββ layout.tsx
β β β βββ page.tsx # Dashboard
β β β βββ upload/
β β β βββ documents/
β β β βββ groups/
β β β βββ export/
β β βββ components/
β β β βββ Sidebar.tsx
β β β βββ FileUpload.tsx
β β β βββ DocumentTable.tsx
β β β βββ StatsCards.tsx
β β βββ lib/
β β βββ api.ts # API client
β β βββ utils.ts # Utility functions
β βββ package.json
β βββ tailwind.config.js
β βββ next.config.js
βββ samples/
β βββ sample_invoice.txt
β βββ sample_contract.txt
β βββ sample_receipt.txt
βββ README.md
Edit backend/.env to configure:
# LLM Provider: "ollama" or "huggingface"
LLM_PROVIDER=ollama
# Ollama settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2
# HuggingFace settings (if using)
HUGGINGFACE_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
# Processing
CONFIDENCE_THRESHOLD=70.0| Method | Endpoint | Description |
|---|---|---|
| POST | /api/upload/ |
Upload multiple files |
| GET | /api/documents/ |
List documents (with filters) |
| GET | /api/documents/{id} |
Get document details |
| PUT | /api/documents/{id} |
Update document fields |
| POST | /api/documents/{id}/verify |
Mark as verified |
| POST | /api/documents/{id}/reprocess |
Reprocess document |
| DELETE | /api/documents/{id} |
Delete document |
| GET | /api/documents/stats |
Get dashboard statistics |
| GET | /api/groups/ |
Get grouped documents |
| GET | /api/export/json |
Export as JSON |
| GET | /api/export/csv |
Export as CSV |
The samples/ folder contains example documents:
sample_invoice.txt- Invoice from ACME Corporationsample_contract.txt- Service Agreement Contractsample_receipt.txt- Store receipt
Convert to PDF/image for testing, or use them to understand expected extraction.
- Ensure Tesseract is installed and in PATH
- On Windows, you may need to set
TESSERACT_CMDin.env
- Ensure Poppler is installed and in PATH
- On Windows, add Poppler's bin folder to PATH
- Ensure Ollama is running (
ollama serve) - Try a smaller model (
tinyllama) - Increase
LLM_TIMEOUTin.env
- Ensure document images are clear and high resolution
- Try increasing
PDF_DPIin config - Documents with poor formatting may need manual review
MIT License - See LICENSE file for details.