DocExtract - Intelligent Document Processing

AI-powered document extraction running 100% locally

DocExtract is a full-stack web application that allows users to upload business documents (invoices, contracts, receipts), automatically extract structured information using OCR and local LLM, review/edit extracted fields, and export data as CSV or JSON.

🔒 Privacy First: All processing happens locally. No cloud APIs, no data leaves your infrastructure.

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              DocExtract System                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐         ┌─────────────────────────────────────────┐   │
│  │                 │         │              Backend (FastAPI)           │   │
│  │   Frontend      │  HTTP   │  ┌─────────┐  ┌─────────┐  ┌─────────┐  │   │
│  │   (Next.js)     │◄───────►│  │ Upload  │  │Documents│  │ Export  │  │   │
│  │                 │   API   │  │ Router  │  │ Router  │  │ Router  │  │   │
│  │  ┌───────────┐  │         │  └────┬────┘  └────┬────┘  └────┬────┘  │   │
│  │  │ Dashboard │  │         │       │            │            │       │   │
│  │  ├───────────┤  │         │       ▼            ▼            ▼       │   │
│  │  │  Upload   │  │         │  ┌─────────────────────────────────┐    │   │
│  │  ├───────────┤  │         │  │        Services Layer           │    │   │
│  │  │ Documents │  │         │  │  ┌─────────┐  ┌─────────────┐   │    │   │
│  │  ├───────────┤  │         │  │  │   OCR   │  │     LLM     │   │    │   │
│  │  │  Groups   │  │         │  │  │ Service │  │   Service   │   │    │   │
│  │  ├───────────┤  │         │  │  │(Tesser.)│  │(Ollama/HF)  │   │    │   │
│  │  │  Export   │  │         │  │  └────┬────┘  └──────┬──────┘   │    │   │
│  │  └───────────┘  │         │  │       │              │          │    │   │
│  │                 │         │  │       ▼              ▼          │    │   │
│  │  Tailwind CSS   │         │  │  ┌─────────────────────────┐    │    │   │
│  │  React          │         │  │  │  Extraction Service     │    │    │   │
│  └─────────────────┘         │  │  │  (Pipeline Orchestrator)│    │    │   │
│         │                    │  │  └────────────┬────────────┘    │    │   │
│         │                    │  └───────────────┼────────────────┘    │   │
│         │                    │                  │                      │   │
│         │                    │                  ▼                      │   │
│         │                    │  ┌─────────────────────────────────┐   │   │
│         │                    │  │    Database (SQLite/Postgres)   │   │   │
│         │                    │  │  ┌──────────┐  ┌─────────────┐  │   │   │
│         │                    │  │  │Documents │  │ProcessingLog│  │   │   │
│         │                    │  │  └──────────┘  └─────────────┘  │   │   │
│         │                    │  └─────────────────────────────────┘   │   │
│         │                    └────────────────────────────────────────┘   │
│         │                                                                  │
│         │     ┌────────────────────────────────────────────────────┐      │
│         └────►│                  File Storage                       │      │
│               │                  (./uploads/)                       │      │
│               └────────────────────────────────────────────────────┘      │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Processing Pipeline:
┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Upload  │───►│   OCR    │───►│  Clean   │───►│   LLM    │───►│  Store   │
│   File   │    │ Extract  │    │   Text   │    │ Analyze  │    │  Result  │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘

✨ Features

📤 Multi-file Upload: Drag & drop up to 20 documents at once
🔍 OCR Processing: Extract text from PDF, JPG, PNG using Tesseract
🤖 AI Analysis: Local LLM extracts structured fields (Ollama or HuggingFace)
📊 Document Types: Automatically classify as Invoice, Contract, Receipt
✏️ Review & Edit: View OCR text and edit extracted fields
📁 Smart Grouping: Group documents by type, vendor, or date
📥 Export: Download data as CSV or JSON
📈 Dashboard: Overview with statistics and confidence metrics
🔐 100% Local: No cloud APIs, all processing on your machine

📋 Requirements

System Requirements

Python 3.10+
Node.js 18+
Tesseract OCR installed
Poppler (for PDF processing)
8GB+ RAM recommended

LLM Requirements (choose one)

Option A: Ollama running locally with a model (e.g., llama3.2, mistral)
Option B: HuggingFace Transformers (TinyLlama or Phi-2)

🚀 Installation

1. Clone the Repository

git clone https://github.com/yourusername/docextract.git
cd docextract

2. Install System Dependencies

Windows:

# Install Tesseract
winget install UB-Mannheim.TesseractOCR

# Install Poppler (for PDF support)
# Download from: https://github.com/oschwartz10612/poppler-windows/releases
# Add bin folder to PATH

# Install Ollama (for LLM)
winget install Ollama.Ollama

macOS:

brew install tesseract poppler
brew install ollama

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install tesseract-ocr poppler-utils
# Install Ollama: curl -fsSL https://ollama.com/install.sh | sh

3. Setup Backend

cd backend

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Copy environment file
copy .env.example .env  # Windows
cp .env.example .env    # macOS/Linux

# Edit .env as needed (especially LLM settings)

4. Setup Frontend

cd frontend

# Install dependencies
npm install

5. Setup LLM (Ollama)

# Pull a model (run Ollama first)
ollama pull llama3.2

# Or for smaller model:
ollama pull tinyllama

▶️ Running the Application

1. Start the Backend

cd backend
# Activate virtual environment first

# Windows:
venv\Scripts\activate

# Then run:
python main.py
# Or with uvicorn:
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Backend will be available at: http://localhost:8000 API Documentation: http://localhost:8000/docs

2. Start the Frontend

cd frontend
npm run dev

Frontend will be available at: http://localhost:3000

3. Start Ollama (if using)

ollama serve

📁 Project Structure

docextract/
├── backend/
│   ├── main.py              # FastAPI application entry point
│   ├── config.py            # Configuration settings
│   ├── database.py          # Database connection
│   ├── schemas.py           # Pydantic schemas
│   ├── utils.py             # Utility functions
│   ├── models/
│   │   ├── __init__.py
│   │   └── document.py      # SQLAlchemy models
│   ├── routers/
│   │   ├── __init__.py
│   │   ├── upload.py        # Upload endpoints
│   │   ├── documents.py     # Document CRUD endpoints
│   │   ├── groups.py        # Grouping endpoints
│   │   └── export.py        # Export endpoints
│   ├── services/
│   │   ├── __init__.py
│   │   ├── ocr_service.py   # Tesseract OCR
│   │   ├── llm_service.py   # Local LLM integration
│   │   └── extraction_service.py  # Pipeline orchestration
│   ├── requirements.txt
│   └── .env.example
├── frontend/
│   ├── src/
│   │   ├── app/
│   │   │   ├── layout.tsx
│   │   │   ├── page.tsx     # Landing page
│   │   │   ├── globals.css
│   │   │   └── dashboard/
│   │   │       ├── layout.tsx
│   │   │       ├── page.tsx        # Dashboard
│   │   │       ├── upload/
│   │   │       ├── documents/
│   │   │       ├── groups/
│   │   │       └── export/
│   │   ├── components/
│   │   │   ├── Sidebar.tsx
│   │   │   ├── FileUpload.tsx
│   │   │   ├── DocumentTable.tsx
│   │   │   └── StatsCards.tsx
│   │   └── lib/
│   │       ├── api.ts       # API client
│   │       └── utils.ts     # Utility functions
│   ├── package.json
│   ├── tailwind.config.js
│   └── next.config.js
├── samples/
│   ├── sample_invoice.txt
│   ├── sample_contract.txt
│   └── sample_receipt.txt
└── README.md

⚙️ Configuration

Edit backend/.env to configure:

# LLM Provider: "ollama" or "huggingface"
LLM_PROVIDER=ollama

# Ollama settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2

# HuggingFace settings (if using)
HUGGINGFACE_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Processing
CONFIDENCE_THRESHOLD=70.0

🔌 API Endpoints

Method	Endpoint	Description
POST	`/api/upload/`	Upload multiple files
GET	`/api/documents/`	List documents (with filters)
GET	`/api/documents/{id}`	Get document details
PUT	`/api/documents/{id}`	Update document fields
POST	`/api/documents/{id}/verify`	Mark as verified
POST	`/api/documents/{id}/reprocess`	Reprocess document
DELETE	`/api/documents/{id}`	Delete document
GET	`/api/documents/stats`	Get dashboard statistics
GET	`/api/groups/`	Get grouped documents
GET	`/api/export/json`	Export as JSON
GET	`/api/export/csv`	Export as CSV

🧪 Sample Documents

The samples/ folder contains example documents:

sample_invoice.txt - Invoice from ACME Corporation
sample_contract.txt - Service Agreement Contract
sample_receipt.txt - Store receipt

Convert to PDF/image for testing, or use them to understand expected extraction.

🔧 Troubleshooting

OCR Not Working

Ensure Tesseract is installed and in PATH
On Windows, you may need to set TESSERACT_CMD in .env

PDF Processing Fails

Ensure Poppler is installed and in PATH
On Windows, add Poppler's bin folder to PATH

LLM Timeout

Ensure Ollama is running (ollama serve)
Try a smaller model (tinyllama)
Increase LLM_TIMEOUT in .env

Low Confidence Scores

Ensure document images are clear and high resolution
Try increasing PDF_DPI in config
Documents with poor formatting may need manual review

📄 License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
samples		samples
.gitignore		.gitignore
Copilot_instruction.md		Copilot_instruction.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocExtract - Intelligent Document Processing

🏗️ Architecture

✨ Features

📋 Requirements

System Requirements

LLM Requirements (choose one)

🚀 Installation

1. Clone the Repository

2. Install System Dependencies

3. Setup Backend

4. Setup Frontend

5. Setup LLM (Ollama)

▶️ Running the Application

1. Start the Backend

2. Start the Frontend

3. Start Ollama (if using)

📁 Project Structure

⚙️ Configuration

🔌 API Endpoints

🧪 Sample Documents

🔧 Troubleshooting

OCR Not Working

PDF Processing Fails

LLM Timeout

Low Confidence Scores

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Rayaanxrio/DocExtract

Folders and files

Latest commit

History

Repository files navigation

DocExtract - Intelligent Document Processing

🏗️ Architecture

✨ Features

📋 Requirements

System Requirements

LLM Requirements (choose one)

🚀 Installation

1. Clone the Repository

2. Install System Dependencies

3. Setup Backend

4. Setup Frontend

5. Setup LLM (Ollama)

▶️ Running the Application

1. Start the Backend

2. Start the Frontend

3. Start Ollama (if using)

📁 Project Structure

⚙️ Configuration

🔌 API Endpoints

🧪 Sample Documents

🔧 Troubleshooting

OCR Not Working

PDF Processing Fails

LLM Timeout

Low Confidence Scores

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages