A powerful, local-first Retrieval-Augmented Generation (RAG) system with a two-stage retrieval pipeline, OCR support for scanned documents, and professional observability.
- Two-Stage Retrieval: Vector search (Qdrant) + AI Re-ranking (Cross-Encoder).
- Intelligent OCR: Automatic fallback to Tesseract for scanned/image-based PDFs.
- Premium CLI: Highly verbose and organized output using
rich. - Pro Observability: Industry-standard dashboard with Arize Phoenix.
- Local-First: Complete privacy, running entirely on your machine via Docker and Pixi.
Start the local Qdrant database and Phoenix dashboard:
pixi run upPlace your PDFs in the data/ folder and run:
pixi run ingestSearch through your documents with AI re-ranking:
pixi run query "What is RAG?"| Command | Description |
|---|---|
pixi run up |
Start Docker containers (Qdrant + Phoenix) |
pixi run down |
Stop all Docker containers |
pixi run ingest |
Process PDFs and store embeddings |
pixi run query "..." |
Search and re-rank results |
pixi run test |
Run unit tests |
pixi run stats |
Check collection statistics |
pixi run dashboard |
Open Arize Phoenix in browser |
pixi run qdrant_ui |
Open Qdrant Dashboard in browser |
.
├── data/ # Source PDF documents
├── src/
│ └── prorag/ # Main package
│ ├── core/ # Config, Database, Model managers
│ ├── ingest/ # PDF processing & pipeline
│ ├── retrieval/ # Vector search & re-ranking
│ └── cli.py # Unified CLI entry point
├── tests/ # Unit tests
├── docker-compose.yml # Infrastructure as Code
└── pixi.toml # Dependency & Task management