Scriptoria is a Retrieval-Augmented Generation (RAG) engine purpose-built for historical document research. Upload scanned manuscripts, archival records, old letters, or any heritage PDF—Scriptoria extracts text via OCR (even from faded prints and handwritten notes), indexes it in a vector database, and lets you interrogate centuries of knowledge using natural language powered by local LLMs.
"Like having a research assistant who has read every page in your archive."
- Historical PDF ingestion – upload scanned manuscripts, archival records, and heritage documents
- OCR optimized for aged documents – extracts text from faded prints, old typographies, and scanned pages
- Semantic vector storage – powered by ChromaDB for intelligent document retrieval
- Natural language queries – ask questions about your historical sources in plain language
- Fully local & private – runs entirely on your machine using Ollama (no data leaves your system)
- Modern web interface – clean, dark-themed UI built with Astro
- REST API – integrate with your own tools via FastAPI endpoints
- Docker support – one-command deployment for easy setup
- Python 3.12+
- Node.js and npm
- Ollama (for local LLM inference)
- Docker and Docker Compose (optional, for containerized setup)
-
Clone the repository:
git clone <repo-url> cd scriptoria
-
Start the application:
docker-compose up --build
This will:
- Start Ollama server
- Pull required models (nomic-embed-text, llama3.1)
- Build and run the application on http://localhost:8000
-
Clone the repository:
git clone <repo-url> cd scriptoria
-
Install Ollama:
- Download from https://ollama.com/download
-
Set up Python environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv sync # or pip install -e .
-
Set up frontend:
cd frontend npm install npm run build cd ..
-
Run the application:
./start.sh
Or manually:
uv run python main.py
The application will be available at http://localhost:8000
- Open http://localhost:8000 in your browser
- Upload scanned historical PDFs (manuscripts, archival records, old books, letters…)
- Scriptoria processes them with OCR and indexes content in the vector database
- Ask questions in natural language: "What events are described in 1492?", "Summarize the correspondence between these two figures", etc.
The backend provides a REST API:
POST /upload– Upload historical PDF documents for OCR processing and indexingPOST /ask– Query your document archive using natural languageGET /files– List all indexed documents
app.py– FastAPI application & API endpointsmain.py– Entry point with automatic Ollama management & model provisioningrag/– RAG pipeline (OCR extraction, document ingestion, semantic querying, vector store)frontend/– Astro-based web interfacedata/– Uploaded documents and ChromaDB vector storedocker-compose.yml– Full-stack Docker deploymentpyproject.toml– Python project metadata & dependencies
Contributions are welcome! Please open issues or submit pull requests.
GPL-3.0
