Skip to content

fraluegut/scriptoria

Repository files navigation

Scriptoria – AI-Powered Historical Document Analysis

Scriptoria is a Retrieval-Augmented Generation (RAG) engine purpose-built for historical document research. Upload scanned manuscripts, archival records, old letters, or any heritage PDF—Scriptoria extracts text via OCR (even from faded prints and handwritten notes), indexes it in a vector database, and lets you interrogate centuries of knowledge using natural language powered by local LLMs.

"Like having a research assistant who has read every page in your archive."

Scriptoria Screenshot

Features

  • Historical PDF ingestion – upload scanned manuscripts, archival records, and heritage documents
  • OCR optimized for aged documents – extracts text from faded prints, old typographies, and scanned pages
  • Semantic vector storage – powered by ChromaDB for intelligent document retrieval
  • Natural language queries – ask questions about your historical sources in plain language
  • Fully local & private – runs entirely on your machine using Ollama (no data leaves your system)
  • Modern web interface – clean, dark-themed UI built with Astro
  • REST API – integrate with your own tools via FastAPI endpoints
  • Docker support – one-command deployment for easy setup

Prerequisites

  • Python 3.12+
  • Node.js and npm
  • Ollama (for local LLM inference)
  • Docker and Docker Compose (optional, for containerized setup)

Installation

Option 1: Docker (Recommended)

  1. Clone the repository:

    git clone <repo-url>
    cd scriptoria
  2. Start the application:

    docker-compose up --build

    This will:

    • Start Ollama server
    • Pull required models (nomic-embed-text, llama3.1)
    • Build and run the application on http://localhost:8000

Option 2: Local Development

  1. Clone the repository:

    git clone <repo-url>
    cd scriptoria
  2. Install Ollama:

  3. Set up Python environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    uv sync  # or pip install -e .
  4. Set up frontend:

    cd frontend
    npm install
    npm run build
    cd ..
  5. Run the application:

    ./start.sh

    Or manually:

    uv run python main.py

    The application will be available at http://localhost:8000

Usage

  1. Open http://localhost:8000 in your browser
  2. Upload scanned historical PDFs (manuscripts, archival records, old books, letters…)
  3. Scriptoria processes them with OCR and indexes content in the vector database
  4. Ask questions in natural language: "What events are described in 1492?", "Summarize the correspondence between these two figures", etc.

API

The backend provides a REST API:

  • POST /upload – Upload historical PDF documents for OCR processing and indexing
  • POST /ask – Query your document archive using natural language
  • GET /files – List all indexed documents

Project Structure

  • app.py – FastAPI application & API endpoints
  • main.py – Entry point with automatic Ollama management & model provisioning
  • rag/ – RAG pipeline (OCR extraction, document ingestion, semantic querying, vector store)
  • frontend/ – Astro-based web interface
  • data/ – Uploaded documents and ChromaDB vector store
  • docker-compose.yml – Full-stack Docker deployment
  • pyproject.toml – Python project metadata & dependencies

Contributing

Contributions are welcome! Please open issues or submit pull requests.

License

GPL-3.0

About

AI-powered document analysis platform for historical research, OCR processing, and retrieval-augmented knowledge exploration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors