hinbox is a flexible, domain-configurable entity extraction system designed
for historians and researchers. It processes historical documents, academic
papers, news articles, and book chapters to extract structured information about
people, organizations, locations, and events. Originally designed for GuantΓ‘namo
Bay media coverage analysis, it now supports any historical or research domain
through a simple configuration system.
- Research-Focused: Designed for historians, academics, and researchers
- Flexible Sources: Process historical documents, academic papers, news articles, book chapters
- Domain-Agnostic: Configure for any historical period, region, or research topic
- Multiple AI Models: Support for both cloud (Gemini default, but supports
anything that
litellmsupports) and local (Ollama default, but works withlitellm) models - Entity Extraction: Automatically extract people, organizations, locations, and events
- Smart Deduplication: Uses embeddings to merge similar entities across sources
- Profile Versioning: Track how entity profiles evolve as new sources are processed
- Modular Engine:
src/enginecoordinates article processing, extraction, merging, and profile versioning so new domains can reuse the same pipeline - Web Interface: FastHTML-based UI for exploring research findings with version navigation
- Easy Setup: Simple configuration files, no Python coding required
Note: This project supports both
./run.pycommands andjustcommands. Use whichever you prefer!
./run.py domains
# OR: just domains./run.py init palestine_food_history
# OR: just init afghanistan_1980sEdit the generated files in configs/palestine_food_history/:
config.yaml- Research domain settings and data pathsprompts/*.md- Extraction instructions tailored to your sourcescategories/*.yaml- Entity type definitions relevant to your research
./run.py process --domain palestine_food_history --limit 5
# OR: just process-domain afghanistan_1980s --limit 5./run.py frontend
# OR: just frontend- Python 3.12+
uv(for dependency management)- Optional: Ollama (for local model support)
- Optional: just (for easier command running)
-
Clone the repository:
git clone https://github.com/strickvl/hinbox.git cd hinbox -
Install dependencies:
uv sync
-
Set up environment variables:
export GEMINI_API_KEY="your-gemini-api-key" # Optional for local processing: export OLLAMA_API_URL="http://localhost:11434/v1"
-
Verify installation:
./run.py domains
just init palestine_food_history
# Edit configs/palestine_food_history/ to focus on:
# - People: farmers, traders, cookbook authors, anthropologists
# - Organizations: agricultural cooperatives, food companies, research institutions
# - Events: harvests, famines, recipe documentation, cultural exchanges
# - Locations: villages, markets, agricultural regions, refugee campsjust init afghanistan_1980s
# Configure for:
# - People: military leaders, diplomats, journalists, mujahideen commanders
# - Organizations: military units, intelligence agencies, NGOs, tribal groups
# - Events: battles, negotiations, refugee movements, arms shipments
# - Locations: provinces, military bases, refugee camps, border crossingsjust init medieval_trade
# Set up for:
# - People: merchants, rulers, scholars, travelers
# - Organizations: trading companies, guilds, monasteries, courts
# - Events: trade agreements, diplomatic missions, market fairs
# - Locations: trading posts, cities, trade routes, ports# Process with different options
./run.py process --domain afghanistan_1980s -n 20 --verbose
just process-domain palestine_food_history --limit 10 --relevance
# Use local models (requires Ollama) - useful for sensitive historical research
./run.py process --domain medieval_trade --local
# Force reprocessing when you update your configuration
./run.py process --domain afghanistan_1980s --force./run.py frontend
# OR: just frontendExplore extracted entities at http://localhost:5001
# Check processing status
./run.py check
# Reset processing status
./run.py reset
# View available domains
./run.py domainsconfigs/
βββ guantanamo/ # Example domain shipped with the project
βββ template/ # Starter files copied by `run.py init`
βββ README.md # Domain configuration walkthrough
src/
βββ process_and_extract.py # CLI entry point for the article pipeline
βββ engine/ # ArticleProcessor, EntityExtractor, mergers, profiles
βββ frontend/ # FastHTML UI (routes, components, static assets)
βββ utils/ # Embeddings, LLM wrappers, logging, file helpers
βββ config_loader.py # Domain configuration loader helpers
βββ dynamic_models.py # Domain-driven Pydantic model factories
βββ constants.py # Model defaults, embedding settings, thresholds
βββ exceptions.py # Custom exception types used across the pipeline
tests/
βββ embeddings/ # Embedding manager and similarity unit tests
βββ test_domain_paths.py # Validates domain-specific path resolution
βββ test_entity_merger_merge_smoke.py # Embedding-based merge smoke tests
βββ test_entity_merger_similarity.py # Similarity scoring behaviour
βββ test_profile_versioning.py # Versioned profile regression tests
βββ test_frontend_versioning.py # UI behaviour for profile history
data/
βββ guantanamo/ # Default domain data directory (created locally)
βββ {domain}/ # Additional domains maintain their own raw/entity data
Each domain has its own configs/{domain}/ directory with:
config.yaml - Main settings:
domain: "palestine_food_history"
description: "Historical analysis of Palestinian food culture and agriculture"
data_sources:
default_path: "data/palestine_food_history/raw_sources/historical_sources.parquet"
output:
directory: "data/palestine_food_history/entities"categories/*.yaml - Entity type definitions:
person_types:
player:
description: "Professional football players"
examples: ["Lionel Messi", "Cristiano Ronaldo"]prompts/*.md - Extraction instructions (plain English!):
You are an expert at extracting people from historical documents about Palestinian food culture.
Focus on farmers, traders, cookbook authors, researchers, and community leaders...Historical sources should be in Parquet format with columns:
title: Document/article titlecontent: Full text contenturl: Source URL (if applicable)published_date: Publication/creation datesource_type: "book_chapter", "journal_article", "news_article", "archival_document", etc.
- Configuration Loading: Read domain-specific settings
- Source Loading: Process historical documents in Parquet format
- Relevance Filtering: Domain-specific content filtering for research focus
- Entity Extraction: Extract people, organizations, locations, events from historical sources
- Smart Deduplication: Merge similar entities using embeddings
- Profile Generation: Create comprehensive entity profiles with automatic versioning
- Version Management: Track profile evolution as new sources are processed
ArticleProcessororchestrates relevance checks, extraction dispatch, and per-article metadata aggregation (src/engine/article_processor.py)EntityExtractorunifies cloud and local model calls using domain-specific Pydantic schemas (src/engine/extractors.py)EntityMergercompares embeddings, calls match-checkers, and updates persisted Parquet rows (src/engine/mergers.py)VersionedProfileand helper functions maintain profile history for each entity (src/engine/profiles.py)
- Domain-Agnostic: Easy to configure for any topic
- Multiple AI Models: Cloud (Gemini) and local (Ollama) support
- Smart Processing: Automatic relevance filtering and deduplication
- Profile Versioning: Track entity profile changes over time with full version history
- Modern Interface: FastHTML-based web UI with version navigation
- Robust Pipeline: Error handling and progress tracking
# Run tests (pytest)
pytest tests/
# Run specific test files
pytest tests/test_profile_versioning.py
pytest tests/test_frontend_versioning.pyThe project includes unit tests for profile versioning functionality and frontend components.
# Format code
./scripts/format.sh
# Run linting
./scripts/lint.sh
# Both together
just check-codeContributions welcome! Areas of interest:
- New domain templates and examples
- Additional language model integrations
- Enhanced web interface features
- Performance optimizations
MIT License - see LICENSE file for details.
For questions about:
- Configuration: See
configs/README.md - Setup: Check installation steps above
- Usage: Try
./run.py --helporjust --list - Issues: Open a GitHub issue
Built for: Historians, researchers, and academics working with large document collections
Built with: Python, Pydantic, FastHTML, LiteLLM, Jina Embeddings



