Deriva

Automatically generate ArchiMate enterprise architecture models from software repositories.

Deriva analyzes code repositories and transforms them into ArchiMate models that can be opened in the Archi modeling tool.

How It Works

Clone a Git repository
Extract a graph representation into Neo4j:
- Structural nodes: directories, files (classified by type and subtype)
- Semantic nodes: TypeDefinitions, Methods, BusinessConcepts, Technologies, etc.
- Python files use fast AST extraction; other languages use LLM
Derive ArchiMate elements using a hybrid approach:
- Enrich phase: Graph enrichment (PageRank, Louvain communities, k-core)
- Generate phase: LLM-based element derivation with graph metrics
- Refine phase: Relationship derivation and quality assurance
Export to .archimate XML file

Quick Setup

Prerequisites

Python 3.14+
Docker (for Neo4j)
uv (Python package manager)

1. Install uv

# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone and Configure

git clone https://github.com/StevenBtw/Deriva.git
cd Deriva

# Create environment configuration
cp .env.example .env
# Edit .env with your settings (Neo4j, LLM API keys, etc.)

3. Create Python Environment

uv venv --python 3.14

Activate the virtual environment:

# Windows PowerShell
.venv\Scripts\Activate.ps1

# Windows Command Prompt
.venv\Scripts\activate.bat

# macOS/Linux
source .venv/bin/activate

4. Install Dependencies

uv sync

5. Start Neo4j

cd deriva/adapters/neo4j
docker-compose up -d

Neo4j will be available at:

Browser UI: http://localhost:7474 (no authentication)
Bolt Protocol: bolt://localhost:7687

Verify Neo4j is running:

docker ps  # Should show deriva_neo4j container

6. Launch Deriva

cd ../../..  # Back to Deriva root
uv run marimo edit deriva/app/app.py

The marimo notebook opens in your browser at: http://127.0.0.1:2718

First Time Setup

When you first open Deriva, you need to seed the configuration database.

1. Seed File Type Registry

Navigate to Column 2: Manage Extraction → File Type Registry

Click "Seed from JSON"
This loads default file type mappings from extraction_config.json
Categories include: Source, Config, Docs, Test, Build, Asset, Data, Exclude

2. Enable Extraction Steps

Navigate to Column 2: Manage Extraction → Extraction Step Configuration

Enable the extraction steps you need:

Step	Purpose	Recommended
Repository	Creates root node for the repo	Always
Directory	Creates directory structure nodes	Always
File	Creates file nodes with classification	Always
TypeDefinition	Extracts classes, functions (AST for Python)	Yes
Method	Extracts methods from type definitions	Optional
Technology	Detects frameworks and libraries	Optional
ExternalDependency	Maps external dependencies	Optional
Test	Extracts test definitions	Optional

3. Configure LLM (Optional)

If using LLM-assisted extraction, configure your provider in .env:

LLM_PROVIDER=mistral  # or azure, anthropic
LLM_MISTRAL_API_KEY=your-key-here
LLM_MISTRAL_MODEL=devstral2
LLM_MISTRAL_URL=https://api.mistral.ai/v1/chat/completions
LLM_MISTRAL_STRUCTURED_OUTPUT=true

Using Deriva

Basic Workflow

1. Clone a Repository

Column 1: Configuration → Repositories

Enter repository URL (e.g., https://github.com/user/repo.git)
Optionally specify a target name
Click "Clone"

2. Run the Pipeline

Column 0: Run Deriva

Click "Run Deriva" to run the full pipeline (extraction → derivation)
Or use individual step buttons: Extraction, Derivation

Results display in a status callout showing nodes/elements created and any errors.

3. View Results

Column 1: Configuration

Graph Statistics: Node counts by type (Repository, Directory, File, etc.)
ArchiMate Model: Element and relationship counts by type

4. Export to Archi

Column 1: Configuration → ArchiMate Model

Set export path (default: workspace/output/model.archimate)
Click "Export Model"
Open the file with Archi

Via CLI:

deriva export -o workspace/output/model.archimate

Configuration

Environment Variables (.env)

All configuration lives in .env. Key settings:

# Neo4j (default docker-compose has auth disabled)
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=
NEO4J_PASSWORD=

# LLM Provider (mistral, openai, azure, anthropic, ollama, lmstudio)
LLM_MISTRAL_DEVSTRAL_PROVIDER=mistral
LLM_MISTRAL_DEVSTRAL_MODEL=devstral-2512
LLM_MISTRAL_DEVSTRAL_URL=https://api.mistral.ai/v1/chat/completions
LLM_MISTRAL_DEVSTRAL_KEY=your-mistral-api-key
LLM_MISTRAL_DEVSTRAL_STRUCTURED_OUTPUT=true

# Namespaces
NEO4J_GRAPH_NAMESPACE=Graph
ARCHIMATE_NAMESPACE=Model

See .env.example for all available options.

Rate Limiting

The LLM adapter includes built-in rate limiting to prevent API throttling:

# Requests per minute (0 = use provider default: 60 RPM for cloud, unlimited for local)
LLM_RATE_LIMIT_RPM=0

# Minimum delay between requests in seconds
LLM_RATE_LIMIT_DELAY=0.0

# Max retries on rate limit (429) errors
LLM_RATE_LIMIT_RETRIES=3

Default rate limits by provider:

Provider	Default RPM
OpenAI	30
Anthropic	30
Mistral	24
Ollama	Unlimited
LM Studio	Unlimited

The rate limiter automatically:

Throttles requests to stay within limits
Applies exponential backoff on rate limit errors (HTTP 429)
Handles timeout errors with backoff retries

Managing File Types

If you encounter undefined extensions during extraction:

Via UI (Marimo):

Navigate to Column 2 → Undefined Extensions
Add them to the registry:
- Extension (e.g., .tsx, Dockerfile)
- Type (source, config, docs, test, build, asset, data, exclude)
- Subtype (e.g., typescript, docker)

Via CLI:

# List all registered file types
deriva config filetype list

# Add a new file type
deriva config filetype add ".tsx" source typescript

# Delete a file type
deriva config filetype delete ".tsx"

# Show file type statistics by category
deriva config filetype stats

Note: Files with unrecognized extensions are automatically classified as file_type="unknown" with their extension as the subtype. This ensures all files get proper classification even without explicit registry entries.

Updating Configurations (Versioning)

Deriva uses a versioning system for configurations. When you update a config, a new version is created while preserving previous versions for rollback.

Correct ways to update configs:

Via UI (Marimo): Navigate to the config section, edit, and click "Save Config"
Via CLI: Use the config update command

# Update extraction config instruction
deriva config update extraction BusinessConcept \
  -i "New instruction text..."

# Update derivation config from file
deriva config update derivation ApplicationComponent \
  --instruction-file prompts/app_component.txt

# View all versions
deriva config versions

Do NOT use JSON import/export for config updates. The db_tool import command is only for backup restoration or migration - it overwrites version history. See BENCHMARKS.md for the optimization workflow.

Customizing Extraction Prompts

For LLM-assisted extraction steps:

Navigate to Column 2 → Extraction Step Configuration
Expand a node type (e.g., TypeDefinition)
Edit: Input File Types, Input Graph Elements, Instruction, Example
Click "Save Config" (this creates a new version)

All prompts follow the Input + Instruction + Example pattern.

UI Layout

Deriva uses a multi-column marimo notebook layout:

Column	Purpose
0	Run Deriva: Pipeline execution buttons, status display
1	Configuration: Runs, repositories, Neo4j, graph stats, ArchiMate, LLM
2	Extraction Settings: File type registry, extraction step configuration
3	Derivation Settings: Element type configuration (13 types across Business/Application/Technology layers), relationship derivation

The UI is powered by PipelineSession from the services layer, providing a clean separation between presentation and business logic.

Data Storage

Neo4j Graph Database:
- Graph namespace: Intermediate representation (Modules, Files, Dependencies)
- Model namespace: ArchiMate elements and relationships
DuckDB (deriva/adapters/database/sql.db): File type registry, extraction configs, settings

Clearing Data

Column 0: Run Overview

Clear Graph: Removes all nodes/edges from Graph namespace
Clear Model: Removes all ArchiMate elements and relationships

Querying Neo4j Directly

Access the Neo4j browser at http://localhost:7474 and run Cypher queries:

// See all repositories
MATCH (r:Graph:Repository) RETURN r

// See files in a repo
MATCH (repo:Graph:Repository)-[:Graph:CONTAINS*]->(f:Graph:File)
WHERE repo.name = 'my-repo'
RETURN f.name, f.file_type

// See type definitions
MATCH (td:Graph:TypeDefinition) RETURN td.name, td.type_category

CLI (Headless Mode)

Deriva includes a full CLI for headless operation and automation:

# Help
deriva --help

# View configuration
deriva config list extraction
deriva config show extraction BusinessConcept
deriva status

# Manage file types
deriva config filetype list
deriva config filetype add ".lock" dependency lock
deriva config filetype stats

# Run pipeline stages
deriva run extraction --repo flask_invoice_generator -v
deriva run derivation -v
deriva run derivation --phase generate -v  # Run specific phase (enrich, generate, refine)
deriva run all --repo myrepo

# Export ArchiMate model
deriva export -o workspace/output/model.archimate

CLI Options:

Option	Description
`--repo NAME`	Process specific repository (default: all)
`--phase PHASE`	Run specific derivation phase: enrich, generate, or refine
`-v, --verbose`	Print detailed progress
`--no-llm`	Skip LLM-based steps (structural extraction only)
`-o, --output PATH`	Output file path for export

Benchmarking

Deriva includes a multi-model benchmarking system for comparing LLM performance across different providers and models. See BENCHMARKS.md for the full guide and optimization_guide.md for detailed case studies.

Running Benchmarks

# List available benchmark models
deriva benchmark models

# Run a benchmark with specific models
deriva benchmark run \
  --repos flask_invoice_generator \
  --models openai-gptx,ollama-devstral \
  -n 3 \
  -d "Comparing gptx with devstral" \
  -v

# List benchmark sessions
deriva benchmark list

# Analyze a benchmark session
deriva benchmark analyze bench_20260101_150724

Configuring Benchmark Models

Add models to .env using the pattern:

# Azure GPT-4o-mini
LLM_AZURE_GPT4MINI_PROVIDER=azure
LLM_AZURE_GPT4MINI_MODEL=gpt-4
LLM_AZURE_GPT4MINI_URL=https://your-resource.openai.azure.com/...
LLM_AZURE_GPT4MINI_KEY=your-api-key

# Ollama local model
LLM_OLLAMA_LLAMA_PROVIDER=ollama
LLM_OLLAMA_LLAMA_MODEL=devstral
LLM_OLLAMA_LLAMA_URL=http://localhost:11434/api/chat

OCEL Event Logging

Benchmark runs are logged in OCEL 2.0 (Object-Centric Event Log) format for process mining analysis:

Events capture pipeline stages, LLM calls, and results
Object types: BenchmarkSession, BenchmarkRun, Repository, Model
Logs are saved to workspace/benchmarks/{session_id}/events.ocel.json

OCEL files can be analyzed with process mining tools like PM4Py, Celonis, or custom analysis scripts.

Troubleshooting

Neo4j Connection Issues

# Check if running
docker ps

# View logs
cd deriva/adapters/neo4j && docker-compose logs

# Restart
docker-compose restart

# Clear all data (destructive!)
docker-compose down -v

Port Conflicts

If ports 7687/7474 are in use, edit deriva/adapters/neo4j/docker-compose.yml:

ports:
  - "7688:7687"
  - "7475:7474"

Update .env accordingly:

NEO4J_URI=bolt://localhost:7688

Marimo Issues

# Check Python version
python --version  # Should be 3.14+

# Reinstall dependencies
uv sync --reinstall

# Run without watch mode
uv run marimo edit deriva/app/app.py

Contributing

For development setup, architecture details, and contribution guidelines, see CONTRIBUTING.md.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

This means you can freely use, modify, and distribute this software, but if you run a modified version as a network service, you must make the source code available to users of that service.

See LICENSE for the full license text.

Acknowledgments

Marimo - Reactive Python notebooks
Neo4j - Graph database
ArchiMate - Enterprise architecture standard
Archi - Open source ArchiMate modeling tool
Tree-sitter - Multi-language AST parsing

Status: Active Development

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github		.github
deriva		deriva
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
ARCHITECTURE.MD		ARCHITECTURE.MD
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
OPTIMIZATION.md		OPTIMIZATION.md
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Uh oh!

License

StevenBtw/deriva

Folders and files

Latest commit

History

Repository files navigation

Deriva

How It Works

Quick Setup

Prerequisites

1. Install uv

2. Clone and Configure

3. Create Python Environment

4. Install Dependencies

5. Start Neo4j

6. Launch Deriva

First Time Setup

1. Seed File Type Registry

2. Enable Extraction Steps

3. Configure LLM (Optional)

Using Deriva

Basic Workflow

1. Clone a Repository

2. Run the Pipeline

3. View Results

4. Export to Archi

Configuration

Environment Variables (.env)

Rate Limiting

Managing File Types

Updating Configurations (Versioning)

Customizing Extraction Prompts

UI Layout

Data Storage

Clearing Data

Querying Neo4j Directly

CLI (Headless Mode)

Benchmarking

Running Benchmarks

Configuring Benchmark Models

OCEL Event Logging

Troubleshooting

Neo4j Connection Issues

Port Conflicts

Marimo Issues

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Sponsor this project

Uh oh!

Languages