Skip to content

Deriva is a research project aimed at enabling digital architects by automating the derivation of architecture models. Deriva uses a combination of knowledge graphs, heuristics and LLM's.

License

Notifications You must be signed in to change notification settings

StevenBtw/deriva

Repository files navigation

Deriva

Research Project Build Status License Python 3.14+ Docker Neo4j Marimo

Automatically generate ArchiMate enterprise architecture models from software repositories.

Deriva analyzes code repositories and transforms them into ArchiMate models that can be opened in the Archi modeling tool.

How It Works

  1. Clone a Git repository
  2. Extract a graph representation into Neo4j:
    • Structural nodes: directories, files (classified by type and subtype)
    • Semantic nodes: TypeDefinitions, Methods, BusinessConcepts, Technologies, etc.
    • Python files use fast AST extraction; other languages use LLM
  3. Derive ArchiMate elements using a hybrid approach:
    • Enrich phase: Graph enrichment (PageRank, Louvain communities, k-core)
    • Generate phase: LLM-based element derivation with graph metrics
    • Refine phase: Relationship derivation and quality assurance
  4. Export to .archimate XML file

Quick Setup

Prerequisites

  • Python 3.14+
  • Docker (for Neo4j)
  • uv (Python package manager)

1. Install uv

# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone and Configure

git clone https://github.com/StevenBtw/Deriva.git
cd Deriva

# Create environment configuration
cp .env.example .env
# Edit .env with your settings (Neo4j, LLM API keys, etc.)

3. Create Python Environment

uv venv --python 3.14

Activate the virtual environment:

# Windows PowerShell
.venv\Scripts\Activate.ps1

# Windows Command Prompt
.venv\Scripts\activate.bat

# macOS/Linux
source .venv/bin/activate

4. Install Dependencies

uv sync

5. Start Neo4j

cd deriva/adapters/neo4j
docker-compose up -d

Neo4j will be available at:

Verify Neo4j is running:

docker ps  # Should show deriva_neo4j container

6. Launch Deriva

cd ../../..  # Back to Deriva root
uv run marimo edit deriva/app/app.py

The marimo notebook opens in your browser at: http://127.0.0.1:2718


First Time Setup

When you first open Deriva, you need to seed the configuration database.

1. Seed File Type Registry

Navigate to Column 2: Manage ExtractionFile Type Registry

  1. Click "Seed from JSON"
  2. This loads default file type mappings from extraction_config.json
  3. Categories include: Source, Config, Docs, Test, Build, Asset, Data, Exclude

2. Enable Extraction Steps

Navigate to Column 2: Manage ExtractionExtraction Step Configuration

Enable the extraction steps you need:

Step Purpose Recommended
Repository Creates root node for the repo Always
Directory Creates directory structure nodes Always
File Creates file nodes with classification Always
TypeDefinition Extracts classes, functions (AST for Python) Yes
Method Extracts methods from type definitions Optional
Technology Detects frameworks and libraries Optional
ExternalDependency Maps external dependencies Optional
Test Extracts test definitions Optional

3. Configure LLM (Optional)

If using LLM-assisted extraction, configure your provider in .env:

LLM_PROVIDER=mistral  # or azure, anthropic
LLM_MISTRAL_API_KEY=your-key-here
LLM_MISTRAL_MODEL=devstral2
LLM_MISTRAL_URL=https://api.mistral.ai/v1/chat/completions
LLM_MISTRAL_STRUCTURED_OUTPUT=true

Using Deriva

Basic Workflow

1. Clone a Repository

Column 1: Configuration → Repositories

  1. Enter repository URL (e.g., https://github.com/user/repo.git)
  2. Optionally specify a target name
  3. Click "Clone"

2. Run the Pipeline

Column 0: Run Deriva

  • Click "Run Deriva" to run the full pipeline (extraction → derivation)
  • Or use individual step buttons: Extraction, Derivation

Results display in a status callout showing nodes/elements created and any errors.

3. View Results

Column 1: Configuration

  • Graph Statistics: Node counts by type (Repository, Directory, File, etc.)
  • ArchiMate Model: Element and relationship counts by type

4. Export to Archi

Column 1: Configuration → ArchiMate Model

  1. Set export path (default: workspace/output/model.archimate)
  2. Click "Export Model"
  3. Open the file with Archi

Via CLI:

deriva export -o workspace/output/model.archimate

Configuration

Environment Variables (.env)

All configuration lives in .env. Key settings:

# Neo4j (default docker-compose has auth disabled)
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=
NEO4J_PASSWORD=

# LLM Provider (mistral, openai, azure, anthropic, ollama, lmstudio)
LLM_MISTRAL_DEVSTRAL_PROVIDER=mistral
LLM_MISTRAL_DEVSTRAL_MODEL=devstral-2512
LLM_MISTRAL_DEVSTRAL_URL=https://api.mistral.ai/v1/chat/completions
LLM_MISTRAL_DEVSTRAL_KEY=your-mistral-api-key
LLM_MISTRAL_DEVSTRAL_STRUCTURED_OUTPUT=true

# Namespaces
NEO4J_GRAPH_NAMESPACE=Graph
ARCHIMATE_NAMESPACE=Model

See .env.example for all available options.

Rate Limiting

The LLM adapter includes built-in rate limiting to prevent API throttling:

# Requests per minute (0 = use provider default: 60 RPM for cloud, unlimited for local)
LLM_RATE_LIMIT_RPM=0

# Minimum delay between requests in seconds
LLM_RATE_LIMIT_DELAY=0.0

# Max retries on rate limit (429) errors
LLM_RATE_LIMIT_RETRIES=3

Default rate limits by provider:

Provider Default RPM
OpenAI 30
Anthropic 30
Mistral 24
Ollama Unlimited
LM Studio Unlimited

The rate limiter automatically:

  • Throttles requests to stay within limits
  • Applies exponential backoff on rate limit errors (HTTP 429)
  • Handles timeout errors with backoff retries

Managing File Types

If you encounter undefined extensions during extraction:

Via UI (Marimo):

  1. Navigate to Column 2Undefined Extensions
  2. Add them to the registry:
    • Extension (e.g., .tsx, Dockerfile)
    • Type (source, config, docs, test, build, asset, data, exclude)
    • Subtype (e.g., typescript, docker)

Via CLI:

# List all registered file types
deriva config filetype list

# Add a new file type
deriva config filetype add ".tsx" source typescript

# Delete a file type
deriva config filetype delete ".tsx"

# Show file type statistics by category
deriva config filetype stats

Note: Files with unrecognized extensions are automatically classified as file_type="unknown" with their extension as the subtype. This ensures all files get proper classification even without explicit registry entries.

Updating Configurations (Versioning)

Deriva uses a versioning system for configurations. When you update a config, a new version is created while preserving previous versions for rollback.

Correct ways to update configs:

  1. Via UI (Marimo): Navigate to the config section, edit, and click "Save Config"
  2. Via CLI: Use the config update command
# Update extraction config instruction
deriva config update extraction BusinessConcept \
  -i "New instruction text..."

# Update derivation config from file
deriva config update derivation ApplicationComponent \
  --instruction-file prompts/app_component.txt

# View all versions
deriva config versions

Do NOT use JSON import/export for config updates. The db_tool import command is only for backup restoration or migration - it overwrites version history. See BENCHMARKS.md for the optimization workflow.

Customizing Extraction Prompts

For LLM-assisted extraction steps:

  1. Navigate to Column 2Extraction Step Configuration
  2. Expand a node type (e.g., TypeDefinition)
  3. Edit: Input File Types, Input Graph Elements, Instruction, Example
  4. Click "Save Config" (this creates a new version)

All prompts follow the Input + Instruction + Example pattern.


UI Layout

Deriva uses a multi-column marimo notebook layout:

Column Purpose
0 Run Deriva: Pipeline execution buttons, status display
1 Configuration: Runs, repositories, Neo4j, graph stats, ArchiMate, LLM
2 Extraction Settings: File type registry, extraction step configuration
3 Derivation Settings: Element type configuration (13 types across Business/Application/Technology layers), relationship derivation

The UI is powered by PipelineSession from the services layer, providing a clean separation between presentation and business logic.


Data Storage

  • Neo4j Graph Database:
    • Graph namespace: Intermediate representation (Modules, Files, Dependencies)
    • Model namespace: ArchiMate elements and relationships
  • DuckDB (deriva/adapters/database/sql.db): File type registry, extraction configs, settings

Clearing Data

Column 0: Run Overview

  • Clear Graph: Removes all nodes/edges from Graph namespace
  • Clear Model: Removes all ArchiMate elements and relationships

Querying Neo4j Directly

Access the Neo4j browser at http://localhost:7474 and run Cypher queries:

// See all repositories
MATCH (r:Graph:Repository) RETURN r

// See files in a repo
MATCH (repo:Graph:Repository)-[:Graph:CONTAINS*]->(f:Graph:File)
WHERE repo.name = 'my-repo'
RETURN f.name, f.file_type

// See type definitions
MATCH (td:Graph:TypeDefinition) RETURN td.name, td.type_category

CLI (Headless Mode)

Deriva includes a full CLI for headless operation and automation:

# Help
deriva --help

# View configuration
deriva config list extraction
deriva config show extraction BusinessConcept
deriva status

# Manage file types
deriva config filetype list
deriva config filetype add ".lock" dependency lock
deriva config filetype stats

# Run pipeline stages
deriva run extraction --repo flask_invoice_generator -v
deriva run derivation -v
deriva run derivation --phase generate -v  # Run specific phase (enrich, generate, refine)
deriva run all --repo myrepo

# Export ArchiMate model
deriva export -o workspace/output/model.archimate

CLI Options:

Option Description
--repo NAME Process specific repository (default: all)
--phase PHASE Run specific derivation phase: enrich, generate, or refine
-v, --verbose Print detailed progress
--no-llm Skip LLM-based steps (structural extraction only)
-o, --output PATH Output file path for export

Benchmarking

Deriva includes a multi-model benchmarking system for comparing LLM performance across different providers and models. See BENCHMARKS.md for the full guide and optimization_guide.md for detailed case studies.

Running Benchmarks

# List available benchmark models
deriva benchmark models

# Run a benchmark with specific models
deriva benchmark run \
  --repos flask_invoice_generator \
  --models openai-gptx,ollama-devstral \
  -n 3 \
  -d "Comparing gptx with devstral" \
  -v

# List benchmark sessions
deriva benchmark list

# Analyze a benchmark session
deriva benchmark analyze bench_20260101_150724

Configuring Benchmark Models

Add models to .env using the pattern:

# Azure GPT-4o-mini
LLM_AZURE_GPT4MINI_PROVIDER=azure
LLM_AZURE_GPT4MINI_MODEL=gpt-4
LLM_AZURE_GPT4MINI_URL=https://your-resource.openai.azure.com/...
LLM_AZURE_GPT4MINI_KEY=your-api-key

# Ollama local model
LLM_OLLAMA_LLAMA_PROVIDER=ollama
LLM_OLLAMA_LLAMA_MODEL=devstral
LLM_OLLAMA_LLAMA_URL=http://localhost:11434/api/chat

OCEL Event Logging

Benchmark runs are logged in OCEL 2.0 (Object-Centric Event Log) format for process mining analysis:

  • Events capture pipeline stages, LLM calls, and results
  • Object types: BenchmarkSession, BenchmarkRun, Repository, Model
  • Logs are saved to workspace/benchmarks/{session_id}/events.ocel.json

OCEL files can be analyzed with process mining tools like PM4Py, Celonis, or custom analysis scripts.


Troubleshooting

Neo4j Connection Issues

# Check if running
docker ps

# View logs
cd deriva/adapters/neo4j && docker-compose logs

# Restart
docker-compose restart

# Clear all data (destructive!)
docker-compose down -v

Port Conflicts

If ports 7687/7474 are in use, edit deriva/adapters/neo4j/docker-compose.yml:

ports:
  - "7688:7687"
  - "7475:7474"

Update .env accordingly:

NEO4J_URI=bolt://localhost:7688

Marimo Issues

# Check Python version
python --version  # Should be 3.14+

# Reinstall dependencies
uv sync --reinstall

# Run without watch mode
uv run marimo edit deriva/app/app.py

Contributing

For development setup, architecture details, and contribution guidelines, see CONTRIBUTING.md.


License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

This means you can freely use, modify, and distribute this software, but if you run a modified version as a network service, you must make the source code available to users of that service.

See LICENSE for the full license text.

Acknowledgments

  • Marimo - Reactive Python notebooks
  • Neo4j - Graph database
  • ArchiMate - Enterprise architecture standard
  • Archi - Open source ArchiMate modeling tool
  • Tree-sitter - Multi-language AST parsing

Status: Active Development

About

Deriva is a research project aimed at enabling digital architects by automating the derivation of architecture models. Deriva uses a combination of knowledge graphs, heuristics and LLM's.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Languages