Skip to content

Deterministic simulation testing platform for hunting bugs in distributed systems

License

Notifications You must be signed in to change notification settings

nerdsane/bloodhound

Repository files navigation

Bloodhound

CI Release License: MIT

A deterministic simulation testing platform for hunting bugs in distributed systems.

Bloodhound uses a modified QEMU hypervisor to provide perfect reproducibility for containerized applications, enabling systematic exploration of failure scenarios that are impossible to reproduce with traditional testing.

Version 0.2.0 - OCI image integration, registry auth, container auto-translation, property checking. See Release Notes.

Research Note: This project is co-authored with Claude (Anthropic) as an experiment in AI-assisted systems programming. See CLAUDE.md for project philosophy and AI collaboration guidelines.

Platform Support: Linux x86_64 fully tested with real VMs. macOS supports harness/simulation mode only. See Platform Support for details.

Features

  • Deterministic Execution: Same seed produces identical execution every time
  • Language Agnostic: Test any containerized application (Go, Rust, Java, Python, etc.)
  • Fault Injection: Network partitions, disk failures, process crashes, clock skew
  • Time-Travel Debugging: Full replay capability with GDB integration
  • Coverage-Guided Exploration: Intelligent state space exploration
  • Docker Compose Integration: Test multi-service stacks directly

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       BLOODHOUND                            │
│              (Modified QEMU Hypervisor)                     │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Virtual Time │  │ Fault Inject │  │ State Snap   │      │
│  │ (TSC, HPET)  │  │ (Net, Disk)  │  │ (CoW, Tree)  │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
├─────────────────────────────────────────────────────────────┤
│                    GUEST VMs (Containers)                   │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐           │
│  │  web   │  │ redis  │  │postgres│  │  ...   │           │
│  └────────┘  └────────┘  └────────┘  └────────┘           │
└─────────────────────────────────────────────────────────────┘

Quick Start

Installation

Option 1: Download Release Binary (Recommended)

# Linux x86_64
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-linux-amd64.tar.gz
tar -xzf bloodhound-linux-amd64.tar.gz
sudo mv bloodhound /usr/local/bin/

# macOS Intel
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-darwin-amd64.tar.gz
tar -xzf bloodhound-darwin-amd64.tar.gz
sudo mv bloodhound /usr/local/bin/

# macOS Apple Silicon
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-darwin-arm64.tar.gz
tar -xzf bloodhound-darwin-arm64.tar.gz
sudo mv bloodhound /usr/local/bin/

# Verify installation
bloodhound --version

Option 2: Install from Source

# Using cargo
cargo install --git https://github.com/nerdsane/bloodhound

# Or build manually
git clone https://github.com/nerdsane/bloodhound.git
cd bloodhound
cargo build --release

Option 3: Full Setup with Real VMs (Linux Only)

# Clone the repository
git clone https://github.com/nerdsane/bloodhound.git
cd bloodhound

# Build the CLI
cargo build --release

# Build the deterministic kernel
./scripts/build-guest.sh

# Clone and build the patched QEMU hypervisor
git clone https://github.com/nerdsane/qemu.git
cd qemu
mkdir build && cd build
../configure --target-list=x86_64-softmmu --enable-bloodhound
make -j$(nproc)
cd ../..

# Verify installation
./target/release/bloodhound --version

Basic Usage

# Run with specific seed (deterministic)
bloodhound run --compose docker-compose.yml --seed 42

# Explore for bugs (coverage-guided)
bloodhound explore --compose docker-compose.yml --seeds 10000 --timeout 1h

# Debug a failure with time-travel debugging
bloodhound debug --seed 42 --gdb-port 1234

# Run as part of CI/CD
bloodhound test --compose docker-compose.yml --coverage-threshold 80%

Configuration

Create a bloodhound.yaml file in your project directory:

# Docker Compose file to test
compose: docker-compose.yml

# Simulation settings
simulation:
  max_time: 5m        # Maximum simulation time
  time_step: 10ms     # Time step granularity
  seeds: 1000         # Number of seeds to explore

# Workload configuration
workload:
  driver: http
  config:
    target: http://lb:80
    qps: 100
    duration: 60s
    operations:
      - type: put
        weight: 40
        key_pattern: "key-{random:1-10000}"
      - type: get
        weight: 50
      - type: delete
        weight: 10

# Properties to verify
properties:
  - name: no-data-loss
    kind: safety
    description: "Acknowledged writes must not be lost"
    check:
      type: linearizability
      operations: [put, get]

  - name: partition-recovery
    kind: liveness
    description: "Cluster must recover within 30s after partition heals"
    timeout: 30s
    check:
      type: http
      endpoint: http://lb:80/health
      expect:
        status: 200

# Fault injection
faults:
  network:
    drop_rate: 0.01
    delay_ms: 50
    delay_rate: 0.05
    partition_probability: 0.001
    partition_duration_ms: [1000, 10000]

  disk:
    write_fail_rate: 0.001
    partial_write_rate: 0.0005
    fsync_fail_rate: 0.001

  process:
    crash_probability: 0.0001
    pause_probability: 0.0005
    oom_probability: 0.00001

# Exploration settings
exploration:
  strategy: coverage-guided  # bfs, dfs, random, coverage-guided
  max_depth: 1000
  max_states: 100000
  parallel_workers: 8
  prioritize_coverage: true
  prioritize_near_violation: true

# Output settings
output:
  dir: ./output
  save_violation_traces: true
  save_interesting_seeds: true
  coverage_format: html
  summary_report: true

# Debugging
debug:
  gdb_enabled: true
  gdb_port: 1234
  trace_level: normal
  trace_events:
    - network_send
    - network_recv
    - disk_write
    - process_crash

Property Types

Safety Properties

Safety properties assert that "bad things never happen." They are checked continuously throughout execution.

- name: no-data-loss
  kind: safety
  check:
    type: linearizability
    operations: [put, get]

Liveness Properties

Liveness properties assert that "good things eventually happen." They include a timeout.

- name: leader-election
  kind: liveness
  timeout: 5s
  check:
    type: custom
    script: ./checks/has-leader.sh

Invariants

Invariants are properties that must hold at every state.

- name: single-leader-per-term
  kind: invariant
  check:
    type: custom
    script: ./checks/single-leader.sh

Fault Injection

Bloodhound can inject various types of faults deterministically:

Network Faults

  • Packet drop: Randomly drop network packets
  • Packet delay: Add latency to network communication
  • Packet corruption: Corrupt packet data
  • Network partitions: Isolate nodes from each other

Disk Faults

  • Write failures: Fail disk writes
  • Partial writes: Simulate torn writes (power failure during write)
  • fsync failures: Fail fsync calls
  • Read corruption: Return corrupted data on reads

Process Faults

  • Crashes: Kill processes (SIGKILL)
  • Pauses: Pause processes (SIGSTOP)
  • OOM kills: Simulate out-of-memory conditions

Time-Travel Debugging

When a bug is found, Bloodhound can replay the exact execution:

# Start debugging session
bloodhound debug --seed 42 --gdb-port 1234

# In another terminal, connect with GDB
gdb -ex "target remote :1234"

The debugger supports:

  • Step forward/backward: Navigate through execution
  • Breakpoints: Set breakpoints that work across time
  • Watchpoints: Watch variables change over time
  • Reverse execution: Step backward to find bug origins

CI/CD Integration

GitHub Actions

name: Bloodhound Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: datadog/bloodhound-action@v1
        with:
          compose: docker-compose.yml
          seeds: 1000
          timeout: 30m
          coverage-threshold: 80%

Examples

See the examples/ directory for complete examples:

  • redis-rust-demo/: Bug reproduction demo - Find and reproduce a CRDT consistency bug using async-VM mode. See Demo Documentation.
  • redis-rust/: Simple redis-rust cluster example
  • distributed-kv/: A distributed key-value store with Raft consensus
  • message-queue/: A distributed message queue
  • cache-cluster/: A distributed cache with consistent hashing

How It Works

  1. Deterministic Hypervisor: Bloodhound uses a modified QEMU with TCG (Tiny Code Generator) mode to ensure deterministic execution. All sources of non-determinism (time, random numbers, I/O ordering) are controlled.

  2. Virtual Time: Time is virtualized so simulations run faster than real-time while maintaining correct behavior. A 5-minute simulation might complete in seconds.

  3. Snapshot Tree: Bloodhound maintains a tree of VM snapshots using copy-on-write, enabling efficient exploration of different execution paths from the same starting point.

  4. Coverage-Guided Exploration: Like fuzzing, Bloodhound prioritizes execution paths that discover new code coverage, efficiently finding edge cases.

  5. Property Checking: User-defined properties are checked at each step, with violations triggering detailed trace capture.

Execution Modes

Mode Status Description
Simulation Working In-memory deterministic simulation with fault injection
Docker Working Container-based testing with real Docker Compose stacks
Async-VM Working Real QEMU VMs with deterministic execution via patched QEMU

Async-VM Mode

Async-VM mode runs real QEMU virtual machines with deterministic time control and fault injection via the custom bloodhound-ctrl QMP command implemented in the QEMU fork.

Features:

  • Deterministic time control (set_time, advance_time, freeze_time)
  • Deterministic RNG seeding (set_seed)
  • Fault injection (inject_network_fault, inject_disk_fault)
  • VM snapshots and restore

Example:

# Build the patched QEMU (required for async-VM mode)
git clone https://github.com/nerdsane/qemu.git
cd qemu && git checkout bloodhound
mkdir build && cd build
../configure --target-list=x86_64-softmmu --enable-bloodhound
make -j$(nproc)
cd ../..

# Run async-VM exploration
bloodhound explore \
  --compose docker-compose.yml \
  --async-vm \
  --base-image disk.qcow2 \
  --kernel guest/build/bzImage \
  --initrd guest/build/initramfs.cpio.gz \
  --qemu qemu/build/qemu-system-x86_64 \
  --seeds 10 \
  --max-depth 5

Other commands:

# Simulation mode (no QEMU required)
bloodhound explore --compose docker-compose.yml --seeds 1000

# Docker-based testing
bloodhound test --compose docker-compose.yml --config bloodhound.yaml

# Determinism verification
bloodhound verify --kernel guest/build/bzImage --initrd guest/build/initramfs.cpio.gz

Platform Support

Platform Harness Mode Real VM Mode Status
Linux x86_64 ✅ Full ✅ Full Fully Tested
macOS Intel ✅ Full ⚠️ Experimental Harness mode tested
macOS ARM ✅ Full ⚠️ Experimental Harness mode tested

Linux x86_64 (Recommended)

  • Full support for all features
  • Real VM mode with QEMU/KVM tested and working
  • 296 unit tests + 44 integration tests passing

macOS (Harness Mode Only)

  • Harness/simulation mode fully works (all unit tests pass)
  • Real VM mode (--async-vm) is untested and requires:
    • QEMU built with HVF (Hypervisor.framework) support
    • Different VM configuration than Linux
  • Use bloodhound test --actor-mode for simulation testing

Requirements

  • Linux: KVM support recommended (x86_64)
  • macOS: Harness mode works out of the box
  • Docker (for container-based tests)
  • 8GB+ RAM recommended
  • SSD recommended for snapshot storage

Optional: Patched QEMU (Linux Only)

For full deterministic async-VM execution, build the patched QEMU:

git clone https://github.com/nerdsane/qemu.git
cd qemu
mkdir build && cd build
../configure --target-list=x86_64-softmmu --enable-bloodhound
make -j$(nproc)

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License. See LICENSE for details.

Documentation

Acknowledgments

Bloodhound is inspired by:

Disclaimer

This is a research project exploring deterministic simulation testing techniques. It is:

  • Not production-ready: Core features work but edge cases exist
  • Under active development: APIs may change without notice
  • Co-authored with AI: Developed collaboratively with Claude (Anthropic)

The goal is to explore what's possible with human-AI collaboration on complex systems programming, not to provide a production-ready tool. Use at your own risk.

About

Deterministic simulation testing platform for hunting bugs in distributed systems

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •