SyntheticData

A high-performance, configurable synthetic data generator for enterprise financial simulation. SyntheticData produces realistic, interconnected General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, ML-ready graph exports, and complete enterprise process chains (S2C sourcing, HR/payroll, manufacturing, financial reporting) at scale.

Developed by Ernst & Young Ltd., Zurich, Switzerland

Overview

SyntheticData generates coherent enterprise financial data that mirrors the characteristics of real corporate accounting systems. The generated data is suitable for:

Machine Learning Model Development: Training fraud detection, anomaly detection, and graph neural network models
Audit Analytics Testing: Validating audit procedures and analytical tools with realistic data patterns
SOX Compliance Testing: Testing internal controls and segregation of duties monitoring systems
System Integration Testing: Load and stress testing for ERP and accounting platforms
Process Mining: Generating realistic event logs for process discovery and conformance checking
Training and Education: Providing realistic accounting data for professional development

The generator produces statistically accurate data based on empirical research from real-world general ledger patterns, ensuring that synthetic datasets exhibit the same characteristics as production data—including Benford's Law compliance, temporal patterns, and document flow integrity.

Key Features

Core Data Generation

Feature	Description
Statistical Distributions	Line item counts, amounts, and patterns based on empirical GL research
Mixture Models	Gaussian and Log-Normal mixture distributions with weighted components
Copula Correlations	Cross-field dependencies via Gaussian, Clayton, Gumbel, Frank, Student-t copulas
Benford's Law Compliance	First and second-digit distribution following Benford's Law with anomaly injection
Regime Changes	Economic cycles, acquisition effects, and structural breaks in time series
Industry Presets	Manufacturing, Retail, Financial Services, Healthcare, Technology, and more
Chart of Accounts	Small (~100), Medium (~400), Large (~2500) account structures
Temporal Patterns	Month-end, quarter-end, year-end volume spikes with working hour modeling
Regional Calendars	Holiday calendars for US, DE, GB, CN, JP, IN with lunar calendar support

Enterprise Simulation

Master Data Management: Vendors, customers, materials, fixed assets, employees with temporal validity
Document Flow Engine: Complete P2P (Procure-to-Pay) and O2C (Order-to-Cash) processes
Source-to-Contract (S2C): Spend analysis → sourcing projects → supplier qualification → RFx → bids → evaluation → contracts → catalogs → scorecards
Hire-to-Retire (H2R): Payroll runs with tax/deduction calculations, time & attendance tracking, expense report management
Manufacturing: Production orders with BOM explosion, routing operations, WIP costing, quality inspections, cycle counting
Financial Reporting: Balance sheet, income statement, cash flow statement, changes in equity with BS equation enforcement
Sales Quotes: Quote-to-order pipeline with win rate modeling and pricing negotiation
Management KPIs & Budgets: Financial ratio computation (liquidity, profitability, efficiency, leverage) and budget variance analysis
Revenue Recognition: ASC 606/IFRS 15 contract generation with performance obligations and standalone selling price allocation
Impairment Testing: Asset impairment workflow with fair value estimation and journal entry generation
Intercompany Transactions: IC matching, transfer pricing, consolidation eliminations
Balance Coherence: Opening balances, running balance tracking, trial balance generation
Subledger Simulation: AR, AP, Fixed Assets, Inventory with GL reconciliation
Currency & FX: Realistic exchange rates, currency translation, CTA generation
Period Close Engine: Monthly close, depreciation runs, accruals, year-end closing
Bank Reconciliation: Automated statement matching, outstanding checks, deposits in transit, net difference validation
Banking/KYC/AML: Customer personas, KYC profiles, AML typologies (structuring, funnel, mule, layering)
Process Mining: OCEL 2.0 and XES 2.0 event logs with object-centric relationships across 8 process families
- OCEL 2.0 JSON/XML export for object-centric process mining
- XES 2.0 XML export for ProM, Celonis, Disco, pm4py compatibility
- 88 activity types across 8 process families: P2P, O2C, R2R/A2R, S2C, H2R, MFG, BANK, AUDIT
- 52 object types with lifecycle states and relationships
- 6 new OCPM generators: S2C (sourcing), H2R (payroll/time/expense), MFG (production/quality), BANK (customer/transactions), AUDIT (engagement lifecycle), Bank Recon (statement matching)
- Three variant types per generator: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%)
Audit Simulation: ISA-compliant engagements, workpapers, findings, risk assessments
COSO 2013 Framework: Full internal control framework with 5 components, 17 principles, and maturity levels
Accounting Standards: US GAAP and IFRS support with ASC 606/IFRS 15 (revenue), ASC 842/IFRS 16 (leases with 5 bright-line tests), ASC 820/IFRS 13 (fair value), ASC 360/IAS 36 (impairment)
Audit Standards: ISA (34 standards), PCAOB (19+ standards), SOX 302/404 compliance with deficiency classification

Interconnectivity & Relationships

Multi-Tier Vendor Networks: Tier1/Tier2/Tier3 supply chain modeling with parent-child hierarchies
Vendor Clusters: ReliableStrategic, StandardOperational, Transactional, Problematic behavioral segmentation
Customer Value Segmentation: Enterprise/MidMarket/SMB/Consumer with Pareto-like revenue distribution
Customer Lifecycle: Prospect, New, Growth, Mature, AtRisk, Churned, WonBack stages
Relationship Strength: Composite scoring from volume, count, duration, recency, and mutual connections
Cross-Process Links: P2P↔O2C linkage via inventory (GoodsReceipt connects to Delivery)
Entity Graphs: 16 entity types, 26 relationship types with graph metrics (connectivity, clustering, power law)

Pattern & Process Drift

Organizational Events: Acquisitions (volume multipliers, integration errors), divestitures, mergers, reorganizations
Process Evolution: S-curve automation rollout, workflow changes, policy updates, control enhancements
Technology Transitions: ERP migrations with phased rollout (parallel run, cutover, stabilization, hypercare)
Behavioral Drift: Vendor payment term extensions, customer payment delays, employee learning curves
Market Drift: Economic cycles (sinusoidal, asymmetric, mean-reverting), commodity price shocks, recession modeling
Regulatory Events: Accounting standard adoptions, tax rate changes, compliance requirement impacts
Drift Detection Ground Truth: Labeled drift events with magnitude and detection difficulty for ML training

Fraud Patterns & Industry-Specific Features

ACFE-Aligned Fraud Taxonomy: Fraud classification based on ACFE Report to the Nations statistics
- Asset Misappropriation (86% of cases): Cash fraud, billing schemes, expense reimbursement, payroll fraud
- Corruption (33% of cases): Conflicts of interest, bribery, kickbacks, bid rigging
- Financial Statement Fraud (10% of cases): Revenue manipulation, expense timing, improper disclosures
Collusion & Conspiracy Modeling: Multi-party fraud networks with coordinated schemes
- 9 ring types (EmployeePair, DepartmentRing, EmployeeVendor, VendorRing, etc.)
- Role-based conspirators (Initiator, Executor, Approver, Concealer, Lookout, Beneficiary)
- Defection and escalation modeling based on detection risk
Management Override Patterns: Senior-level fraud with override techniques and fraud triangle modeling
Red Flag Generation: 40+ probabilistic fraud indicators with calibrated Bayesian probabilities
Industry-Specific Transactions: Authentic transaction modeling per industry
- Manufacturing: Work orders, BOM, routings, production variances, WIP tracking
- Retail: POS sales, returns, inventory, promotions, shrinkage tracking
- Healthcare: Revenue cycle, charge capture, claims, ICD-10/CPT/DRG coding
- Technology: License revenue, subscription billing, R&D capitalization
- Financial Services: Loan origination, trading, customer deposits
- Professional Services: Time & billing, engagement management, trust accounts
Industry-Specific Anomalies: Authentic fraud patterns per industry
- Manufacturing: Yield manipulation, phantom production, obsolete inventory concealment
- Retail: Sweethearting, skimming, refund fraud, receiving fraud
- Healthcare: Upcoding, unbundling, phantom billing, physician kickbacks
ACFE-Calibrated Benchmarks: ML evaluation benchmarks aligned with ACFE statistics

Machine Learning & Analytics

Graph Export: PyTorch Geometric, Neo4j, DGL, RustGraph, and RustGraph Hypergraph formats with train/val/test splits
Multi-Layer Hypergraph: 3-layer hypergraph (Governance, Process Events, Accounting Network) spanning all 8 process families with OCPM events as hyperedges, 24 entity type codes (100-400), and cross-process edge linking
Anomaly Injection: 60+ fraud types, errors, process issues with full labeling
Data Quality Variations: Missing values, format variations, duplicates, typos
Relationship Generation: Configurable entity relationships with cardinality rules
Industry Benchmarks: Pre-configured benchmarks for fraud detection by industry

Privacy-Preserving Fingerprinting

Fingerprint Extraction: Extract statistical properties from real data into .dsf files
Differential Privacy: Laplace mechanism with configurable epsilon budget
Formal DP Composition: Rényi DP and zCDP accounting with tighter composition bounds
K-Anonymity: Suppression of rare categorical values
Custom Privacy Levels: Configurable (ε, δ) tuples with preset levels (minimal, standard, high, maximum)
Privacy Budget Management: Global budget tracking across multiple extraction runs
Privacy Audit Trail: Complete logging of all privacy decisions with composition metadata
Fidelity Evaluation: Wasserstein-1, Jensen-Shannon divergence, and KS statistics per column
Privacy Evaluation: Membership inference attack (MIA) testing, linkage attack assessment, NIST SP 800-226 alignment, SynQP matrix
Federated Fingerprinting: Extract partial fingerprints from distributed data sources and aggregate without centralizing raw data (weighted average, median, trimmed mean)
Synthetic Data Certificates: Cryptographic attestation of DP guarantees and quality metrics with HMAC-SHA256 signing and verification
Pareto Privacy-Utility Frontier: Explore and navigate the optimal tradeoff between privacy (epsilon) and data utility

LLM-Augmented Generation

Provider Abstraction: Pluggable LlmProvider trait with mock (deterministic) and HTTP (OpenAI-compatible) backends
Metadata Enrichment: LLM-generated vendor names, transaction descriptions, memo fields, and anomaly explanations
Natural Language Configuration: Generate YAML configs from plain English (e.g., "1 year of retail data for a German company")
Response Caching: In-memory LRU cache keyed by prompt hash for deduplication
Graceful Fallback: All enrichment falls back to template-based generation when LLM is disabled or unavailable

Diffusion Model Integration

Backend Trait: Extensible DiffusionBackend with forward (noise) and reverse (denoise) processes
Noise Schedules: Linear, cosine, and sigmoid schedules with precomputed alpha/beta values
Statistical Diffusion: Pure-Rust Langevin-inspired reverse process guided by fingerprint statistics (no ML framework dependency)
Hybrid Generation: Blend rule-based and diffusion outputs via interpolation, selection, or per-column ensemble strategies
Training Pipeline: Fit diffusion models from column statistics, persist as JSON, evaluate with mean/std/correlation error metrics

Causal & Counterfactual Generation

Causal Graphs: Directed acyclic graphs with linear, threshold, polynomial, and logistic mechanisms
Structural Causal Models: Generate samples respecting causal structure via topological traversal
do-Calculus Interventions: Fix variables to specific values and measure average treatment effects with confidence intervals
Counterfactual Generation: Abduction-action-prediction framework for "what-if" scenario analysis
Causal Validation: Verify edge correlations, non-edge weakness, and topological consistency
Built-in Templates: Pre-configured fraud detection and revenue cycle causal models

Ecosystem Integrations

Apache Airflow: DataSynthOperator, DataSynthSensor, and DataSynthValidateOperator for DAG-based orchestration
dbt: Source YAML generation, seed export, and project scaffolding from DataSynth output
MLflow: Track generation runs as experiments with parameters, metrics, and artifact logging
Apache Spark: Read DataSynth output as Spark DataFrames with schema inference and temp view registration

Production Features

REST & gRPC APIs: Streaming generation with Argon2id authentication and rate limiting
JWT/OIDC Authentication: RS256 JWT validation with Keycloak, Auth0, and Entra ID support (feature-gated)
Role-Based Access Control: Admin/Operator/Viewer roles with 7 permission types and structured JSON audit logging
gRPC Auth Interceptor: Bearer token validation for gRPC endpoints with API versioning headers
Quality Gates: Configurable pass/fail thresholds (strict/default/lenient) with 8 metrics and CLI enforcement
Plugin SDK: Extensible GeneratorPlugin, SinkPlugin, TransformPlugin traits with thread-safe registry
Webhook Notifications: Fire-and-forget event dispatch for RunStarted, RunCompleted, RunFailed, GateViolation
EU AI Act Compliance: Article 50 synthetic content marking and Article 10 data governance reports
Compliance Documentation: SOC 2 Type II readiness, ISO 27001 Annex A alignment, NIST AI RMF, GDPR templates
Async Job Queue: Submit/poll/cancel pattern for long-running generation jobs
Security Hardening: Security headers, request validation, request ID propagation, timing-safe auth
TLS Support: Native rustls TLS or reverse proxy (nginx/envoy) with documented configuration
OpenTelemetry: Feature-gated OTEL integration with OTLP traces and Prometheus metrics
Structured Logging: JSON-formatted logs with request IDs, method, path, status, and latency
Docker & Compose: Multi-stage distroless containers, local dev stack with Prometheus + Grafana
Kubernetes Helm Chart: Production-ready chart with HPA, PDB, optional Redis subchart, and Prometheus ServiceMonitor
CI/CD Pipeline: 7-job GitHub Actions (fmt, clippy, cross-platform test, MSRV, security, coverage, benchmarks)
Release Automation: Binary builds for 5 platforms, GHCR container publishing, Trivy scanning
Data Lineage & Provenance: Per-file checksums, lineage graph, W3C PROV-JSON export, CLI verify command
Distributed Rate Limiting: Redis-backed sliding window rate limiting for multi-instance deployments
Streaming Output API: Async generation with backpressure handling (Block, DropOldest, DropNewest, Buffer)
Rate Limiting: Token bucket rate limiter for controlled generation throughput
Load Testing: k6 scripts for health, bulk generation, WebSocket, job queue, and soak testing
Temporal Attributes: Bi-temporal data support (valid time + transaction time) with version chains
Desktop UI: Cross-platform Tauri/SvelteKit application
Resource Guards: Memory, disk, and CPU monitoring with graceful degradation
Panic-Free Library Crates: #![deny(clippy::unwrap_used)] enforced across all library crates
Fuzzing: cargo-fuzz targets for config parsing, fingerprint loading, and validation
Evaluation Framework: Auto-tuning with quality gate enforcement and configuration recommendations
Deterministic Generation: Seeded RNG for reproducible output

Architecture

SyntheticData is organized as a Rust workspace with 16 modular crates:

datasynth-cli          Command-line interface (binary: datasynth-data)
datasynth-server       REST/gRPC/WebSocket server with auth and rate limiting
datasynth-ui           Tauri/SvelteKit desktop application
    │
datasynth-runtime      Orchestration layer (parallel execution, resource guards)
    │
datasynth-generators   Data generators (JE, documents, subledgers, anomalies, audit)
datasynth-banking      KYC/AML banking transaction generator
datasynth-ocpm         Object-Centric Process Mining (OCEL 2.0, XES 2.0, 8 process families)
datasynth-fingerprint  Privacy-preserving fingerprint extraction and synthesis
datasynth-standards    Accounting/audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
    │
datasynth-graph        Graph/network export (PyTorch Geometric, Neo4j, DGL, RustGraph Multi-Layer Hypergraph)
datasynth-eval         Evaluation framework with auto-tuning
    │
datasynth-config       Configuration schema, validation, industry presets
    │
datasynth-core         Domain models, traits, distributions, resource guards
    │
datasynth-output       Output sinks (CSV, JSON, NDJSON, Parquet/Zstd) with streaming support
datasynth-test-utils   Test utilities, fixtures, mocks

See individual crate READMEs for detailed documentation.

Installation

From crates.io

# Install the CLI tool
cargo install datasynth-cli

# Or add individual crates to your project
cargo add datasynth-core datasynth-generators datasynth-config

From Source

git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release

The binary is available at target/release/datasynth-data.

Available Crates

Crate	Description
`datasynth-core`	Domain models, traits, distributions
`datasynth-config`	Configuration schema and validation
`datasynth-generators`	Data generators
`datasynth-banking`	KYC/AML banking transactions
`datasynth-fingerprint`	Privacy-preserving fingerprint extraction
`datasynth-standards`	Accounting/audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
`datasynth-graph`	Graph/network export
`datasynth-eval`	Evaluation framework
`datasynth-runtime`	Orchestration layer
`datasynth-cli`	Command-line interface
`datasynth-server`	REST/gRPC server

Requirements

Rust 1.88 or later
For the desktop UI: Node.js 18+ and platform-specific Tauri dependencies

Quick Start

# Generate a configuration file for a manufacturing company
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Validate the configuration
datasynth-data validate --config config.yaml

# Generate synthetic data
datasynth-data generate --config config.yaml --output ./output

# View available presets and options
datasynth-data info

Demo Mode

# Quick demo with default settings
datasynth-data generate --demo --output ./demo-output

# Generate with graph export for ML training
datasynth-data generate --demo --output ./demo-output --graph-export

Configuration

SyntheticData uses YAML configuration files with comprehensive options:

global:
  seed: 42                        # For reproducible generation
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 1.0            # Transaction volume weight

transactions:
  target_count: 100000
  benford:
    enabled: true

fraud:
  enabled: true
  fraud_rate: 0.005               # 0.5% fraud rate

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true           # For supervised learning

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j
    - rustgraph               # RustGraph/RustAssureTwin compatible JSON
    - rustgraph_hypergraph    # 3-layer hypergraph JSONL for RustGraph
  hypergraph:
    enabled: true
    max_nodes: 50000
    aggregation_strategy: pool_by_counterparty

streaming:
  enabled: true
  buffer_size: 1000
  backpressure: block         # block, drop_oldest, drop_newest, buffer

rate_limit:
  enabled: true
  entities_per_second: 10000
  burst_size: 100

distributions:
  enabled: true
  industry_profile: retail        # retail, manufacturing, financial_services
  amounts:
    enabled: true
    distribution_type: lognormal
    components:
      - { weight: 0.60, mu: 6.0, sigma: 1.5, label: "routine" }
      - { weight: 0.30, mu: 8.5, sigma: 1.0, label: "significant" }
      - { weight: 0.10, mu: 11.0, sigma: 0.8, label: "major" }
    benford_compliance: true
  correlations:
    enabled: true
    copula_type: gaussian         # gaussian, clayton, gumbel, frank, student_t
    fields: [amount, line_items, approval_level]
    matrix:
      - [1.00, 0.65, 0.72]
      - [0.65, 1.00, 0.55]
      - [0.72, 0.55, 1.00]
  regime_changes:
    enabled: true
    economic_cycle:
      enabled: true
      cycle_period_months: 48
      amplitude: 0.15
      recession_probability: 0.1
  validation:
    enabled: true
    tests:
      - { type: benford_first_digit, threshold_mad: 0.015 }
      - { type: distribution_fit, target: lognormal, significance: 0.05 }
      - { type: correlation_check, significance: 0.05 }

accounting_standards:
  enabled: true
  framework: us_gaap              # us_gaap, ifrs, dual_reporting
  revenue_recognition:
    enabled: true
    generate_contracts: true
  leases:
    enabled: true
    finance_lease_percent: 0.30

audit_standards:
  enabled: true
  isa_compliance:
    enabled: true
    compliance_level: comprehensive
    framework: dual               # isa, pcaob, dual
  sox:
    enabled: true
    materiality_threshold: 10000.0

# Enterprise Process Chain Extensions (v0.6.0)
source_to_pay:
  enabled: true
  sourcing:
    projects_per_year: 20
  qualification:
    pass_rate: 0.80
  rfx:
    invited_vendors_min: 3
    invited_vendors_max: 8
  contracts:
    duration_months_min: 12
    duration_months_max: 36
  scorecards:
    frequency: quarterly

financial_reporting:
  enabled: true
  generate_balance_sheet: true
  generate_income_statement: true
  generate_cash_flow: true
  management_kpis:
    enabled: true
    frequency: monthly
  budgets:
    enabled: true
    revenue_growth_rate: 0.05

hr:
  enabled: true
  payroll:
    enabled: true
    pay_frequency: monthly
  time_attendance:
    enabled: true
    overtime_rate: 0.10
  expenses:
    enabled: true
    submission_rate: 0.30

manufacturing:
  enabled: true
  production_orders:
    orders_per_month: 50
    yield_rate: 0.97
  costing:
    labor_rate_per_hour: 35.00
    overhead_rate: 1.50

sales_quotes:
  enabled: true
  quotes_per_month: 30
  win_rate: 0.35
  validity_days: 30

vendor_network:
  enabled: true
  depth: 3                          # Tier1/Tier2/Tier3
  clusters:
    reliable_strategic: 0.20
    standard_operational: 0.50
    transactional: 0.25
    problematic: 0.05
  dependencies:
    max_single_vendor_concentration: 0.15
    top_5_concentration: 0.45

customer_segmentation:
  enabled: true
  value_segments:
    enterprise: { revenue_share: 0.40, customer_share: 0.05 }
    mid_market: { revenue_share: 0.35, customer_share: 0.20 }
    smb: { revenue_share: 0.20, customer_share: 0.50 }
    consumer: { revenue_share: 0.05, customer_share: 0.25 }

relationship_strength:
  enabled: true
  calculation:
    transaction_volume_weight: 0.30
    transaction_count_weight: 0.25
    relationship_duration_weight: 0.20
    recency_weight: 0.15
    mutual_connections_weight: 0.10

ocpm:
  enabled: true
  generate_lifecycle_events: true
  compute_variants: true
  output:
    ocel_json: true               # OCEL 2.0 JSON format
    ocel_xml: false               # OCEL 2.0 XML format
    xes: true                     # XES 2.0 for ProM/Celonis/Disco
    xes_include_lifecycle: true   # Include start/complete transitions
    xes_include_resources: true   # Include resource attributes
    export_reference_models: true # Export P2P/O2C/R2R reference models

llm:
  enabled: true
  provider: mock                    # mock, openai, anthropic, custom
  enrichment:
    vendor_names: true
    transaction_descriptions: true
    anomaly_explanations: true

diffusion:
  enabled: true
  n_steps: 100
  schedule: cosine                  # linear, cosine, sigmoid
  sample_size: 100

causal:
  enabled: true
  template: fraud_detection         # fraud_detection, revenue_cycle
  sample_size: 500
  validate: true

output:
  format: csv                       # csv, json, parquet
  compression: none                 # none, gzip, zstd (parquet uses zstd by default)

See the Configuration Guide for complete documentation.

Output Structure

output/
├── master_data/          Vendors, customers, materials, assets, employees
├── transactions/         Journal entries, purchase orders, invoices, payments
├── sourcing/             S2C sourcing pipeline outputs
│   ├── sourcing_projects.csv
│   ├── supplier_qualifications.csv
│   ├── rfx_events.csv
│   ├── supplier_bids.csv
│   ├── bid_evaluations.csv
│   ├── procurement_contracts.csv
│   ├── catalog_items.csv
│   └── supplier_scorecards.csv
├── subledgers/           AR, AP, FA, inventory detail records
├── hr/                   HR & payroll outputs
│   ├── payroll_runs.csv
│   ├── payslips.csv
│   ├── time_entries.csv
│   └── expense_reports.csv
├── manufacturing/        Production & quality outputs
│   ├── production_orders.csv
│   ├── routing_operations.csv
│   ├── quality_inspection_lots.csv
│   └── cycle_count_records.csv
├── period_close/         Trial balances, accruals, closing entries
├── financial_reporting/  Financial statements & management reporting
│   ├── balance_sheet.csv
│   ├── income_statement.csv
│   ├── cash_flow_statement.csv
│   ├── changes_in_equity.csv
│   ├── financial_kpis.csv
│   └── budget_variance.csv
├── sales/                Sales pipeline outputs
│   ├── sales_quotes.csv
│   └── sales_quote_items.csv
├── consolidation/        Eliminations, currency translation
├── fx/                   Exchange rates, CTA adjustments
├── banking/              KYC profiles, bank transactions, AML typology labels
│   ├── bank_statement_lines.csv
│   ├── bank_reconciliations.csv
│   └── reconciling_items.csv
├── process_mining/       Event logs and process models
│   ├── event_log.json    OCEL 2.0 JSON format
│   ├── event_log.xes     XES 2.0 XML format (for ProM, Celonis, Disco)
│   ├── process_variants/ Discovered process variants
│   └── reference_models/ Canonical P2P, O2C, R2R process models
├── audit/                Engagements, workpapers, findings, risk assessments
├── graphs/               PyTorch Geometric, Neo4j, DGL, RustGraph exports
│   └── hypergraph/       Multi-layer hypergraph (nodes.jsonl, edges.jsonl, hyperedges.jsonl)
├── labels/               Anomaly, fraud, and data quality labels for ML
├── controls/             Internal controls, COSO mappings, SoD rules
└── standards/            Accounting & audit standards outputs
    ├── accounting/       Contracts, leases, fair value, impairment tests
    └── audit/            ISA mappings, confirmations, opinions, SOX assessments

Use Cases

Use Case	Description
Fraud Detection ML	Train supervised models with labeled fraud patterns
Graph Neural Networks	Entity relationship graphs for anomaly detection
AML/KYC Testing	Banking transaction data with structuring, layering, mule patterns
Audit Analytics	Test audit procedures with known control exceptions
Process Mining	OCEL 2.0 and XES 2.0 event logs for process discovery and conformance checking
Conformance Checking	Reference process models (P2P, O2C, R2R) for process validation
ERP Testing	Load testing with realistic transaction volumes
Procurement Analytics	Source-to-contract pipeline with spend analysis, RFx, bids, and supplier scorecards
HR & Payroll Testing	Payroll runs, time tracking, expense management with policy compliance
Manufacturing Simulation	Production orders, BOM explosion, WIP costing, quality inspections
Financial Reporting	Balance sheet, income statement, cash flow, KPIs, and budget variance
Bank Reconciliation	Statement matching, outstanding items, net difference validation
SOX Compliance	Test internal control monitoring systems
COSO Framework	COSO 2013 control mapping with 5 components, 17 principles, maturity levels
Standards Compliance	IFRS/US GAAP revenue recognition, lease accounting, fair value, impairment testing
Audit Standards	ISA/PCAOB procedure mapping, analytical procedures, confirmations, audit opinions
Data Quality ML	Train models to detect missing values, typos, duplicates
RustGraph Integration	Stream data directly to RustAssureTwin knowledge graphs
Hypergraph Analytics	3-layer hypergraph export (Governance, Process, Accounting) for multi-relational GNN models
Causal Analysis	Generate interventional and counterfactual datasets for causal ML research
LLM Training Data	LLM-enriched metadata with realistic vendor names, descriptions, and explanations
Pipeline Orchestration	Airflow operators, dbt sources, MLflow tracking, Spark DataFrames

Performance

Metric	Performance
Single-threaded throughput	~100,000+ entries/second
Parallel scaling	Linear with available cores
Memory efficiency	Streaming generation for large volumes

Server Usage

# Start REST + gRPC server
cargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051 --worker-threads 4

# With API key authentication
cargo run -p datasynth-server -- --api-keys "key1,key2"

# With JWT/OIDC authentication (requires jwt feature)
cargo run -p datasynth-server --features jwt -- \
  --jwt-issuer "https://auth.example.com" \
  --jwt-audience "datasynth-api" \
  --jwt-public-key /path/to/public.pem

# With RBAC and audit logging
cargo run -p datasynth-server -- --api-keys "key1" --rbac-enabled --audit-log

# With TLS (requires tls feature)
cargo run -p datasynth-server --features tls -- --tls-cert cert.pem --tls-key key.pem

# API endpoints
curl http://localhost:3000/health              # Health check
curl http://localhost:3000/ready               # Readiness probe (config + memory + disk)
curl http://localhost:3000/metrics             # Prometheus metrics
curl -H "Authorization: Bearer <key>" http://localhost:3000/api/config
curl -H "Authorization: Bearer <key>" -X POST http://localhost:3000/api/stream/start

WebSocket streaming available at ws://localhost:3000/ws/events.

Docker

# Build and run the server
docker build -t datasynth:latest .
docker run -p 50051:50051 -p 3000:3000 datasynth:latest

# Or use Docker Compose for full stack (server + Prometheus + Grafana)
docker compose up -d
# REST API: http://localhost:3000 | gRPC: localhost:50051
# Prometheus: http://localhost:9090 | Grafana: http://localhost:3001

See the Deployment Guide for Docker, SystemD, and reverse proxy setup.

Desktop UI

cd crates/datasynth-ui
npm install
npm run tauri dev

The desktop application provides visual configuration, real-time streaming, and preset management.

Features:

40+ config pages with form controls for every generation parameter
Info cards on feature pages explaining capabilities before enabling
Sidebar navigation with collapsible sections and scroll indicator for 10 section groups
Web preview mode — run npm run dev for config editing without Tauri; dashboard requires npm run tauri dev
Visual regression testing — 56 Playwright screenshot baselines for UI consistency

Fingerprinting

Extract privacy-preserving fingerprints from real data and generate matching synthetic data:

# Extract fingerprint from CSV data
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Validate fingerprint
datasynth-data fingerprint validate ./fingerprint.dsf

# Show fingerprint info
datasynth-data fingerprint info ./fingerprint.dsf --detailed

# Compare fingerprints
datasynth-data fingerprint diff ./fp1.dsf ./fp2.dsf

# Evaluate synthetic data fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

Privacy Levels:

Level	Epsilon	k	Use Case
minimal	5.0	3	Low privacy, high utility
standard	1.0	5	Balanced (default)
high	0.5	10	Higher privacy
maximum	0.1	20	Maximum privacy

See the Fingerprinting Guide for complete documentation.

Python Wrapper

A Python wrapper (v1.0.0) is available for programmatic access:

cd python
pip install -e ".[all]"    # Includes pandas, polars, jupyter, streaming

from datasynth_py import DataSynth, AsyncDataSynth
from datasynth_py import to_pandas, to_polars, list_tables
from datasynth_py.config import blueprints

# Basic generation
config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)

# DataFrame loading
tables = list_tables(result)          # ['journal_entries', 'vendors', ...]
df = to_pandas(result, "journal_entries")
pl_df = to_polars(result, "vendors")

# Async generation
async with AsyncDataSynth() as synth:
    result = await synth.generate(config=config)

# Fingerprint operations
synth.fingerprint.extract("./real_data/", "./fingerprint.dsf", privacy_level="standard")
report = synth.fingerprint.evaluate("./fingerprint.dsf", "./synthetic/")
print(f"Fidelity score: {report.overall_score}")

Optional dependencies: [pandas], [polars], [jupyter], [streaming], [airflow], [dbt], [mlflow], [spark], [all].

Ecosystem Integrations:

from datasynth_py.config import blueprints

# LLM-enriched generation
config = blueprints.with_llm_enrichment(provider="mock")

# Diffusion-enhanced generation
config = blueprints.with_diffusion(schedule="cosine", hybrid_weight=0.3)

# Causal data generation
config = blueprints.with_causal(template="fraud_detection")

# Airflow operator
from datasynth_py.integrations.airflow import DataSynthOperator

# dbt integration
from datasynth_py.integrations.dbt import DbtSourceGenerator

# MLflow tracking
from datasynth_py.integrations.mlflow_tracker import DataSynthMlflowTracker

# Spark connector
from datasynth_py.integrations.spark import DataSynthSparkReader

See the Python Wrapper Guide for complete documentation.

Documentation

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available upon request. Please contact the author at michael.ivertowski@ch.ey.com for inquiries.

Acknowledgments

This project incorporates research on statistical distributions in accounting data and implements industry-standard patterns for enterprise financial systems.

SyntheticData is provided "as is" without warranty of any kind. It is intended for testing, development, and educational purposes. Generated data should not be used as a substitute for real financial records.

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.github		.github
benches		benches
crates		crates
deploy		deploy
docs		docs
examples/templates		examples/templates
fuzz		fuzz
python		python
scripts		scripts
src		src
templates		templates
tests/load		tests/load
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
Dockerfile		Dockerfile
Dockerfile.cli		Dockerfile.cli
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
cliff.toml		cliff.toml
deny.toml		deny.toml
docker-compose.yml		docker-compose.yml

License

ey-asu-rnd/SyntheticData

Folders and files

Latest commit

History

Repository files navigation

SyntheticData

Table of Contents

Overview

Key Features

Core Data Generation

Enterprise Simulation

Interconnectivity & Relationships

Pattern & Process Drift

Fraud Patterns & Industry-Specific Features

Machine Learning & Analytics

Privacy-Preserving Fingerprinting

LLM-Augmented Generation

Diffusion Model Integration

Causal & Counterfactual Generation

Ecosystem Integrations

Production Features

Architecture

Installation

From crates.io

From Source

Available Crates

Requirements

Quick Start

Demo Mode

Configuration

Output Structure

Use Cases

Performance

Server Usage

Docker

Desktop UI

Fingerprinting

Python Wrapper

Documentation

License

Support

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 10

Contributors 3

Uh oh!

Languages