A high-performance, configurable synthetic data generator for enterprise financial simulation. SyntheticData produces realistic, interconnected General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, ML-ready graph exports, and complete enterprise process chains (S2C sourcing, HR/payroll, manufacturing, financial reporting) at scale.
Developed by Ernst & Young Ltd., Zurich, Switzerland
- SyntheticData
SyntheticData generates coherent enterprise financial data that mirrors the characteristics of real corporate accounting systems. The generated data is suitable for:
- Machine Learning Model Development: Training fraud detection, anomaly detection, and graph neural network models
- Audit Analytics Testing: Validating audit procedures and analytical tools with realistic data patterns
- SOX Compliance Testing: Testing internal controls and segregation of duties monitoring systems
- System Integration Testing: Load and stress testing for ERP and accounting platforms
- Process Mining: Generating realistic event logs for process discovery and conformance checking
- Training and Education: Providing realistic accounting data for professional development
The generator produces statistically accurate data based on empirical research from real-world general ledger patterns, ensuring that synthetic datasets exhibit the same characteristics as production data—including Benford's Law compliance, temporal patterns, and document flow integrity.
| Feature | Description |
|---|---|
| Statistical Distributions | Line item counts, amounts, and patterns based on empirical GL research |
| Mixture Models | Gaussian and Log-Normal mixture distributions with weighted components |
| Copula Correlations | Cross-field dependencies via Gaussian, Clayton, Gumbel, Frank, Student-t copulas |
| Benford's Law Compliance | First and second-digit distribution following Benford's Law with anomaly injection |
| Regime Changes | Economic cycles, acquisition effects, and structural breaks in time series |
| Industry Presets | Manufacturing, Retail, Financial Services, Healthcare, Technology, and more |
| Chart of Accounts | Small (~100), Medium (~400), Large (~2500) account structures |
| Temporal Patterns | Month-end, quarter-end, year-end volume spikes with working hour modeling |
| Regional Calendars | Holiday calendars for US, DE, GB, CN, JP, IN with lunar calendar support |
- Master Data Management: Vendors, customers, materials, fixed assets, employees with temporal validity
- Document Flow Engine: Complete P2P (Procure-to-Pay) and O2C (Order-to-Cash) processes
- Source-to-Contract (S2C): Spend analysis → sourcing projects → supplier qualification → RFx → bids → evaluation → contracts → catalogs → scorecards
- Hire-to-Retire (H2R): Payroll runs with tax/deduction calculations, time & attendance tracking, expense report management
- Manufacturing: Production orders with BOM explosion, routing operations, WIP costing, quality inspections, cycle counting
- Financial Reporting: Balance sheet, income statement, cash flow statement, changes in equity with BS equation enforcement
- Sales Quotes: Quote-to-order pipeline with win rate modeling and pricing negotiation
- Management KPIs & Budgets: Financial ratio computation (liquidity, profitability, efficiency, leverage) and budget variance analysis
- Revenue Recognition: ASC 606/IFRS 15 contract generation with performance obligations and standalone selling price allocation
- Impairment Testing: Asset impairment workflow with fair value estimation and journal entry generation
- Intercompany Transactions: IC matching, transfer pricing, consolidation eliminations
- Balance Coherence: Opening balances, running balance tracking, trial balance generation
- Subledger Simulation: AR, AP, Fixed Assets, Inventory with GL reconciliation
- Currency & FX: Realistic exchange rates, currency translation, CTA generation
- Period Close Engine: Monthly close, depreciation runs, accruals, year-end closing
- Bank Reconciliation: Automated statement matching, outstanding checks, deposits in transit, net difference validation
- Banking/KYC/AML: Customer personas, KYC profiles, AML typologies (structuring, funnel, mule, layering)
- Process Mining: OCEL 2.0 and XES 2.0 event logs with object-centric relationships across 8 process families
- OCEL 2.0 JSON/XML export for object-centric process mining
- XES 2.0 XML export for ProM, Celonis, Disco, pm4py compatibility
- 88 activity types across 8 process families: P2P, O2C, R2R/A2R, S2C, H2R, MFG, BANK, AUDIT
- 52 object types with lifecycle states and relationships
- 6 new OCPM generators: S2C (sourcing), H2R (payroll/time/expense), MFG (production/quality), BANK (customer/transactions), AUDIT (engagement lifecycle), Bank Recon (statement matching)
- Three variant types per generator: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%)
- Audit Simulation: ISA-compliant engagements, workpapers, findings, risk assessments
- COSO 2013 Framework: Full internal control framework with 5 components, 17 principles, and maturity levels
- Accounting Standards: US GAAP and IFRS support with ASC 606/IFRS 15 (revenue), ASC 842/IFRS 16 (leases with 5 bright-line tests), ASC 820/IFRS 13 (fair value), ASC 360/IAS 36 (impairment)
- Audit Standards: ISA (34 standards), PCAOB (19+ standards), SOX 302/404 compliance with deficiency classification
- Multi-Tier Vendor Networks: Tier1/Tier2/Tier3 supply chain modeling with parent-child hierarchies
- Vendor Clusters: ReliableStrategic, StandardOperational, Transactional, Problematic behavioral segmentation
- Customer Value Segmentation: Enterprise/MidMarket/SMB/Consumer with Pareto-like revenue distribution
- Customer Lifecycle: Prospect, New, Growth, Mature, AtRisk, Churned, WonBack stages
- Relationship Strength: Composite scoring from volume, count, duration, recency, and mutual connections
- Cross-Process Links: P2P↔O2C linkage via inventory (GoodsReceipt connects to Delivery)
- Entity Graphs: 16 entity types, 26 relationship types with graph metrics (connectivity, clustering, power law)
- Organizational Events: Acquisitions (volume multipliers, integration errors), divestitures, mergers, reorganizations
- Process Evolution: S-curve automation rollout, workflow changes, policy updates, control enhancements
- Technology Transitions: ERP migrations with phased rollout (parallel run, cutover, stabilization, hypercare)
- Behavioral Drift: Vendor payment term extensions, customer payment delays, employee learning curves
- Market Drift: Economic cycles (sinusoidal, asymmetric, mean-reverting), commodity price shocks, recession modeling
- Regulatory Events: Accounting standard adoptions, tax rate changes, compliance requirement impacts
- Drift Detection Ground Truth: Labeled drift events with magnitude and detection difficulty for ML training
- ACFE-Aligned Fraud Taxonomy: Fraud classification based on ACFE Report to the Nations statistics
- Asset Misappropriation (86% of cases): Cash fraud, billing schemes, expense reimbursement, payroll fraud
- Corruption (33% of cases): Conflicts of interest, bribery, kickbacks, bid rigging
- Financial Statement Fraud (10% of cases): Revenue manipulation, expense timing, improper disclosures
- Collusion & Conspiracy Modeling: Multi-party fraud networks with coordinated schemes
- 9 ring types (EmployeePair, DepartmentRing, EmployeeVendor, VendorRing, etc.)
- Role-based conspirators (Initiator, Executor, Approver, Concealer, Lookout, Beneficiary)
- Defection and escalation modeling based on detection risk
- Management Override Patterns: Senior-level fraud with override techniques and fraud triangle modeling
- Red Flag Generation: 40+ probabilistic fraud indicators with calibrated Bayesian probabilities
- Industry-Specific Transactions: Authentic transaction modeling per industry
- Manufacturing: Work orders, BOM, routings, production variances, WIP tracking
- Retail: POS sales, returns, inventory, promotions, shrinkage tracking
- Healthcare: Revenue cycle, charge capture, claims, ICD-10/CPT/DRG coding
- Technology: License revenue, subscription billing, R&D capitalization
- Financial Services: Loan origination, trading, customer deposits
- Professional Services: Time & billing, engagement management, trust accounts
- Industry-Specific Anomalies: Authentic fraud patterns per industry
- Manufacturing: Yield manipulation, phantom production, obsolete inventory concealment
- Retail: Sweethearting, skimming, refund fraud, receiving fraud
- Healthcare: Upcoding, unbundling, phantom billing, physician kickbacks
- ACFE-Calibrated Benchmarks: ML evaluation benchmarks aligned with ACFE statistics
- Graph Export: PyTorch Geometric, Neo4j, DGL, RustGraph, and RustGraph Hypergraph formats with train/val/test splits
- Multi-Layer Hypergraph: 3-layer hypergraph (Governance, Process Events, Accounting Network) spanning all 8 process families with OCPM events as hyperedges, 24 entity type codes (100-400), and cross-process edge linking
- Anomaly Injection: 60+ fraud types, errors, process issues with full labeling
- Data Quality Variations: Missing values, format variations, duplicates, typos
- Relationship Generation: Configurable entity relationships with cardinality rules
- Industry Benchmarks: Pre-configured benchmarks for fraud detection by industry
- Fingerprint Extraction: Extract statistical properties from real data into
.dsffiles - Differential Privacy: Laplace mechanism with configurable epsilon budget
- Formal DP Composition: Rényi DP and zCDP accounting with tighter composition bounds
- K-Anonymity: Suppression of rare categorical values
- Custom Privacy Levels: Configurable (ε, δ) tuples with preset levels (minimal, standard, high, maximum)
- Privacy Budget Management: Global budget tracking across multiple extraction runs
- Privacy Audit Trail: Complete logging of all privacy decisions with composition metadata
- Fidelity Evaluation: Wasserstein-1, Jensen-Shannon divergence, and KS statistics per column
- Privacy Evaluation: Membership inference attack (MIA) testing, linkage attack assessment, NIST SP 800-226 alignment, SynQP matrix
- Federated Fingerprinting: Extract partial fingerprints from distributed data sources and aggregate without centralizing raw data (weighted average, median, trimmed mean)
- Synthetic Data Certificates: Cryptographic attestation of DP guarantees and quality metrics with HMAC-SHA256 signing and verification
- Pareto Privacy-Utility Frontier: Explore and navigate the optimal tradeoff between privacy (epsilon) and data utility
- Provider Abstraction: Pluggable
LlmProvidertrait with mock (deterministic) and HTTP (OpenAI-compatible) backends - Metadata Enrichment: LLM-generated vendor names, transaction descriptions, memo fields, and anomaly explanations
- Natural Language Configuration: Generate YAML configs from plain English (e.g., "1 year of retail data for a German company")
- Response Caching: In-memory LRU cache keyed by prompt hash for deduplication
- Graceful Fallback: All enrichment falls back to template-based generation when LLM is disabled or unavailable
- Backend Trait: Extensible
DiffusionBackendwith forward (noise) and reverse (denoise) processes - Noise Schedules: Linear, cosine, and sigmoid schedules with precomputed alpha/beta values
- Statistical Diffusion: Pure-Rust Langevin-inspired reverse process guided by fingerprint statistics (no ML framework dependency)
- Hybrid Generation: Blend rule-based and diffusion outputs via interpolation, selection, or per-column ensemble strategies
- Training Pipeline: Fit diffusion models from column statistics, persist as JSON, evaluate with mean/std/correlation error metrics
- Causal Graphs: Directed acyclic graphs with linear, threshold, polynomial, and logistic mechanisms
- Structural Causal Models: Generate samples respecting causal structure via topological traversal
- do-Calculus Interventions: Fix variables to specific values and measure average treatment effects with confidence intervals
- Counterfactual Generation: Abduction-action-prediction framework for "what-if" scenario analysis
- Causal Validation: Verify edge correlations, non-edge weakness, and topological consistency
- Built-in Templates: Pre-configured fraud detection and revenue cycle causal models
- Apache Airflow:
DataSynthOperator,DataSynthSensor, andDataSynthValidateOperatorfor DAG-based orchestration - dbt: Source YAML generation, seed export, and project scaffolding from DataSynth output
- MLflow: Track generation runs as experiments with parameters, metrics, and artifact logging
- Apache Spark: Read DataSynth output as Spark DataFrames with schema inference and temp view registration
- REST & gRPC APIs: Streaming generation with Argon2id authentication and rate limiting
- JWT/OIDC Authentication: RS256 JWT validation with Keycloak, Auth0, and Entra ID support (feature-gated)
- Role-Based Access Control: Admin/Operator/Viewer roles with 7 permission types and structured JSON audit logging
- gRPC Auth Interceptor: Bearer token validation for gRPC endpoints with API versioning headers
- Quality Gates: Configurable pass/fail thresholds (strict/default/lenient) with 8 metrics and CLI enforcement
- Plugin SDK: Extensible
GeneratorPlugin,SinkPlugin,TransformPlugintraits with thread-safe registry - Webhook Notifications: Fire-and-forget event dispatch for RunStarted, RunCompleted, RunFailed, GateViolation
- EU AI Act Compliance: Article 50 synthetic content marking and Article 10 data governance reports
- Compliance Documentation: SOC 2 Type II readiness, ISO 27001 Annex A alignment, NIST AI RMF, GDPR templates
- Async Job Queue: Submit/poll/cancel pattern for long-running generation jobs
- Security Hardening: Security headers, request validation, request ID propagation, timing-safe auth
- TLS Support: Native rustls TLS or reverse proxy (nginx/envoy) with documented configuration
- OpenTelemetry: Feature-gated OTEL integration with OTLP traces and Prometheus metrics
- Structured Logging: JSON-formatted logs with request IDs, method, path, status, and latency
- Docker & Compose: Multi-stage distroless containers, local dev stack with Prometheus + Grafana
- Kubernetes Helm Chart: Production-ready chart with HPA, PDB, optional Redis subchart, and Prometheus ServiceMonitor
- CI/CD Pipeline: 7-job GitHub Actions (fmt, clippy, cross-platform test, MSRV, security, coverage, benchmarks)
- Release Automation: Binary builds for 5 platforms, GHCR container publishing, Trivy scanning
- Data Lineage & Provenance: Per-file checksums, lineage graph, W3C PROV-JSON export, CLI
verifycommand - Distributed Rate Limiting: Redis-backed sliding window rate limiting for multi-instance deployments
- Streaming Output API: Async generation with backpressure handling (Block, DropOldest, DropNewest, Buffer)
- Rate Limiting: Token bucket rate limiter for controlled generation throughput
- Load Testing: k6 scripts for health, bulk generation, WebSocket, job queue, and soak testing
- Temporal Attributes: Bi-temporal data support (valid time + transaction time) with version chains
- Desktop UI: Cross-platform Tauri/SvelteKit application
- Resource Guards: Memory, disk, and CPU monitoring with graceful degradation
- Panic-Free Library Crates:
#![deny(clippy::unwrap_used)]enforced across all library crates - Fuzzing: cargo-fuzz targets for config parsing, fingerprint loading, and validation
- Evaluation Framework: Auto-tuning with quality gate enforcement and configuration recommendations
- Deterministic Generation: Seeded RNG for reproducible output
SyntheticData is organized as a Rust workspace with 16 modular crates:
datasynth-cli Command-line interface (binary: datasynth-data)
datasynth-server REST/gRPC/WebSocket server with auth and rate limiting
datasynth-ui Tauri/SvelteKit desktop application
│
datasynth-runtime Orchestration layer (parallel execution, resource guards)
│
datasynth-generators Data generators (JE, documents, subledgers, anomalies, audit)
datasynth-banking KYC/AML banking transaction generator
datasynth-ocpm Object-Centric Process Mining (OCEL 2.0, XES 2.0, 8 process families)
datasynth-fingerprint Privacy-preserving fingerprint extraction and synthesis
datasynth-standards Accounting/audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
│
datasynth-graph Graph/network export (PyTorch Geometric, Neo4j, DGL, RustGraph Multi-Layer Hypergraph)
datasynth-eval Evaluation framework with auto-tuning
│
datasynth-config Configuration schema, validation, industry presets
│
datasynth-core Domain models, traits, distributions, resource guards
│
datasynth-output Output sinks (CSV, JSON, NDJSON, Parquet/Zstd) with streaming support
datasynth-test-utils Test utilities, fixtures, mocks
See individual crate READMEs for detailed documentation.
# Install the CLI tool
cargo install datasynth-cli
# Or add individual crates to your project
cargo add datasynth-core datasynth-generators datasynth-configgit clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --releaseThe binary is available at target/release/datasynth-data.
| Crate | Description |
|---|---|
datasynth-core |
Domain models, traits, distributions |
datasynth-config |
Configuration schema and validation |
datasynth-generators |
Data generators |
datasynth-banking |
KYC/AML banking transactions |
datasynth-fingerprint |
Privacy-preserving fingerprint extraction |
datasynth-standards |
Accounting/audit standards (IFRS, US GAAP, ISA, SOX, PCAOB) |
datasynth-graph |
Graph/network export |
datasynth-eval |
Evaluation framework |
datasynth-runtime |
Orchestration layer |
datasynth-cli |
Command-line interface |
datasynth-server |
REST/gRPC server |
- Rust 1.88 or later
- For the desktop UI: Node.js 18+ and platform-specific Tauri dependencies
# Generate a configuration file for a manufacturing company
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Validate the configuration
datasynth-data validate --config config.yaml
# Generate synthetic data
datasynth-data generate --config config.yaml --output ./output
# View available presets and options
datasynth-data info# Quick demo with default settings
datasynth-data generate --demo --output ./demo-output
# Generate with graph export for ML training
datasynth-data generate --demo --output ./demo-output --graph-exportSyntheticData uses YAML configuration files with comprehensive options:
global:
seed: 42 # For reproducible generation
industry: manufacturing
start_date: 2024-01-01
period_months: 12
group_currency: USD
companies:
- code: "1000"
name: "Headquarters"
currency: USD
country: US
volume_weight: 1.0 # Transaction volume weight
transactions:
target_count: 100000
benford:
enabled: true
fraud:
enabled: true
fraud_rate: 0.005 # 0.5% fraud rate
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true # For supervised learning
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
- rustgraph # RustGraph/RustAssureTwin compatible JSON
- rustgraph_hypergraph # 3-layer hypergraph JSONL for RustGraph
hypergraph:
enabled: true
max_nodes: 50000
aggregation_strategy: pool_by_counterparty
streaming:
enabled: true
buffer_size: 1000
backpressure: block # block, drop_oldest, drop_newest, buffer
rate_limit:
enabled: true
entities_per_second: 10000
burst_size: 100
distributions:
enabled: true
industry_profile: retail # retail, manufacturing, financial_services
amounts:
enabled: true
distribution_type: lognormal
components:
- { weight: 0.60, mu: 6.0, sigma: 1.5, label: "routine" }
- { weight: 0.30, mu: 8.5, sigma: 1.0, label: "significant" }
- { weight: 0.10, mu: 11.0, sigma: 0.8, label: "major" }
benford_compliance: true
correlations:
enabled: true
copula_type: gaussian # gaussian, clayton, gumbel, frank, student_t
fields: [amount, line_items, approval_level]
matrix:
- [1.00, 0.65, 0.72]
- [0.65, 1.00, 0.55]
- [0.72, 0.55, 1.00]
regime_changes:
enabled: true
economic_cycle:
enabled: true
cycle_period_months: 48
amplitude: 0.15
recession_probability: 0.1
validation:
enabled: true
tests:
- { type: benford_first_digit, threshold_mad: 0.015 }
- { type: distribution_fit, target: lognormal, significance: 0.05 }
- { type: correlation_check, significance: 0.05 }
accounting_standards:
enabled: true
framework: us_gaap # us_gaap, ifrs, dual_reporting
revenue_recognition:
enabled: true
generate_contracts: true
leases:
enabled: true
finance_lease_percent: 0.30
audit_standards:
enabled: true
isa_compliance:
enabled: true
compliance_level: comprehensive
framework: dual # isa, pcaob, dual
sox:
enabled: true
materiality_threshold: 10000.0
# Enterprise Process Chain Extensions (v0.6.0)
source_to_pay:
enabled: true
sourcing:
projects_per_year: 20
qualification:
pass_rate: 0.80
rfx:
invited_vendors_min: 3
invited_vendors_max: 8
contracts:
duration_months_min: 12
duration_months_max: 36
scorecards:
frequency: quarterly
financial_reporting:
enabled: true
generate_balance_sheet: true
generate_income_statement: true
generate_cash_flow: true
management_kpis:
enabled: true
frequency: monthly
budgets:
enabled: true
revenue_growth_rate: 0.05
hr:
enabled: true
payroll:
enabled: true
pay_frequency: monthly
time_attendance:
enabled: true
overtime_rate: 0.10
expenses:
enabled: true
submission_rate: 0.30
manufacturing:
enabled: true
production_orders:
orders_per_month: 50
yield_rate: 0.97
costing:
labor_rate_per_hour: 35.00
overhead_rate: 1.50
sales_quotes:
enabled: true
quotes_per_month: 30
win_rate: 0.35
validity_days: 30
vendor_network:
enabled: true
depth: 3 # Tier1/Tier2/Tier3
clusters:
reliable_strategic: 0.20
standard_operational: 0.50
transactional: 0.25
problematic: 0.05
dependencies:
max_single_vendor_concentration: 0.15
top_5_concentration: 0.45
customer_segmentation:
enabled: true
value_segments:
enterprise: { revenue_share: 0.40, customer_share: 0.05 }
mid_market: { revenue_share: 0.35, customer_share: 0.20 }
smb: { revenue_share: 0.20, customer_share: 0.50 }
consumer: { revenue_share: 0.05, customer_share: 0.25 }
relationship_strength:
enabled: true
calculation:
transaction_volume_weight: 0.30
transaction_count_weight: 0.25
relationship_duration_weight: 0.20
recency_weight: 0.15
mutual_connections_weight: 0.10
ocpm:
enabled: true
generate_lifecycle_events: true
compute_variants: true
output:
ocel_json: true # OCEL 2.0 JSON format
ocel_xml: false # OCEL 2.0 XML format
xes: true # XES 2.0 for ProM/Celonis/Disco
xes_include_lifecycle: true # Include start/complete transitions
xes_include_resources: true # Include resource attributes
export_reference_models: true # Export P2P/O2C/R2R reference models
llm:
enabled: true
provider: mock # mock, openai, anthropic, custom
enrichment:
vendor_names: true
transaction_descriptions: true
anomaly_explanations: true
diffusion:
enabled: true
n_steps: 100
schedule: cosine # linear, cosine, sigmoid
sample_size: 100
causal:
enabled: true
template: fraud_detection # fraud_detection, revenue_cycle
sample_size: 500
validate: true
output:
format: csv # csv, json, parquet
compression: none # none, gzip, zstd (parquet uses zstd by default)See the Configuration Guide for complete documentation.
output/
├── master_data/ Vendors, customers, materials, assets, employees
├── transactions/ Journal entries, purchase orders, invoices, payments
├── sourcing/ S2C sourcing pipeline outputs
│ ├── sourcing_projects.csv
│ ├── supplier_qualifications.csv
│ ├── rfx_events.csv
│ ├── supplier_bids.csv
│ ├── bid_evaluations.csv
│ ├── procurement_contracts.csv
│ ├── catalog_items.csv
│ └── supplier_scorecards.csv
├── subledgers/ AR, AP, FA, inventory detail records
├── hr/ HR & payroll outputs
│ ├── payroll_runs.csv
│ ├── payslips.csv
│ ├── time_entries.csv
│ └── expense_reports.csv
├── manufacturing/ Production & quality outputs
│ ├── production_orders.csv
│ ├── routing_operations.csv
│ ├── quality_inspection_lots.csv
│ └── cycle_count_records.csv
├── period_close/ Trial balances, accruals, closing entries
├── financial_reporting/ Financial statements & management reporting
│ ├── balance_sheet.csv
│ ├── income_statement.csv
│ ├── cash_flow_statement.csv
│ ├── changes_in_equity.csv
│ ├── financial_kpis.csv
│ └── budget_variance.csv
├── sales/ Sales pipeline outputs
│ ├── sales_quotes.csv
│ └── sales_quote_items.csv
├── consolidation/ Eliminations, currency translation
├── fx/ Exchange rates, CTA adjustments
├── banking/ KYC profiles, bank transactions, AML typology labels
│ ├── bank_statement_lines.csv
│ ├── bank_reconciliations.csv
│ └── reconciling_items.csv
├── process_mining/ Event logs and process models
│ ├── event_log.json OCEL 2.0 JSON format
│ ├── event_log.xes XES 2.0 XML format (for ProM, Celonis, Disco)
│ ├── process_variants/ Discovered process variants
│ └── reference_models/ Canonical P2P, O2C, R2R process models
├── audit/ Engagements, workpapers, findings, risk assessments
├── graphs/ PyTorch Geometric, Neo4j, DGL, RustGraph exports
│ └── hypergraph/ Multi-layer hypergraph (nodes.jsonl, edges.jsonl, hyperedges.jsonl)
├── labels/ Anomaly, fraud, and data quality labels for ML
├── controls/ Internal controls, COSO mappings, SoD rules
└── standards/ Accounting & audit standards outputs
├── accounting/ Contracts, leases, fair value, impairment tests
└── audit/ ISA mappings, confirmations, opinions, SOX assessments
| Use Case | Description |
|---|---|
| Fraud Detection ML | Train supervised models with labeled fraud patterns |
| Graph Neural Networks | Entity relationship graphs for anomaly detection |
| AML/KYC Testing | Banking transaction data with structuring, layering, mule patterns |
| Audit Analytics | Test audit procedures with known control exceptions |
| Process Mining | OCEL 2.0 and XES 2.0 event logs for process discovery and conformance checking |
| Conformance Checking | Reference process models (P2P, O2C, R2R) for process validation |
| ERP Testing | Load testing with realistic transaction volumes |
| Procurement Analytics | Source-to-contract pipeline with spend analysis, RFx, bids, and supplier scorecards |
| HR & Payroll Testing | Payroll runs, time tracking, expense management with policy compliance |
| Manufacturing Simulation | Production orders, BOM explosion, WIP costing, quality inspections |
| Financial Reporting | Balance sheet, income statement, cash flow, KPIs, and budget variance |
| Bank Reconciliation | Statement matching, outstanding items, net difference validation |
| SOX Compliance | Test internal control monitoring systems |
| COSO Framework | COSO 2013 control mapping with 5 components, 17 principles, maturity levels |
| Standards Compliance | IFRS/US GAAP revenue recognition, lease accounting, fair value, impairment testing |
| Audit Standards | ISA/PCAOB procedure mapping, analytical procedures, confirmations, audit opinions |
| Data Quality ML | Train models to detect missing values, typos, duplicates |
| RustGraph Integration | Stream data directly to RustAssureTwin knowledge graphs |
| Hypergraph Analytics | 3-layer hypergraph export (Governance, Process, Accounting) for multi-relational GNN models |
| Causal Analysis | Generate interventional and counterfactual datasets for causal ML research |
| LLM Training Data | LLM-enriched metadata with realistic vendor names, descriptions, and explanations |
| Pipeline Orchestration | Airflow operators, dbt sources, MLflow tracking, Spark DataFrames |
| Metric | Performance |
|---|---|
| Single-threaded throughput | ~100,000+ entries/second |
| Parallel scaling | Linear with available cores |
| Memory efficiency | Streaming generation for large volumes |
# Start REST + gRPC server
cargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051 --worker-threads 4
# With API key authentication
cargo run -p datasynth-server -- --api-keys "key1,key2"
# With JWT/OIDC authentication (requires jwt feature)
cargo run -p datasynth-server --features jwt -- \
--jwt-issuer "https://auth.example.com" \
--jwt-audience "datasynth-api" \
--jwt-public-key /path/to/public.pem
# With RBAC and audit logging
cargo run -p datasynth-server -- --api-keys "key1" --rbac-enabled --audit-log
# With TLS (requires tls feature)
cargo run -p datasynth-server --features tls -- --tls-cert cert.pem --tls-key key.pem# API endpoints
curl http://localhost:3000/health # Health check
curl http://localhost:3000/ready # Readiness probe (config + memory + disk)
curl http://localhost:3000/metrics # Prometheus metrics
curl -H "Authorization: Bearer <key>" http://localhost:3000/api/config
curl -H "Authorization: Bearer <key>" -X POST http://localhost:3000/api/stream/startWebSocket streaming available at ws://localhost:3000/ws/events.
# Build and run the server
docker build -t datasynth:latest .
docker run -p 50051:50051 -p 3000:3000 datasynth:latest
# Or use Docker Compose for full stack (server + Prometheus + Grafana)
docker compose up -d
# REST API: http://localhost:3000 | gRPC: localhost:50051
# Prometheus: http://localhost:9090 | Grafana: http://localhost:3001See the Deployment Guide for Docker, SystemD, and reverse proxy setup.
cd crates/datasynth-ui
npm install
npm run tauri devThe desktop application provides visual configuration, real-time streaming, and preset management.
Features:
- 40+ config pages with form controls for every generation parameter
- Info cards on feature pages explaining capabilities before enabling
- Sidebar navigation with collapsible sections and scroll indicator for 10 section groups
- Web preview mode — run
npm run devfor config editing without Tauri; dashboard requiresnpm run tauri dev - Visual regression testing — 56 Playwright screenshot baselines for UI consistency
Extract privacy-preserving fingerprints from real data and generate matching synthetic data:
# Extract fingerprint from CSV data
datasynth-data fingerprint extract \
--input ./real_data.csv \
--output ./fingerprint.dsf \
--privacy-level standard
# Validate fingerprint
datasynth-data fingerprint validate ./fingerprint.dsf
# Show fingerprint info
datasynth-data fingerprint info ./fingerprint.dsf --detailed
# Compare fingerprints
datasynth-data fingerprint diff ./fp1.dsf ./fp2.dsf
# Evaluate synthetic data fidelity
datasynth-data fingerprint evaluate \
--fingerprint ./fingerprint.dsf \
--synthetic ./synthetic_data/ \
--threshold 0.8Privacy Levels:
| Level | Epsilon | k | Use Case |
|---|---|---|---|
| minimal | 5.0 | 3 | Low privacy, high utility |
| standard | 1.0 | 5 | Balanced (default) |
| high | 0.5 | 10 | Higher privacy |
| maximum | 0.1 | 20 | Maximum privacy |
See the Fingerprinting Guide for complete documentation.
A Python wrapper (v1.0.0) is available for programmatic access:
cd python
pip install -e ".[all]" # Includes pandas, polars, jupyter, streamingfrom datasynth_py import DataSynth, AsyncDataSynth
from datasynth_py import to_pandas, to_polars, list_tables
from datasynth_py.config import blueprints
# Basic generation
config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)
# DataFrame loading
tables = list_tables(result) # ['journal_entries', 'vendors', ...]
df = to_pandas(result, "journal_entries")
pl_df = to_polars(result, "vendors")
# Async generation
async with AsyncDataSynth() as synth:
result = await synth.generate(config=config)
# Fingerprint operations
synth.fingerprint.extract("./real_data/", "./fingerprint.dsf", privacy_level="standard")
report = synth.fingerprint.evaluate("./fingerprint.dsf", "./synthetic/")
print(f"Fidelity score: {report.overall_score}")Optional dependencies: [pandas], [polars], [jupyter], [streaming], [airflow], [dbt], [mlflow], [spark], [all].
Ecosystem Integrations:
from datasynth_py.config import blueprints
# LLM-enriched generation
config = blueprints.with_llm_enrichment(provider="mock")
# Diffusion-enhanced generation
config = blueprints.with_diffusion(schedule="cosine", hybrid_weight=0.3)
# Causal data generation
config = blueprints.with_causal(template="fraud_detection")
# Airflow operator
from datasynth_py.integrations.airflow import DataSynthOperator
# dbt integration
from datasynth_py.integrations.dbt import DbtSourceGenerator
# MLflow tracking
from datasynth_py.integrations.mlflow_tracker import DataSynthMlflowTracker
# Spark connector
from datasynth_py.integrations.spark import DataSynthSparkReaderSee the Python Wrapper Guide for complete documentation.
- Configuration Guide
- API Reference
- Architecture Overview
- Python Wrapper Guide
- Compliance & Regulatory
- Contributing Guidelines
Copyright 2024-2026 Michael Ivertowski, Ernst & Young Ltd., Zurich, Switzerland
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Commercial support, custom development, and enterprise licensing are available upon request. Please contact the author at michael.ivertowski@ch.ey.com for inquiries.
This project incorporates research on statistical distributions in accounting data and implements industry-standard patterns for enterprise financial systems.
SyntheticData is provided "as is" without warranty of any kind. It is intended for testing, development, and educational purposes. Generated data should not be used as a substitute for real financial records.