GPU-accelerated synthetic genomic data generation for machine learning
BioForge generates realistic synthetic DNA, RNA, and protein sequences with known ground truth, designed for training and benchmarking ML models in computational biology. It provides GPU acceleration, comprehensive evolutionary models, and full reproducibility through configuration tracking.
- Fast generation: GPU-accelerated with CuPy, CPU fallback with NumPy
- Multiple sequence types: DNA, RNA, protein, and codon-aware coding sequences
- Realistic sequences: Configurable GC content, amino acid frequencies, codon usage bias
- Evolutionary simulation: JC69, K80, HKY85, GTR substitution models with rate heterogeneity
- Population genetics: Coalescent, structured coalescent, selection, demographic models
- Genome evolution: Gene duplication/loss, whole genome duplication, segmental duplication
- Recombination: Crossover simulation, ancestral recombination graphs, gene conversion
- Structural variants: Inversions, deletions, duplications, translocations, CNVs
- Indel models: General indel simulation, TKF91/TKF92 probabilistic models
- Noise models: Platform-specific sequencing errors (Illumina, PacBio, Nanopore), coverage bias, contamination
- Motif insertion: Insert known binding sites with ground truth positions
- Tree generation: Yule, birth-death, coalescent, balanced, and caterpillar topologies
- ML-ready outputs: One-hot encoding, PyTorch Datasets, HDF5 export
- Data augmentation: Reverse complement, masking, k-mer shuffling, random mutations
- Pre-built benchmarks: Ready-to-use datasets for common ML tasks
- Simulation pipeline: Chain models together for complex workflows
- Reproducibility: Registry system for tracking configurations and provenance
# Basic installation
pip install bioforge
# With GPU support
pip install bioforge[gpu]
# With ML frameworks
pip install bioforge[ml]
# Everything
pip install bioforge[all]from bioforge import DNAGenerator
# Create generator with 50% GC content
gen = DNAGenerator(gc_content=0.5, seed=42)
# Generate 10,000 sequences of length 150bp
sequences = gen.generate(n=10000, length=150)
print(sequences.shape) # (10000, 150)
# One-hot encoded for neural networks
onehot = gen.generate(n=10000, length=150, encoding="onehot")
print(onehot.shape) # (10000, 150, 4)
# As strings for traditional tools
strings = gen.generate(n=100, length=150, encoding="string")
print(strings[0]) # "ATCGATCG..."from bioforge import RNAGenerator, ProteinGenerator, CodonGenerator
# RNA with custom base frequencies
rna_gen = RNAGenerator(base_frequencies=[0.25, 0.25, 0.25, 0.25], seed=42)
rna_seqs = rna_gen.generate(n=1000, length=100)
# Protein with amino acid frequency profiles
prot_gen = ProteinGenerator(frequency_profile="uniform", seed=42)
proteins = prot_gen.generate(n=1000, length=50)
# Codon-aware sequences with usage bias
codon_gen = CodonGenerator(codon_table="standard", seed=42)
coding_seqs = codon_gen.generate(n=1000, length=300) # must be divisible by 3from bioforge import SimulationPipeline
from bioforge.models import HKY85, GammaRates
# Build a complete simulation pipeline
pipeline = (
SimulationPipeline(seed=42)
.generate(n=1000, length=500, gc_content=0.45)
.evolve(
HKY85(kappa=2.0, base_frequencies=[0.3, 0.2, 0.2, 0.3]),
branch_length=0.1,
track_mutations=True
)
.with_rate_variation(GammaRates(alpha=0.5, n_categories=4))
.with_indels(insertion_rate=0.01, deletion_rate=0.01)
.add_noise(platform="illumina", error_rate=1.0)
)
# Execute the pipeline
result = pipeline.run()
print(f"Generated {result.n_sequences} sequences")
print(f"Mutation positions tracked: {len(result.mutation_positions)}")from bioforge import get_benchmark, list_benchmarks
# See available benchmarks
print(list_benchmarks())
# Get a benchmark dataset
dataset = get_benchmark("promoter_classification", n_samples=10000, seed=42)
# Ready for ML
train, test = dataset.train_test_split(test_size=0.2)
X_train, y_train = train.sequences, train.labelsfrom bioforge import DNAGenerator
from bioforge.generators import MotifInserter, tata_box, e_box
# Generate background sequences
gen = DNAGenerator(gc_content=0.5, seed=42)
sequences = gen.generate(n=1000, length=200, encoding="index")
# Insert motifs with known positions
inserter = MotifInserter(seed=42)
inserter.add_motif(tata_box(), frequency=0.4)
inserter.add_motif(e_box(), frequency=0.3)
result = inserter.insert(sequences)
print(f"Sequences with motifs: {result.n_positive}")
# Access ground truth positions via result.positionsfrom bioforge.models import RecombinationMap, Recombinator, AncestralRecombinationGraph
# Define position-specific recombination rates
recomb_map = RecombinationMap(
positions=[0, 1000, 2000, 3000],
rates=[1e-8, 5e-8, 1e-8] # hotspot in the middle
)
# Simulate crossovers between haplotypes
recombinator = Recombinator(recombination_map=recomb_map, seed=42)
recombined = recombinator.recombine(haplotype1, haplotype2)
# Simulate ancestral recombination graph
arg = AncestralRecombinationGraph(n_samples=10, sequence_length=10000, seed=42)
arg_result = arg.simulate(recombination_rate=1e-8)from bioforge.models import Coalescent, StructuredCoalescent, SelectionModel, DemographicModel
# Standard coalescent
coal = Coalescent(n_samples=50, seed=42)
tree = coal.simulate(effective_population_size=10000)
# Structured coalescent with migration
struct_coal = StructuredCoalescent(
n_populations=3,
n_samples=[20, 20, 10],
migration_matrix=[[0, 1e-4, 1e-5], [1e-4, 0, 1e-4], [1e-5, 1e-4, 0]],
seed=42
)
# Selection models
selection = SelectionModel(
selection_coefficient=0.01,
selection_type="positive",
seed=42
)from bioforge.models import SequencingErrorModel, NoiseLayer, CoverageBiasModel
# Platform-specific sequencing errors
illumina_errors = SequencingErrorModel(platform="illumina", error_rate=0.01, seed=42)
noisy_reads = illumina_errors.apply(sequences)
# Combine multiple noise sources
noise = NoiseLayer(seed=42)
noise.add(SequencingErrorModel(platform="illumina", error_rate=0.01))
noise.add(CoverageBiasModel(gc_bias_strength=0.3))
noisy_data = noise.apply(sequences)from bioforge.io import write_fasta, write_fastq, to_hdf5, to_torch_dataset
# Save as FASTA
write_fasta(sequences, "sequences.fa", sequence_type="dna")
# Save as FASTQ with quality scores
from bioforge.io import generate_quality_scores
quality = generate_quality_scores(n_sequences=1000, length=150)
write_fastq(sequences, "reads.fq", quality_scores=quality)
# Save as HDF5 (compressed, good for large datasets)
to_hdf5(sequences, "dataset.h5", labels=labels)
# Create PyTorch Dataset
from torch.utils.data import DataLoader
dataset = to_torch_dataset(sequences, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)from bioforge import SequenceAugmenter
aug = SequenceAugmenter(seed=42)
# Reverse complement
rc_seqs = aug.reverse_complement(sequences)
# Random mutations (with transition bias)
mutated = aug.random_mutation(sequences, rate=0.05)
# BERT-style masking
masked, labels = aug.mask_random(sequences, mask_rate=0.15)
# Create augmented dataset
X_aug, y_aug = aug.create_augmented_dataset(
sequences, labels,
augmentation_factor=3
)from bioforge import DNAGenerator
from bioforge.registry import tracked, load, reproduce
# Generate with automatic tracking
gen = DNAGenerator(gc_content=0.5, seed=42)
with tracked(name="my_dataset", version="1.0") as session:
sequences = gen.generate(n=1000, length=150)
session.save(sequences)
# Later: retrieve by name
dataset = load("my_dataset/1.0")
sequences = dataset.data
# Or reproduce from scratch using the config
new_sequences = reproduce(dataset.config)# Enable GPU (requires CuPy)
gen = DNAGenerator(gc_content=0.5, use_gpu=True)
# Generate 1M sequences blazingly fast
sequences = gen.generate(n=1_000_000, length=150)| Operation | CPU (NumPy) | GPU (CuPy) | Speedup |
|---|---|---|---|
| 100K seqs x 150bp | 2.3s | 0.08s | 29x |
| 1M seqs x 150bp | 23s | 0.7s | 33x |
| 100K seqs one-hot | 4.1s | 0.12s | 34x |
Benchmarked on AMD Ryzen 7 7700X (CPU) and RTX 5070ti (GPU)
DNAGenerator- DNA sequences with configurable GC contentRNAGenerator- RNA sequences with base frequency controlProteinGenerator- Protein sequences with amino acid frequency profilesCodonGenerator- Codon-aware coding sequences with usage bias
- Substitution: JC69, K80, HKY85, GTR
- Rate variation: GammaRates, InvariantSites, GammaInvariant, FreeRates, CodonRates
- Indels: IndelModel, TKF91, TKF92
- Tree generation: Yule, Birth-Death, Coalescent, Balanced, Caterpillar
BirthDeathModel- Gene duplication and lossWholeGenomeDuplication- WGD and polyploidy eventsSegmentalDuplication- Local duplicationsGeneConversion- Concerted evolution between paralogs
RecombinationMap- Position-specific recombination ratesRecombinator- Crossover and gene conversion simulationAncestralRecombinationGraph- ARG simulationIntrageneRecombination- Within-gene recombination
Coalescent- Standard coalescent simulationStructuredCoalescent- Multiple populations with migrationSelectionModel- Purifying, positive, balancing selectionDemographicModel- Bottlenecks, expansions, demographic scenarios
StructuralVariantModel- Inversions, translocations, CNVs, deletions, duplications
SequencingErrorModel- Platform-specific errors (Illumina, PacBio, Nanopore)MissingDataModel- Random and block missingness patternsCoverageBiasModel- GC bias, low-coverage noiseContaminationModel- Foreign sequence injectionAlignmentArtifactModel- Gap placement errorsNoiseLayer- Combines multiple noise sources
MotifInserter- Insert motifs with ground truth tracking- Built-in motifs: TATA box, Kozak sequence, poly-A signal, Sp1, NF-kB, CREB, E-box, CpG island
The examples/ directory contains working scripts demonstrating common workflows:
quickstart.py- Basic sequence generationadvanced_pipeline.py- Complex simulation pipelines with multiple modelsml_training.py- Preparing data for ML trainingmotif_discovery.py- Generating motif discovery benchmarks
- DNA/RNA/Protein sequence generation
- Codon-aware sequence generation with usage bias
- Evolutionary models (JC69, K80, HKY85, GTR)
- Rate heterogeneity (Gamma, +I, codon models)
- Indel simulation (IndelModel, TKF91/92)
- Tree generation (Yule, Birth-Death, Coalescent)
- Genome evolution (duplication, WGD, gene conversion)
- Recombination (crossover, ARG simulation)
- Structural variants (inversions, translocations, CNVs)
- Population genetics (coalescent, selection, demographics)
- Noise models (sequencing errors, contamination)
- Motif insertion with ground truth positions
- FASTA/FASTQ I/O with quality scores
- HDF5/NumPy/PyTorch export
- Simulation pipeline for chaining models
- Pre-built ML benchmarks
- Data augmentation utilities
- Dataset registry for reproducibility
- Custom CUDA kernels for maximum performance
- Integration with msprime/tskit
If you use BioForge in your research, please cite:
@software{bioforge,
author = {Kessler, Mark},
title = {BioForge: GPU-accelerated synthetic genomic data generation},
url = {https://github.com/markkessler66/bioforge},
year = {2025}
}Contributions welcome! Please check out our contributing guidelines.
MIT License - see LICENSE for details.