Skip to content

GPU-Accelerated Python Library for generating realistic synthetic genomic data sets that are AI/ML ready!

License

Notifications You must be signed in to change notification settings

markkessler66/BioForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioForge

GPU-accelerated synthetic genomic data generation for machine learning

Python 3.9+ License: MIT

BioForge generates realistic synthetic DNA, RNA, and protein sequences with known ground truth, designed for training and benchmarking ML models in computational biology. It provides GPU acceleration, comprehensive evolutionary models, and full reproducibility through configuration tracking.

Features

  • Fast generation: GPU-accelerated with CuPy, CPU fallback with NumPy
  • Multiple sequence types: DNA, RNA, protein, and codon-aware coding sequences
  • Realistic sequences: Configurable GC content, amino acid frequencies, codon usage bias
  • Evolutionary simulation: JC69, K80, HKY85, GTR substitution models with rate heterogeneity
  • Population genetics: Coalescent, structured coalescent, selection, demographic models
  • Genome evolution: Gene duplication/loss, whole genome duplication, segmental duplication
  • Recombination: Crossover simulation, ancestral recombination graphs, gene conversion
  • Structural variants: Inversions, deletions, duplications, translocations, CNVs
  • Indel models: General indel simulation, TKF91/TKF92 probabilistic models
  • Noise models: Platform-specific sequencing errors (Illumina, PacBio, Nanopore), coverage bias, contamination
  • Motif insertion: Insert known binding sites with ground truth positions
  • Tree generation: Yule, birth-death, coalescent, balanced, and caterpillar topologies
  • ML-ready outputs: One-hot encoding, PyTorch Datasets, HDF5 export
  • Data augmentation: Reverse complement, masking, k-mer shuffling, random mutations
  • Pre-built benchmarks: Ready-to-use datasets for common ML tasks
  • Simulation pipeline: Chain models together for complex workflows
  • Reproducibility: Registry system for tracking configurations and provenance

Quick Start

Installation

# Basic installation
pip install bioforge

# With GPU support
pip install bioforge[gpu]

# With ML frameworks
pip install bioforge[ml]

# Everything
pip install bioforge[all]

Generate DNA Sequences

from bioforge import DNAGenerator

# Create generator with 50% GC content
gen = DNAGenerator(gc_content=0.5, seed=42)

# Generate 10,000 sequences of length 150bp
sequences = gen.generate(n=10000, length=150)
print(sequences.shape)  # (10000, 150)

# One-hot encoded for neural networks
onehot = gen.generate(n=10000, length=150, encoding="onehot")
print(onehot.shape)  # (10000, 150, 4)

# As strings for traditional tools
strings = gen.generate(n=100, length=150, encoding="string")
print(strings[0])  # "ATCGATCG..."

Generate RNA and Protein Sequences

from bioforge import RNAGenerator, ProteinGenerator, CodonGenerator

# RNA with custom base frequencies
rna_gen = RNAGenerator(base_frequencies=[0.25, 0.25, 0.25, 0.25], seed=42)
rna_seqs = rna_gen.generate(n=1000, length=100)

# Protein with amino acid frequency profiles
prot_gen = ProteinGenerator(frequency_profile="uniform", seed=42)
proteins = prot_gen.generate(n=1000, length=50)

# Codon-aware sequences with usage bias
codon_gen = CodonGenerator(codon_table="standard", seed=42)
coding_seqs = codon_gen.generate(n=1000, length=300)  # must be divisible by 3

Simulation Pipeline

from bioforge import SimulationPipeline
from bioforge.models import HKY85, GammaRates

# Build a complete simulation pipeline
pipeline = (
    SimulationPipeline(seed=42)
    .generate(n=1000, length=500, gc_content=0.45)
    .evolve(
        HKY85(kappa=2.0, base_frequencies=[0.3, 0.2, 0.2, 0.3]),
        branch_length=0.1,
        track_mutations=True
    )
    .with_rate_variation(GammaRates(alpha=0.5, n_categories=4))
    .with_indels(insertion_rate=0.01, deletion_rate=0.01)
    .add_noise(platform="illumina", error_rate=1.0)
)

# Execute the pipeline
result = pipeline.run()
print(f"Generated {result.n_sequences} sequences")
print(f"Mutation positions tracked: {len(result.mutation_positions)}")

Pre-built Benchmarks

from bioforge import get_benchmark, list_benchmarks

# See available benchmarks
print(list_benchmarks())

# Get a benchmark dataset
dataset = get_benchmark("promoter_classification", n_samples=10000, seed=42)

# Ready for ML
train, test = dataset.train_test_split(test_size=0.2)
X_train, y_train = train.sequences, train.labels

Motif Discovery Data

from bioforge import DNAGenerator
from bioforge.generators import MotifInserter, tata_box, e_box

# Generate background sequences
gen = DNAGenerator(gc_content=0.5, seed=42)
sequences = gen.generate(n=1000, length=200, encoding="index")

# Insert motifs with known positions
inserter = MotifInserter(seed=42)
inserter.add_motif(tata_box(), frequency=0.4)
inserter.add_motif(e_box(), frequency=0.3)

result = inserter.insert(sequences)
print(f"Sequences with motifs: {result.n_positive}")
# Access ground truth positions via result.positions

Recombination Simulation

from bioforge.models import RecombinationMap, Recombinator, AncestralRecombinationGraph

# Define position-specific recombination rates
recomb_map = RecombinationMap(
    positions=[0, 1000, 2000, 3000],
    rates=[1e-8, 5e-8, 1e-8]  # hotspot in the middle
)

# Simulate crossovers between haplotypes
recombinator = Recombinator(recombination_map=recomb_map, seed=42)
recombined = recombinator.recombine(haplotype1, haplotype2)

# Simulate ancestral recombination graph
arg = AncestralRecombinationGraph(n_samples=10, sequence_length=10000, seed=42)
arg_result = arg.simulate(recombination_rate=1e-8)

Population Genetics

from bioforge.models import Coalescent, StructuredCoalescent, SelectionModel, DemographicModel

# Standard coalescent
coal = Coalescent(n_samples=50, seed=42)
tree = coal.simulate(effective_population_size=10000)

# Structured coalescent with migration
struct_coal = StructuredCoalescent(
    n_populations=3,
    n_samples=[20, 20, 10],
    migration_matrix=[[0, 1e-4, 1e-5], [1e-4, 0, 1e-4], [1e-5, 1e-4, 0]],
    seed=42
)

# Selection models
selection = SelectionModel(
    selection_coefficient=0.01,
    selection_type="positive",
    seed=42
)

Noise and Error Models

from bioforge.models import SequencingErrorModel, NoiseLayer, CoverageBiasModel

# Platform-specific sequencing errors
illumina_errors = SequencingErrorModel(platform="illumina", error_rate=0.01, seed=42)
noisy_reads = illumina_errors.apply(sequences)

# Combine multiple noise sources
noise = NoiseLayer(seed=42)
noise.add(SequencingErrorModel(platform="illumina", error_rate=0.01))
noise.add(CoverageBiasModel(gc_bias_strength=0.3))
noisy_data = noise.apply(sequences)

Export for ML Pipelines

from bioforge.io import write_fasta, write_fastq, to_hdf5, to_torch_dataset

# Save as FASTA
write_fasta(sequences, "sequences.fa", sequence_type="dna")

# Save as FASTQ with quality scores
from bioforge.io import generate_quality_scores
quality = generate_quality_scores(n_sequences=1000, length=150)
write_fastq(sequences, "reads.fq", quality_scores=quality)

# Save as HDF5 (compressed, good for large datasets)
to_hdf5(sequences, "dataset.h5", labels=labels)

# Create PyTorch Dataset
from torch.utils.data import DataLoader
dataset = to_torch_dataset(sequences, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

Data Augmentation

from bioforge import SequenceAugmenter

aug = SequenceAugmenter(seed=42)

# Reverse complement
rc_seqs = aug.reverse_complement(sequences)

# Random mutations (with transition bias)
mutated = aug.random_mutation(sequences, rate=0.05)

# BERT-style masking
masked, labels = aug.mask_random(sequences, mask_rate=0.15)

# Create augmented dataset
X_aug, y_aug = aug.create_augmented_dataset(
    sequences, labels,
    augmentation_factor=3
)

Dataset Registry and Reproducibility

from bioforge import DNAGenerator
from bioforge.registry import tracked, load, reproduce

# Generate with automatic tracking
gen = DNAGenerator(gc_content=0.5, seed=42)
with tracked(name="my_dataset", version="1.0") as session:
    sequences = gen.generate(n=1000, length=150)
    session.save(sequences)

# Later: retrieve by name
dataset = load("my_dataset/1.0")
sequences = dataset.data

# Or reproduce from scratch using the config
new_sequences = reproduce(dataset.config)

GPU Acceleration

# Enable GPU (requires CuPy)
gen = DNAGenerator(gc_content=0.5, use_gpu=True)

# Generate 1M sequences blazingly fast
sequences = gen.generate(n=1_000_000, length=150)

Benchmarks

Operation CPU (NumPy) GPU (CuPy) Speedup
100K seqs x 150bp 2.3s 0.08s 29x
1M seqs x 150bp 23s 0.7s 33x
100K seqs one-hot 4.1s 0.12s 34x

Benchmarked on AMD Ryzen 7 7700X (CPU) and RTX 5070ti (GPU)

Available Models

Sequence Generators

  • DNAGenerator - DNA sequences with configurable GC content
  • RNAGenerator - RNA sequences with base frequency control
  • ProteinGenerator - Protein sequences with amino acid frequency profiles
  • CodonGenerator - Codon-aware coding sequences with usage bias

Evolutionary Models

  • Substitution: JC69, K80, HKY85, GTR
  • Rate variation: GammaRates, InvariantSites, GammaInvariant, FreeRates, CodonRates
  • Indels: IndelModel, TKF91, TKF92
  • Tree generation: Yule, Birth-Death, Coalescent, Balanced, Caterpillar

Genome Evolution

  • BirthDeathModel - Gene duplication and loss
  • WholeGenomeDuplication - WGD and polyploidy events
  • SegmentalDuplication - Local duplications
  • GeneConversion - Concerted evolution between paralogs

Recombination

  • RecombinationMap - Position-specific recombination rates
  • Recombinator - Crossover and gene conversion simulation
  • AncestralRecombinationGraph - ARG simulation
  • IntrageneRecombination - Within-gene recombination

Population Genetics

  • Coalescent - Standard coalescent simulation
  • StructuredCoalescent - Multiple populations with migration
  • SelectionModel - Purifying, positive, balancing selection
  • DemographicModel - Bottlenecks, expansions, demographic scenarios

Structural Variants

  • StructuralVariantModel - Inversions, translocations, CNVs, deletions, duplications

Noise Models

  • SequencingErrorModel - Platform-specific errors (Illumina, PacBio, Nanopore)
  • MissingDataModel - Random and block missingness patterns
  • CoverageBiasModel - GC bias, low-coverage noise
  • ContaminationModel - Foreign sequence injection
  • AlignmentArtifactModel - Gap placement errors
  • NoiseLayer - Combines multiple noise sources

Motifs

  • MotifInserter - Insert motifs with ground truth tracking
  • Built-in motifs: TATA box, Kozak sequence, poly-A signal, Sp1, NF-kB, CREB, E-box, CpG island

Examples

The examples/ directory contains working scripts demonstrating common workflows:

  • quickstart.py - Basic sequence generation
  • advanced_pipeline.py - Complex simulation pipelines with multiple models
  • ml_training.py - Preparing data for ML training
  • motif_discovery.py - Generating motif discovery benchmarks

Roadmap

  • DNA/RNA/Protein sequence generation
  • Codon-aware sequence generation with usage bias
  • Evolutionary models (JC69, K80, HKY85, GTR)
  • Rate heterogeneity (Gamma, +I, codon models)
  • Indel simulation (IndelModel, TKF91/92)
  • Tree generation (Yule, Birth-Death, Coalescent)
  • Genome evolution (duplication, WGD, gene conversion)
  • Recombination (crossover, ARG simulation)
  • Structural variants (inversions, translocations, CNVs)
  • Population genetics (coalescent, selection, demographics)
  • Noise models (sequencing errors, contamination)
  • Motif insertion with ground truth positions
  • FASTA/FASTQ I/O with quality scores
  • HDF5/NumPy/PyTorch export
  • Simulation pipeline for chaining models
  • Pre-built ML benchmarks
  • Data augmentation utilities
  • Dataset registry for reproducibility
  • Custom CUDA kernels for maximum performance
  • Integration with msprime/tskit

Citation

If you use BioForge in your research, please cite:

@software{bioforge,
  author = {Kessler, Mark},
  title = {BioForge: GPU-accelerated synthetic genomic data generation},
  url = {https://github.com/markkessler66/bioforge},
  year = {2025}
}

Contributing

Contributions welcome! Please check out our contributing guidelines.

License

MIT License - see LICENSE for details.

About

GPU-Accelerated Python Library for generating realistic synthetic genomic data sets that are AI/ML ready!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages