Extract haplotype-specific sequences centered on BED intervals from phased VCFs.
pixi install# Run test data
pixi run test
# Run on real GM12878 data
pixi run real
# Validate outputs
pixi run validate# 1. Generate haplotype FASTAs and chain files
src/generate_haplotypes.sh <ref.fa> <phased.vcf.gz> <sample_name> <output_prefix>
# 2. Extract centered sequences
python src/extract_centered_haplotypes.py \
--bed sites.bed \
--hap1-fa haplotypes.hap1.fa \
--hap2-fa haplotypes.hap2.fa \
--hap1-chain haplotypes.hap1.chain \
--hap2-chain haplotypes.hap2.chain \
--size 1000000 \
--no-pad \
--out-dir output/hap1.fa,hap2.fa- Multi-record FASTAs with extracted sequenceshap1_intervals.bed,hap2_intervals.bed- Interval positions in each haplotypeintervals_combined.bed.gz- Combined file with coordinates and full sequences
chrom | start | end | name | hap1_chrom | hap1_start | hap1_end | hap2_chrom | hap2_start | hap2_end | hap1_seq | hap2_seq
Case is preserved from the input reference. To validate interval tracking, use a soft-masked reference where intervals are lowercase:
bedtools maskfasta -soft -fi ref.fa -bed intervals.bed -fo ref.softmasked.faThen verify extracted intervals contain lowercase bases (uppercase = VCF insertions).