Skip to content

Extract haplotype-specific sequences centered on BED intervals from phased VCFs

Notifications You must be signed in to change notification settings

vollgerlab/hap-windows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hap-windows

Extract haplotype-specific sequences centered on BED intervals from phased VCFs.

Installation

pixi install

Usage

# Run test data
pixi run test

# Run on real GM12878 data
pixi run real

# Validate outputs
pixi run validate

Manual Usage

# 1. Generate haplotype FASTAs and chain files
src/generate_haplotypes.sh <ref.fa> <phased.vcf.gz> <sample_name> <output_prefix>

# 2. Extract centered sequences
python src/extract_centered_haplotypes.py \
    --bed sites.bed \
    --hap1-fa haplotypes.hap1.fa \
    --hap2-fa haplotypes.hap2.fa \
    --hap1-chain haplotypes.hap1.chain \
    --hap2-chain haplotypes.hap2.chain \
    --size 1000000 \
    --no-pad \
    --out-dir output/

Output

  • hap1.fa, hap2.fa - Multi-record FASTAs with extracted sequences
  • hap1_intervals.bed, hap2_intervals.bed - Interval positions in each haplotype
  • intervals_combined.bed.gz - Combined file with coordinates and full sequences

Combined BED columns

chrom | start | end | name | hap1_chrom | hap1_start | hap1_end | hap2_chrom | hap2_start | hap2_end | hap1_seq | hap2_seq

Validation

Case is preserved from the input reference. To validate interval tracking, use a soft-masked reference where intervals are lowercase:

bedtools maskfasta -soft -fi ref.fa -bed intervals.bed -fo ref.softmasked.fa

Then verify extracted intervals contain lowercase bases (uppercase = VCF insertions).

About

Extract haplotype-specific sequences centered on BED intervals from phased VCFs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published