This project contains a complete genetics data analysis pipeline (based on Nextflow) and accompanying Python/R/Java script libraries. The project integrates whole genome sequencing data processing, variant calling, population genetics analysis, and GWAS features.
.
├── environment_*.yml # Conda environment configuration files
├── setup.py # Python package installation script
├── README.md # Project documentation
├── src/ # Source code directory
│ ├── python/ # Python core library (python_script)
│ │ ├── genetics/ # Genetics analysis modules (genomics, gwas, phenotype, etc.)
│ │ ├── infra/ # Infrastructure and tools (server, stats, utils)
│ │ └── WeaTE/ # Transposon analysis module
│ ├── r/ # R statistics and plotting scripts
│ └── java/ # Java tools (e.g., BamHeader, TaxaBamMap)
├── workflow/ # Nextflow workflows
│ ├── Genetics/ # Core genetics analysis pipelines
│ │ ├── genotype/ # Genotype processing (alignment, calling, statistics)
│ │ ├── static/ # Static analysis (GWAS, phenotype)
│ │ ├── dynamic/ # Dynamic/Population analysis (Kinship, XP-CLR)
│ │ └── main.nf # Main entry script
│ └── ALiYun/ # Alibaba Cloud WDL workflow backup
└── note/ # Analysis notes (Jupyter Notebooks)
This project relies on Conda for environment management and provides multiple dedicated environments for different tasks.
Create the appropriate environment based on your task requirements:
run: Workflow execution environment (Nextflow, Screen)conda env create -f environment_run.yml
stats: Main statistical analysis environment (Python 3.12, Hail, Plink, Samtools, Bcftools)- Used for script development and most Python analysis tasks
conda env create -f environment_stats.yml
stats_r: R language statistics and plotting environment (R 4.3.1, Tidyverse, BioConductor)conda env create -f environment_stats_r.yml
tiger: Compute-intensive task environment (BWA-MEM2, Samtools)- Used for alignment and variant calling
conda env create -f environment_tiger.yml
dbone: Database and basic operations environmentconda env create -f environment_dbone.yml
To enable calls to code under src/python within the environment, you need to install this project in editable mode in the stats environment (or other environments requiring Python scripts):
conda activate stats
pip install -e .This will install the python_script package and its dependencies (pandas, numpy, scipy, seaborn, hail, etc.).
The main workflow scripts are located in the workflow/Genetics directory.
Path: workflow/Genetics/genotype/
align.nf: Sequence alignment workflow. Includes FASTQ QC, BWA-MEM2 alignment, Samtools sort/dedup/index, and Mosdepth depth calculation.- New Feature: Includes smart USB file distribution (
cp_based_on_usb_size), which can copy files in parallel using multiple threads based on available disk space.
- New Feature: Includes smart USB file distribution (
stats.nf: Genotype statistics workflow. Integrates Python scripts to compute and plot metrics such as missing rates, heterozygosity, IBS, King kinship, etc.caller.nf: Variant calling workflow.hail.nf: Distributed data processing related to Hail.
Path: workflow/Genetics/static/
gwas/: Genome-wide association study workflow.phenotype/: Phenotype data processing and statistics.
Path: workflow/Genetics/dynamic/
kinship.nf: Kinship analysis.xp_clr.nf: Selective sweep analysis.
Core logic is encapsulated in the python_script package:
genetics.genomics: Genomics analysis (Sample QC, Variant QC, IBS, Kinship, etc.).infra.server.cp: Contains smart file copying logic (run_copy_process) for use by Nextflow pipelines.infra.utils.graph: General plotting utility library.WeaTE: Wheat transposon analysis specialized module.
Contains R scripts for GWAS result visualization (Manhattan/QQ plots), PCA visualization, etc.
Contains utility classes for handling BAM headers (BamHeader.java) and Taxa mapping (TaxaBamMap.java).
conda activate run
nextflow run workflow/Genetics/genotype/align.nf \
--fastq_dir ./raw_data \
--reference ref.fa \
-profile tigerconda activate run
nextflow run workflow/Genetics/genotype/stats.nf \
--output_dir ./results \
-profile stats(Note: Please adjust parameters according to the specific nextflow.config file)