Genetics Analysis Pipeline & Script Library

This project contains a complete genetics data analysis pipeline (based on Nextflow) and accompanying Python/R/Java script libraries. The project integrates whole genome sequencing data processing, variant calling, population genetics analysis, and GWAS features.

📂 Project Structure

.
├── environment_*.yml       # Conda environment configuration files
├── setup.py               # Python package installation script
├── README.md              # Project documentation
├── src/                   # Source code directory
│   ├── python/            # Python core library (python_script)
│   │   ├── genetics/      # Genetics analysis modules (genomics, gwas, phenotype, etc.)
│   │   ├── infra/         # Infrastructure and tools (server, stats, utils)
│   │   └── WeaTE/         # Transposon analysis module
│   ├── r/                 # R statistics and plotting scripts
│   └── java/              # Java tools (e.g., BamHeader, TaxaBamMap)
├── workflow/              # Nextflow workflows
│   ├── Genetics/          # Core genetics analysis pipelines
│   │   ├── genotype/      # Genotype processing (alignment, calling, statistics)
│   │   ├── static/        # Static analysis (GWAS, phenotype)
│   │   ├── dynamic/       # Dynamic/Population analysis (Kinship, XP-CLR)
│   │   └── main.nf        # Main entry script
│   └── ALiYun/            # Alibaba Cloud WDL workflow backup
└── note/                  # Analysis notes (Jupyter Notebooks)

🛠️ Environment Setup

This project relies on Conda for environment management and provides multiple dedicated environments for different tasks.

1. Create Conda Environments

Create the appropriate environment based on your task requirements:

run: Workflow execution environment (Nextflow, Screen)
```
conda env create -f environment_run.yml
```
stats: Main statistical analysis environment (Python 3.12, Hail, Plink, Samtools, Bcftools)
- Used for script development and most Python analysis tasks
```
conda env create -f environment_stats.yml
```
stats_r: R language statistics and plotting environment (R 4.3.1, Tidyverse, BioConductor)
```
conda env create -f environment_stats_r.yml
```
tiger: Compute-intensive task environment (BWA-MEM2, Samtools)
- Used for alignment and variant calling
```
conda env create -f environment_tiger.yml
```
dbone: Database and basic operations environment
```
conda env create -f environment_dbone.yml
```

2. Install Python Package

To enable calls to code under src/python within the environment, you need to install this project in editable mode in the stats environment (or other environments requiring Python scripts):

conda activate stats
pip install -e .

This will install the python_script package and its dependencies (pandas, numpy, scipy, seaborn, hail, etc.).

🚀 Workflows

The main workflow scripts are located in the workflow/Genetics directory.

Genotype

Path: workflow/Genetics/genotype/

align.nf: Sequence alignment workflow. Includes FASTQ QC, BWA-MEM2 alignment, Samtools sort/dedup/index, and Mosdepth depth calculation.
- New Feature: Includes smart USB file distribution (cp_based_on_usb_size), which can copy files in parallel using multiple threads based on available disk space.
stats.nf: Genotype statistics workflow. Integrates Python scripts to compute and plot metrics such as missing rates, heterozygosity, IBS, King kinship, etc.
caller.nf: Variant calling workflow.
hail.nf: Distributed data processing related to Hail.

Static (Association Analysis)

Path: workflow/Genetics/static/

gwas/: Genome-wide association study workflow.
phenotype/: Phenotype data processing and statistics.

Dynamic (Population/Evolution Analysis)

Path: workflow/Genetics/dynamic/

kinship.nf: Kinship analysis.
xp_clr.nf: Selective sweep analysis.

📦 Source Code Library (Src)

Python (`src/python`)

Core logic is encapsulated in the python_script package:

genetics.genomics: Genomics analysis (Sample QC, Variant QC, IBS, Kinship, etc.).
infra.server.cp: Contains smart file copying logic (run_copy_process) for use by Nextflow pipelines.
infra.utils.graph: General plotting utility library.
WeaTE: Wheat transposon analysis specialized module.

R (`src/r`)

Contains R scripts for GWAS result visualization (Manhattan/QQ plots), PCA visualization, etc.

Java (`src/java`)

Contains utility classes for handling BAM headers (BamHeader.java) and Taxa mapping (TaxaBamMap.java).

📝 Usage Examples

Run Alignment Workflow

conda activate run
nextflow run workflow/Genetics/genotype/align.nf \
    --fastq_dir ./raw_data \
    --reference ref.fa \
    -profile tiger

Run Statistics Workflow

conda activate run
nextflow run workflow/Genetics/genotype/stats.nf \
    --output_dir ./results \
    -profile stats

(Note: Please adjust parameters according to the specific nextflow.config file)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genetics Analysis Pipeline & Script Library

📂 Project Structure

🛠️ Environment Setup

1. Create Conda Environments

2. Install Python Package

🚀 Workflows

Genotype

Static (Association Analysis)

Dynamic (Population/Evolution Analysis)

📦 Source Code Library (Src)

Python (`src/python`)

R (`src/r`)

Java (`src/java`)

📝 Usage Examples

Run Alignment Workflow

Run Statistics Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 388 Commits
.github/agents		.github/agents
DataProcess		DataProcess
note		note
src		src
workflow		workflow
.gitignore		.gitignore
README.md		README.md
environment_dbone.yml		environment_dbone.yml
environment_run.yml		environment_run.yml
environment_stats.yml		environment_stats.yml
environment_stats_r.yml		environment_stats_r.yml
environment_tiger.yml		environment_tiger.yml
resources		resources
setup.py		setup.py

SirLearning/script

Folders and files

Latest commit

History

Repository files navigation

Genetics Analysis Pipeline & Script Library

📂 Project Structure

🛠️ Environment Setup

1. Create Conda Environments

2. Install Python Package

🚀 Workflows

Genotype

Static (Association Analysis)

Dynamic (Population/Evolution Analysis)

📦 Source Code Library (Src)

Python (src/python)

R (src/r)

Java (src/java)

📝 Usage Examples

Run Alignment Workflow

Run Statistics Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python (`src/python`)

R (`src/r`)

Java (`src/java`)

Packages