Research from:
State Key Laboratory of Tree Genetics and Breeding, Chinese Academy of Forestry, Beijing 100091, China
Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou 311400, China
State Key Laboratory of Tree Genetics and Breeding, Nanjing Forestry University, Nanjing 210037, China
PKDP (Prior Knowledge Dual-Path CNN) is a dual-path convolutional neural network framework designed to enhance genomic selection (GS) by integrating genome-wide association study (GWAS) results with genome-wide minor-effect markers.
# Create a new conda environment
conda create -n PKDP_env python=3.8
# Activate the environment (works on all platforms)
conda activate PKDP_env
# Install PKDP
git clone https://github.com/aiPGAB/PKDP-GS.git
cd ./PKDP-GS
chmod +x ./PKDP.py
# Install dependencies
pip install -r requirements.txt- Python >= 3.8
- PyTorch >= 1.10.0
- NumPy >= 1.20.0
- Pandas >= 1.3.0
- Matplotlib >= 3.4.0
- Seaborn >= 0.11.0
- Scikit-learn >= 1.0.0
- SciPy >= 1.7.0
- Optuna >= 2.10.0
python ./PKDP.py train -h| Parameter | Description |
|---|---|
--train_phe |
Path to the training phenotype file |
--geno |
Path to the genotype file |
--output_path |
Directory to save outputs |
| Parameter | Description | Default Value |
|---|---|---|
--pnum |
Phenotype column index or name | First column |
--prefix |
Prefix for output files | Timestamp |
--batch_size |
Batch size for training | 32 |
--epochs |
Number of training epochs | 100 |
--optuna_trials |
Number of Optuna trials for hyperparameter tuning | 50 |
--device |
Device to use (cuda or cpu) |
cuda:0 |
--optimizer |
Optimizer type (Adam, SGD, AdamW) |
Adam |
--early_stop |
Enable early stopping | False |
--prior_features |
Prior knowledge features (space-separated IDs) | None |
--prior_features_file |
Path to a file with one prior feature ID per line | None |
--adjust_encoding |
Adjust genotype encoding from {0,1,2} to {-1,0,1} | False |
python ./PKDP.py train \
--train_phe demo/train_phe.csv \
--geno demo/train_geno.csv \
--output_path results/ \
--prior_features_file ./demo/prior_features.txtIn the train model, the training set will automatically calculate hyperparameters through cross-validation. After obtaining the best hyperparameters, the model will be trained using the entire training set. This mode outputs the trained model file, which can be used for prediction in the predict model. For genotypes with values 0/1/2, it is recommended to set --adjust_encoding.
python ./PKDP.py predict -h| Parameter | Description |
|---|---|
--geno |
Path to the genotype file |
--model_path |
Path to the trained model file |
--output_path |
Directory to save outputs |
| Parameter | Description | Default Value |
|---|---|---|
--test_phe |
Path to the testing phenotype file (optional) | None |
--pnum |
Phenotype column name | First column |
--prefix |
Prefix for output files | Timestamp |
--device |
Device to use (cuda or cpu) |
cuda:0 |
--adjust_encoding |
Adjust genotype encoding from {0,1,2} to {-1,0,1} | False |
--prior_features |
Prior knowledge features (space-separated IDs) | None |
--prior_features_file |
Path to a file with one prior feature ID per line | None |
python ./PKDP.py predict \
--geno demo/test_geno.csv \
--test_phe demo/test_phe.csv \
--prior_features_file ./demo/prior_features.txt \
--model_path results/best_model.pth --output_path predictions/- When no prior features are provided, the model will only utilize the main convolutional path.
- It is recommended to use the
--prior_features_fileparameter instead of the--prior_featuresparameter to specify prior features. - The
--pnumparameter can be used to specify the phenotype column to predict. - During model training, samples with NA values in the phenotype will be automatically ignored, so there is no need to manually remove samples with NA values.
- The order of SNPs in the prediction should match the order used during training.
- Input phenotype data format: see
./demo/demo_phenotypes.csv - Input genotype data format: see
./demo/demo_genotypes.csv - During model training, please pay close attention to adjusting the following hyperparameters, as they significantly impact model performance:
--main_channels: Number of channels in the main network.--prior_channels: Number of channels in the prior knowledge network.--learning_rate: Learning rate.--batch_size: Batch size.
- Initial release of PKDP.
- Added dual-path CNN architecture.
- Integrated support for prior knowledge features.
- Implemented training with cross-validation and hyperparameter optimization.
- Added visualization tools for training progress and predictions.
- In the previous version, the optional parameter
test_phewas only used for evaluating the predictive capability after model training. It has been removed in the new version.
Han F, Gao M, Zhao Y, Bi C, et al. Improving genomic selection accuracy using a dual-path convolutional neural network framework: a terpenoid case study. Unpublished.

