This is the official repository for our paper Genomic heterogeneity inflates the performance of variant pathogenicity predictions.
It provides a genome-wide, variant-type-stratified benchmark dataset (>250,000 ClinVar variants) and the code to evaluate state-of-the-art DNA-based and protein-based models for variant pathogenicity prediction.
We provide one-click Jupyter notebook examples for each evaluated model, benchmark creation, and results visualization.
-
DNA-based models:
AlphaGenome, DNABERT2, Evo2, GPN-MSA, Nucleotide Transformer (NT), PhyloGPN, PhyloP
→ Notebooks are available in theDNA-based Models/directory. -
Protein-based models:
ESM family models, AlphaMissense, PrimateAI-3D
→ Notebooks are available in theprotein_models/directory. -
Benchmark creation:
→ SeeVEP_ClinVar_Benchmarking_RefSeq.ipynb. -
Visualization:
→ SeeVEP_AUROC_figure.ipynb.
Figure 1. Pathogenicity prediction performance of frontier sequence-based models across variant types.
Evaluation and comparison of DNA and protein sequence AI models for their capacity to distinguish between pathogenic and benign variants across variant types, measured by the area under the receiver operating characteristic curve (AUROC). Error bars denote 95% confidence intervals estimated by stratified bootstrap resampling (1,000 iterations) within each variant group.
- %P indicates the proportion of pathogenic variants in each group.
- Some groups are defined by multiple annotated effects (e.g., both missense and 3′ UTR, with respect to different transcripts).
- DNA models are shown as solid bars, protein models as dashed bars.
Note: The evaluation of PrimateAI-3D on stop-gain variants includes only 19,795 variants.
If you find this benchmark useful for your research, please cite our paper:
@article{genomic2025biorxiv,
author = {Baiyu Lu and Xueshen Liu and Po-Yu Lin and Nadav Brandes},
title = {Genomic heterogeneity inflates the performance of variant pathogenicity predictions},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.09.05.674459},
url = {https://www.biorxiv.org/content/10.1101/2025.09.05.674459v2},
eprint = {https://www.biorxiv.org/content/10.1101/2025.09.05.674459v2.full.pdf}
}