Conventional self-supervised pretraining paradigms constrain the capacity of genomic language models on regulatory decoding
LingoDNABench was developed to address a fundamental yet underexplored question:
Does canonical Masked Language Model (MLM) pretraining truly equip genomic language models (gLMs) with the ability to decode complex regulatory codes?
Rather than focusing exclusively on downstream leaderboard rankings, this benchmark scrutinizes the alignment between pretraining dynamics and downstream regulatory task performance.
The benchmark evaluates a diverse suite of models:
- Genomic Language Models (gLMs): Please refer to the table in
benchmark/README.mdfor detailed task information. - Random baselines (using the architecture defined in
pretrain_gLM/BERT):- RandomWeight: Models initialized with completely random parameters.
Please refer to the table in benchmark/datasets/README.md for detailed task information.
- Standardized Metrics: Task-specific performance metrics (AUROC for classification, SpearmanR for regression applictions).
- Unified Pipeline: A consistent training and evaluation protocol across models to ensure fair comparison.
- Robustness Measures: Each adapter model is trained across five random seeds to ensure statistical significance.
- Systematically analyze the correlation between pretraining loss trajectories and the resulting downstream task performance.(see
pretrain_analysis/pretrain_downstream_alignment)
- Different mutual information structure in genome region: gLM show significantly different performance in predicting mask token in different genome region. (see
pretrain_analysis/mask_token_prediction)
LingoDNABench/
├── benchmark/ # datasets and evaluation pipelines
├── pretrain_gLM/ # pretraining scripts or configs
└── pretrain_analysis/ # pretrain–downstream analysis