Skip to content

gao-lab/LingoDNABench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation


LingoDNABench

Conventional self-supervised pretraining paradigms constrain the capacity of genomic language models on regulatory decoding

LingoDNABench was developed to address a fundamental yet underexplored question:

Does canonical Masked Language Model (MLM) pretraining truly equip genomic language models (gLMs) with the ability to decode complex regulatory codes?

Rather than focusing exclusively on downstream leaderboard rankings, this benchmark scrutinizes the alignment between pretraining dynamics and downstream regulatory task performance.


Benchmark Overview

Models

The benchmark evaluates a diverse suite of models:

  • Genomic Language Models (gLMs): Please refer to the table in benchmark/README.md for detailed task information.
  • Random baselines (using the architecture defined in pretrain_gLM/BERT):
    • RandomWeight: Models initialized with completely random parameters.

Tasks

Please refer to the table in benchmark/datasets/README.md for detailed task information.

Evaluation Protocol

  • Standardized Metrics: Task-specific performance metrics (AUROC for classification, SpearmanR for regression applictions).
  • Unified Pipeline: A consistent training and evaluation protocol across models to ensure fair comparison.
  • Robustness Measures: Each adapter model is trained across five random seeds to ensure statistical significance.

Pretraining Analysis

Alignment: Pretraining vs. Downstream Performance

  • Systematically analyze the correlation between pretraining loss trajectories and the resulting downstream task performance.(see pretrain_analysis/pretrain_downstream_alignment)

A Mutual Information Perspective

  • Different mutual information structure in genome region: gLM show significantly different performance in predicting mask token in different genome region. (see pretrain_analysis/mask_token_prediction)

Repository Structure

LingoDNABench/
├── benchmark/      # datasets and evaluation pipelines
├── pretrain_gLM/       # pretraining scripts or configs
└── pretrain_analysis/       # pretrain–downstream analysis

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •