📄 Manuscript • 🛠️ Installation • 📦 Data • 🧪 Demo • 🧬 Embedding Generation • 🔬 Perturbation Analysis
# clone GitHub repository
git clone https://github.com/FunctionLab/mahi.git
cd mahi
# create Conda environment from YAML
conda env create -f environment.yaml
conda activate mahi
# install PyTorch Geometric dependencies
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.1.0+cu121.html# create new Conda environment
conda create --name mahi python=3.10 pytorch=2.1 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda activate mahi
# install dependencies
pip install "numpy<2"
pip install torch-geometric wandb pytorch-lightning ipykernel umap-learn biopython pyfaidx seaborn xgboost
conda install scikit-learn matplotlib pandas -c conda-forge
# install PyTorch Geometric dependencies
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
# install transformers package
pip install "transformers[torch]"Download required datasets from:
https://drive.google.com/drive/folders/1xWfPkC8bs3aQCsI6YMqYpXnSn6f6E1-B?usp=share_link
Then unzip into the repository root:
unzip <data>.zipThis demo runs gene essentiality prediction on one cell line to verify your set up (takes ~30 minutes depending on your setup):
# attach gene essentiality labels to Mahi demo embeddings for lung tissue
python scripts/gene_essentiality/add_labels.py \
--mahi_root data/demo/mahi_embeddings_lung \
--data_dir data/demo
# evaluate gene essentiality (5-fold CV + test eval)
python scripts/gene_essentiality/evaluate_mahi_gene_essentiality.py \
--out_dir outputs/demo \
--mahi_root data/demo/mahi_embeddings_lung \
--mapping_file resources/cell_lines.txt \
--cell_line ACH-000012 # cell line associated with lung tissueFor much faster runtime on CPUs, you can also submit the demo as a SLURM job:
sbatch demo.slurmoutputs/demo/mahi_gene_essentiality_eval/
├── mahi.metrics_by_cellline_and_tissue.csv # summary metrics on training set
├── cv_preds/ # per-gene out-of-fold predictions
└── test_preds/ # per-gene test predictionsMahi can be run entirely on CPU (unless you are re-training the multigraph GNN).
Please download the functional networks using the links from the manuscript and convert .dab files to .dat format using Dat2Dab from Sleipnir (https://github.com/FunctionLab/sleipnir.git).
./sleipnir/build/tools/Dat2Dab -i data/dab_networks/<data.dab> -o data/dat_networks/<data.dat>After conversion, filter networks to the top 3% of edges (recommended on SLURM):
sbatch scripts/networks/process_networks.slurmIf you do not have SLURM, you can run the same script locally:
bash scripts/networks/process_networks.slurmThis generates filtered networks in:
data/dat_networks/*_filtered_top3.datpython wt_mahi.py \
--dir data \
--tissue lung \
--checkpoint checkpoints/best-checkpoint.ckptpython wt_mahi.py \
--dir data \
--tissues lung heart kidney \
--checkpoint checkpoints/best-checkpoint.ckpttissues.txt
# tissues.txt
lung
heart
colonpython wt_mahi.py \
--dir data \
--tissues_txt tissues.txt \
--checkpoint checkpoints/best-checkpoint.ckptYou can specify a single tissue (--tissue), multiple tissues (--tissues), or provide a tissue list file (--tissues_txt).
python perturb_mahi.py \
--dir data \
--gene <Entrez ID> \
--tissue lung \
--checkpoint checkpoints/best-checkpoint.ckptYou can specify a single tissue (--tissue), multiple tissues (--tissues), or provide a tissue list file (--tissues_txt).
python get_top_genes.py \
--dir data \
--gene <Entrez ID> \
--tissue lung \
--avg resources/averaged_distances.csv \
--top 1000If you use Mahi in your research, please cite:
@article{aggarwal2026mahi,
title = {Multi-modal tissue-aware graph neural network for in silico genetic discovery},
author = {Aggarwal, Anusha and Sokolova, Ksenia and Troyanskaya, Olga G},
journal = {bioRxiv},
year = {2026},
month = feb,
doi = {10.64898/2026.02.17.706433},
url = {https://www.biorxiv.org/content/10.64898/2026.02.17.706433v1},
}
