Rohit Menon
Department of Computer Science
University of Colorado, Colorado Springs
Email: rmenonch@uccs.edu
Oluwatosin Oluwadare, PhD Department of Computer Science University of Colorado, Colorado Springs Email: ooluwada@uccs.edu
ScHiCAtt is a deep learning model designed to enhance the resolution of Single-Cell Hi-C contact matrices using various attention mechanisms, such as self-attention, local attention, global attention, and dynamic attention. The model leverages GAN-based training to optimize the quality of Hi-C contact maps through a composite loss function consisting of MSE, perceptual, total variation, and adversarial losses.
ScHiCAtt runs in a Docker-containerized environment. Follow these steps to set up ScHiCAtt.
- Clone this repository:
git clone https://github.com/OluwadareLab/ScHiCAtt.git && cd ScHiCAtt- Pull the ScHiCAtt Docker image:
docker pull oluwadarelab/schicatt:latest- Run the container and mount the present working directory to the container:
docker run --rm --gpus all -it --name schicatt -v ${PWD}:${PWD} oluwadarelab/schicatt- You can now navigate within the container and run the model.
All necessary dependencies are bundled within the Docker environment. The core dependencies include:
- Python 3.8
- PyTorch 1.10.0 (CUDA 11.3)
- NumPy 1.21.1
- SciPy 1.7.0
- Pandas 1.3.1
- Scikit-learn 0.24.2
- Matplotlib 3.4.2
- tqdm 4.61.2
Note: GPU usage for training and testing is highly recommended.
To train the ScHiCAtt model on your Hi-C data, navigate to the Training folder and use the following command:
python3 train.py \
--train_data path/to/train_data.npz \
--valid_data path/to/valid_data.npz \
--epochs 50 \
--batch_size 32 \
--lr 0.0001 \
--save_path checkpoints/schicatt.pth
Your .npz files must contain the following keys:
'data': Low-resolution Hi-C patches (shape:[N, 1, 40, 40])'target': Ground truth high-resolution patches (same shape)
| Argument | Description | Default |
|---|---|---|
--train_data |
Path to the training .npz dataset |
Required |
--valid_data |
Path to the validation .npz dataset |
Required |
--epochs |
Number of training epochs | 1 |
--batch_size |
Batch size for training | 64 |
--lr |
Learning rate for the optimizer | 0.0003 |
--save_path |
Path to save the best model checkpoint | schicatt.pth |
The script saves the best-performing model (based on validation loss) to the specified path.
By default, the model will be saved to schicatt.pth.
After training the model, you can run inference (present in the Training folder) using the saved checkpoint on test data using:
python3 infer.py \
--input path/to/test_data.npz \
--checkpoint path/to/schicatt.pth \
--output path/to/output_directory \
--cuda 0
The input .npz file must contain the following keys:
'data': Low-resolution Hi-C patches (shape:[N, 1, 40, 40])'inds': Index array indicating patch positions and chromosome IDs (shape:[N, 4])
| Argument | Description | Default |
|---|---|---|
--input |
Path to input .npz test dataset |
Required |
--checkpoint |
Path to the trained model checkpoint .pth file |
Required |
--output |
Directory to save reconstructed full matrices | Required |
--multi-chrom |
(Optional) Enable multi-chromosome handling if needed | False |
--cuda |
CUDA device ID (e.g., 0 for GPU, -1 for CPU) |
-1 |
The script reconstructs chromosome-wise Hi-C matrices and saves them as compressed .npz files inside the specified output directory.
Each output file is named as:
chr<chromosome_id>_schicatt.npz
and contains the key 'schicatt' with the predicted high-resolution matrix.
python3 infer_schicatt.py \
--input data/test_chr11.npz \
--checkpoint schicatt.pth \
--output results/ \
--cuda 0
This will generate the file results/chr11_schicatt.npz containing the inferred matrix.
All the analysis scripts are available at analysis folder
- We used
https://github.com/zengguangjie/deDoc2for TAD detection from scHiC data. We used lower TAD-like domains. - Download the doDoc2
- Edit the necessary variables in
call_tads.pyscript such as INPUT_FILEPATH and adjust other things as necessary. - Run
python3 call_tads.py
- We used
https://xiaotaowang.github.io/TADLib/index.htmlto produce the TAD figures. - Here, we provided a sample python script
draw_tad_plots.pyto produce the plots. Update INPUT_FILEPATH, MATRIX_FILEPATH, FILENAMES, CHROMOSOMES, ALGORITHMS, OUTPUT_PATH. - It takes matrix and TADs as input. TAD file structure should be (without heading):
| Chromosome | Start Position | End Position |
|---|---|---|
| chr12 | 80000 | 1960000 |
| chr12 | 2000000 | 2720000 |
| chr12 | 2760000 | 4040000 |
| chr12 | 4080000 | 5480000 |
| chr12 | 5520000 | 5720000 |
- Run
python3 draw_tad_plots.py
- Edit draw_l2_norm.py with your filenames and paths including MATRIX_FILEPATH, FILENAMES, CHROMOSOMES, ALGORITHMS.
- It takes matrix as input.
- Run
python3 draw_l2_norm.py
The ScHiCAtt project is publicly available at https://github.com/OluwadareLab/ScHiCAtt. ScHiCAtt web-server is publicly available at http://schicatt.hicrobin.online/. Drosophila Hi-C Data is publicly available at https://doi.org/10.5281/zenodo.10535486. Human Cell HiC data is available at https://salkinstitute.app.box.com/s/fp63a4j36m5k255dhje3zcj5kfuzkyj1.
If you use ScHiCAtt in your research, please cite the following:
Menon, R., Chowdhury, H. M., & Oluwadare, O. (2025). ScHiCAtt: Enhancing single-cell Hi-C data resolution using attention-based models. Computational and Structural Biotechnology Journal, 27, 978-991. https://doi.org/10.1016/j.csbj.2025.02.031
This project is licensed under the MIT License. See the LICENSE file for details.