EGRET (Estimating Genome-wide Regulatory Effects on the Transcriptome)

EGRET is a multivariate linear model designed to identify genome-wide loci that are predictive of gene expression levels. EGRET integrates predictions from existing trans-eQTL mapping approaches (Matrix eQTL, GBAT, and trans-PCO) and determines the optimal weighted combination of regulatory variants that best explains gene expression. Finally, EGRET utilizes a genome-wide summary statistics-based TWAS to identify novel gene-disease associations.

Installation

git clone https://github.com/krishnasbrunton/EGRET.git
cd EGRET

Setting up the environment

Install dependences

Setup environment using yml file.

conda env create -f environment.yml
conda activate egret_env

R packages Required

data.table
optparse
plink2R
Matrix eQTL

To download plink go to: https://www.cog-genomics.org/plink/1.9/

Recomended

xtune

To install xtune you wiil need to use devtools.

library(devtools)
devtools::install_github("JingxuanH/xtune", build_vignettes = TRUE)

Setup

To setup expression and genotype files in the format that EGRET requires, we have created a wrapper script to carry out preprocessing of genoytype and expression data. All outputs are stored in the ouput_dir.

Argument	Description
`genotypes_file_path`	Path to PLINK `.bed/.bim/.fam` files containing genotype data.
`expression_file_path`	Gene expression matrix (rows = genes, columns = individuals). Must include a `gene_id` column in ENSG format.
`covariates_file_path`	Covariate file (rows = individuals, columns = covariates to regress out).
`tissue`	Name of tissue for analysis. Used for labeling and output organization.
`individuals_file_path`	File containing list of individual IDs (one per line, no header).
`LD_prune_r2`	LD pruning threshold (default: `0.9`).
`plink_path`	Path to PLINK2 executable.
`genotype_output_prefix`	Prefix for pruned genotype output files.
`folds`	Number of cross-validation folds (e.g., 5).
`gene_info_file_path`	File containing gene metadata (gene ID, name, chromosome, start position).
`FDR`	False discovery rate (FDR) that will be used to select significant variants
`output_dir`	Directory to store all processed outputs.

./setup_genotype_and_expression.sh \
    $genotypes_file_path \
    $expression_file_path \
    $covariates_file_path \
    $tissue \
    $individuals_file_path \
    $LD_prune_r2 \
    $plink_path \
    $genotype_output_prefix \
    $folds \
    $gene_info_file_path \
    $output_dir

Running Matrix eQTL, GBAT, and trans-PCO

In EGRET, we implement Matrix eQTL, GBAT, and trans-PCO methods to identify trans-variants with potential to enhance gene expression models. Each one is run separatly and the outputs of each is used in the final training of the genome-wide model.

Running Matrix eQTL

We utilize the R package, Matrix eQTL, to conduct pairwise association between all genetic variants and gene expression. As input this script takes the tissue you wish to carry this analysis on. It is assumed that the setup scripts were run so it looks for their respective outputs.

./MatrixeQTL_scripts.sh \
    $tissue \
    $FDR \
    $output_dir

Running GBAT

GBAT utilizes previously trained cis-expression models to find genome-wide regulators. This code is designed to take as input models fitted by FUSION.

If FUSION models have not already been generated, model weights can be downloaded from http://gusevlab.org/projects/fusion/

./GBAT_scripts.sh \
    $FDR \
    $folds \
    $gene_info \
    $output_dir

Running trans-PCO

transPCO finds the association between a module of genes and a variant.

./transPCO_scripts.sh \
    $tissue \
    $folds \
    $FDR \
    $num_PCs \
    $output_dir

Training EGRET models

This script takes the input from the training of the models above and trains the genome-wide EGRET model and outputs the weights and model R2. We recommend using at least Matrix eQTL and GBAT.

In addition this script, will filter cross-mappable reads from alignment errors. As default we supply bed files containing cross-mappable beds for each gene in the hg38 genome build. If you want to filter cross-mappable regions in a different genome build, you can run the script 12.0_make_crossmappable_gene_beds.R with gene-gene mappabality scores provided by https://doi.org/10.12688/f1000research.17145.2.

./train_EGRET_models.sh \
    $tissue \
    $folds \
    $models \
    $output_dir

Running EGRET-TWAS

To conduct gene-trait association, we have provided the following script which takes as input gwas summary statistics and the models trained by EGRET to produce gene-trait associations.

./run_TWAS_scripts.sh \
    $tissue \
    $trait \
    $gwas_sumstat_dir \
    $output_dir

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
grn_analysis_files		grn_analysis_files
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EGRET (Estimating Genome-wide Regulatory Effects on the Transcriptome)

Installation

Setting up the environment

Setup

Running Matrix eQTL, GBAT, and trans-PCO

Running Matrix eQTL

Running GBAT

Running trans-PCO

Training EGRET models

Running EGRET-TWAS

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AmariutaLab/EGRET

Folders and files

Latest commit

History

Repository files navigation

EGRET (Estimating Genome-wide Regulatory Effects on the Transcriptome)

Installation

Setting up the environment

Setup

Running Matrix eQTL, GBAT, and trans-PCO

Running Matrix eQTL

Running GBAT

Running trans-PCO

Training EGRET models

Running EGRET-TWAS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages