Microbiome Data Analysis Platform

Overview

A comprehensive platform for analyzing microbiome data from the LucKi cohort, featuring machine learning models for age group prediction from gut microbiome taxonomic profiles. The platform includes both interactive Streamlit web application and Jupyter notebook-based analysis.

Keywords

microbiome metagenomics machine-learning bioinformatics MetaPhlAn age-prediction taxonomic-profiling compositional-data CLR-transformation feature-selection model-interpretability LIME SHAP Random-Forest XGBoost neural-networks streamlit python data-science

Identifiers

Repository: https://github.com/MAI-David/Data-analysis
Project Name: Microbiome Data Analysis Platform
Version: 1.0.0
DOI: 10.5281/zenodo.18302927
Data Source: LucKi Cohort
License: AGPL-3.0

Features

Data Processing

CLR Transformation: Handles compositional nature of microbiome data
Label Encoding: Automatic encoding of categorical variables
Missing Value Handling: Robust preprocessing pipeline
Train-Test Split: Stratified splitting to maintain class balance

Machine Learning Models

Random Forest: Ensemble learning with decision trees
XGBoost: Gradient boosting with regularization
Gradient Boosting: Sequential ensemble learning
LightGBM: High-efficiency gradient boosting
AdaBoost: Adaptive boosting algorithm
Neural Networks: Feature selection with gatekeeper layers

Model Interpretability

LIME: Local interpretable model-agnostic explanations
SHAP: SHapley additive explanations
Feature Importance: Analysis and visualization
Cross-Validation: K-fold validation for robustness

Interactive Platform

Streamlit Application: Web-based interactive interface
Real-time Analysis: Dynamic model training and evaluation
Visualization: Comprehensive plotting and comparison tools
User-Friendly: No coding required for basic analysis

Organization

Notebooks:

Notebook Name	Contents
`data-pipeline`	Full process of the entire project, from the pre-processing/exploration of the dataset, to the finalization of the model. Includes exploratory tests.
`data_analysis`	Pre-processing of data and exploratory data analysis (EDA).
`predicting_models`	The various models trained and tested at various stages.
`model_results`	Visualizations of each model's performance.

Installation

Prerequisites

Python 3.8 or higher
pip package manager
(Optional) GPU support for TensorFlow
Minimum 8GB RAM (16GB recommended for neural network feature selection)

Step-by-Step Installation Guide

1. Clone the Repository

git clone https://github.com/MAI-David/Data-analysis.git
cd Data-analysis2

2. Create Virtual Environment (Recommended)

Using venv:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Using conda:

conda create -n microbiome python=3.10
conda activate microbiome

3. Install Dependencies

For Streamlit Application:

pip install -r requirements.txt

For Jupyter Notebook Analysis:

cd notebooks
pip install -r requirements.txt

For Development (with code quality tools):

pip install -e ".[dev]"

4. Verify Installation

python -c "import streamlit; import pandas; import sklearn; import xgboost; print('Installation successful!')"

Hardware Requirements

Minimum Requirements

Component	Specification
CPU	2 cores, 2.0 GHz
RAM	8 GB
Storage	2 GB available space
OS	Linux, macOS, Windows 10+

Recommended Requirements

Component	Specification
CPU	4+ cores, 3.0+ GHz
RAM	16 GB
GPU	NVIDIA GPU with 4GB+ VRAM (for neural network feature selection)
Storage	5 GB available space
OS	Linux (Ubuntu 20.04+), macOS 11+, Windows 10+

GPU Support

NVIDIA GPUs: Requires CUDA 11.2+ and cuDNN 8.1+
AMD GPUs: ROCm support (experimental)
Apple Silicon (M1/M2): TensorFlow Metal plugin

Note: GPU is optional. All models can run on CPU, though neural network feature selection will be slower.

Quick Start

Using Streamlit Application

streamlit run app.py

Then open your browser to http://localhost:8501

Direct Navigation: You can navigate directly to specific sections using URL parameters:

http://localhost:8501/?page=interpretability
http://localhost:8501/?page=eda

See streamlit/URL_NAVIGATION.md for all available page identifiers.

Using Jupyter Notebook

cd notebooks
jupyter notebook data-pipeline.ipynb

Command Line Analysis (Quick Demo)

from utils.data_loader import get_train_test_split, apply_clr_transformation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Load and preprocess data
X_train, X_test, y_train, y_test, _ = get_train_test_split()
X_train_clr, X_test_clr = apply_clr_transformation(X_train, X_test)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_clr, y_train)

# Evaluate
score = r2_score(y_test, model.predict(X_test_clr))
print(f"Test R² Score: {score:.4f}")

Usage

Streamlit Application

The web application provides five main sections:

Home: Project overview and dataset statistics
Data Preprocessing: Interactive data transformation and visualization
Model Training: Train and compare multiple ML models
Model Interpretability: Understand model predictions with LIME/SHAP
Results Comparison: Cross-validation and ensemble analysis

Jupyter Notebook Workflow

The notebook is organized into sections:

Housekeeping: Library imports and settings
Data Preprocessing: Loading, merging, and cleaning
Exploratory Data Analysis: Visualizations and statistics
Model Training: Multiple ML algorithms
Feature Selection: Neural network-based selection
Model Interpretability: LIME and SHAP analysis
Cross-Validation: Taxonomic level comparison

Data Description

Dataset Characteristics

The LucKi cohort subdataset consists of:

930 stool samples from multiple individuals across different families
~6,900 microbiome features (taxonomic clades)
MetaPhlAn 4.1.1 taxonomic profiling
Age groups as target variable for prediction

Data Files

Located in data/raw/:

MAI3004_lucki_mpa411.csv

Format: CSV (converted from TSV)
Dimensions: 6903 rows × 932 columns
Content: Taxonomic profiles with relative abundances
Row Index: Taxonomic clade names (species to kingdom level)
Columns: Sample IDs prefixed with mpa411_
Values: Relative abundance (0-100%)

MetaPhlAn 4 Taxonomic Format:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Blautia_obeum

Taxonomic levels:

k__: Kingdom
p__: Phylum
c__: Class
o__: Order
f__: Family
g__: Genus
s__: Species

MAI3004_lucki_metadata_safe.csv

Format: CSV
Dimensions: 930 rows × 6 columns
Content: Sample metadata and demographics

Columns:

sample_id: Unique sample identifier
family_id: Family grouping identifier
sex: Biological sex (categorical)
age_group_at_sample: Age group category (target variable)
year_of_birth: Birth year (removed during preprocessing)
body_product: Sample type (removed during preprocessing)

Data Quality Metrics

Metric	Value
Total samples	930
Total features	~6,900
Average genera per sample	~1,200
Average species per sample	~300
Missing values	Minimal (<1%)
Data sparsity	High (~80% zeros)
Distribution	Log-normal (typical for microbiome)

Methodology

Preprocessing Pipeline

Data Integration

Merge abundance table with metadata
Filter for common samples
Remove unnecessary columns

Encoding

Label encoding for categorical variables (family_id, sex, age_group)
Preserve ordinal relationships where applicable

Quality Control

Missing value detection and removal
Outlier analysis using IQR method
Normality testing with Shapiro-Wilk

Normalization

CLR (Centered Log-Ratio) transformation
Accounts for compositional nature of microbiome data
Formula: CLR(x) = log(x / geometric_mean(x))

Feature Selection

Genus-level filtering
Neural network-based selection (optional)
Prevalence and variance filtering

Machine Learning Pipeline

Raw Data → Preprocessing → Train/Test Split → CLR Transform → 
Feature Selection → Model Training → Evaluation → Interpretation

Model Evaluation Metrics

RMSE (Root Mean Squared Error): Prediction error magnitude
R² Score: Proportion of variance explained
MAE (Mean Absolute Error): Average prediction error
Cross-Validation: K-fold validation for robustness

Reproducibility

Setting Random Seeds

All analyses use fixed random seeds for reproducibility:

import random
import numpy as np
import tensorflow as tf

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

Environment Specification

Full environment captured in:

requirements.txt: Pinned package versions
pyproject.toml: Development dependencies
Python version: 3.8+

Reproducing Results

Clone exact repository version:

git clone https://github.com/MAI-David/Data-analysis.git
git checkout <commit-hash>  # Use specific commit for exact reproduction

Install exact dependencies:
```
pip install -r requirements.txt
```

Run analysis:

jupyter notebook notebooks/data-pipeline.ipynb

Or:

streamlit run app.py

Version Control

All code changes tracked in Git
See CHANGELOG.md for version history
Tagged releases for major versions

Documentation

UML activity diagram

---
config:
  layout: elk
  look: neo
  theme: default
---
flowchart TB
    n1["Data"] --> n3["Merged DataFrame"]
    n2["Metadata"] --> n3
    n3 --> n4["Preprocessing"] & n6@{ label: "<span style=\"padding-left:\" data-darkreader-inline-color=\"\">Year of Birth<br>Body Product</span>" } & n7["Exploratory Data Analysis"]
    n6 --> n5["Drop unused columns"]
    n4 --> n8["LabelEncoding"]
    n9["Family ID<br>Sex<br>Age group"] --> n8
    n8 --> n10["Missingness check"]
    n11["Rows with NaN Age Group"] --> n12["Drop unknown samples"]
    n10 --> n11 & n13["Outlier check"]
    n13 --> n14["Summary"] & n15["Normalisation check"]
    n15 --> n16["Summary"]
    n7 --> n17["Shape measure"]
    n17 --> n18["Samples per child"]
    n18 --> n19["Samples per age group"]
    n19 --> n20["Bacterial abundance"]
    n20 --> n21["Feature analysis"]

    n1@{ shape: db}
    n2@{ shape: db}
    n6@{ shape: manual-input}
    n5@{ shape: event}
    n9@{ shape: manual-input}
    n11@{ shape: display}
    n14@{ shape: summary}
    n16@{ shape: summary}
     n1:::Aqua
     n2:::Aqua
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    style n1 color:#000000
    style n2 color:#000000

Important variables and objects

Name	Purpose
`data`, `metadata`	Raw abundance table and sample metadata loaded from `../data/raw/MAI3004_lucki_mpa411.csv` and `../data/raw/MAI3004_lucki_metadata_safe.csv`; shapes asserted at `(6903, 932)` and `(930, 6)`.
`sample_cols`	List of abundance columns prefixed with `mpa411_`, used to isolate sample-level measurements.
`sample_abundances`	Transposed abundance table keyed by `sample_id`, created from `sample_cols` and `clade_name`.
`metadata_common`	Subset of metadata with sample IDs present in `sample_abundances`.
`merged_samples`	Inner merge of `metadata_common` and `sample_abundances`; drops `year_of_birth` and `body_product`.
`encoded_samples`	Copy of `merged_samples` with `sex` and `family_id` encoded and rows missing `age_group_at_sample` removed.
`age_encoder`, `age_groups`	`LabelEncoder` fitted on `age_group_at_sample`; `age_groups` maps age group labels to encoded integers.
`missing_table`	Summary of missing values per column in `encoded_samples`, including percentage of missing data.
`numeric_cols`, `outlier_table`	Numeric column list and corresponding IQR-based outlier bounds/counts.
`normalized_samples`	Copy used for Shapiro-Wilk normality checks across `numeric_cols`.
`X`, `feature_cols`	Feature matrix derived from `merged_samples` after removing metadata columns; drives prevalence and PCA analysis.
`top_features`, `X_sub`, `X_scaled`, `X_pca`	PCA prep artifacts: top 500 prevalent features, their subset matrix, scaled values, and resulting 2D projection.

Project Structure

Data-analysis2/
├── README.md                    # This file
├── CHANGELOG.md                 # Version history
├── pyproject.toml              # Project configuration and dependencies
├── requirements.txt            # Python package requirements
├── LICENSE                     # MIT License
├── app.py                      # Streamlit application entry point
│
├── data/                       # Data directory
│   └── raw/                    # Raw data files
│       ├── MAI3004_lucki_mpa411.csv           # Abundance data
│       ├── MAI3004_lucki_metadata_safe.csv    # Sample metadata
│       └── metaphlan411_data_description.md   # Data format docs
│
├── notebooks/                  # Jupyter notebooks
│   ├── data-pipeline.ipynb    # Main analysis notebook
│   ├── functions.py           # Helper functions
│   └── requirements.txt       # Notebook-specific dependencies
│
├── pages/                      # Streamlit pages
│   ├── __init__.py
│   ├── home.py                # Home page
│   ├── preprocessing.py       # Data preprocessing page
│   ├── models.py              # Model training page
│   ├── interpretability.py    # Model interpretability page
│   └── results.py             # Results comparison page
│
├── utils/                      # Utility modules
│   ├── __init__.py
│   └── data_loader.py         # Data loading and caching functions
│
└── outputs/                    # Analysis outputs (generated)
    └── data-pipeline-1150-2001.ipynb  # Example output

Documentation

Function Documentation

All critical functions include NumPy-style docstrings:

def apply_clr_transformation(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Apply Centered Log-Ratio transformation to microbiome abundance data.
    
    Parameters
    ----------
    X_train : pd.DataFrame
        Training feature matrix with abundance values.
    X_test : pd.DataFrame
        Test feature matrix with abundance values.
    
    Returns
    -------
    Tuple[pd.DataFrame, pd.DataFrame]
        CLR-transformed training and test datasets.
    
    Notes
    -----
    The CLR transformation accounts for the compositional nature of microbiome data.
    A small pseudocount (1e-6) is added to avoid log(0).
    """

API Documentation

For detailed API documentation, see individual module docstrings:

utils/data_loader.py: Data loading functions
notebooks/functions.py: Analysis functions
pages/*.py: Streamlit page implementations

External Resources

Related Publications

The LucKi cohort is described in:

Luckey et al. (2015). "2015 LucKi cohort description." BMC Public Health. DOI: 10.1186/s12889-015-2255-7

Last Updated: 2024-01-25 Version: 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
.github/workflows		.github/workflows
checklists		checklists
data/raw		data/raw
notebooks		notebooks
streamlit		streamlit
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

License

MAI-David/Data-analysis

Folders and files

Latest commit

History

Repository files navigation

Microbiome Data Analysis Platform

Overview

Keywords

Identifiers

Table of Contents

Features

Data Processing

Machine Learning Models

Model Interpretability

Interactive Platform

Organization

Notebooks:

Installation

Prerequisites

Step-by-Step Installation Guide

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Verify Installation

Hardware Requirements

Minimum Requirements

Recommended Requirements

GPU Support

Quick Start

Using Streamlit Application

Using Jupyter Notebook

Command Line Analysis (Quick Demo)

Usage

Streamlit Application

Jupyter Notebook Workflow

Data Description

Dataset Characteristics

Data Files

MAI3004_lucki_mpa411.csv

MAI3004_lucki_metadata_safe.csv

Data Quality Metrics

Methodology

Preprocessing Pipeline

Machine Learning Pipeline

Model Evaluation Metrics

Reproducibility

Setting Random Seeds

Environment Specification

Reproducing Results

Version Control

Documentation

UML activity diagram

Important variables and objects

Project Structure

Documentation

Function Documentation

API Documentation

External Resources

Related Publications

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 6

Uh oh!

Languages