Skip to content
This repository was archived by the owner on Feb 1, 2026. It is now read-only.
/ Data-analysis Public archive

A comprehensive platform for analyzing microbiome data from the LucKi cohort, featuring machine learning models for age group prediction from gut microbiome taxonomic profiles. The platform includes both interactive Streamlit web application and Jupyter notebook-based analysis.

License

Notifications You must be signed in to change notification settings

MAI-David/Data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

246 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Microbiome Data Analysis Platform

image image image

Overview

A comprehensive platform for analyzing microbiome data from the LucKi cohort, featuring machine learning models for age group prediction from gut microbiome taxonomic profiles. The platform includes both interactive Streamlit web application and Jupyter notebook-based analysis.

Keywords

microbiome metagenomics machine-learning bioinformatics MetaPhlAn age-prediction taxonomic-profiling compositional-data CLR-transformation feature-selection model-interpretability LIME SHAP Random-Forest XGBoost neural-networks streamlit python data-science

Identifiers


Table of Contents


Features

Data Processing

  • CLR Transformation: Handles compositional nature of microbiome data
  • Label Encoding: Automatic encoding of categorical variables
  • Missing Value Handling: Robust preprocessing pipeline
  • Train-Test Split: Stratified splitting to maintain class balance

Machine Learning Models

  • Random Forest: Ensemble learning with decision trees
  • XGBoost: Gradient boosting with regularization
  • Gradient Boosting: Sequential ensemble learning
  • LightGBM: High-efficiency gradient boosting
  • AdaBoost: Adaptive boosting algorithm
  • Neural Networks: Feature selection with gatekeeper layers

Model Interpretability

  • LIME: Local interpretable model-agnostic explanations
  • SHAP: SHapley additive explanations
  • Feature Importance: Analysis and visualization
  • Cross-Validation: K-fold validation for robustness

Interactive Platform

  • Streamlit Application: Web-based interactive interface
  • Real-time Analysis: Dynamic model training and evaluation
  • Visualization: Comprehensive plotting and comparison tools
  • User-Friendly: No coding required for basic analysis

Organization

Notebooks:

Notebook Name Contents
data-pipeline Full process of the entire project, from the pre-processing/exploration of the dataset, to the finalization of the model. Includes exploratory tests.
data_analysis Pre-processing of data and exploratory data analysis (EDA).
predicting_models The various models trained and tested at various stages.
model_results Visualizations of each model's performance.

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • (Optional) GPU support for TensorFlow
  • Minimum 8GB RAM (16GB recommended for neural network feature selection)

Step-by-Step Installation Guide

1. Clone the Repository

git clone https://github.com/MAI-David/Data-analysis.git
cd Data-analysis2

2. Create Virtual Environment (Recommended)

Using venv:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Using conda:

conda create -n microbiome python=3.10
conda activate microbiome

3. Install Dependencies

For Streamlit Application:

pip install -r requirements.txt

For Jupyter Notebook Analysis:

cd notebooks
pip install -r requirements.txt

For Development (with code quality tools):

pip install -e ".[dev]"

4. Verify Installation

python -c "import streamlit; import pandas; import sklearn; import xgboost; print('Installation successful!')"

Hardware Requirements

Minimum Requirements

Component Specification
CPU 2 cores, 2.0 GHz
RAM 8 GB
Storage 2 GB available space
OS Linux, macOS, Windows 10+

Recommended Requirements

Component Specification
CPU 4+ cores, 3.0+ GHz
RAM 16 GB
GPU NVIDIA GPU with 4GB+ VRAM (for neural network feature selection)
Storage 5 GB available space
OS Linux (Ubuntu 20.04+), macOS 11+, Windows 10+

GPU Support

  • NVIDIA GPUs: Requires CUDA 11.2+ and cuDNN 8.1+
  • AMD GPUs: ROCm support (experimental)
  • Apple Silicon (M1/M2): TensorFlow Metal plugin

Note: GPU is optional. All models can run on CPU, though neural network feature selection will be slower.


Quick Start

Using Streamlit Application

streamlit run app.py

Then open your browser to http://localhost:8501

Direct Navigation: You can navigate directly to specific sections using URL parameters:

http://localhost:8501/?page=interpretability
http://localhost:8501/?page=eda

See streamlit/URL_NAVIGATION.md for all available page identifiers.

Using Jupyter Notebook

cd notebooks
jupyter notebook data-pipeline.ipynb

Command Line Analysis (Quick Demo)

from utils.data_loader import get_train_test_split, apply_clr_transformation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Load and preprocess data
X_train, X_test, y_train, y_test, _ = get_train_test_split()
X_train_clr, X_test_clr = apply_clr_transformation(X_train, X_test)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_clr, y_train)

# Evaluate
score = r2_score(y_test, model.predict(X_test_clr))
print(f"Test R² Score: {score:.4f}")

Usage

Streamlit Application

The web application provides five main sections:

  1. Home: Project overview and dataset statistics
  2. Data Preprocessing: Interactive data transformation and visualization
  3. Model Training: Train and compare multiple ML models
  4. Model Interpretability: Understand model predictions with LIME/SHAP
  5. Results Comparison: Cross-validation and ensemble analysis

Jupyter Notebook Workflow

The notebook is organized into sections:

  1. Housekeeping: Library imports and settings
  2. Data Preprocessing: Loading, merging, and cleaning
  3. Exploratory Data Analysis: Visualizations and statistics
  4. Model Training: Multiple ML algorithms
  5. Feature Selection: Neural network-based selection
  6. Model Interpretability: LIME and SHAP analysis
  7. Cross-Validation: Taxonomic level comparison

Data Description

Dataset Characteristics

The LucKi cohort subdataset consists of:

  • 930 stool samples from multiple individuals across different families
  • ~6,900 microbiome features (taxonomic clades)
  • MetaPhlAn 4.1.1 taxonomic profiling
  • Age groups as target variable for prediction

Data Files

Located in data/raw/:

MAI3004_lucki_mpa411.csv

  • Format: CSV (converted from TSV)
  • Dimensions: 6903 rows × 932 columns
  • Content: Taxonomic profiles with relative abundances
  • Row Index: Taxonomic clade names (species to kingdom level)
  • Columns: Sample IDs prefixed with mpa411_
  • Values: Relative abundance (0-100%)

MetaPhlAn 4 Taxonomic Format:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Blautia_obeum

Taxonomic levels:

  • k__: Kingdom
  • p__: Phylum
  • c__: Class
  • o__: Order
  • f__: Family
  • g__: Genus
  • s__: Species

MAI3004_lucki_metadata_safe.csv

  • Format: CSV
  • Dimensions: 930 rows × 6 columns
  • Content: Sample metadata and demographics

Columns:

  • sample_id: Unique sample identifier
  • family_id: Family grouping identifier
  • sex: Biological sex (categorical)
  • age_group_at_sample: Age group category (target variable)
  • year_of_birth: Birth year (removed during preprocessing)
  • body_product: Sample type (removed during preprocessing)

Data Quality Metrics

Metric Value
Total samples 930
Total features ~6,900
Average genera per sample ~1,200
Average species per sample ~300
Missing values Minimal (<1%)
Data sparsity High (~80% zeros)
Distribution Log-normal (typical for microbiome)

Methodology

Preprocessing Pipeline

  1. Data Integration
  • Merge abundance table with metadata
  • Filter for common samples
  • Remove unnecessary columns
  1. Encoding
  • Label encoding for categorical variables (family_id, sex, age_group)
  • Preserve ordinal relationships where applicable
  1. Quality Control
  • Missing value detection and removal
  • Outlier analysis using IQR method
  • Normality testing with Shapiro-Wilk
  1. Normalization
  • CLR (Centered Log-Ratio) transformation
  • Accounts for compositional nature of microbiome data
  • Formula: CLR(x) = log(x / geometric_mean(x))
  1. Feature Selection
  • Genus-level filtering
  • Neural network-based selection (optional)
  • Prevalence and variance filtering

Machine Learning Pipeline

Raw Data → Preprocessing → Train/Test Split → CLR Transform → 
Feature Selection → Model Training → Evaluation → Interpretation

Model Evaluation Metrics

  • RMSE (Root Mean Squared Error): Prediction error magnitude
  • R² Score: Proportion of variance explained
  • MAE (Mean Absolute Error): Average prediction error
  • Cross-Validation: K-fold validation for robustness

Reproducibility

Setting Random Seeds

All analyses use fixed random seeds for reproducibility:

import random
import numpy as np
import tensorflow as tf

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

Environment Specification

Full environment captured in:

  • requirements.txt: Pinned package versions
  • pyproject.toml: Development dependencies
  • Python version: 3.8+

Reproducing Results

  1. Clone exact repository version:

    git clone https://github.com/MAI-David/Data-analysis.git
    git checkout <commit-hash>  # Use specific commit for exact reproduction
  2. Install exact dependencies:

    pip install -r requirements.txt
  3. Run analysis:

    jupyter notebook notebooks/data-pipeline.ipynb

    Or:

    streamlit run app.py

Version Control

  • All code changes tracked in Git
  • See CHANGELOG.md for version history
  • Tagged releases for major versions

Documentation

UML activity diagram

---
config:
  layout: elk
  look: neo
  theme: default
---
flowchart TB
    n1["Data"] --> n3["Merged DataFrame"]
    n2["Metadata"] --> n3
    n3 --> n4["Preprocessing"] & n6@{ label: "<span style=\"padding-left:\" data-darkreader-inline-color=\"\">Year of Birth<br>Body Product</span>" } & n7["Exploratory Data Analysis"]
    n6 --> n5["Drop unused columns"]
    n4 --> n8["LabelEncoding"]
    n9["Family ID<br>Sex<br>Age group"] --> n8
    n8 --> n10["Missingness check"]
    n11["Rows with NaN Age Group"] --> n12["Drop unknown samples"]
    n10 --> n11 & n13["Outlier check"]
    n13 --> n14["Summary"] & n15["Normalisation check"]
    n15 --> n16["Summary"]
    n7 --> n17["Shape measure"]
    n17 --> n18["Samples per child"]
    n18 --> n19["Samples per age group"]
    n19 --> n20["Bacterial abundance"]
    n20 --> n21["Feature analysis"]

    n1@{ shape: db}
    n2@{ shape: db}
    n6@{ shape: manual-input}
    n5@{ shape: event}
    n9@{ shape: manual-input}
    n11@{ shape: display}
    n14@{ shape: summary}
    n16@{ shape: summary}
     n1:::Aqua
     n2:::Aqua
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    style n1 color:#000000
    style n2 color:#000000
Loading

Important variables and objects

Name Purpose
data, metadata Raw abundance table and sample metadata loaded from ../data/raw/MAI3004_lucki_mpa411.csv and ../data/raw/MAI3004_lucki_metadata_safe.csv; shapes asserted at (6903, 932) and (930, 6).
sample_cols List of abundance columns prefixed with mpa411_, used to isolate sample-level measurements.
sample_abundances Transposed abundance table keyed by sample_id, created from sample_cols and clade_name.
metadata_common Subset of metadata with sample IDs present in sample_abundances.
merged_samples Inner merge of metadata_common and sample_abundances; drops year_of_birth and body_product.
encoded_samples Copy of merged_samples with sex and family_id encoded and rows missing age_group_at_sample removed.
age_encoder, age_groups LabelEncoder fitted on age_group_at_sample; age_groups maps age group labels to encoded integers.
missing_table Summary of missing values per column in encoded_samples, including percentage of missing data.
numeric_cols, outlier_table Numeric column list and corresponding IQR-based outlier bounds/counts.
normalized_samples Copy used for Shapiro-Wilk normality checks across numeric_cols.
X, feature_cols Feature matrix derived from merged_samples after removing metadata columns; drives prevalence and PCA analysis.
top_features, X_sub, X_scaled, X_pca PCA prep artifacts: top 500 prevalent features, their subset matrix, scaled values, and resulting 2D projection.

Project Structure

Data-analysis2/
├── README.md                    # This file
├── CHANGELOG.md                 # Version history
├── pyproject.toml              # Project configuration and dependencies
├── requirements.txt            # Python package requirements
├── LICENSE                     # MIT License
├── app.py                      # Streamlit application entry point
│
├── data/                       # Data directory
│   └── raw/                    # Raw data files
│       ├── MAI3004_lucki_mpa411.csv           # Abundance data
│       ├── MAI3004_lucki_metadata_safe.csv    # Sample metadata
│       └── metaphlan411_data_description.md   # Data format docs
│
├── notebooks/                  # Jupyter notebooks
│   ├── data-pipeline.ipynb    # Main analysis notebook
│   ├── functions.py           # Helper functions
│   └── requirements.txt       # Notebook-specific dependencies
│
├── pages/                      # Streamlit pages
│   ├── __init__.py
│   ├── home.py                # Home page
│   ├── preprocessing.py       # Data preprocessing page
│   ├── models.py              # Model training page
│   ├── interpretability.py    # Model interpretability page
│   └── results.py             # Results comparison page
│
├── utils/                      # Utility modules
│   ├── __init__.py
│   └── data_loader.py         # Data loading and caching functions
│
└── outputs/                    # Analysis outputs (generated)
    └── data-pipeline-1150-2001.ipynb  # Example output

Documentation

Function Documentation

All critical functions include NumPy-style docstrings:

def apply_clr_transformation(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Apply Centered Log-Ratio transformation to microbiome abundance data.
    
    Parameters
    ----------
    X_train : pd.DataFrame
        Training feature matrix with abundance values.
    X_test : pd.DataFrame
        Test feature matrix with abundance values.
    
    Returns
    -------
    Tuple[pd.DataFrame, pd.DataFrame]
        CLR-transformed training and test datasets.
    
    Notes
    -----
    The CLR transformation accounts for the compositional nature of microbiome data.
    A small pseudocount (1e-6) is added to avoid log(0).
    """

API Documentation

For detailed API documentation, see individual module docstrings:

  • utils/data_loader.py: Data loading functions
  • notebooks/functions.py: Analysis functions
  • pages/*.py: Streamlit page implementations

External Resources


Related Publications

The LucKi cohort is described in:

  • Luckey et al. (2015). "2015 LucKi cohort description." BMC Public Health. DOI: 10.1186/s12889-015-2255-7

Last Updated: 2024-01-25 Version: 1.0.0

About

A comprehensive platform for analyzing microbiome data from the LucKi cohort, featuring machine learning models for age group prediction from gut microbiome taxonomic profiles. The platform includes both interactive Streamlit web application and Jupyter notebook-based analysis.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 6