A comprehensive platform for analyzing microbiome data from the LucKi cohort, featuring machine learning models for age group prediction from gut microbiome taxonomic profiles. The platform includes both interactive Streamlit web application and Jupyter notebook-based analysis.
microbiome metagenomics machine-learning bioinformatics MetaPhlAn age-prediction taxonomic-profiling compositional-data CLR-transformation feature-selection model-interpretability LIME SHAP Random-Forest XGBoost neural-networks streamlit python data-science
- Repository: https://github.com/MAI-David/Data-analysis
- Project Name: Microbiome Data Analysis Platform
- Version: 1.0.0
- DOI: 10.5281/zenodo.18302927
- Data Source: LucKi Cohort
- License: AGPL-3.0
- Features
- Organization
- Installation
- Hardware Requirements
- Quick Start
- Usage
- Data Description
- Methodology
- Reproducibility
- Project Structure
- Documentation
- Contributing
- Citation
- License
- CLR Transformation: Handles compositional nature of microbiome data
- Label Encoding: Automatic encoding of categorical variables
- Missing Value Handling: Robust preprocessing pipeline
- Train-Test Split: Stratified splitting to maintain class balance
- Random Forest: Ensemble learning with decision trees
- XGBoost: Gradient boosting with regularization
- Gradient Boosting: Sequential ensemble learning
- LightGBM: High-efficiency gradient boosting
- AdaBoost: Adaptive boosting algorithm
- Neural Networks: Feature selection with gatekeeper layers
- LIME: Local interpretable model-agnostic explanations
- SHAP: SHapley additive explanations
- Feature Importance: Analysis and visualization
- Cross-Validation: K-fold validation for robustness
- Streamlit Application: Web-based interactive interface
- Real-time Analysis: Dynamic model training and evaluation
- Visualization: Comprehensive plotting and comparison tools
- User-Friendly: No coding required for basic analysis
| Notebook Name | Contents |
|---|---|
data-pipeline |
Full process of the entire project, from the pre-processing/exploration of the dataset, to the finalization of the model. Includes exploratory tests. |
data_analysis |
Pre-processing of data and exploratory data analysis (EDA). |
predicting_models |
The various models trained and tested at various stages. |
model_results |
Visualizations of each model's performance. |
- Python 3.8 or higher
- pip package manager
- (Optional) GPU support for TensorFlow
- Minimum 8GB RAM (16GB recommended for neural network feature selection)
git clone https://github.com/MAI-David/Data-analysis.git
cd Data-analysis2Using venv:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateUsing conda:
conda create -n microbiome python=3.10
conda activate microbiomeFor Streamlit Application:
pip install -r requirements.txtFor Jupyter Notebook Analysis:
cd notebooks
pip install -r requirements.txtFor Development (with code quality tools):
pip install -e ".[dev]"python -c "import streamlit; import pandas; import sklearn; import xgboost; print('Installation successful!')"| Component | Specification |
|---|---|
| CPU | 2 cores, 2.0 GHz |
| RAM | 8 GB |
| Storage | 2 GB available space |
| OS | Linux, macOS, Windows 10+ |
| Component | Specification |
|---|---|
| CPU | 4+ cores, 3.0+ GHz |
| RAM | 16 GB |
| GPU | NVIDIA GPU with 4GB+ VRAM (for neural network feature selection) |
| Storage | 5 GB available space |
| OS | Linux (Ubuntu 20.04+), macOS 11+, Windows 10+ |
- NVIDIA GPUs: Requires CUDA 11.2+ and cuDNN 8.1+
- AMD GPUs: ROCm support (experimental)
- Apple Silicon (M1/M2): TensorFlow Metal plugin
Note: GPU is optional. All models can run on CPU, though neural network feature selection will be slower.
streamlit run app.pyThen open your browser to http://localhost:8501
Direct Navigation: You can navigate directly to specific sections using URL parameters:
http://localhost:8501/?page=interpretability
http://localhost:8501/?page=eda
See streamlit/URL_NAVIGATION.md for all available page identifiers.
cd notebooks
jupyter notebook data-pipeline.ipynbfrom utils.data_loader import get_train_test_split, apply_clr_transformation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
# Load and preprocess data
X_train, X_test, y_train, y_test, _ = get_train_test_split()
X_train_clr, X_test_clr = apply_clr_transformation(X_train, X_test)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_clr, y_train)
# Evaluate
score = r2_score(y_test, model.predict(X_test_clr))
print(f"Test R² Score: {score:.4f}")The web application provides five main sections:
- Home: Project overview and dataset statistics
- Data Preprocessing: Interactive data transformation and visualization
- Model Training: Train and compare multiple ML models
- Model Interpretability: Understand model predictions with LIME/SHAP
- Results Comparison: Cross-validation and ensemble analysis
The notebook is organized into sections:
- Housekeeping: Library imports and settings
- Data Preprocessing: Loading, merging, and cleaning
- Exploratory Data Analysis: Visualizations and statistics
- Model Training: Multiple ML algorithms
- Feature Selection: Neural network-based selection
- Model Interpretability: LIME and SHAP analysis
- Cross-Validation: Taxonomic level comparison
The LucKi cohort subdataset consists of:
- 930 stool samples from multiple individuals across different families
- ~6,900 microbiome features (taxonomic clades)
- MetaPhlAn 4.1.1 taxonomic profiling
- Age groups as target variable for prediction
Located in data/raw/:
- Format: CSV (converted from TSV)
- Dimensions: 6903 rows × 932 columns
- Content: Taxonomic profiles with relative abundances
- Row Index: Taxonomic clade names (species to kingdom level)
- Columns: Sample IDs prefixed with
mpa411_ - Values: Relative abundance (0-100%)
MetaPhlAn 4 Taxonomic Format:
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Blautia_obeum
Taxonomic levels:
k__: Kingdomp__: Phylumc__: Classo__: Orderf__: Familyg__: Genuss__: Species
- Format: CSV
- Dimensions: 930 rows × 6 columns
- Content: Sample metadata and demographics
Columns:
sample_id: Unique sample identifierfamily_id: Family grouping identifiersex: Biological sex (categorical)age_group_at_sample: Age group category (target variable)year_of_birth: Birth year (removed during preprocessing)body_product: Sample type (removed during preprocessing)
| Metric | Value |
|---|---|
| Total samples | 930 |
| Total features | ~6,900 |
| Average genera per sample | ~1,200 |
| Average species per sample | ~300 |
| Missing values | Minimal (<1%) |
| Data sparsity | High (~80% zeros) |
| Distribution | Log-normal (typical for microbiome) |
- Data Integration
- Merge abundance table with metadata
- Filter for common samples
- Remove unnecessary columns
- Encoding
- Label encoding for categorical variables (family_id, sex, age_group)
- Preserve ordinal relationships where applicable
- Quality Control
- Missing value detection and removal
- Outlier analysis using IQR method
- Normality testing with Shapiro-Wilk
- Normalization
- CLR (Centered Log-Ratio) transformation
- Accounts for compositional nature of microbiome data
- Formula:
CLR(x) = log(x / geometric_mean(x))
- Feature Selection
- Genus-level filtering
- Neural network-based selection (optional)
- Prevalence and variance filtering
Raw Data → Preprocessing → Train/Test Split → CLR Transform →
Feature Selection → Model Training → Evaluation → Interpretation
- RMSE (Root Mean Squared Error): Prediction error magnitude
- R² Score: Proportion of variance explained
- MAE (Mean Absolute Error): Average prediction error
- Cross-Validation: K-fold validation for robustness
All analyses use fixed random seeds for reproducibility:
import random
import numpy as np
import tensorflow as tf
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)Full environment captured in:
requirements.txt: Pinned package versionspyproject.toml: Development dependencies- Python version: 3.8+
-
Clone exact repository version:
git clone https://github.com/MAI-David/Data-analysis.git git checkout <commit-hash> # Use specific commit for exact reproduction
-
Install exact dependencies:
pip install -r requirements.txt
-
Run analysis:
jupyter notebook notebooks/data-pipeline.ipynb
Or:
streamlit run app.py
- All code changes tracked in Git
- See
CHANGELOG.mdfor version history - Tagged releases for major versions
---
config:
layout: elk
look: neo
theme: default
---
flowchart TB
n1["Data"] --> n3["Merged DataFrame"]
n2["Metadata"] --> n3
n3 --> n4["Preprocessing"] & n6@{ label: "<span style=\"padding-left:\" data-darkreader-inline-color=\"\">Year of Birth<br>Body Product</span>" } & n7["Exploratory Data Analysis"]
n6 --> n5["Drop unused columns"]
n4 --> n8["LabelEncoding"]
n9["Family ID<br>Sex<br>Age group"] --> n8
n8 --> n10["Missingness check"]
n11["Rows with NaN Age Group"] --> n12["Drop unknown samples"]
n10 --> n11 & n13["Outlier check"]
n13 --> n14["Summary"] & n15["Normalisation check"]
n15 --> n16["Summary"]
n7 --> n17["Shape measure"]
n17 --> n18["Samples per child"]
n18 --> n19["Samples per age group"]
n19 --> n20["Bacterial abundance"]
n20 --> n21["Feature analysis"]
n1@{ shape: db}
n2@{ shape: db}
n6@{ shape: manual-input}
n5@{ shape: event}
n9@{ shape: manual-input}
n11@{ shape: display}
n14@{ shape: summary}
n16@{ shape: summary}
n1:::Aqua
n2:::Aqua
classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
style n1 color:#000000
style n2 color:#000000
| Name | Purpose |
|---|---|
data, metadata |
Raw abundance table and sample metadata loaded from ../data/raw/MAI3004_lucki_mpa411.csv and ../data/raw/MAI3004_lucki_metadata_safe.csv; shapes asserted at (6903, 932) and (930, 6). |
sample_cols |
List of abundance columns prefixed with mpa411_, used to isolate sample-level measurements. |
sample_abundances |
Transposed abundance table keyed by sample_id, created from sample_cols and clade_name. |
metadata_common |
Subset of metadata with sample IDs present in sample_abundances. |
merged_samples |
Inner merge of metadata_common and sample_abundances; drops year_of_birth and body_product. |
encoded_samples |
Copy of merged_samples with sex and family_id encoded and rows missing age_group_at_sample removed. |
age_encoder, age_groups |
LabelEncoder fitted on age_group_at_sample; age_groups maps age group labels to encoded integers. |
missing_table |
Summary of missing values per column in encoded_samples, including percentage of missing data. |
numeric_cols, outlier_table |
Numeric column list and corresponding IQR-based outlier bounds/counts. |
normalized_samples |
Copy used for Shapiro-Wilk normality checks across numeric_cols. |
X, feature_cols |
Feature matrix derived from merged_samples after removing metadata columns; drives prevalence and PCA analysis. |
top_features, X_sub, X_scaled, X_pca |
PCA prep artifacts: top 500 prevalent features, their subset matrix, scaled values, and resulting 2D projection. |
Data-analysis2/
├── README.md # This file
├── CHANGELOG.md # Version history
├── pyproject.toml # Project configuration and dependencies
├── requirements.txt # Python package requirements
├── LICENSE # MIT License
├── app.py # Streamlit application entry point
│
├── data/ # Data directory
│ └── raw/ # Raw data files
│ ├── MAI3004_lucki_mpa411.csv # Abundance data
│ ├── MAI3004_lucki_metadata_safe.csv # Sample metadata
│ └── metaphlan411_data_description.md # Data format docs
│
├── notebooks/ # Jupyter notebooks
│ ├── data-pipeline.ipynb # Main analysis notebook
│ ├── functions.py # Helper functions
│ └── requirements.txt # Notebook-specific dependencies
│
├── pages/ # Streamlit pages
│ ├── __init__.py
│ ├── home.py # Home page
│ ├── preprocessing.py # Data preprocessing page
│ ├── models.py # Model training page
│ ├── interpretability.py # Model interpretability page
│ └── results.py # Results comparison page
│
├── utils/ # Utility modules
│ ├── __init__.py
│ └── data_loader.py # Data loading and caching functions
│
└── outputs/ # Analysis outputs (generated)
└── data-pipeline-1150-2001.ipynb # Example output
All critical functions include NumPy-style docstrings:
def apply_clr_transformation(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
Apply Centered Log-Ratio transformation to microbiome abundance data.
Parameters
----------
X_train : pd.DataFrame
Training feature matrix with abundance values.
X_test : pd.DataFrame
Test feature matrix with abundance values.
Returns
-------
Tuple[pd.DataFrame, pd.DataFrame]
CLR-transformed training and test datasets.
Notes
-----
The CLR transformation accounts for the compositional nature of microbiome data.
A small pseudocount (1e-6) is added to avoid log(0).
"""For detailed API documentation, see individual module docstrings:
utils/data_loader.py: Data loading functionsnotebooks/functions.py: Analysis functionspages/*.py: Streamlit page implementations
The LucKi cohort is described in:
- Luckey et al. (2015). "2015 LucKi cohort description." BMC Public Health. DOI: 10.1186/s12889-015-2255-7
Last Updated: 2024-01-25 Version: 1.0.0