A comprehensive framework for converting tabular data to transformer-ready token sequences with extensive analysis of dimensionality reduction methods for mental health prediction tasks.
This repository implements a novel approach to applying transformer models to longitudinal healthcare data through intelligent tokenization strategies. The system converts structured tabular data (daily hassles, life events, GHQ measurements) into token sequences optimized for transformer processing, with comprehensive comparison of multiple dimensionality reduction techniques.
The project explores feature engineering strategies for transformer models when applied to healthcare time series data from the LORA (Longitudinal Research on Aging) study. The system processes daily hassles (DH), life events (LE), and General Health Questionnaire (GHQ) measurements to predict future stress patterns and mental health outcomes.
🔄 Table-to-Token Conversion: Intelligent tokenization of tabular healthcare data for transformer processing 📊 Comprehensive Dimensionality Reduction Analysis:
- Linear Autoencoders (41 runs)
- Non-linear Autoencoders (49 runs)
- PCA (41 runs)
- Semantic Embeddings (51 runs)
- One-hot Encoding variants (94 runs) 🧠 Custom Transformer Architecture: Specialized for longitudinal healthcare sequences ⏰ Advanced Temporal Modeling: Age-aware and time-aware positional encoding 📈 Statistical Feature Grouping: Chi-square based stressor dependency analysis 🎯 Mental Health Prediction: GHQ outcome prediction with comprehensive evaluation 📊 Rich Visualization Suite: Attention heatmaps, UMAP projections, similarity matrices 📋 Extensive Analysis Framework: 276 experimental runs with statistical comparisons
- Julia 1.6+
- Required Julia packages (see
Project.toml)
- Clone the repository:
git clone <repository-url>
cd table2token4transformer- Activate the Julia environment:
using Pkg
Pkg.activate(".")
Pkg.instantiate()- Install dependencies:
Pkg.resolve()Run the main training pipeline:
julia main.jlThe project includes a comprehensive hyperparameter tuning script (tune_configurations.jl) that can sweep across multiple model parameters:
julia --project=. tune_configurations.jl-
Boolean Flags:
- Age Encoding (
age_encoding) - Actual Time Encoding (
actual_time_encoding) - Absolute Positional Encoding (
absolute_positional_encoding) - Relative Positional Encoding (
relative_positional_encoding) - End Flag (
end_flag)
- Age Encoding (
-
Model Architecture:
- Attention Head Size: [32, 64]
- Number of Attention Heads: [4, 8]
- Embedding Size: [64, 128]
- Projection Size: [32, 64]
-
Training Parameters:
- Learning Rate: [1e-4, 5e-4, 1e-3]
- Epochs: [50, 100]
The tuning script automatically:
- Generates all valid parameter combinations
- Skips invalid configurations (e.g., projection size > embedding size)
- Creates temporary config files for each run
- Executes training with each configuration
- Records validation loss
- Produces a summary table of results
Results are stored in the runs/ directory with unique timestamps for each experiment.
The project includes several analysis scripts that automatically generate comprehensive reports and visualizations:
# Full analysis with all plots and tables
python tensorboard_analysis.py --mode full
# Only generate UMAP plots
python tensorboard_analysis.py --mode umap
# Only generate word embedding similarity matrices
python tensorboard_analysis.py --mode word_embedding# Generate 2D embedding comparisons and similarity heatmaps
julia plotting_dim_reduction.jl# Analyze attention patterns across different model configurations
julia analyze_attention.jlNote: All analysis scripts automatically create the necessary output directories (analysis_results/, analysis_results/attention_analysis/, etc.) if they don't exist. No manual directory creation is required.
All model parameters are configurable through config.txt. Key parameters include:
# Stressor Selection
included_stressors: ["dh_10", "dh_14", "dh_15", ...] # Specific stressors to include
# Data Processing
min_streak_size: 1 # Minimum streak length for denoising
grouping_threshold: 1 # Chi-square grouping threshold
appearance_disappearance_flag: false # Enable appearance/disappearance encoding
# Model Architecture
d_model: 26 # Model dimension
number_of_heads: 1 # Attention heads
number_of_transformer_encoders: 1 # Number of encoder layers
latent_dim: 32 # Dimensionality reduction target
# Training
batch_size: 8 # Batch size
epochs: 50 # Training epochs
lr: 0.005 # Learning rate
# Positional Encoding
relative_positional_encoding: true # Enable relative positional encoding
age_encoding: false # Include age information
actual_time_encoding: false # Include actual time intervals
- Daily Hassles (DH): 58 categories representing routine stressors
- Life Events (LE): 27 categories representing significant life changes
- Mental Health Indicators: GHQ subscales measuring psychological distress
The system implements comprehensive chi-square analysis to identify statistical dependencies between stressors:
- Pairwise Analysis: Constructs 2×2 contingency tables for all stressor pairs
- Participant-Level Statistics: Computes chi-square statistics for each individual
- Population Aggregation: Two methods available:
- Count Method: Counts significant associations (χ² > 3.841, p < 0.05)
- Median Method: Robust central tendency measure
- Dependency-Based Grouping: Hierarchical clustering based on statistical dependencies
- Binarization: Converts raw scores to binary indicators with configurable thresholds
- Denoising: Removes isolated occurrences below minimum streak size
- Sequence Representation:
- Standard: Visit-level feature vectors
- Appearance-Disappearance: Explicit onset/cessation modeling
- Embedding Layer: Converts dimensionality-reduced inputs with optional positional encoding
- Multi-Head Attention: Scaled dot-product attention with causal masking
- Feed-Forward Networks: Position-wise dense layers with residual connections
- Prediction Head: Temperature-scaled softmax for next-token prediction
- Absolute Positional Encoding: Sinusoidal encoding based on sequence position
- Relative Positional Encoding: Distance-based encoding with learnable scaling
- Time-Aware Encoding: Incorporates actual time intervals and participant age
- Visit-Aware Encoding: Models follow-up visit boundaries using separator tokens
The framework implements and compares six dimensionality reduction approaches:
| Method | Runs | Performance (Val Loss) | Description |
|---|---|---|---|
| One-hot (linear) | 49 | 1.362 ± 0.153 ⭐ | Direct linear embedding of categorical data |
| One-hot (lookup) | 45 | 1.372 ± 0.275 | Lookup table-based embeddings |
| PCA | 41 | 1.398 ± 0.156 | Principal component analysis |
| Semantic Embeddings | 51 | 1.397 ± 0.172 | Pre-trained semantic representations |
| Linear Autoencoders | 41 | 1.516 ± 0.240 | Neural network-based linear compression |
| Non-linear Autoencoders | 49 | 2.000 ± 0.433 | Deep non-linear dimensionality reduction |
Results show that simple one-hot encodings outperform complex dimensionality reduction methods for this healthcare prediction task.
- Specialized Loss Function: Visit-aware cross-entropy with reduced end-token punishment
- Adam Optimizer: Configurable learning rates with gradient clipping
- Validation Strategy: Participant-level splitting to prevent data leakage
- Early Stopping: Validation-based convergence detection
- Visit-Level Accuracy: Exact match between predicted and actual visits
- Event-Level Accuracy: Element-wise accuracy for partial matches
- GHQ-Specific Metrics: Mental health outcome prediction accuracy
- Cross-Entropy Loss: Standard sequence modeling loss
- Repetition Baseline: Previous visit repetition strategy
- Linear Regression: Traditional regression on flattened sequences
- Ridge Regression: Regularized linear models
- Intercept-Only: Mean prediction baseline
- Attention Heatmaps: Visualization of attention patterns
- Embedding Projections: 2D visualizations using PCA and t-SNE
- UMAP Clustering: Exploration of learned representations
- Chi-Square Dependency Matrices: Stressor co-occurrence patterns
- Baseline Variable Analysis: Demographic and clinical correlations
table2token4transformer/
├── main.jl # Main training script
├── main_figure.pdf # Main methodology figure
├── config.json # Configuration parameters
├── Project.toml # Julia package dependencies
├── tensorboard_analysis.py # Comprehensive analysis framework
├── src/ # Source code
│ ├── transformer_layer.jl # Transformer implementation
│ ├── prepare_data.jl # Data preprocessing pipeline
│ ├── eval.jl # Evaluation metrics
│ ├── visualization.jl # Plotting and visualization
│ ├── regression.jl # Baseline regression models
│ ├── parameters.jl # Parameter structures
│ ├── logging.jl # Experiment logging
│ └── learn_word_embedding/ # Dimensionality reduction
│ ├── word_embedding.jl # Autoencoder implementations
│ ├── loading.jl # Data loading utilities
│ ├── create_data_sequence.jl # Sequence creation
│ └── get_data_report.jl # Chi-square analysis
├── analysis_results/ # Analysis outputs (auto-generated)
│ ├── dimensionality_reduction_table.tex
│ ├── hyperparameter_analysis.csv
│ ├── word_embedding_similarity_matrices.pdf
│ └── umap_*.png # UMAP visualizations
├── data/ # Data directory
├── runs_*/ # Experiment outputs (multiple sets)
└── notebooks/ # Jupyter notebooks
Each experimental run creates:
- Model Checkpoints: Trained model weights (
.bson) - Configuration Logs: Complete parameter settings
- Evaluation Reports: Comprehensive metrics (
eval.txt) - Visualizations:
- Attention heatmaps (
.pdf) - Embedding projections (
.pdf) - Chi-square dependency matrices (
.pdf) - UMAP clusterings (
.pdf)
- Attention heatmaps (
- TensorBoard Logs: Training progress monitoring
The analysis scripts generate comprehensive reports in the analysis_results/ directory:
- LaTeX Tables: Performance summaries and hyperparameter analysis (
.tex) - CSV Reports: Detailed metrics and statistics (
.csv) - Visualizations:
- 2D embedding comparisons (
.pdf) - Token similarity heatmaps (
.pdf) - UMAP clustering plots (
.png) - Attention pattern analysis (
.pdf)
- 2D embedding comparisons (
- Recommendations: JSON files with optimal hyperparameter suggestions
Directory Management: All output directories are created automatically by the scripts. No manual setup required.
- Automatic timestamping and versioning
- Reproducible random seeds
- Complete parameter logging
- Git integration for code versioning
This methodology is particularly suitable for:
- Longitudinal Healthcare Analysis: Time series prediction in clinical settings
- Mental Health Research: Stress pattern analysis and prediction
- Behavioral Modeling: Understanding temporal dependencies in human behavior
- Feature Engineering Research: Novel approaches to sequence representation
- Transformer Adaptations: Specialized architectures for healthcare data
If you use this code in your research, please cite:
@misc{table2token4transformer,
title={table2token4transformer: Converting Tabular Data to Transformer-Ready Token Sequences for Mental Health Prediction},
author={[Your Name]},
year={2024},
howpublished={\url{https://github.com/[username]/table2token4transformer}}
}[Specify your license here]
[Contributing guidelines if applicable]
[Your contact information]
Note: This implementation is part of ongoing research in feature engineering for transformer models applied to healthcare data. The methodology emphasizes interpretability, clinical relevance, and statistical rigor in feature construction.
