Skip to content
/ T2T4T Public

T2T4T: Table2Token for Transformers - A comprehensive research implementation investigating optimal token representation strategies and manual feature crafting techniques for transformer models applied to small-sample longitudinal clinical datasets.

Notifications You must be signed in to change notification settings

kianaf/T2T4T

Repository files navigation

table2token4transformer 🔄

A comprehensive framework for converting tabular data to transformer-ready token sequences with extensive analysis of dimensionality reduction methods for mental health prediction tasks.

Main Figure

This repository implements a novel approach to applying transformer models to longitudinal healthcare data through intelligent tokenization strategies. The system converts structured tabular data (daily hassles, life events, GHQ measurements) into token sequences optimized for transformer processing, with comprehensive comparison of multiple dimensionality reduction techniques.

Overview

The project explores feature engineering strategies for transformer models when applied to healthcare time series data from the LORA (Longitudinal Research on Aging) study. The system processes daily hassles (DH), life events (LE), and General Health Questionnaire (GHQ) measurements to predict future stress patterns and mental health outcomes.

Key Features

🔄 Table-to-Token Conversion: Intelligent tokenization of tabular healthcare data for transformer processing 📊 Comprehensive Dimensionality Reduction Analysis:

  • Linear Autoencoders (41 runs)
  • Non-linear Autoencoders (49 runs)
  • PCA (41 runs)
  • Semantic Embeddings (51 runs)
  • One-hot Encoding variants (94 runs) 🧠 Custom Transformer Architecture: Specialized for longitudinal healthcare sequences ⏰ Advanced Temporal Modeling: Age-aware and time-aware positional encoding 📈 Statistical Feature Grouping: Chi-square based stressor dependency analysis 🎯 Mental Health Prediction: GHQ outcome prediction with comprehensive evaluation 📊 Rich Visualization Suite: Attention heatmaps, UMAP projections, similarity matrices 📋 Extensive Analysis Framework: 276 experimental runs with statistical comparisons

Installation

Prerequisites

  • Julia 1.6+
  • Required Julia packages (see Project.toml)

Setup

  1. Clone the repository:
git clone <repository-url>
cd table2token4transformer
  1. Activate the Julia environment:
using Pkg
Pkg.activate(".")
Pkg.instantiate()
  1. Install dependencies:
Pkg.resolve()

Usage

Basic Usage

Run the main training pipeline:

julia main.jl

Hyperparameter Tuning

The project includes a comprehensive hyperparameter tuning script (tune_configurations.jl) that can sweep across multiple model parameters:

julia --project=. tune_configurations.jl

Tunable Parameters

  1. Boolean Flags:

    • Age Encoding (age_encoding)
    • Actual Time Encoding (actual_time_encoding)
    • Absolute Positional Encoding (absolute_positional_encoding)
    • Relative Positional Encoding (relative_positional_encoding)
    • End Flag (end_flag)
  2. Model Architecture:

    • Attention Head Size: [32, 64]
    • Number of Attention Heads: [4, 8]
    • Embedding Size: [64, 128]
    • Projection Size: [32, 64]
  3. Training Parameters:

    • Learning Rate: [1e-4, 5e-4, 1e-3]
    • Epochs: [50, 100]

The tuning script automatically:

  • Generates all valid parameter combinations
  • Skips invalid configurations (e.g., projection size > embedding size)
  • Creates temporary config files for each run
  • Executes training with each configuration
  • Records validation loss
  • Produces a summary table of results

Results are stored in the runs/ directory with unique timestamps for each experiment.

Analysis and Visualization

The project includes several analysis scripts that automatically generate comprehensive reports and visualizations:

TensorBoard Analysis

# Full analysis with all plots and tables
python tensorboard_analysis.py --mode full

# Only generate UMAP plots
python tensorboard_analysis.py --mode umap

# Only generate word embedding similarity matrices
python tensorboard_analysis.py --mode word_embedding

Dimensionality Reduction Visualization

# Generate 2D embedding comparisons and similarity heatmaps
julia plotting_dim_reduction.jl

Attention Analysis

# Analyze attention patterns across different model configurations
julia analyze_attention.jl

Note: All analysis scripts automatically create the necessary output directories (analysis_results/, analysis_results/attention_analysis/, etc.) if they don't exist. No manual directory creation is required.

Configuration

All model parameters are configurable through config.txt. Key parameters include:

# Stressor Selection
included_stressors: ["dh_10", "dh_14", "dh_15", ...] # Specific stressors to include

# Data Processing
min_streak_size: 1                    # Minimum streak length for denoising
grouping_threshold: 1                 # Chi-square grouping threshold
appearance_disappearance_flag: false  # Enable appearance/disappearance encoding

# Model Architecture
d_model: 26                          # Model dimension
number_of_heads: 1                   # Attention heads
number_of_transformer_encoders: 1    # Number of encoder layers
latent_dim: 32                       # Dimensionality reduction target

# Training
batch_size: 8                        # Batch size
epochs: 50                           # Training epochs
lr: 0.005                           # Learning rate

# Positional Encoding
relative_positional_encoding: true   # Enable relative positional encoding
age_encoding: false                  # Include age information
actual_time_encoding: false          # Include actual time intervals

Methodology

Data Preprocessing and Feature Engineering

Stressor Categorization

  • Daily Hassles (DH): 58 categories representing routine stressors
  • Life Events (LE): 27 categories representing significant life changes
  • Mental Health Indicators: GHQ subscales measuring psychological distress

Chi-Square Dependency Analysis

The system implements comprehensive chi-square analysis to identify statistical dependencies between stressors:

  1. Pairwise Analysis: Constructs 2×2 contingency tables for all stressor pairs
  2. Participant-Level Statistics: Computes chi-square statistics for each individual
  3. Population Aggregation: Two methods available:
    • Count Method: Counts significant associations (χ² > 3.841, p < 0.05)
    • Median Method: Robust central tendency measure
  4. Dependency-Based Grouping: Hierarchical clustering based on statistical dependencies

Temporal Sequence Construction

  1. Binarization: Converts raw scores to binary indicators with configurable thresholds
  2. Denoising: Removes isolated occurrences below minimum streak size
  3. Sequence Representation:
    • Standard: Visit-level feature vectors
    • Appearance-Disappearance: Explicit onset/cessation modeling

Transformer Architecture

Custom Design Features

  • Embedding Layer: Converts dimensionality-reduced inputs with optional positional encoding
  • Multi-Head Attention: Scaled dot-product attention with causal masking
  • Feed-Forward Networks: Position-wise dense layers with residual connections
  • Prediction Head: Temperature-scaled softmax for next-token prediction

Positional Encoding Strategies

  1. Absolute Positional Encoding: Sinusoidal encoding based on sequence position
  2. Relative Positional Encoding: Distance-based encoding with learnable scaling
  3. Time-Aware Encoding: Incorporates actual time intervals and participant age
  4. Visit-Aware Encoding: Models follow-up visit boundaries using separator tokens

Dimensionality Reduction Methods

The framework implements and compares six dimensionality reduction approaches:

Method Runs Performance (Val Loss) Description
One-hot (linear) 49 1.362 ± 0.153 ⭐ Direct linear embedding of categorical data
One-hot (lookup) 45 1.372 ± 0.275 Lookup table-based embeddings
PCA 41 1.398 ± 0.156 Principal component analysis
Semantic Embeddings 51 1.397 ± 0.172 Pre-trained semantic representations
Linear Autoencoders 41 1.516 ± 0.240 Neural network-based linear compression
Non-linear Autoencoders 49 2.000 ± 0.433 Deep non-linear dimensionality reduction

Results show that simple one-hot encodings outperform complex dimensionality reduction methods for this healthcare prediction task.

Training and Optimization

  • Specialized Loss Function: Visit-aware cross-entropy with reduced end-token punishment
  • Adam Optimizer: Configurable learning rates with gradient clipping
  • Validation Strategy: Participant-level splitting to prevent data leakage
  • Early Stopping: Validation-based convergence detection

Evaluation Framework

Metrics

  1. Visit-Level Accuracy: Exact match between predicted and actual visits
  2. Event-Level Accuracy: Element-wise accuracy for partial matches
  3. GHQ-Specific Metrics: Mental health outcome prediction accuracy
  4. Cross-Entropy Loss: Standard sequence modeling loss

Baseline Comparisons

  • Repetition Baseline: Previous visit repetition strategy
  • Linear Regression: Traditional regression on flattened sequences
  • Ridge Regression: Regularized linear models
  • Intercept-Only: Mean prediction baseline

Visualization and Analysis

  • Attention Heatmaps: Visualization of attention patterns
  • Embedding Projections: 2D visualizations using PCA and t-SNE
  • UMAP Clustering: Exploration of learned representations
  • Chi-Square Dependency Matrices: Stressor co-occurrence patterns
  • Baseline Variable Analysis: Demographic and clinical correlations

File Structure

table2token4transformer/
├── main.jl                     # Main training script
├── main_figure.pdf             # Main methodology figure
├── config.json                 # Configuration parameters
├── Project.toml                # Julia package dependencies
├── tensorboard_analysis.py     # Comprehensive analysis framework
├── src/                        # Source code
│   ├── transformer_layer.jl    # Transformer implementation
│   ├── prepare_data.jl         # Data preprocessing pipeline
│   ├── eval.jl                 # Evaluation metrics
│   ├── visualization.jl        # Plotting and visualization
│   ├── regression.jl           # Baseline regression models
│   ├── parameters.jl           # Parameter structures
│   ├── logging.jl              # Experiment logging
│   └── learn_word_embedding/   # Dimensionality reduction
│       ├── word_embedding.jl   # Autoencoder implementations
│       ├── loading.jl          # Data loading utilities
│       ├── create_data_sequence.jl # Sequence creation
│       └── get_data_report.jl  # Chi-square analysis
├── analysis_results/           # Analysis outputs (auto-generated)
│   ├── dimensionality_reduction_table.tex
│   ├── hyperparameter_analysis.csv
│   ├── word_embedding_similarity_matrices.pdf
│   └── umap_*.png             # UMAP visualizations
├── data/                       # Data directory
├── runs_*/                     # Experiment outputs (multiple sets)
└── notebooks/                  # Jupyter notebooks

Output and Results

Generated Files

Each experimental run creates:

  • Model Checkpoints: Trained model weights (.bson)
  • Configuration Logs: Complete parameter settings
  • Evaluation Reports: Comprehensive metrics (eval.txt)
  • Visualizations:
    • Attention heatmaps (.pdf)
    • Embedding projections (.pdf)
    • Chi-square dependency matrices (.pdf)
    • UMAP clusterings (.pdf)
  • TensorBoard Logs: Training progress monitoring

Analysis Outputs

The analysis scripts generate comprehensive reports in the analysis_results/ directory:

  • LaTeX Tables: Performance summaries and hyperparameter analysis (.tex)
  • CSV Reports: Detailed metrics and statistics (.csv)
  • Visualizations:
    • 2D embedding comparisons (.pdf)
    • Token similarity heatmaps (.pdf)
    • UMAP clustering plots (.png)
    • Attention pattern analysis (.pdf)
  • Recommendations: JSON files with optimal hyperparameter suggestions

Directory Management: All output directories are created automatically by the scripts. No manual setup required.

Experiment Tracking

  • Automatic timestamping and versioning
  • Reproducible random seeds
  • Complete parameter logging
  • Git integration for code versioning

Research Applications

This methodology is particularly suitable for:

  • Longitudinal Healthcare Analysis: Time series prediction in clinical settings
  • Mental Health Research: Stress pattern analysis and prediction
  • Behavioral Modeling: Understanding temporal dependencies in human behavior
  • Feature Engineering Research: Novel approaches to sequence representation
  • Transformer Adaptations: Specialized architectures for healthcare data

Citation

If you use this code in your research, please cite:

@misc{table2token4transformer,
  title={table2token4transformer: Converting Tabular Data to Transformer-Ready Token Sequences for Mental Health Prediction},
  author={[Your Name]},
  year={2024},
  howpublished={\url{https://github.com/[username]/table2token4transformer}}
}

License

[Specify your license here]

Contributing

[Contributing guidelines if applicable]

Contact

[Your contact information]


Note: This implementation is part of ongoing research in feature engineering for transformer models applied to healthcare data. The methodology emphasizes interpretability, clinical relevance, and statistical rigor in feature construction.

About

T2T4T: Table2Token for Transformers - A comprehensive research implementation investigating optimal token representation strategies and manual feature crafting techniques for transformer models applied to small-sample longitudinal clinical datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published