A rigorous framework for diagnosing token-level embedding geometry pathologies in Large Language Models.
DCTT provides tools to identify tokens with problematic local geometry in LLM embedding spaces. The core contribution is a diagnostic and causal validation framework; repair methods are exploratory, with a current finding that single-token local optimization is insufficient when entire neighborhoods exhibit pathological geometry.
- Multi-stage diagnostics: Fast screening (Stage 1) followed by detailed spectral analysis (Stage 2)
- Spectral geometry metrics: Participation ratio, condition number, effective dimension, log-determinant
- Constrained repair optimization: Fix geometry while preserving semantics
- Causal validation: Stress tests and matched control experiments
- Apple Silicon optimized: USearch HNSW with ARM NEON acceleration
- Reproducibility first: W&B tracking, config snapshots, seed management
# Clone the repository
git clone https://github.com/MJ-Ref/DCTT.git
cd DCTT
# Install with development dependencies
pip install -e ".[dev]"
# For Apple Silicon with MLX support
pip install -e ".[dev,mlx]"
# For cloud GPU with Modal
pip install -e ".[dev,modal]"- Python 3.11+
- PyTorch 2.3+
- 16GB+ RAM (96GB recommended for full vocabulary analysis)
# Extract and normalize embeddings from a model
dctt extract --model Qwen/Qwen2.5-Coder-7B --output outputs/embeddings.npy# Analyze geometry for all tokens
python experiments/run_census.py model=qwen2_5_coder_7b
# Or sample 1000 tokens for quick analysis
python experiments/run_census.py model=qwen2_5_coder_7b \
experiment.tokens.mode=sample \
experiment.tokens.sample_size=1000# Repair high-severity tokens and compare to matched controls
python experiments/run_causal_repair.py model=qwen2_5_coder_7bdctt/
├── core/ # Types, exceptions, registry pattern
├── embeddings/ # Extraction, normalization, caching
├── neighbors/ # USearch HNSW index (M3-optimized)
├── metrics/ # Stage 1 & 2 diagnostic metrics
├── repair/ # Constrained optimization repair
├── stress_tests/ # Code syntax, math formatting tests
├── evaluation/ # Statistical analysis, causal inference
└── tracking/ # W&B integration, reproducibility
Fast, cheap metrics for initial screening:
μ_k: Mean k-NN distancemed_k: Median k-NN distancespread_q: Quantile spread ratio (q90/q10)
Core contribution - displacement matrix-based metrics:
dim95: Effective dimension at 95% variancePR: Participation ratio (eigenvalue spread)cond: Local condition numberlogdet: Log-determinant of covarianceanisotropy: Dominant direction ratio
Robust z-scores within (frequency_tier, token_type) buckets ensure fair comparison across token categories.
DCTT uses Hydra for configuration management:
# Override model
python experiments/run_census.py model=llama3_8b
# Override compute settings
python experiments/run_census.py compute=modal_gpu
# Override multiple settings
python experiments/run_census.py \
model=qwen2_5_coder_7b \
neighbors.k=100 \
seed=123| File | Purpose |
|---|---|
configs/config.yaml |
Root configuration |
configs/model/*.yaml |
Model-specific settings |
configs/compute/*.yaml |
Hardware optimization |
configs/pipeline/*.yaml |
Pipeline stages |
configs/experiment/*.yaml |
Experiment configurations |
PR = (Σλ)² / Σ(λ²)
Measures effective dimensionality. PR = d for uniform eigenvalues, PR ≈ 1 for single dominant direction.
cond = (λ₁ + ε) / (λₘ + ε)
High condition numbers indicate ill-conditioned local geometry that may cause gradient instabilities.
logdet = Σ log(λᵢ + ε)
Measures local "volume" or dispersion. Very negative values indicate collapsed geometry.
The initial repair optimizer uses projected gradient descent with tangent space projection:
for outer_iter in range(max_outer_iters):
neighbors = knn(index, embedding, k) # Recompute neighbors
for step in range(inner_steps):
loss = geometry_loss + λ_anchor * anchor_loss + λ_nn * nn_preserve_loss
grad_tangent = gradient - (gradient · embedding) * embedding
embedding = embedding - lr * grad_tangent
embedding = project_to_constraints(embedding) # Unit norm, max deltaFinding: Single-token local optimization preserves semantics (Jaccard > 0.7) but does NOT improve geometry metrics. This occurs because centered displacement covariance makes the loss independent of a single moving token when neighbors are fixed.
To address the single-token limitation, we implemented cluster-level repair that jointly optimizes connected components of pathological tokens:
python experiments/run_cluster_repair.py model=qwen2_5_coder_7b \
cluster_repair.mutual_k=50 cluster_repair.min_cluster_size=2Algorithm:
- Build mutual k-NN graph on high-severity tokens
- Find connected components (clusters)
- Jointly optimize all tokens in each cluster
- Centered covariance CAN change when multiple reference points shift together
Results on Qwen2.5-Coder-7B:
| Metric | Value | Status |
|---|---|---|
| Clusters found | 69 | With mutual_k=50, min_size=2 |
| Clusters improved | 5/5 (100%) | All clusters show improvement |
| Condition number reduction | 0.427 ± 0.157 | 10-17% improvement |
| Jaccard overlap | 0.836 ± 0.030 | Excellent semantic preservation |
| Similarity to original | 0.992 | Minimal movement required |
Key Finding: Cluster-level repair successfully improves geometry (condition number decreases) while preserving semantics, validating the hypothesis that pathological tokens need to move together.
# Run tests
make test
# Run fast tests only
make test-fast
# Lint and format
make format
make lint
# Type check
make typecheck
# Install pre-commit hooks
make pre-commitFull vocabulary census on 152,064 tokens (3,584 dimensions):
| Metric | Result |
|---|---|
| Tokens analyzed | 152,064 |
| Flagged tokens (poor geometry) | 3,325 (2.19%) |
| Processing speed | ~330 tokens/sec |
| Total census time | 7.5 minutes |
High-severity token outputs in the current census artifact:
- The saved
diagnostic_results.jsoncurrently uses placeholder token strings (token_<id>) - Token type is
UNKNOWNin that artifact - Lexical examples should be treated as pending until census writes decoded tokenizer strings
| Criterion | Status | Value |
|---|---|---|
| Embeddings moved | ✓ YES | similarity = 0.98 |
| Semantic preservation | ✓ PASS | Jaccard 0.75 - 1.0 |
| Geometry improved | ✗ NO | cond/PR unchanged |
Finding: Single-token repair preserves semantics but doesn't improve geometry when neighborhoods are uniformly pathological.
| Criterion | Status | Value |
|---|---|---|
| Clusters detected | ✓ YES | 69 clusters found |
| Geometry improved | ✓ YES | cond reduced 0.43 ± 0.16 |
| Semantic preservation | ✓ PASS | Jaccard = 0.836 |
| Improvement rate | ✓ 100% | 5/5 clusters improved |
| Claim | Status | Evidence |
|---|---|---|
| Geometry improves vs placebo | ✓ Supported | Treatment cond -0.27 vs control +0.04 |
| Behavior improves causally | ✗ Not yet | DiD not significant (p=0.81), simulated outcomes |
Supported Claim: "Cluster-level repair improves local geometry (cond) relative to placebo with minimal embedding movement."
Not Yet Supported: "Repair causally improves downstream behavior." Requires real stress tests with model inference, better matching, and larger samples.
Latest strict real-label sweep (forced-token minimal-pair, logprob_choice scoring, no proxy confounds):
qwen2_5_coder_7b(3 seeds, 100 tokens/run): delta mean-0.166(positive in 0/3 runs)qwen2_5_7b(3 seeds, 100 tokens/run): delta mean-0.248(positive in 0/3 runs)
Cross-family pilot (2 seeds/model, strict no-proxy setup):
mistral_7b: delta mean-0.053(gateFAIL)tinyllama_1_1b: delta mean-0.101(gateFAIL)
Current interpretation: geometry-only signal is negative under strict confound controls, so the predictive claim is not supported for publication at this time.
Reproduce with:
# Build/update counts vector (one-time per tokenizer/corpus).
# For Qwen strict runs, align to model output vocab size:
python experiments/build_token_frequency_counts.py \
--model-name Qwen/Qwen2.5-Coder-7B \
--input-root /path/to/corpus \
--output configs/confounds/qwen2_5_coder_7b_repo_counts_aligned.npy \
--target-vocab-size 152064
python experiments/run_predictive_validity_sweep.py \
--models qwen2_5_coder_7b,qwen2_5_7b \
--seeds 70,71,72 \
--sample-size 100 \
--n-prompts 2 \
--max-new-tokens 8 \
--n-bootstrap 30 \
--scoring-mode logprob_choice \
--compute-device cuda \
--frequency-counts-path configs/confounds/qwen2_5_coder_7b_repo_counts_aligned.npy \
--fail-on-proxy-confounds
# Gate decision (PASS/FAIL):
python scripts/evaluate_predictive_gate.py \
--sweep-results outputs/sweeps/predictive_validity/<run_stamp>/sweep_results.json \
--output-json outputs/sweeps/predictive_validity/<run_stamp>/gate_evaluation.json \
--output-markdown outputs/sweeps/predictive_validity/<run_stamp>/gate_evaluation.md
# Full-power cross-family rescue launcher (5 seeds/model, per-model confound files):
python scripts/launch_cross_family_rescue.py \
--config configs/experiment/cross_family_rescue.yaml
# Wait/pull/finalize modal sweep artifacts once a stamp is known:
python scripts/finalize_modal_predictive_sweep.py \
--stamp <run_stamp> \
--wait \
--min-runs-per-model 5This is a research codebase under active development. Current status:
- Core types and abstractions
- Stage 1 & 2 metrics
- USearch index integration
- Single-token repair optimizer
- Stress test framework
- Statistical evaluation
- W&B integration
- Benchmark wrappers (HumanEval, GSM8k)
- Stage 3 TDA metrics
- Paper figures generation
- Full census on Qwen2.5-Coder-7B
- Single-token repair validation (negative result)
- Cluster-level repair (positive result - geometry improves!)
- Forced-token minimal-pair stress tests
- Predictive-validity analysis pipeline (real-label runs complete)
- Causal experiment framework (mechanistic claim validated)
- Cross-family pilot replication (Mistral, TinyLlama; strict negative)
- Causal behavioral evidence (needs real stress tests)
- Full-power cross-family rescue sweep (5 seeds/model)
If you use DCTT in your research, please cite:
@software{dctt2024,
title = {DCTT: Discrete-to-Continuous Transition Testing for LLM Embedding Geometry},
year = {2024},
url = {https://github.com/MJ-Ref/DCTT}
}MIT License - see LICENSE for details.
- USearch for fast HNSW indexing
- Weights & Biases for experiment tracking
- Hydra for configuration management