Ben Wade · Independent Researcher · ORCID 0009-0009-5857-7447
Speculative decoding accelerates inference by having a small model draft tokens that a large model verifies. We invert this paradigm entirely.
Rather than using a small model to speed up a large one, we use a large Diffusion Language Model (DLM) to elevate a small one. We exploit a mathematical property of absorbing-state masked diffusion — that token commitments during denoising are permanent and structurally load-bearing — to extract a sparse skeleton of anchor tokens from as few as 10% of denoising steps. A sub-billion-parameter autoregressive model then fills the gaps between these anchors, reproducing the DLM's full output at 0.82–0.93 F1 while running entirely on edge hardware.
The DLM never finishes generating. It only needs to start.
Evaluated on 190 prompts across MMLU, ARC, GSM8K, and HumanEval:
| Step Fraction | Qwen-0.5B F1 | Qwen-1.5B F1 | Mean Coverage |
|---|---|---|---|
| 10% (13 steps) | 0.821 | 0.830 | ~30% |
| 15% (19 steps) | 0.887 | 0.896 | ~50% |
| 25% (32 steps) | 0.921 | 0.930 | ~70% |
Ablation (N=50): Token identity drives anchor effectiveness, not position.
- Real vs Random-Token: Cohen's d = 6.41, p = 1.21×10⁻¹⁰
- Real vs Random-Position: Cohen's d = 0.10, p = 1.52×10⁻²
Gap-filler size: 0.5B matches 1.5B when anchors are provided (Δ = −0.009), confirming the DLM provides the dominant semantic signal.
Compute reduction: 128 DLM forward passes → 13, plus ~100 forward passes through a model 14× smaller. The pipeline requires approximately 10–15% of the full DLM's compute budget.
inverse-speculation-dlm/
├── inverse_speculation.pdf ← Paper
├── README.md
├── requirements.txt
├── cloud_setup.md ← RunPod/Vast.ai setup instructions
├── run_full_suite.py ← Main experiment (Phase A/B/C + ablation)
├── bridge_viability.py ← Layer-depth vs anchor quality (32B probe)
└── results/
├── dlm_anchor_results.json
├── ablation_anchors.json
├── gap_only_full.json
├── llm_size_results.json
├── supplement_timing.json
└── supplement_summary.txt
Runs all phases on a single A100 instance:
- Phase A: Dream-7B commit curves on 310 prompts (MMLU / ARC / GSM8K / HumanEval)
- Phase B: Qwen-0.5B and Qwen-1.5B gap-fill at step fractions 0.10, 0.15, 0.25
- Phase C: Pearl reasoning point — minimum DLM steps preserving ≥95% F1 on GSM8K
- Ablation: Real vs random-token vs random-position anchor controls
# Smoke test (5 prompts per benchmark, ~5 min)
python run_full_suite.py --subset 5
# Full run (310 prompts, ~45 min on A100, ~$1)
python run_full_suite.pyOutputs to /workspace/results/.
Tests whether Qwen2.5-32B intermediate layer activations can seed Dream's denoising process. Measures per-layer confidence and accuracy, then injects viable anchors into Dream via hook function.
python bridge_viability.pyOutputs to /workspace/bridge_viability/.
See cloud_setup.md for full instructions.
Quick start on an A100 instance:
pip install -r requirements.txt
python run_full_suite.py --subset 5 # smoke testMinimal example (1 prompt per benchmark, ~2 min):
python run_full_suite.py --subset 1Note:
transformers==4.46.2is required for Dream-7B compatibility.
| Model | Role | VRAM |
|---|---|---|
| Dream-org/Dream-v0-Instruct-7B | DLM anchor source | ~14 GB |
| Qwen/Qwen2.5-0.5B-Instruct | Primary gap-filler | ~1 GB |
| Qwen/Qwen2.5-1.5B-Instruct | Secondary gap-filler | ~3 GB |
| Qwen/Qwen2.5-32B-Instruct | Bridge viability probe | ~20 GB (4-bit) |
Dream-7B (Dream-org/Dream-v0-Instruct-7B)
- Checkpoint:
Dream-v0-Instruct-7B(bf16, ~14 GB) - Requires
transformers==4.46.2andtrust_remote_code=True - Tokenizer: shared with Qwen family — no mapping required for Qwen gap-fillers
- Download:
AutoModel.from_pretrained("Dream-org/Dream-v0-Instruct-7B", trust_remote_code=True)
Qwen gap-fillers (Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B-Instruct)
- Standard HuggingFace download, no special flags required
- Dream-7B and Qwen2.5 share compatible tokenizer vocabularies — forced decoding works without token translation
All three models together: ~18 GB VRAM — fits comfortably on a single A100-80GB.
See cloud_setup.md for the full download script.
If you use this work, please cite:
@misc{wade2026inverse,
title = {Inverse Speculation: Structural Anchoring from Diffusion Language
Models for Edge-Scale Generation},
author = {Wade, Ben},
year = {2026},
orcid = {0009-0009-5857-7447},
note = {Preprint}
}MIT