A.T.L.A.S

Adaptive Test-time Learning and Autonomous Specialization

A.T.L.A.S achieves 36-41% LiveCodeBench pass@1 with a frozen 14B model on a single consumer GPU through intelligent test-time compute allocation. No fine-tuning, no API calls, no cloud -- just a $500 GPU and smart inference.

Benchmark Results

Run ID: v2_run_20260217_125310 | Hardware: RTX 5060 Ti 16GB | Throughput: 109 tasks/hr

Benchmark	Score	Tasks	Method
LiveCodeBench v5	36-41% pass@1	599	k=3, Geometric Lens selection, 4 epochs
GPQA Diamond	47.0%	198	k=5, multiple-choice knowledge reasoning
SciCode	14.7% (sub-problems)	341	k=1, cross-domain scientific coding

Single run, not averaged. LCB range reflects epoch 0-3 of Lens retraining, not a confidence interval.

Lens learning curve (LiveCodeBench, k=3)

Epoch	Tasks	Pass Rate	First-Pick Accuracy	Energy Gap
0 (baseline, no Lens)	100	36.0%	n/a	n/a
1 (1st retrain)	200	38.0%	82.9%	5.3
2 (2nd retrain)	200	35.5%	78.9%	11.5
3 (3rd retrain)	99	41.4%	78.0%	11.3

First-pick accuracy = how often the Lens's lowest-energy candidate actually passes. The energy gap between pass and fail candidates doubled after retraining (5.3 to 11.3), showing the Lens learned to separate passing from failing code. Val AUC reached 0.968 at epoch 3.

Note: The V2.5 ablation study found that under 768-dim nomic embeddings, C(x) was statistically indistinguishable from random selection (37.7% vs 37.1%). V2.5.1 confirmed this was an embedding source limitation, not an architecture failure. With Qwen3-14B self-embeddings (5120-dim), C(x) selects correctly 87.8% of the time on mixed-result tasks vs 48.3% random (+39.5pp, p < 0.000001). The Lens requires the model's own internal representations to discriminate candidates.

V2.5.1 Result (2026-02-23): Self-embeddings restore full discrimination. C(x) is a verified candidate verifier AND difficulty router. G(x) metric tensor contributes zero value and will be removed. See V2_5_ABLATION_STUDY.md for the full confirmation ablation report.

V2.5 Ablation Study + V2.5.1 Confirmation

A systematic ablation (2026-02-21) tested whether the Geometric Lens C(x) energy scoring provides real candidate selection value beyond diversity. Under 768-dim nomic embeddings, Lens scoring was statistically indistinguishable from random selection (37.7% vs 37.1%, +0.6pp within 3.4pp seed variance).

✅ V2.5.1 CONFIRMED (2026-02-23): This result was caused by the embedding source switch, not the Lens architecture. With Qwen3-14B self-embeddings (5120-dim), C(x) selects correctly 87.8% of the time on mixed-result tasks vs 48.3% random (+39.5pp, p < 0.000001). Reverse energy selects only 4.3%, proving a strong directional signal. The Lens is a verified candidate discriminator — it just needs the model's own internal representations.

Metric	V2.5 (nomic 768-dim)	V2.5.1 (self 5120-dim)
Selection accuracy (mixed tasks)	37.7%	87.8%
Selection - Random delta	+0.6pp	+39.5pp
Energy separation (PASS - FAIL)	~3.0	21.75
G(x) metric tensor value	dormant	zero (0.0pp at any alpha)

The V2.5 study also discovered that llama.cpp's --embeddings flag silently breaks speculative decoding (forcing n_batch=512). This led to a two-server sidecar architecture: generation with spec decode (~100 tok/s) on the main server, embeddings via nomic sidecar (~300 MiB VRAM). C(x) energy is confirmed as both a candidate verifier (87.8% selection accuracy) and difficulty router (Q1=100% solvable, Q4=0.3%).

Full results: V2_5_ABLATION_STUDY.md | Architecture change: V2_TO_V2_5_MIGRATION.md

Architecture

flowchart TB
  subgraph Input
    Problem[Coding Problem]
  end

  subgraph Routing["Confidence Router"]
    DE[Difficulty Estimator<br/>Weights: 0.30 / 0.25 / 0.20 / 0.25]
    AK[Adaptive-k Selection<br/>CACHE_HIT k=0 / FAST k=1<br/>STANDARD k=5 / HARD k=20]
  end

  subgraph Generation["Best-of-K Pipeline"]
    LS[Server A: llama-server<br/>Qwen3-14B-Q4_K_M<br/>+ Qwen3-0.6B Draft<br/>Spec decode ON]
    EM[Server B: Embeddings<br/>nomic-embed-text-v1.5<br/>768-dim]
    PC[Pattern Cache<br/>Redis + Ebbinghaus Decay]
  end

  subgraph Evaluation["Candidate Selection"]
    GL[Geometric Lens<br/>Energy-based scoring<br/>Cost Field C x ~0.5M params]
    SB[Sandbox<br/>Code Execution + Testing]
  end

  subgraph Knowledge["Context Retrieval"]
    PI[PageIndex RAG<br/>Tree Index + LLM Reasoning]
  end

  Problem --> DE
  DE --> AK
  AK --> LS
  PC -.->|strategy hints| LS
  PI -.->|relevant context| LS
  LS -->|k candidates| GL
  EM -->|768-dim embeddings| GL
  GL -->|sorted by energy| SB
  SB -->|result + feedback| PC

  style GL fill:#2d5016,color:#fff
  style LS fill:#1a3a5c,color:#fff
  style EM fill:#1a3a5c,color:#fff
  style DE fill:#5c3a1a,color:#fff

A.T.L.A.S runs entirely on K3s with a single GPU. The Confidence Router estimates task difficulty from 4 signals and selects how many candidates to generate (k=0 to k=20). The Best-of-K Pipeline generates candidates via speculative decoding (~100 tok/s), scores them with the Geometric Lens energy field, and tests them in an isolated Sandbox with early exit on first pass. A Pattern Cache with Ebbinghaus memory decay stores successful strategies for future routing.

The system also includes an optional MaaS layer (API Portal + LLM Proxy) for multi-user access with JWT auth, API key management, and rate limiting.

Full architecture details: docs/ARCHITECTURE.md

The Geometric Lens

The Lens implements an ARM-EBM (Adaptive Riemannian Metric / Energy-Based Model) duality. A cost field C(x) maps code embeddings to scalar energy: passing code concentrates near energy 2.99, failing code near 24.73 (under self-embeddings; V2.5.1 results).


Candidate verifier	With 5120-dim self-embeddings, C(x) selects the passing candidate 87.8% of the time on mixed-result tasks (+39.5pp vs random, p < 0.000001). Val AUC 0.9934. Reverse energy selects only 4.3%, proving a strong directional signal.
Difficulty router	C(x) energy perfectly stratifies task difficulty: Q1 (low energy) = 100% solvable, Q4 (high energy) = 0.3%. Dual use as verifier + router validated.
Embedding source matters	Under 768-dim nomic embeddings (V2.5), C(x) ≈ random (+0.6pp). V2.5.1 confirmed this was an embedding source limitation — the Lens requires the model's own internal representations.

G(x) metric tensor contributes zero value at any correction strength and will be removed or redesigned for V3 (5.2M parameters, 0.0pp net contribution).

Quick Start

# 1. Clone
git clone https://github.com/itigges22/A.T.L.A.S.git && cd A.T.L.A.S

# 2. Configure
cp A.T.L.A.S.conf.example A.T.L.A.S.conf
# Edit A.T.L.A.S.conf: set MODEL_PATH, DATA_DIR, GPU device

# 3. Install
sudo ./scripts/install.sh

# 4. Verify
./scripts/verify-install.sh

# 5. Run benchmark
benchmark/run_v2_benchmark.sh

See docs/SETUP.md for full installation instructions.

Hardware Requirements

Resource	Minimum	Tested
Python	3.10+	3.11
GPU VRAM	16 GB	RTX 5060 Ti 16 GB
System RAM	14 GB	16 GB
Storage	~20 GB	150 GB SSD
OS	RHEL 9 / Ubuntu 24	RHEL 9 (Proxmox VM)

Project Structure

api-portal/      API key management portal (JWT auth, web UI)
benchmark/       V2 benchmark suite (LCB, GPQA, SciCode, Custom, IFBench)
docs/            Architecture, setup, configuration, troubleshooting
manifests/       K3s deployment manifests
rag-api/         Core API: Geometric Lens, router, RAG, cache
llama-server/    llama.cpp server container
A.T.L.A.S/sandbox/   Isolated code execution environment
scripts/         Installation and management scripts
tests/           Test suite

Documentation

Document	Description
ARCHITECTURE.md	Full system architecture, component deep-dives, data flows
V2_5_ABLATION_STUDY.md	Geometric Lens ablation results and analysis
V2_TO_V2_5_MIGRATION.md	Two-server sidecar migration details
SETUP.md	Installation and deployment guide
CONFIGURATION.md	Configuration reference
API.md	API endpoint documentation
TROUBLESHOOTING.md	Common issues and solutions

Roadmap

V2.5.1 — Embedding Source Hypothesis (CONFIRMED, 2026-02-23)

V2.5.1 confirmed that the V2.5 finding (Lens ≈ random) was caused by the embedding source switch from self-embeddings to nomic, not a Lens architecture failure.

Result: C(x) selects correctly 87.8% on mixed tasks (+39.5pp vs random, p < 0.000001) with 5120-dim self-embeddings
G(x): Zero value at any alpha. Remove or fundamentally redesign.
Next step: Restore self-embeddings in production while maintaining spec decode throughput

V3 — Performance Target

V3 targets 70%+ LiveCodeBench through C(x) candidate verification (87.8% accuracy), test synthesis for the remaining 12.2%, and difficulty-adaptive routing. The core thesis: a frozen model with the right selection and routing infrastructure can match models 10x its size. V2.5.1 resolved the blocking dependency — the Lens is the verifier, and V3 Phase 4 builds a test synthesis module for cases beyond C(x)'s ceiling.

License

Licensed under the A.T.L.A.S Source Available License v1.0 -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
api-portal		api-portal
atlas		atlas
benchmark		benchmark
docs		docs
llama-server		llama-server
llm-proxy		llm-proxy
manifests		manifests
rag-api		rag-api
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
atlas.conf.example		atlas.conf.example
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A.T.L.A.S

Benchmark Results

Architecture

The Geometric Lens

Quick Start

Hardware Requirements

Project Structure

Documentation

Roadmap

V2.5.1 — Embedding Source Hypothesis (CONFIRMED, 2026-02-23)

V3 — Performance Target

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A.T.L.A.S

Benchmark Results

Architecture

The Geometric Lens

Quick Start

Hardware Requirements

Project Structure

Documentation

Roadmap

V2.5.1 — Embedding Source Hypothesis (CONFIRMED, 2026-02-23)

V3 — Performance Target

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages