LLama inference implementation in Scala by szymon-rd · Pull Request #82 · ComputeNode/cyfra

szymon-rd · 2026-01-31T03:17:21Z

WIP (shows weird conflicts because to be rebased)
DSL to be cleaned up, GShared is weird now.
Some sources are to be removed.

- Bump SPIR-V version to 1.3 (required for GroupNonUniform ops) - Add Float16 capability to SPIR-V headers - Fix loop-invariant expression hoisting in GIOCompiler: - Expressions defined outside loops but first referenced inside were being compiled in the loop body, causing SPIR-V dominance errors - Now hoists invariant expressions before loop structure - Filters out loop-dependent and scope-dependent (When, etc.) expressions - Add AGENTS.md with codebase documentation 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

For repeated execute() calls with the same pipeline and buffers, skip the O(n) GExecution tree traversal by caching interpret results. Keyed by (execution identity, layout bindings identity hash). 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

On cache hit (same pipeline + same buffers): - Wait for previous GPU fence - Resubmit same command buffer - Skip tree traversal, descriptor allocation, command recording Eliminates O(n) per-call overhead in decode loop. 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

Replace per-submission fences with a single timeline semaphore: - GPU-GPU sync via semaphore wait/signal (no CPU involvement) - Only sync to CPU when reading results (unavoidable for sampling) - Eliminates fence creation/destruction overhead per token Performance: 57.8 -> 62.0 tok/s (+8%) 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

Optimize F16MatmulVecHybridProgram to compute 2 output rows per workgroup instead of 8 warps computing 1 row each. This matches llama.cpp's approach. Key changes: - Change WARPS_PER_WORKGROUP=8 to NUM_ROWS=2, BLOCK_SIZE=32 (single warp) - Each workgroup computes 2 consecutive output rows sharing input loads - Use separate accumulation loops per row (DSL limitation with Vec2) - Use for-comprehension to sequence writes (DSL bug workaround) Performance: 20 tok/s → 70 tok/s on RTX 2070 Max-Q 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

Apply same multi-row optimization to output projection kernel. Each workgroup now computes 4 vocab logits instead of 8 warps computing 1 each. Tested NUM_ROWS values: - NUM_ROWS=2: ~70 tok/s (too many workgroups) - NUM_ROWS=4: ~73 tok/s (best) - NUM_ROWS=8: ~67 tok/s (too much loop overhead) Performance: 70 → 73 tok/s (+4%) 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

Previously, ExecutionHandler tracked ALL buffer bindings as "dirty" after each dispatch, causing unnecessary barriers between operations that only READ the same buffer. This serialized Q/K/V matmuls that should run in parallel. Changes: - Add getWrittenBuffers() to DSLCompiler to extract written buffers from GIO - Extend SpirvProgram to track per-binding Read/Write/ReadWrite operations - Modify ExecutionHandler to only mark WRITTEN bindings as dirty - Read-after-read is now barrier-free, enabling parallel execution This reduces unnecessary pipeline barriers and allows independent read operations (like Q/K/V matmuls reading from attnNormOut) to overlap on GPU. Benchmark: ~35% improvement in tok/s on LLM inference (64.7 → 87.2 tok/s) 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>

LLama GPU Implementation

d200629

szymon-rd force-pushed the llm.scala branch from 8b5671a to d200629 Compare January 31, 2026 18:37

szymon-rd added 7 commits February 1, 2026 17:20

szymon-rd force-pushed the llm.scala branch from 63eab4d to 508f4d3 Compare February 2, 2026 19:33

wip to be cleared later

af93153

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama inference implementation in Scala#82

LLama inference implementation in Scala#82
szymon-rd wants to merge 9 commits intomainfrom
llm.scala

szymon-rd commented Jan 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szymon-rd commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

szymon-rd commented Jan 31, 2026 •

edited

Loading