Skip to content

LLama inference implementation in Scala#82

Open
szymon-rd wants to merge 9 commits intomainfrom
llm.scala
Open

LLama inference implementation in Scala#82
szymon-rd wants to merge 9 commits intomainfrom
llm.scala

Conversation

@szymon-rd
Copy link
Member

@szymon-rd szymon-rd commented Jan 31, 2026

WIP (shows weird conflicts because to be rebased)
DSL to be cleaned up, GShared is weird now.
Some sources are to be removed.

- Bump SPIR-V version to 1.3 (required for GroupNonUniform ops)
- Add Float16 capability to SPIR-V headers
- Fix loop-invariant expression hoisting in GIOCompiler:
  - Expressions defined outside loops but first referenced inside
    were being compiled in the loop body, causing SPIR-V dominance errors
  - Now hoists invariant expressions before loop structure
  - Filters out loop-dependent and scope-dependent (When, etc.) expressions
- Add AGENTS.md with codebase documentation

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
For repeated execute() calls with the same pipeline and buffers,
skip the O(n) GExecution tree traversal by caching interpret results.
Keyed by (execution identity, layout bindings identity hash).

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
On cache hit (same pipeline + same buffers):
- Wait for previous GPU fence
- Resubmit same command buffer
- Skip tree traversal, descriptor allocation, command recording

Eliminates O(n) per-call overhead in decode loop.

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Replace per-submission fences with a single timeline semaphore:
- GPU-GPU sync via semaphore wait/signal (no CPU involvement)
- Only sync to CPU when reading results (unavoidable for sampling)
- Eliminates fence creation/destruction overhead per token

Performance: 57.8 -> 62.0 tok/s (+8%)

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Optimize F16MatmulVecHybridProgram to compute 2 output rows per workgroup
instead of 8 warps computing 1 row each. This matches llama.cpp's approach.

Key changes:
- Change WARPS_PER_WORKGROUP=8 to NUM_ROWS=2, BLOCK_SIZE=32 (single warp)
- Each workgroup computes 2 consecutive output rows sharing input loads
- Use separate accumulation loops per row (DSL limitation with Vec2)
- Use for-comprehension to sequence writes (DSL bug workaround)

Performance: 20 tok/s → 70 tok/s on RTX 2070 Max-Q

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Apply same multi-row optimization to output projection kernel.
Each workgroup now computes 4 vocab logits instead of 8 warps
computing 1 each.

Tested NUM_ROWS values:
- NUM_ROWS=2: ~70 tok/s (too many workgroups)
- NUM_ROWS=4: ~73 tok/s (best)
- NUM_ROWS=8: ~67 tok/s (too much loop overhead)

Performance: 70 → 73 tok/s (+4%)

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Previously, ExecutionHandler tracked ALL buffer bindings as "dirty" after each
dispatch, causing unnecessary barriers between operations that only READ the
same buffer. This serialized Q/K/V matmuls that should run in parallel.

Changes:
- Add getWrittenBuffers() to DSLCompiler to extract written buffers from GIO
- Extend SpirvProgram to track per-binding Read/Write/ReadWrite operations
- Modify ExecutionHandler to only mark WRITTEN bindings as dirty
- Read-after-read is now barrier-free, enabling parallel execution

This reduces unnecessary pipeline barriers and allows independent read
operations (like Q/K/V matmuls reading from attnNormOut) to overlap on GPU.

Benchmark: ~35% improvement in tok/s on LLM inference (64.7 → 87.2 tok/s)

💘 Generated with Crush

Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant