feat(benchmark): Add CPU benchmark tool with context length sweep by huangzhenhua111 · Pull Request #639 · UbiquitousLearning/mllm

huangzhenhua111 · 2026-02-14T12:24:10Z

Added CPU benchmark tool for context length sweeps.The existing benchmark tools had PP/TG settings but no way to automate testing across different context lengths. So I wrote a bash script that sweeps from 256 to 4096 tokens and logs everything to CSV.Tested on my machine (WSL2, Ryzen 7 6800H) with TinyLLaMA fp32. Script runs two modes per context - one measures TTFT, other measures decode throughput. Also added KV cache size estimation.Usage is in the README. Basically just export BIN/MODEL/CFG and run the sweep script. Or use the benchmark binary directly for single runs.

Code structure:

Added Llama.hpp for LLaMA/TinyLLaMA models
Updated BenchmarkTemplate to support KV estimation
Scripts for sweeping + plotting results

coderabbitai · 2026-02-14T12:24:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds benchmark artifact ignores and docs, context‑sweep orchestration and plotting tools, offline GEMM/GEMV shape logging in matmul, an x86 SIMD include, extended benchmark CLI with CSV/multi‑run/cooldown/KV estimation, and a new LLaMA benchmark implementation.

Changes

Cohort / File(s)	Summary
Repo ignores `/.gitignore`	Append ignore patterns for local benchmark artifacts (`perf.data`, `perf.data.old`, `bench_context/`, `snapshots/`, `.bak`, `.bak.*`, `bench_mem_context.csv`).
Bench artifacts & snapshot docs `bench_artifacts/README.md`, `bench_artifacts/snapshot.md`	Add documentation and a profiling snapshot describing context‑sweep workflow, outputs (data/, plots/, snapshot.md), plotting steps, and profiling findings for TinyLlama x86.
Matmul shape logging `mllm/backends/cpu/kernels/common/ggml/matmul.cpp`	Add anonymous‑namespace shape logging (MMShapeKey, MMAgg, mm_shape_init, mm_shape_record) to aggregate GEMM/GEMV shapes and FLOPs with thread safety and atexit reporting.
SIMD header `mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp`	Include `<immintrin.h>` to enable x86 SIMD intrinsics.
Context sweep & snapshot tooling `sweep_context_v2.sh`, `scripts/make_snapshot_nopandas.py`	Add Bash orchestrator for context sweeps (run launcher, log parsing, CSV aggregation, KV memory estimator) and a Python script to normalize CSVs and emit summary CSVs and PNG plots.
Benchmark CLI & registry `tools/mllm-llm-benchmark/main.cpp`, `tools/mllm-llm-benchmark/models/All.hpp`	Enhance CLI: multi‑run, cooldown, CSV output, schema_version, kv_dtype_bytes, support pp/tg pairs; make `createBenchmark` inline and add LLaMA/TinyLlama model-name detection.
Benchmark interfaces & LLaMA model `tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp`, `tools/mllm-llm-benchmark/models/Llama.hpp`	Add `KVCacheEstimateInfo` and virtual `kvEstimateInfo()`; implement `Llama_Benchmark` with init, warmup, run, printModelInfo, clear, and KV‑cache estimation.

Sequence Diagram(s)

sequenceDiagram
  participant Sweep as "sweep_context_v2.sh"
  participant Bench as "mllm-llm-benchmark (binary)"
  participant Model as "Llama_Benchmark"
  participant FS as "Filesystem / Logs / CSV"
  participant Plot as "make_snapshot_nopandas.py"

  Sweep->>Bench: launch run (mode, pp, tg, threads)
  Bench->>Model: init()/warmup()/run() (prefill + decode)
  Model-->>Bench: timing metrics, KV estimate
  Bench-->>FS: append CSV row, write per-run logs, perf data
  Sweep->>FS: collect logs, extract metrics
  Sweep->>Plot: feed CSV
  Plot-->>FS: write summary CSV and PNGs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: Add benchmark for Qwen3 and update readme about benchmark #487 — Overlapping changes to benchmarking tools and model registry (tools/mllm-llm-benchmark).
feat(cli): add mllm-llm-benchmark tool for performance testing #481 — Related benchmark tooling updates (CSV/KV estimation and model support) that intersect this PR.

Suggested reviewers

yirongjie
chenghuaWang
oreomaker

Poem

"🐰 I hopped through logs and tiny‑llama trails,
I counted GEMMs and plotted perf scales,
CSV in paw, snapshots in sight,
I logged each run from morning to night,
carrots, charts, and kernels take flight!"

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (66 files): ⚔️ `.gitignore` (content) ⚔️ `README-ZH.md` (content) ⚔️ `README.md` (content) ⚔️ `docs/qnn_backend/index.rst` (content) ⚔️ `docs/qnn_backend/setup_env.rst` (content) ⚔️ `examples/llama_qnn_aot/compile.cpp` (content) ⚔️ `examples/llama_qnn_aot/compile_sha.cpp` (content) ⚔️ `examples/llama_qnn_aot/modeling_llama_qnn_aot.hpp` (content) ⚔️ `examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp` (content) ⚔️ `examples/qwen2_qnn_aot/compile.cpp` (content) ⚔️ `examples/qwen2_qnn_aot/compile_sha.cpp` (content) ⚔️ `examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot.hpp` (content) ⚔️ `examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp` (content) ⚔️ `examples/qwen3_qnn_aot/aot_run.cpp` (content) ⚔️ `examples/qwen3_qnn_aot/compile.cpp` (content) ⚔️ `examples/qwen3_qnn_aot/compile_sha.cpp` (content) ⚔️ `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp` (content) ⚔️ `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp` (content) ⚔️ `mllm/CMakeLists.txt` (content) ⚔️ `mllm/backends/ascend/AscendBackend.cpp` (content) ⚔️ `mllm/backends/ascend/AscendCommon.cpp` (content) ⚔️ `mllm/backends/ascend/AscendCommon.hpp` (content) ⚔️ `mllm/backends/ascend/ops/AscendElewiseOps.cpp` (content) ⚔️ `mllm/backends/ascend/ops/AscendElewiseOps.hpp` (content) ⚔️ `mllm/backends/cpu/kernels/common/ggml/matmul.cpp` (content) ⚔️ `mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp` (content) ⚔️ `mllm/backends/qnn/QNNAllocator.cpp` (content) ⚔️ `mllm/backends/qnn/QNNAllocator.hpp` (content) ⚔️ `mllm/backends/qnn/QNNBackend.cpp` (content) ⚔️ `mllm/backends/qnn/QNNBackend.hpp` (content) ⚔️ `mllm/backends/qnn/QNNModel.cpp` (content) ⚔️ `mllm/backends/qnn/QNNModel.hpp` (content) ⚔️ `mllm/backends/qnn/QNNUtils.cpp` (content) ⚔️ `mllm/backends/qnn/QNNUtils.hpp` (content) ⚔️ `mllm/backends/qnn/aot/QnnWrappersAPI.cpp` (content) ⚔️ `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp` (content) ⚔️ `mllm/backends/qnn/aot/visitor/CastType.cpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/KVCacheManager.cpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/KVCacheManager.hpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/PromptProcessor.cpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/PromptProcessor.hpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/QnnAOTModule.hpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/TokenGenerator.cpp` (content) ⚔️ `mllm/backends/qnn/aot_rt/TokenGenerator.hpp` (content) ⚔️ `mllm/backends/qnn/passes/QNNGraphBuildPass.cpp` (content) ⚔️ `mllm/compile/ir/Node.cpp` (content) ⚔️ `mllm/compile/ir/Node.hpp` (content) ⚔️ `mllm/core/DataTypes.cpp` (content) ⚔️ `mllm/core/DataTypes.hpp` (content) ⚔️ `mllm/core/SlicePrimitives.hpp` (content) ⚔️ `mllm/core/aops/CmpOp.cpp` (content) ⚔️ `mllm/engine/Context.cpp` (content) ⚔️ `mllm/engine/Context.hpp` (content) ⚔️ `mllm/mllm.cpp` (content) ⚔️ `mllm/models/minicpm_o2_6/modeling_resampler.hpp` (content) ⚔️ `mllm/models/minicpm_o2_6/streaming_generation.cpp` (content) ⚔️ `mllm/nn/Functional.cpp` (content) ⚔️ `mllm/nn/Functional.hpp` (content) ⚔️ `pymllm/backends/qualcomm/transformers/core/qdq.py` (content) ⚔️ `tasks/build_arm_ascend.yaml` (content) ⚔️ `tests/ascend/AscendKernelTest.hpp` (content) ⚔️ `tests/ascend/KernelTest.cpp` (content) ⚔️ `tools/mllm-llm-benchmark/main.cpp` (content) ⚔️ `tools/mllm-llm-benchmark/models/All.hpp` (content) ⚔️ `tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.
Description check	❓ Inconclusive	PR description covers key contributions (shape capture, KV estimation, context sweep script) with usage details, but lacks structured sections matching the template guidelines.	Restructure the description using the template format with clear sections: Problem Statement, Solution, Changes Made, Testing, and Usage. Add more specific context on the AOT profiling toolchain objectives and validation results.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(benchmark): Add CPU benchmark tool with context length sweep' clearly and specifically summarizes the main changes: introducing a new CPU benchmark tool with context length sweep functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mllm-llm-benchmark/main.cpp (1)
175-222: ⚠️ Potential issue | 🟡 Minor

Average results are computed but never printed to console — only written to CSV.

When --output_csv is not provided, the averages (avg_ttft, avg_prefill_speed, avg_decode_speed, latencies, KV estimates) are computed and then discarded. Users running without CSV output never see the summary. Consider printing a console summary regardless of CSV output.
Proposed addition after line 192
     float avg_decode_ms_per_tok = (avg_decode_speed > 0.0f) ? (1.0f / avg_decode_speed) * 1000.0f : 0.0f;
 
+    mllm::print("\n  Average Results (" , R, "runs ):");
+    mllm::print("    Avg TTFT          :", avg_ttft, "ms");
+    mllm::print("    Avg Prefill Speed :", avg_prefill_speed, "tokens/s");
+    mllm::print("    Avg Decode Speed  :", avg_decode_speed, "tokens/s");
+    mllm::print("    Avg Prefill Lat   :", avg_prefill_ms, "ms");
+    mllm::print("    Avg Decode Lat    :", avg_decode_ms_per_tok, "ms/tok");
+
     // Rough KV cache estimate (bytes)

🤖 Fix all issues with AI agents

In `@bench_artifacts/README.md`:
- Around line 22-30: The README contains hardcoded personal paths (e.g., BIN,
MODEL, CFG variables pointing to ~/mllm-runok and /home/huangzhenhua/...) which
are not portable; update the example to use neutral placeholders and environment
variables (e.g., $PROJECT_ROOT or $MODEL_DIR and relative paths) for BIN, MODEL,
CFG, and any user-specific directories so other developers can reproduce the
commands without leaking usernames or personal directories.

In `@bench_artifacts/snapshot.md`:
- Around line 12-24: The markdown images are missing alt text causing
accessibility lint MD045; update each image tag to include a short descriptive
alt string that conveys the content (e.g., "Decode latency vs context plot",
"Decode per-token latency vs context plot", "TTFT and prefill vs context plot",
"Peak RSS memory growth plot", "KV estimate growth (MB) plot") — specifically
edit the five image references referencing
plots/context_sweep_v2.decode_heavy.decode_ms.png,
plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png,
plots/context_sweep_v2.prefill_ttft.ttft_ms.png,
plots/context_sweep_v2.memory.peak_rss_gb.png, and
plots/context_sweep_v2.memory.kv_est_mb.png to replace ![](...) with
![descriptive alt text](...) for each.

In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp`:
- Line 137: The declaration int64_t batch_count_mm = 1; in function mat_mul is
mis-indented (starts at column 0); fix by indenting it to match the surrounding
function body style (align with other local variable declarations inside
mat_mul) so batch_count_mm sits at the same indent level as nearby locals and
code blocks.
- Around line 136-141: Remove the unconditional top-level mm_shape_record(0,
...) in mat_mul (the initial mm_shape_record that uses batch_count_mm) and
instead record shapes only inside the actual compute branches: add
mm_shape_record(0, ...) in the first llamafile sgemm path when batch_count == 1
(inside check_llamafile_sgemm success branch) and add a batched
mm_shape_record(0, ...) in the llamafile batched path (the branch that handles
batch_count > 1), mirroring the existing post-quantize placement; ensure you do
not emit the kind=0 record when later branches will emit kind=1 or kind=2
records to avoid double-counting.

In `@mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp`:
- Line 10: Wrap the unconditional `#include` <immintrin.h> with the same x86 guard
used elsewhere (e.g., `#if` defined(__SSE__) || defined(__x86_64__)), so the
header is only included on x86 builds; locate the include line at the top of
fwd_bshd.hpp and surround it with the existing platform macro guard used for
intrinsics in this file to maintain cross-platform portability.

In `@scripts/make_snapshot_nopandas.py`:
- Around line 8-14: The helpers use bare except which swallows
KeyboardInterrupt/SystemExit and to_int is unused; change both functions to
catch only conversion-related exceptions (e.g., ValueError, TypeError) in
to_float and to_int, and either remove the unused to_int function or use it
where intended (search for to_int and to_float to determine intent) — do not use
a bare except and ensure interrupted signals are allowed to propagate.
- Around line 117-122: The printed "Wrote:" list omits two PNGs—prefill and
decode per-ms plots—so update the print block that uses summary_path, out_dir
and stamp to include the missing file paths
f"{stamp}.prefill_ttft.prefill_ms.png" and f"{stamp}.decode_heavy.decode_ms.png"
alongside the existing entries (the block that currently prints
f"{stamp}.prefill_ttft.ttft_ms.png" and
f"{stamp}.decode_heavy.decode_ms_per_tok.png"); ensure you print
os.path.join(out_dir, ...) for those two filenames so all six generated PNGs are
listed.

In `@sweep_context_v2.sh`:
- Around line 16-18: The default MODEL value contains a hardcoded user-specific
path; update the MODEL variable assignment in sweep_context_v2.sh
(MODEL="${MODEL:-...}") to use $HOME (or another portable relative/config-driven
location) instead of "/home/huangzhenhua", mirroring how BIN and CFG are defined
so the script works for other users/CI environments.
- Line 128: The CSV echo currently hardcodes the model name "tiny_llama"; change
it to use a variable so the correct model is recorded by replacing the hardcoded
string in the echo that writes to "$OUTCSV" with a dynamic value derived from
either a new environment variable MODEL_NAME or from parsing the existing $MODEL
path (e.g., basename or strip extensions) before the echo; update the script to
set MODEL_NAME if unset by extracting it from $MODEL and reference MODEL_NAME in
the echo line so other models (Qwen, Llama-3, etc.) are correctly labeled.

In `@tools/mllm-llm-benchmark/main.cpp`:
- Around line 197-202: Validate the KV dtype byte option (kv_dtype_bytes) right
after it's parsed and convert it into a validated variable (e.g., kv_bpe) that
only allows 1, 2, or 4; if the value is outside that set, log an error and exit
(or clamp/fallback to a safe default) to prevent zero/negative values. Replace
direct uses of kv_dtype_bytes.get() in the KV estimate calculations (the
kv_est_bytes_pp / kv_est_bytes_final assignments) with the validated kv_bpe
variable so bytes_per is always one of 1, 2, or 4.

In `@tools/mllm-llm-benchmark/models/Llama.hpp`:
- Around line 89-92: The clear() override in Llama.hpp is a no-op so the KV
cache persists across runs (LlamaForCausalLM), leaking state between
benchmark->clear() calls from main.cpp and corrupting multi-run / multi-pair
results; to fix, either implement a real KV reset when available (call the
model's KV reset API from clear() once LlamaForCausalLM exposes it) or, as an
immediate mitigation, (1) modify clear() to emit a clear-cache warning into the
benchmark output/CSV and (2) add an argument validation in main.cpp that
enforces --runs == 1 (or refuses >1) when using this Llama implementation, so
benchmark->clear() cannot be relied upon until a proper reset API is added.
- Around line 136-140: The decode throughput calculation in Llama.hpp computes
decode_tokens as (token_count > 0 ? token_count - 1 : 0) and uses that for
r.decode_speed, but Qwen3_W4A32_KAI.hpp::run() uses token_count directly, making
benchmarks inconsistent; update the Qwen3 implementation (in run()) to match
Llama's convention by excluding the first token from decode throughput (use
token_count - 1 with the same guards: ensure decode_tokens is non-negative and
only compute r.decode_speed when decode_us and decode_tokens are > 0) so both
benchmarks use the same definition of decode throughput.
- Around line 17-24: kvEstimateInfo() currently divides cfg_->hidden_size by
cfg_->num_attention_heads without a guard; mirror the defensive behavior used in
printModelInfo(): check cfg_ and that cfg_->num_attention_heads is non‑zero
before computing head_dim. If cfg_->num_attention_heads == 0 (or cfg_ is null)
return std::nullopt (or otherwise avoid the division) so
KVCacheEstimateInfo.head_dim is never computed with a zero divisor; update the
kvEstimateInfo() function to perform this check using the existing cfg_ and
KVCacheEstimateInfo symbols.

🧹 Nitpick comments (11)

mllm/backends/cpu/kernels/common/ggml/matmul.cpp (4)
3-12: Duplicate and unused includes.

<cstdint> is included twice (lines 4 and 12), and <tuple> (line 8) is not used anywhere in this file.
Proposed fix
 `#include` <algorithm>
-#include <cstdint>
 `#include` <cstdio>
 `#include` <cstdlib>
 `#include` <mutex>
-#include <tuple>
 `#include` <unordered_map>
 `#include` <vector>
 `#include` <cassert>
 `#include` <cstdint>
74-98: atexit handler accesses file-scope statics — safe only if no detached threads outlive main().

The atexit callback locks g_mm_shape_mu and iterates g_mm_shape. This works because these statics are constructed before the atexit registration (so they are destroyed after the handler runs). However, if any detached thread is still calling mm_shape_record when the process exits, the mutex lock in the handler could race or deadlock. Since this is a profiling/debug feature gated behind an env var, the risk is low — just worth noting.

101-109: Mutex acquisition on every matmul call when shape logging is enabled.

When MLLM_MATMUL_SHAPE_LOG is active, every GEMM invocation contends on g_mm_shape_mu. Since this tool measures latency and throughput, the synchronization overhead could skew the benchmarks it is designed to support. A thread-local accumulation flushed at the end would avoid contention on the hot path.

52-61: Non-English comments reduce accessibility for contributors.

Several comments in this file are in Chinese (lines 52, 60–61, 288, 331, 341). The rest of the codebase uses English. Please translate these for consistency.
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)
72-72: Consider adding [[nodiscard]] to kvEstimateInfo().

The return value should always be checked by callers. This aligns with the static analysis hint (modernize-use-nodiscard).
Proposed fix
-  virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
+  [[nodiscard]] virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
tools/mllm-llm-benchmark/models/All.hpp (1)
31-35: Redundant conditions: "tinyllama" and "tiny_llama" already contain "llama".

The first condition find("llama") != npos already matches any string containing "tinyllama" or "tiny_llama", making the two extra checks dead code. Not a bug, but could be simplified.
Simplified condition
-  if (normalized_model_name.find("llama") != std::string::npos ||
-      normalized_model_name.find("tinyllama") != std::string::npos ||
-      normalized_model_name.find("tiny_llama") != std::string::npos) {
+  if (normalized_model_name.find("llama") != std::string::npos) {
     return std::make_shared<Llama_Benchmark>();
   }
tools/mllm-llm-benchmark/main.cpp (1)
22-28: Duplicate stringification macros: STR/STR_HELPER are identical to STRINGIFY/STRINGIFY_INTERNAL (lines 17-18) and appear unused.

STR and STR_HELPER are never referenced in the file. They duplicate STRINGIFY/STRINGIFY_INTERNAL already defined above.
Remove unused macros
-#ifndef MLLM_GIT_COMMIT_HASH
-#define MLLM_GIT_COMMIT_HASH unknown
-#endif
-
-#define STR_HELPER(x) `#x`
-#define STR(x) STR_HELPER(x)
-
+#ifndef MLLM_GIT_COMMIT_HASH
+#define MLLM_GIT_COMMIT_HASH unknown
+#endif
scripts/make_snapshot_nopandas.py (1)
1-1: math is imported but never used.
Proposed fix
-import sys, os, csv, math
+import sys, os, csv
sweep_context_v2.sh (3)
64-69: Unreachable guard on line 69.

If cl <= tg, the function returns on line 66. Otherwise pp = cl - tg ≥ 1, so the pp < 1 check on line 69 is dead code. Harmless, but can be removed for clarity.

111-113: rg (ripgrep) is a non-standard dependency.

The metric parsing relies on rg which isn't available by default on most systems. Consider either documenting this prerequisite in the usage comments or falling back to grep -oP (available on most GNU/Linux).
Alternative using grep -oP
-  TTFT_MS="$(rg -o 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
-  PREFILL_MS="$(rg -o 'Prefill Latency\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
-  DECODE_MS="$(rg -o 'Decode Latency\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
+  TTFT_MS="$(grep -oP 'TTFT\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"
+  PREFILL_MS="$(grep -oP 'Prefill Latency\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"
+  DECODE_MS="$(grep -oP 'Decode Latency\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"
131-135: A single failed run aborts the entire sweep due to set -e.

Since run_one returns 1 on failure and set -e is active, any failed benchmark run (e.g., OOM at a large context length) will terminate the entire sweep. If the intent is to collect as much data as possible, consider tolerating individual failures:
Proposed fix
 for CL in $CLS; do
-  run_one "decode_heavy" "$CL" "$TG_DH"
-  run_one "prefill_ttft" "$CL" "$TG_TTFT"
+  run_one "decode_heavy" "$CL" "$TG_DH" || echo "WARN: decode_heavy cl=$CL failed, continuing..."
+  run_one "prefill_ttft" "$CL" "$TG_TTFT" || echo "WARN: prefill_ttft cl=$CL failed, continuing..."
 done

bench_artifacts/README.md

coderabbitai · 2026-02-14T12:29:46Z

bench_artifacts/snapshot.md

+![](plots/context_sweep_v2.decode_heavy.decode_ms.png)
+
+### **Fig.2: Decode per-token latency vs context (ms/tok)**
+![](plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png)
+
+### **Fig.3: TTFT/Prefill vs context (ms)**
+![](plots/context_sweep_v2.prefill_ttft.ttft_ms.png)
+
+### **Fig.4: Memory growth (Peak RSS, GB)**
+![](plots/context_sweep_v2.memory.peak_rss_gb.png)
+
+### **Fig.5: KV estimate growth (MB, formula-based)**
+![](plots/context_sweep_v2.memory.kv_est_mb.png)


⚠️ Potential issue | 🟡 Minor

Images are missing alt text (accessibility).

All five image references use ![]() with empty alt text. Markdownlint MD045 flags this. Adding descriptive alt text improves accessibility and provides context when images fail to load.

Example fix for one image

-![](plots/context_sweep_v2.decode_heavy.decode_ms.png) +![Decode latency vs context length](plots/context_sweep_v2.decode_heavy.decode_ms.png)

Apply similarly to all five image references.

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 12-12: Images should have alternate text (alt text)

(MD045, no-alt-text)

[warning] 15-15: Images should have alternate text (alt text)

(MD045, no-alt-text)

[warning] 18-18: Images should have alternate text (alt text)

(MD045, no-alt-text)

[warning] 21-21: Images should have alternate text (alt text)

(MD045, no-alt-text)

[warning] 24-24: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents

In `@bench_artifacts/snapshot.md` around lines 12 - 24, The markdown images are missing alt text causing accessibility lint MD045; update each image tag to include a short descriptive alt string that conveys the content (e.g., "Decode latency vs context plot", "Decode per-token latency vs context plot", "TTFT and prefill vs context plot", "Peak RSS memory growth plot", "KV estimate growth (MB) plot") — specifically edit the five image references referencing plots/context_sweep_v2.decode_heavy.decode_ms.png, plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png, plots/context_sweep_v2.prefill_ttft.ttft_ms.png, plots/context_sweep_v2.memory.peak_rss_gb.png, and plots/context_sweep_v2.memory.kv_est_mb.png to replace ![](...) with ![descriptive alt text](...) for each.

coderabbitai · 2026-02-14T12:29:46Z

mllm/backends/cpu/kernels/common/ggml/matmul.cpp

+  // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path)
+int64_t batch_count_mm = 1;
+  for (size_t i = 0; i + 2 < dst_shape.size(); ++i) batch_count_mm *= dst_shape[i];
+  mm_shape_record(0, M, N, K,
+                  (uint64_t)batch_count_mm,
+                  (uint64_t)(2.0 * (double)batch_count_mm * M * N * (double)K));


⚠️ Potential issue | 🟠 Major

Double-counting of GEMM shapes when the first llamafile path fails.

This mm_shape_record(kind=0, ...) fires unconditionally at the top of mat_mul. If the first check_llamafile_sgemm (line 170) fails and execution falls through to the post-quantize llamafile path (line 284), a second mm_shape_record(kind=0, ...) is emitted at line 289. Similarly, if execution reaches the gemv/gemm path (line 330), the top-level kind=0 record is redundant alongside the kind=1/kind=2 records.

This inflates FLOPs totals in the shape log. The top-level record should be removed and shape recording should happen only in the path that actually executes the compute.

Proposed fix — remove unconditional top-level recording

- // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path) -int64_t batch_count_mm = 1; - for (size_t i = 0; i + 2 < dst_shape.size(); ++i) batch_count_mm *= dst_shape[i]; - mm_shape_record(0, M, N, K, - (uint64_t)batch_count_mm, - (uint64_t)(2.0 * (double)batch_count_mm * M * N * (double)K)); - - auto src0_dtype = src0_.dtype();

Then add a shape record inside the first llamafile batch_count == 1 path (around line 176) and the batched path (around line 190), mirroring what is already done for the post-quantize path at line 289.

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` around lines 136 - 141, Remove the unconditional top-level mm_shape_record(0, ...) in mat_mul (the initial mm_shape_record that uses batch_count_mm) and instead record shapes only inside the actual compute branches: add mm_shape_record(0, ...) in the first llamafile sgemm path when batch_count == 1 (inside check_llamafile_sgemm success branch) and add a batched mm_shape_record(0, ...) in the llamafile batched path (the branch that handles batch_count > 1), mirroring the existing post-quantize placement; ensure you do not emit the kind=0 record when later branches will emit kind=1 or kind=2 records to avoid double-counting.

coderabbitai · 2026-02-14T12:29:46Z

mllm/backends/cpu/kernels/common/ggml/matmul.cpp

  const int N = transpose1 ? src1_shape[src1_shape.size() - 2] : src1_shape[src1_shape.size() - 1];

+  // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path)
+int64_t batch_count_mm = 1;


⚠️ Potential issue | 🟡 Minor

Indentation is broken — int64_t batch_count_mm starts at column 0.

This line is inside the mat_mul function body but has no indentation, breaking the file's consistent style.

-int64_t batch_count_mm = 1; + int64_t batch_count_mm = 1;

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` at line 137, The declaration int64_t batch_count_mm = 1; in function mat_mul is mis-indented (starts at column 0); fix by indenting it to match the surrounding function body style (align with other local variable declarations inside mat_mul) so batch_count_mm sits at the same indent level as nearby locals and code blocks.

coderabbitai · 2026-02-14T12:29:46Z

mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp

 #include "mllm/core/Parallel.hpp"
 #include "mllm/backends/cpu/kernels/common/paged_attn/arch.hpp"
 #include "mllm/utils/UnsafeMacros.hpp"
+#include <immintrin.h>


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the file to verify the include and its context cat -n mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 7335

🏁 Script executed:

# Search for other x86-specific includes in the CPU backend rg -n '#include\s*<(immintrin|xmmintrin|emmintrin|pmmintrin)\.h>' mllm/backends/cpu/ -A2 -B2

Repository: UbiquitousLearning/mllm

Length of output: 2384

🏁 Script executed:

# Check if immintrin.h is used elsewhere and how it's handled rg -l 'immintrin\.h' mllm/ --type cpp --type hpp

Repository: UbiquitousLearning/mllm

Length of output: 94

Guard the x86-specific include for cross-platform portability.

The unconditional include of <immintrin.h> will fail on non-x86 platforms (ARM, RISC-V, etc.) where this header does not exist. Since all intrinsics usage is properly guarded by #if defined(__SSE__) || defined(__x86_64__) throughout the file, and every other x86 intrinsic include in the codebase follows the same pattern, this include should be similarly guarded to maintain consistency and portability.

Proposed fix

+#if defined(__SSE__) || defined(__x86_64__) `#include` <immintrin.h> +#endif

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include <immintrin.h>

`#if` defined(__SSE__) || defined(__x86_64__)

`#include` <immintrin.h>

`#endif`

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp` at line 10, Wrap the unconditional `#include` <immintrin.h> with the same x86 guard used elsewhere (e.g., `#if` defined(__SSE__) || defined(__x86_64__)), so the header is only included on x86 builds; locate the include line at the top of fwd_bshd.hpp and surround it with the existing platform macro guard used for intrinsics in this file to maintain cross-platform portability.

coderabbitai · 2026-02-14T12:29:46Z

tools/mllm-llm-benchmark/scripts/sweep_context_v2.sh

+
+  echo "TTFT=$TTFT_MS ms  Prefill=$PREFILL_MS ms  Decode=$DECODE_MS ms  Decode/tok=$DECODE_PER_TOK ms  peakRSS=$PEAK_RSS_KB KB  KV_est=$KV_EST_KB KB"
+
+  echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,$TTFT_MS,$PREFILL_MS,$DECODE_MS,$DECODE_PER_TOK,$PEAK_RSS_KB,$KV_EST_KB" >> "$OUTCSV"


⚠️ Potential issue | 🟡 Minor

Model name is hardcoded as tiny_llama in CSV output.

If this script is reused for other models (Qwen, Llama-3, etc.), the CSV will contain an incorrect model identifier. Consider deriving it from the $MODEL path or introducing a MODEL_NAME environment variable.

Proposed fix

+MODEL_NAME="${MODEL_NAME:-$(basename "$MODEL" | sed 's/\.[^.]*$//')}" + # (in the CSV echo on line 128) - echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,..." >> "$OUTCSV" + echo "$TS,$GIT,$ARCH,$MODEL_NAME,$mode,$cl,$pp,$tg,$THREADS,..." >> "$OUTCSV"

🤖 Prompt for AI Agents

In `@sweep_context_v2.sh` at line 128, The CSV echo currently hardcodes the model name "tiny_llama"; change it to use a variable so the correct model is recorded by replacing the hardcoded string in the echo that writes to "$OUTCSV" with a dynamic value derived from either a new environment variable MODEL_NAME or from parsing the existing $MODEL path (e.g., basename or strip extensions) before the echo; update the script to set MODEL_NAME if unset by extracting it from $MODEL and reference MODEL_NAME in the echo line so other models (Qwen, Llama-3, etc.) are correctly labeled.

coderabbitai · 2026-02-14T12:29:46Z

tools/mllm-llm-benchmark/main.cpp

+    if (auto info = benchmark->kvEstimateInfo(); info.has_value()) {
+      const int32_t bytes_per = kv_dtype_bytes.get();  // 1/2/4
+      // LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes
+      kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per;
+      kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per;
+    }


⚠️ Potential issue | 🟡 Minor

No validation on kv_dtype_bytes — accepts arbitrary values including 0 and negatives.

The help text states valid values are 1, 2, or 4, but any integer is accepted. A value of 0 would silently produce zero estimates; a negative value would produce negative estimates.

Proposed validation (e.g., after line 89)

+ int kv_bpe = kv_dtype_bytes.get(); + if (kv_bpe != 1 && kv_bpe != 2 && kv_bpe != 4) { + mllm::print("[WARN] --kv_dtype_bytes should be 1, 2, or 4; got:", kv_bpe, "— defaulting to 4"); + kv_bpe = 4; + }

Then use kv_bpe instead of kv_dtype_bytes.get() on line 198.

🤖 Prompt for AI Agents

In `@tools/mllm-llm-benchmark/main.cpp` around lines 197 - 202, Validate the KV dtype byte option (kv_dtype_bytes) right after it's parsed and convert it into a validated variable (e.g., kv_bpe) that only allows 1, 2, or 4; if the value is outside that set, log an error and exit (or clamp/fallback to a safe default) to prevent zero/negative values. Replace direct uses of kv_dtype_bytes.get() in the KV estimate calculations (the kv_est_bytes_pp / kv_est_bytes_final assignments) with the validated kv_bpe variable so bytes_per is always one of 1, 2, or 4.

coderabbitai · 2026-02-14T12:29:46Z

tools/mllm-llm-benchmark/models/Llama.hpp

+    std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
+    if (!cfg_) return std::nullopt;
+    KVCacheEstimateInfo info;
+    info.num_layers = cfg_->num_hidden_layers;
+    info.num_kv_heads = cfg_->num_key_value_heads;
+    info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
+    return info;
+  }


⚠️ Potential issue | 🟡 Minor

Missing division-by-zero guard in kvEstimateInfo() unlike printModelInfo().

Line 22 computes hidden_size / num_attention_heads without guarding against zero, whereas printModelInfo() (line 58) has a defensive check. For consistency and robustness:

Proposed fix

KVCacheEstimateInfo info; info.num_layers = cfg_->num_hidden_layers; info.num_kv_heads = cfg_->num_key_value_heads; - info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; + info.head_dim = (cfg_->num_attention_heads > 0) + ? (cfg_->hidden_size / cfg_->num_attention_heads) + : 0; return info;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {

if (!cfg_) return std::nullopt;

KVCacheEstimateInfo info;

info.num_layers = cfg_->num_hidden_layers;

info.num_kv_heads = cfg_->num_key_value_heads;

info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;

return info;

}

std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {

if (!cfg_) return std::nullopt;

KVCacheEstimateInfo info;

info.num_layers = cfg_->num_hidden_layers;

info.num_kv_heads = cfg_->num_key_value_heads;

info.head_dim = (cfg_->num_attention_heads > 0)

? (cfg_->hidden_size / cfg_->num_attention_heads)

: 0;

return info;

}

🧰 Tools

🪛 Clang (14.0.6)

[error] 17-17: function 'kvEstimateInfo' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

🤖 Prompt for AI Agents

In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 17 - 24, kvEstimateInfo() currently divides cfg_->hidden_size by cfg_->num_attention_heads without a guard; mirror the defensive behavior used in printModelInfo(): check cfg_ and that cfg_->num_attention_heads is non‑zero before computing head_dim. If cfg_->num_attention_heads == 0 (or cfg_ is null) return std::nullopt (or otherwise avoid the division) so KVCacheEstimateInfo.head_dim is never computed with a zero divisor; update the kvEstimateInfo() function to perform this check using the existing cfg_ and KVCacheEstimateInfo symbols.

coderabbitai · 2026-02-14T12:29:46Z

tools/mllm-llm-benchmark/models/Llama.hpp

+  void clear() override {
+    // TODO: expose a public KV-cache reset API for LlamaForCausalLM (if needed).
+    // For now, keep it as no-op to minimize API changes in PR1.
+  }


⚠️ Potential issue | 🟠 Major

clear() is a no-op — KV cache persists across consecutive runs.

Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.

Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.

🤖 Prompt for AI Agents

In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 89 - 92, The clear() override in Llama.hpp is a no-op so the KV cache persists across runs (LlamaForCausalLM), leaking state between benchmark->clear() calls from main.cpp and corrupting multi-run / multi-pair results; to fix, either implement a real KV reset when available (call the model's KV reset API from clear() once LlamaForCausalLM exposes it) or, as an immediate mitigation, (1) modify clear() to emit a clear-cache warning into the benchmark output/CSV and (2) add an argument validation in main.cpp that enforces --runs == 1 (or refuses >1) when using this Llama implementation, so benchmark->clear() cannot be relied upon until a proper reset API is added.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@huangzhenhua111

please see this coderabbit's review;

In your sweep_context_v2.sh dose the 'cl' mean context length(include prompt length and test_generation_length? But in initial code the 'cl' mean cache length, it may results some misunderstand;

I don't see any information about 'cache lenght' in you code, why?

https://github.com/UbiquitousLearning/mllm/pull/639/changes#r2807444788

⚠️ Potential issue | 🟠 Major

clear() is a no-op — KV cache persists across consecutive runs.

Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.

Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.

🤖 Prompt for AI Agents

coderabbitai · 2026-02-14T12:29:46Z

tools/mllm-llm-benchmark/models/Llama.hpp

+    // NOTE: decode_us is measured from first token timestamp; exclude that first token from decode throughput.
+    int decode_tokens = (token_count > 0) ? (token_count - 1) : 0;
+    r.decode_speed = (decode_us > 0 && decode_tokens > 0)
+                       ? (static_cast<float>(decode_tokens) / decode_us) * 1e6f
+                       : 0.f;


⚠️ Potential issue | 🟡 Minor

Decode throughput calculation differs from Qwen3 benchmark.

This implementation correctly excludes the first token from decode throughput (token_count - 1), since the first token's latency is attributed to TTFT/prefill. However, Qwen3_W4A32_KAI.hpp's run() uses token_count directly (includes the first token in decode speed), making the two benchmarks non-comparable.

Consider aligning both implementations to the same convention (the Llama approach is arguably more correct).

🤖 Prompt for AI Agents

In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 136 - 140, The decode throughput calculation in Llama.hpp computes decode_tokens as (token_count > 0 ? token_count - 1 : 0) and uses that for r.decode_speed, but Qwen3_W4A32_KAI.hpp::run() uses token_count directly, making benchmarks inconsistent; update the Qwen3 implementation (in run()) to match Llama's convention by excluding the first token from decode throughput (use token_count - 1 with the same guards: ensure decode_tokens is non-negative and only compute r.decode_speed when decode_us and decode_tokens are > 0) so both benchmarks use the same definition of decode throughput.

huangzhenhua111 · 2026-02-14T12:43:39Z

Hi @chenghuaWang ,thank you for your email guidance very much , can you take a look at this PR? I sincerely wish it can help you.

chenghuaWang · 2026-02-14T13:46:20Z

Thank you for your PR.

Your PR does not benchmark the NPU. This does not match the description in your PR's README.
Please describe what Context Sweep is testing.
We do not object to using AI programming, but this PR has very obvious AI traces and it is unclear what it is doing. Please check this PR.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@bench_artifacts/snapshot.md`:
- Line 4: The phrase "highly stable" over the 0.67–0.88 ms/tok range is
misleading because that span is ~31% variation; update the sentence in
bench_artifacts/snapshot.md to either replace "highly stable" with "relatively
consistent" or similar, and append the precise variation (≈31% variation) or the
numeric range explicitly to clarify; ensure the revised sentence still notes
that TTFT/prefill latency scales and becomes the primary bottleneck for long
contexts.
- Line 14: The term "VRAM" is incorrect for CPU benchmarks in the "Memory
Planning" section; update the phrase "VRAM lower bounds" to use CPU-appropriate
terminology such as "RAM lower bounds", "system memory lower bounds", or simply
"memory lower bounds" in the sentence under "Memory Planning" so it correctly
refers to x86_64 CPU baseline measurements and aligns with the rest of the
document.

🧹 Nitpick comments (3)

bench_artifacts/snapshot.md (3)

4-4: Add CPU model and system specifications for reproducibility.

Specify the CPU model (e.g., Intel i7-12700K, AMD Ryzen 7 5800X) and other relevant system details (OS version, RAM, compiler flags) to make these benchmark results reproducible and comparable.

6-6: Quantify "heavily concentrated" with profiling percentages.

Replace the vague "heavily concentrated" with specific metrics from perf output, such as "accounts for X% of total CPU time" or "dominates the top-N hotspots by Y%."

14-14: Quantify "strictly aligns" claim with error margins.

The phrase "strictly aligns" is vague. Provide quantitative support, such as "aligns within ±X% error" or "deviates by less than Y MB," to make this claim verifiable and meaningful.

coderabbitai · 2026-02-15T09:13:44Z

bench_artifacts/snapshot.md

+# MLLM x86 TinyLlama: Context Sweep & Profiling Snapshot
+
+## 1. Summary
+On the x86_64 CPU baseline (8 threads), TinyLlama's decode per-token latency remains highly stable (~0.67–0.88 ms/tok) across context lengths from CL=512 to 4096. However, TTFT (Time-to-First-Token) / prefill latency scales significantly and becomes the primary compute bottleneck for long contexts.


⚠️ Potential issue | 🟡 Minor

Clarify "highly stable" with 31% variation range.

The latency range 0.67–0.88 ms/tok represents approximately 31% variation. Consider whether "highly stable" accurately characterizes this range, or use terminology like "relatively consistent" with explicit acknowledgment of the variation.

🤖 Prompt for AI Agents

In `@bench_artifacts/snapshot.md` at line 4, The phrase "highly stable" over the 0.67–0.88 ms/tok range is misleading because that span is ~31% variation; update the sentence in bench_artifacts/snapshot.md to either replace "highly stable" with "relatively consistent" or similar, and append the precise variation (≈31% variation) or the numeric range explicitly to clarify; ensure the revised sentence still notes that TTFT/prefill latency scales and becomes the primary bottleneck for long contexts.

coderabbitai · 2026-02-15T09:13:44Z

bench_artifacts/snapshot.md

+
+## 3. Implications for Static Graph & AOT
+1. **Shape Bucketing:** The heavily reused GEMM shapes captured here provide the exact target dimensions required for AOT static-graph nearest-padding.
+2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.


⚠️ Potential issue | 🟡 Minor

Incorrect "VRAM" terminology for CPU benchmarks.

This document describes "x86_64 CPU baseline" benchmarks (line 4), but line 14 references "VRAM lower bounds." VRAM (Video RAM) is GPU/NPU memory terminology. For CPU benchmarks, use "RAM," "memory," or "system memory" instead.

📝 Suggested fix

-2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation. +2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.

2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation.

🤖 Prompt for AI Agents

In `@bench_artifacts/snapshot.md` at line 14, The term "VRAM" is incorrect for CPU benchmarks in the "Memory Planning" section; update the phrase "VRAM lower bounds" to use CPU-appropriate terminology such as "RAM lower bounds", "system memory lower bounds", or simply "memory lower bounds" in the sentence under "Memory Planning" so it correctly refers to x86_64 CPU baseline measurements and aligns with the rest of the document.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@sweep_context_v2.sh`:
- Around line 106-111: The heredoc used to compute DECODE_PER_TOK injects
shell-expanded variables into the Python snippet; change the delimiter to a
quoted heredoc (use <<'PY' instead of <<PY) so the Python here-doc is not
subject to shell expansion, keeping the computation for DECODE_PER_TOK (the
python3 - <<PY block) consistent with the kv_est_kb pattern and preventing
accidental interpolation of $tg and $DECODE_MS.
- Around line 121-124: The loop over context lengths (variable CLS) calls
run_one for tasks "decode_heavy" and "prefill_ttft" but because the script is
running with set -e any non‑zero return from run_one will abort the whole sweep;
change the loop to run each run_one call in a way that captures failures without
exiting (e.g., append || true or capture the exit code into a variable), record
failures per CL (store context lengths or task names that failed), and after the
loop exit non‑zero only if you want to signal overall failure while still
allowing the loop to continue; update the block where run_one is invoked for
decode_heavy and prefill_ttft to use this non‑fatal invocation and add a final
summary report of failures.

🧹 Nitpick comments (2)

sweep_context_v2.sh (2)
74-77: Hardcoded -n tiny_llama benchmark model name.

The model identifier passed to the benchmark binary is hardcoded. Combined with the hardcoded tiny_llama in the CSV (line 118), this means reusing the script for another model (Qwen, Llama-3) requires edits in multiple places. Extract a single MODEL_NAME variable and reference it in both locations.
Proposed fix

Add near the top defaults (e.g., after line 16):
+MODEL_NAME="${MODEL_NAME:-tiny_llama}"
Then update line 75 and line 118:
-    "$BIN" -n tiny_llama -m "$MODEL" -c "$CFG" \
+    "$BIN" -n "$MODEL_NAME" -m "$MODEL" -c "$CFG" \
-  echo "$TS,$GIT,$ARCH,tiny_llama,$mode,..." >> "$OUTCSV"
+  echo "$TS,$GIT,$ARCH,$MODEL_NAME,$mode,..." >> "$OUTCSV"
101-113: rg (ripgrep) is not universally available — consider grep -oE for portability.

The script requires ripgrep, which is not installed by default on most Linux distributions or CI images. Standard grep -oE provides equivalent functionality here and improves portability.
Example replacement for line 101
-  TTFT_MS="$(rg -o 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
+  TTFT_MS="$(grep -oE 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | grep -oE '[0-9.]+' | head -n 1 || echo 0)"
Apply the same pattern to lines 102, 103, and 113.

coderabbitai · 2026-02-15T09:21:43Z

sweep_context_v2.sh

+  DECODE_PER_TOK="$(python3 - <<PY
+tg=float("$tg")
+d=float("$DECODE_MS")
+print(d/tg if tg>0 else 0.0)
+PY
+)"


⚠️ Potential issue | 🟡 Minor

Unquoted heredoc allows shell expansion into Python code — use a quoted delimiter for consistency and safety.

Unlike the kv_est_kb function (line 38) which correctly uses a quoted heredoc (<<'PY'), this block uses an unquoted <<PY. This causes $tg and $DECODE_MS to be interpolated directly into the Python source. While these values are expected to be numeric, using a quoted heredoc with environment variables (matching the pattern in kv_est_kb) is both safer and more consistent.

Proposed fix

local DECODE_PER_TOK - DECODE_PER_TOK="$(python3 - <<PY -tg=float("$tg") -d=float("$DECODE_MS") -print(d/tg if tg>0 else 0.0) -PY + DECODE_PER_TOK="$(TG="$tg" DECODE_MS="$DECODE_MS" python3 - <<'PY' +import os +tg=float(os.environ["TG"]) +d=float(os.environ["DECODE_MS"]) +print(d/tg if tg>0 else 0.0) +PY )"

Based on learnings: "Identify potential security issues (e.g., injection risks) and recommend using secure alternatives."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

DECODE_PER_TOK="$(python3 - <<PY

tg=float("$tg")

d=float("$DECODE_MS")

print(d/tg if tg>0 else 0.0)

PY

)"

DECODE_PER_TOK="$(TG="$tg" DECODE_MS="$DECODE_MS" python3 - <<'PY'

import os

tg=float(os.environ["TG"])

d=float(os.environ["DECODE_MS"])

print(d/tg if tg>0 else 0.0)

PY

)"

🤖 Prompt for AI Agents

In `@sweep_context_v2.sh` around lines 106 - 111, The heredoc used to compute DECODE_PER_TOK injects shell-expanded variables into the Python snippet; change the delimiter to a quoted heredoc (use <<'PY' instead of <<PY) so the Python here-doc is not subject to shell expansion, keeping the computation for DECODE_PER_TOK (the python3 - <<PY block) consistent with the kv_est_kb pattern and preventing accidental interpolation of $tg and $DECODE_MS.

coderabbitai · 2026-02-15T09:21:44Z

tools/mllm-llm-benchmark/scripts/sweep_context_v2.sh

+for CL in $CLS; do
+  run_one "decode_heavy" "$CL" "$TG_DH"
+  run_one "prefill_ttft" "$CL" "$TG_TTFT"
+done


⚠️ Potential issue | 🟠 Major

A single failed run aborts the entire sweep due to set -e.

Because run_one returns 1 on failure and set -e is active, a failure at any context length (e.g., OOM at 4096) will terminate the loop and skip all remaining configurations. For a benchmarking sweep, you typically want best-effort execution so partial results are still collected.

Proposed fix — continue on failure and report at the end

+FAILURES=0 for CL in $CLS; do - run_one "decode_heavy" "$CL" "$TG_DH" - run_one "prefill_ttft" "$CL" "$TG_TTFT" + run_one "decode_heavy" "$CL" "$TG_DH" || ((FAILURES++)) + run_one "prefill_ttft" "$CL" "$TG_TTFT" || ((FAILURES++)) done echo echo "DONE -> $OUTCSV" +if (( FAILURES > 0 )); then + echo "WARNING: $FAILURES run(s) failed — check logs in $OUTDIR" + exit 1 +fi

🤖 Prompt for AI Agents

In `@sweep_context_v2.sh` around lines 121 - 124, The loop over context lengths (variable CLS) calls run_one for tasks "decode_heavy" and "prefill_ttft" but because the script is running with set -e any non‑zero return from run_one will abort the whole sweep; change the loop to run each run_one call in a way that captures failures without exiting (e.g., append || true or capture the exit code into a variable), record failures per CL (store context lengths or task names that failed), and after the loop exit non‑zero only if you want to signal overall failure while still allowing the loop to continue; update the block where run_one is invoked for decode_heavy and prefill_ttft to use this non‑fatal invocation and add a final summary report of failures.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp`:
- Around line 293-298: The inline Chinese comment near the id==0 path should be
replaced with an English comment (e.g., explaining flops = 2*M*N*K and noting
llamafile_sgemm parameter ordering) for consistency; additionally, the batched
branch (where batch_count > 1) must call mm_shape_record before iterating the
batch so batched post-quantize sgemm calls are logged — call
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count *
(uint64_t)M * (uint64_t)N * (uint64_t)K) in that branch (place it near the start
of the else/batched block that uses batch_count, id, M, N, K).

🧹 Nitpick comments (1)

mllm/backends/cpu/kernels/common/ggml/matmul.cpp (1)
3-12: Duplicate <cstdint> include and unused <tuple>.

<cstdint> is included on both line 4 and line 12. <tuple> (line 8) does not appear to be used anywhere in this file.
Proposed fix
 `#include` <algorithm>
 `#include` <cstdint>
 `#include` <cstdio>
 `#include` <cstdlib>
 `#include` <mutex>
-#include <tuple>
 `#include` <unordered_map>
 `#include` <vector>
 `#include` <cassert>
-#include <cstdint>

coderabbitai · 2026-02-15T09:30:29Z

mllm/backends/cpu/kernels/common/ggml/matmul.cpp

+        if (id == 0) {
+          // flops = 2*M*N*K（注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价）
+          mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
+                          /*calls=*/1,
+                          /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
+        }


⚠️ Potential issue | 🟡 Minor

Non-English comment; also, shape recording is missing for the batched branch.

Line 294 has a Chinese comment — the rest of the codebase uses English. Please translate for consistency.

Additionally, the batch_count == 1 branch records shapes (lines 293-298), but the else batched branch (lines 306-330) does not record any shapes at all. Batched post-quantize sgemm calls will be invisible in the shape log.

Proposed fix

- if (id == 0) { - // flops = 2*M*N*K（注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价） - mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, - /*calls=*/1, - /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K); - } + if (id == 0) { + // flops = 2*M*N*K (llamafile_sgemm arg order is N,M,K but mathematically equivalent) + mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, + /*calls=*/1, + /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K); + }

And in the batched branch (around line 312), add shape recording before the loop:

// Record shape once for the entire batched operation mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, /*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count * (uint64_t)M * (uint64_t)N * (uint64_t)K);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (id == 0) {

// flops = 2*M*N*K（注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价）

mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,

/*calls=*/1,

/*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);

}

if (id == 0) {

// flops = 2*M*N*K (llamafile_sgemm arg order is N,M,K but mathematically equivalent)

mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,

/*calls=*/1,

/*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);

}

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` around lines 293 - 298, The inline Chinese comment near the id==0 path should be replaced with an English comment (e.g., explaining flops = 2*M*N*K and noting llamafile_sgemm parameter ordering) for consistency; additionally, the batched branch (where batch_count > 1) must call mm_shape_record before iterating the batch so batched post-quantize sgemm calls are logged — call mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, /*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count * (uint64_t)M * (uint64_t)N * (uint64_t)K) in that branch (place it near the start of the else/batched block that uses batch_count, id, M, N, K).

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@bench_artifacts/README.md`:
- Around line 62-66: Replace the hardcoded personal path and undefined variables
in the perf repro snippet: remove or replace `cd ~/mllm-runok` with a relative
project-root reference (e.g., use an existing $ROOT or run from repo root), and
add/inline definitions for the referenced variables `$MODEL` and `$CFG` (reuse
the same variable names defined in the "Quick Repro" section or declare them
near `$BIN`) so the perf record command uses defined symbols (`$BIN`, `$MODEL`,
`$CFG`) and is portable and reproducible across environments.

🧹 Nitpick comments (1)

bench_artifacts/README.md (1)
8-8: Consider simpler wording.

"prior to execution" could be simplified to "before execution" for better readability.
✏️ Suggested simplification
-2. **KV Cache Memory Bounds:** Calculates theoretical and empirical (Peak RSS) VRAM requirements prior to execution, establishing strict lower bounds for AOT memory pre-allocation.
+2. **KV Cache Memory Bounds:** Calculates theoretical and empirical (Peak RSS) VRAM requirements before execution, establishing strict lower bounds for AOT memory pre-allocation.

coderabbitai · 2026-02-15T09:35:41Z

bench_artifacts/README.md

+cd ~/mllm-runok
+rm -f perf.data perf.data.old
+perf record -F 99 -g -- \
+  "$BIN" -n tiny_llama -m "$MODEL" -c "$CFG" \
+  -pp 32 -tg 256 -t 8 -cl 2048 -r 1 -cs 0


⚠️ Potential issue | 🟠 Major

Hardcoded path and undefined variables prevent reproduction.

This section has multiple portability issues:

Line 62: cd ~/mllm-runok is a hardcoded personal path that other users cannot reproduce.

Lines 65-66: Variables $MODEL and $CFG are referenced but never defined in this document (only $BIN was defined earlier in line 29).

Inconsistent with the portable approach used in the "Quick Repro" section above.

📝 Proposed fix for portability

-cd ~/mllm-runok +# Note: Define these variables first or use actual paths +export MODEL=/path/to/models/tinyllama-fp32.mllm +export CFG=./examples/llama/config_tiny_llama.json + rm -f perf.data perf.data.old perf record -F 99 -g -- \

Alternatively, reference the variables defined in the Quick Repro section:

-cd ~/mllm-runok +# Prerequisites: Ensure BIN, MODEL, and CFG are set (see Quick Repro section) +# Example: +# export BIN=./build/bin/mllm-llm-benchmark +# export MODEL=/path/to/your/model.mllm +# export CFG=./examples/llama/config_tiny_llama.json + rm -f perf.data perf.data.old

🤖 Prompt for AI Agents

In `@bench_artifacts/README.md` around lines 62 - 66, Replace the hardcoded personal path and undefined variables in the perf repro snippet: remove or replace `cd ~/mllm-runok` with a relative project-root reference (e.g., use an existing $ROOT or run from repo root), and add/inline definitions for the referenced variables `$MODEL` and `$CFG` (reuse the same variable names defined in the "Quick Repro" section or declare them near `$BIN`) so the perf record command uses defined symbols (`$BIN`, `$MODEL`, `$CFG`) and is portable and reproducible across environments.

huangzhenhua111 · 2026-02-16T09:24:08Z

Thank you for your PR.

Your PR does not benchmark the NPU. This does not match the description in your PR's README.

Please describe what Context Sweep is testing.

We do not object to using AI programming, but this PR has very obvious AI traces and it is unclear what it is doing. Please check this PR.

Hi,Senior Wang,thank you very much for your time to review, and sorry for the rough initial submission.

I know this PR is CPU-only, just preparing some data for NPU/AOT which I wish can help you with your current work. I've updated the README and desciption and so on to clearly state my intention.
Added description of what context sweep tests — it measures prefill/decode latency and memory at CL=256–4096 to find CPU bottlenecks.
I have figured out and cleaned up those unclear codes including AI traces to update my PR.
Could you please take another look when you have time? If you have any further requests, please let me know.
Wishing you a Happy New Year in advance！

chenghuaWang · 2026-02-17T03:45:36Z

There is no convenient and unified benchmark to evaluate the performance of mllm with different context lengths.

The MLLM benchmark tools include settings for Prefill Length and Decode Length.

Besides,I learned that you are focusing on NPU/AOT,so I add the shape log and KV estimation functions.The shape log helps identify which shape of matrix is most frequently used and can act as your bucket padding goal. KV memory size estimation helps determine how much memory should be reserved to avoid OOM.

The program for LLM is deterministic, its shape is known in advance, and the memory for KV can be calculated beforehand. I don't think there is a need to capture shape. MLLM provides the perfetto tool to view kernel flame graphs to analyze specific bottlenecks.

For your PR:

Please do not directly submit images and CSV files. You can describe the performance of MLLM on TinyLLaMA in the README, including the test methodology, the machine used for testing (system, CPU model, number of threads used).
Please do not modify the MLLM kernel code, as it is not related to the benchmark. I suggest you only submit code related to the benchmark tools.

huangzhenhua111 · 2026-02-20T14:43:18Z

There is no convenient and unified benchmark to evaluate the performance of mllm with different context lengths.

The MLLM benchmark tools include settings for Prefill Length and Decode Length.

Besides,I learned that you are focusing on NPU/AOT,so I add the shape log and KV estimation functions.The shape log helps identify which shape of matrix is most frequently used and can act as your bucket padding goal. KV memory size estimation helps determine how much memory should be reserved to avoid OOM.

The program for LLM is deterministic, its shape is known in advance, and the memory for KV can be calculated beforehand. I don't think there is a need to capture shape. MLLM provides the perfetto tool to view kernel flame graphs to analyze specific bottlenecks.

For your PR:

Please do not directly submit images and CSV files. You can describe the performance of MLLM on TinyLLaMA in the README, including the test methodology, the machine used for testing (system, CPU model, number of threads used).

Please do not modify the MLLM kernel code, as it is not related to the benchmark. I suggest you only submit code related to the benchmark tools.

Senior Wang,Happy New Year! Thank you for your review. I've revised it according to your requests. I hope you'll take another look when you have time. If there's anything else I need to change, please feel free to let me know.

chenghuaWang · 2026-02-21T15:20:37Z

tools/mllm-llm-benchmark/main.cpp


 #include "models/All.hpp"

+#ifndef MLLM_GIT_COMMIT_HASH


MLLM_GIT_COMMIT_HASH is provide by cmake system. Do not modify it.

chenghuaWang · 2026-02-21T15:23:51Z

tools/mllm-llm-benchmark/main.cpp

-    avg_prefill_speed /= 3.0f;
-    avg_decode_speed /= 3.0f;
-
-    // Print average results


pls keep the cli output

tools/mllm-llm-benchmark/README.md

chenghuaWang · 2026-02-21T15:26:13Z

tools/mllm-llm-benchmark/README.md

-
+### Single run
 ```bash
-./mllm-llm-benchmark \


The README is for developer, not a description of your PR. You should add your part on the top of the original README.md

chenghuaWang · 2026-02-21T15:27:09Z

tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp

+  int32_t head_dim = 0;  // hidden_size / num_attention_heads
 };

-/**


Do not delete all comments here!

chenghuaWang · 2026-02-21T15:28:00Z

pls use the building method in mllm document: python task.py tasks/build_x86.py
pls format your code based on the .clang-format before u submit.
follow the review. modify ur code.

Added CPU benchmark tool for context length sweeps. The existing benchmark tools had PP/TG settings but no way to automate testing across different context lengths. So I wrote a bash script to run sweeps automatically. Key changes: - Add Llama.hpp for LLaMA/TinyLLaMA models - Update BenchmarkTemplate to support KV cache estimation - Add context sweep automation scripts with plotting - Add comprehensive README documentation Tested on WSL2 (Ryzen 7 6800H) with TinyLLaMA fp32

huangzhenhua111 · 2026-02-22T13:28:18Z

pls use the building method in mllm document: python task.py tasks/build_x86.py

pls format your code based on the .clang-format before u submit.

follow the review. modify ur code.

Hi,Senior Wang,thank you for your review, I have revised it according to your requirements,please take another look at it when you have time.

chenghuaWang · 2026-02-23T11:49:20Z

tools/mllm-llm-benchmark/README.md

@@ -1,5 +1,11 @@
 # MLLM LLM Benchmark Tool

+## Why


There is no need to have why section

chenghuaWang · 2026-02-23T11:53:18Z

pls follow the review. modify ur code.
@jialilve pls check this PR again.

huangzhenhua111 · 2026-02-25T04:13:08Z

pls follow the review. modify ur code.

@jialilve pls check this PR again.

Hi @jialilve , I have delete my why section. Could you please take a look when you have time?

jialilve · 2026-02-25T09:25:17Z

tools/mllm-llm-benchmark/models/Llama.hpp

+  void clear() override {
+    // TODO: expose a public KV-cache reset API for LlamaForCausalLM (if needed).
+    // For now, keep it as no-op to minimize API changes in PR1.
+  }


@huangzhenhua111

please see this coderabbit's review;

In your sweep_context_v2.sh dose the 'cl' mean context length(include prompt length and test_generation_length? But in initial code the 'cl' mean cache length, it may results some misunderstand;

I don't see any information about 'cache lenght' in you code, why?

https://github.com/UbiquitousLearning/mllm/pull/639/changes#r2807444788

⚠️ Potential issue | 🟠 Major

clear() is a no-op — KV cache persists across consecutive runs.

Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.

Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.

🤖 Prompt for AI Agents

huangzhenhua111 · 2026-02-25T12:51:40Z

Hi, @jialilve , thank you for your review:

StaticCache::clearCache() is private in LlamaForCausalLM. Chenghua suggested not modifying core lib in this PR, so I added a safeguard: forcing --runs=1 with a warning if user tries >1. The script already defaults to 1 anyway. I wanted to Imitate your clear method in Qwen but LlamaForCausalLM unlike Qwen3ForCausalLM does not provide a public kvCache() method to obtain a reference to kv_cache. If you think it is not enough, @chenghuaWang , can I add a kvCache() getter to LlamaForCausalLM to match Qwen3's API? This would let me implement proper KV-cache reset in the benchmark.
Script uses cl for context length, but binary's -cl is cache length. Renamed to ctx_len in scripts to avoid confusion.
Added it to console output and CSV now.
Could you please take another look when you have time? If you have any other further requst pls let me know.

@jialilve

- rename cl->ctx_len in scripts (was confusing with -cl flag) - print cache_length in output for clarity - force runs=1 for LLaMA only (clear() is no-op, Qwen3 unaffected) - update README with CTX_LENS and RUNS limitation note per @jialilve feedback

chenghuaWang · 2026-02-27T11:29:01Z

can I add a kvCache() getter to LlamaForCausalLM to match Qwen3's API?

Pls add one.

huangzhenhua111 requested review from chenghuaWang, oreomaker and yirongjie as code owners February 14, 2026 12:24

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

huangzhenhua111 changed the title ~~[Bench] AOT Profiling Toolchain: Shape Capture, KV-Memory & Context Sweep~~ [Bench] x86 CPU Baseline Profiler & Context Sweep Tool Feb 15, 2026

huangzhenhua111 changed the title ~~[Bench] x86 CPU Baseline Profiler & Context Sweep Tool~~ Add CPU benchmark tool with context length sweep Feb 15, 2026

huangzhenhua111 force-pushed the pr-627 branch 2 times, most recently from 8d45d81 to 04013c8 Compare February 16, 2026 09:01

This was referenced Feb 17, 2026

tools/mllm-llm-benchmark: add CSV output and configurable runs #622

Closed

tools/mllm-llm-benchmark: add llama benchmark template #617

Closed

chenghuaWang changed the title ~~Add CPU benchmark tool with context length sweep~~ feat(benchmark): Add CPU benchmark tool with context length sweep Feb 17, 2026

huangzhenhua111 force-pushed the pr-627 branch from 81cb984 to c96ea48 Compare February 20, 2026 12:13

chenghuaWang requested changes Feb 21, 2026

View reviewed changes

huangzhenhua111 force-pushed the pr-627 branch 2 times, most recently from 7caabc1 to 4c42b62 Compare February 22, 2026 13:09

huangzhenhua111 requested review from liang1232018 and xumengwei as code owners February 22, 2026 13:09

huangzhenhua111 force-pushed the pr-627 branch from 4c42b62 to edfba9c Compare February 22, 2026 13:11

huangzhenhua111 force-pushed the pr-627 branch from edfba9c to fc46527 Compare February 22, 2026 13:17

chenghuaWang requested changes Feb 23, 2026

View reviewed changes

Update README.md

27db797

jialilve suggested changes Feb 25, 2026

View reviewed changes

huangzhenhua111 force-pushed the pr-627 branch from 60a549f to 31fc935 Compare February 25, 2026 13:13


		echo "TTFT=$TTFT_MS ms Prefill=$PREFILL_MS ms Decode=$DECODE_MS ms Decode/tok=$DECODE_PER_TOK ms peakRSS=$PEAK_RSS_KB KB KV_est=$KV_EST_KB KB"

		echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,$TTFT_MS,$PREFILL_MS,$DECODE_MS,$DECODE_PER_TOK,$PEAK_RSS_KB,$KV_EST_KB" >> "$OUTCSV"

	2. Memory Planning: Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.
	2. Memory Planning: Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation.

Conversation

huangzhenhua111 commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

jialilve Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

huangzhenhua111 commented Feb 14, 2026

Uh oh!

chenghuaWang commented Feb 14, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 15, 2026

huangzhenhua111 commented Feb 14, 2026 •

edited

Loading

coderabbitai bot commented Feb 14, 2026 •

edited

Loading

coderabbitai bot Feb 14, 2026 •

edited

Loading