Skip to content

feat(benchmark): Add CPU benchmark tool with context length sweep#639

Open
huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
huangzhenhua111:pr-627
Open

feat(benchmark): Add CPU benchmark tool with context length sweep#639
huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
huangzhenhua111:pr-627

Conversation

@huangzhenhua111
Copy link

@huangzhenhua111 huangzhenhua111 commented Feb 14, 2026

Added CPU benchmark tool for context length sweeps.The existing benchmark tools had PP/TG settings but no way to automate testing across different context lengths. So I wrote a bash script that sweeps from 256 to 4096 tokens and logs everything to CSV.Tested on my machine (WSL2, Ryzen 7 6800H) with TinyLLaMA fp32. Script runs two modes per context - one measures TTFT, other measures decode throughput. Also added KV cache size estimation.Usage is in the README. Basically just export BIN/MODEL/CFG and run the sweep script. Or use the benchmark binary directly for single runs.

Code structure:

  • Added Llama.hpp for LLaMA/TinyLLaMA models
  • Updated BenchmarkTemplate to support KV estimation
  • Scripts for sweeping + plotting results

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds benchmark artifact ignores and docs, context‑sweep orchestration and plotting tools, offline GEMM/GEMV shape logging in matmul, an x86 SIMD include, extended benchmark CLI with CSV/multi‑run/cooldown/KV estimation, and a new LLaMA benchmark implementation.

Changes

Cohort / File(s) Summary
Repo ignores
/.gitignore
Append ignore patterns for local benchmark artifacts (perf.data, perf.data.old, bench_context/, snapshots/, *.bak, *.bak.*, bench_mem_context.csv).
Bench artifacts & snapshot docs
bench_artifacts/README.md, bench_artifacts/snapshot.md
Add documentation and a profiling snapshot describing context‑sweep workflow, outputs (data/, plots/, snapshot.md), plotting steps, and profiling findings for TinyLlama x86.
Matmul shape logging
mllm/backends/cpu/kernels/common/ggml/matmul.cpp
Add anonymous‑namespace shape logging (MMShapeKey, MMAgg, mm_shape_init, mm_shape_record) to aggregate GEMM/GEMV shapes and FLOPs with thread safety and atexit reporting.
SIMD header
mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp
Include <immintrin.h> to enable x86 SIMD intrinsics.
Context sweep & snapshot tooling
sweep_context_v2.sh, scripts/make_snapshot_nopandas.py
Add Bash orchestrator for context sweeps (run launcher, log parsing, CSV aggregation, KV memory estimator) and a Python script to normalize CSVs and emit summary CSVs and PNG plots.
Benchmark CLI & registry
tools/mllm-llm-benchmark/main.cpp, tools/mllm-llm-benchmark/models/All.hpp
Enhance CLI: multi‑run, cooldown, CSV output, schema_version, kv_dtype_bytes, support pp/tg pairs; make createBenchmark inline and add LLaMA/TinyLlama model-name detection.
Benchmark interfaces & LLaMA model
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp, tools/mllm-llm-benchmark/models/Llama.hpp
Add KVCacheEstimateInfo and virtual kvEstimateInfo(); implement Llama_Benchmark with init, warmup, run, printModelInfo, clear, and KV‑cache estimation.

Sequence Diagram(s)

sequenceDiagram
  participant Sweep as "sweep_context_v2.sh"
  participant Bench as "mllm-llm-benchmark (binary)"
  participant Model as "Llama_Benchmark"
  participant FS as "Filesystem / Logs / CSV"
  participant Plot as "make_snapshot_nopandas.py"

  Sweep->>Bench: launch run (mode, pp, tg, threads)
  Bench->>Model: init()/warmup()/run() (prefill + decode)
  Model-->>Bench: timing metrics, KV estimate
  Bench-->>FS: append CSV row, write per-run logs, perf data
  Sweep->>FS: collect logs, extract metrics
  Sweep->>Plot: feed CSV
  Plot-->>FS: write summary CSV and PNGs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • yirongjie
  • chenghuaWang
  • oreomaker

Poem

"🐰 I hopped through logs and tiny‑llama trails,
I counted GEMMs and plotted perf scales,
CSV in paw, snapshots in sight,
I logged each run from morning to night,
carrots, charts, and kernels take flight!"

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (66 files):

⚔️ .gitignore (content)
⚔️ README-ZH.md (content)
⚔️ README.md (content)
⚔️ docs/qnn_backend/index.rst (content)
⚔️ docs/qnn_backend/setup_env.rst (content)
⚔️ examples/llama_qnn_aot/compile.cpp (content)
⚔️ examples/llama_qnn_aot/compile_sha.cpp (content)
⚔️ examples/llama_qnn_aot/modeling_llama_qnn_aot.hpp (content)
⚔️ examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp (content)
⚔️ examples/qwen2_qnn_aot/compile.cpp (content)
⚔️ examples/qwen2_qnn_aot/compile_sha.cpp (content)
⚔️ examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot.hpp (content)
⚔️ examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp (content)
⚔️ examples/qwen3_qnn_aot/aot_run.cpp (content)
⚔️ examples/qwen3_qnn_aot/compile.cpp (content)
⚔️ examples/qwen3_qnn_aot/compile_sha.cpp (content)
⚔️ examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (content)
⚔️ examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp (content)
⚔️ mllm/CMakeLists.txt (content)
⚔️ mllm/backends/ascend/AscendBackend.cpp (content)
⚔️ mllm/backends/ascend/AscendCommon.cpp (content)
⚔️ mllm/backends/ascend/AscendCommon.hpp (content)
⚔️ mllm/backends/ascend/ops/AscendElewiseOps.cpp (content)
⚔️ mllm/backends/ascend/ops/AscendElewiseOps.hpp (content)
⚔️ mllm/backends/cpu/kernels/common/ggml/matmul.cpp (content)
⚔️ mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp (content)
⚔️ mllm/backends/qnn/QNNAllocator.cpp (content)
⚔️ mllm/backends/qnn/QNNAllocator.hpp (content)
⚔️ mllm/backends/qnn/QNNBackend.cpp (content)
⚔️ mllm/backends/qnn/QNNBackend.hpp (content)
⚔️ mllm/backends/qnn/QNNModel.cpp (content)
⚔️ mllm/backends/qnn/QNNModel.hpp (content)
⚔️ mllm/backends/qnn/QNNUtils.cpp (content)
⚔️ mllm/backends/qnn/QNNUtils.hpp (content)
⚔️ mllm/backends/qnn/aot/QnnWrappersAPI.cpp (content)
⚔️ mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (content)
⚔️ mllm/backends/qnn/aot/visitor/CastType.cpp (content)
⚔️ mllm/backends/qnn/aot_rt/KVCacheManager.cpp (content)
⚔️ mllm/backends/qnn/aot_rt/KVCacheManager.hpp (content)
⚔️ mllm/backends/qnn/aot_rt/PromptProcessor.cpp (content)
⚔️ mllm/backends/qnn/aot_rt/PromptProcessor.hpp (content)
⚔️ mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (content)
⚔️ mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (content)
⚔️ mllm/backends/qnn/aot_rt/TokenGenerator.cpp (content)
⚔️ mllm/backends/qnn/aot_rt/TokenGenerator.hpp (content)
⚔️ mllm/backends/qnn/passes/QNNGraphBuildPass.cpp (content)
⚔️ mllm/compile/ir/Node.cpp (content)
⚔️ mllm/compile/ir/Node.hpp (content)
⚔️ mllm/core/DataTypes.cpp (content)
⚔️ mllm/core/DataTypes.hpp (content)
⚔️ mllm/core/SlicePrimitives.hpp (content)
⚔️ mllm/core/aops/CmpOp.cpp (content)
⚔️ mllm/engine/Context.cpp (content)
⚔️ mllm/engine/Context.hpp (content)
⚔️ mllm/mllm.cpp (content)
⚔️ mllm/models/minicpm_o2_6/modeling_resampler.hpp (content)
⚔️ mllm/models/minicpm_o2_6/streaming_generation.cpp (content)
⚔️ mllm/nn/Functional.cpp (content)
⚔️ mllm/nn/Functional.hpp (content)
⚔️ pymllm/backends/qualcomm/transformers/core/qdq.py (content)
⚔️ tasks/build_arm_ascend.yaml (content)
⚔️ tests/ascend/AscendKernelTest.hpp (content)
⚔️ tests/ascend/KernelTest.cpp (content)
⚔️ tools/mllm-llm-benchmark/main.cpp (content)
⚔️ tools/mllm-llm-benchmark/models/All.hpp (content)
⚔️ tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Description check ❓ Inconclusive PR description covers key contributions (shape capture, KV estimation, context sweep script) with usage details, but lacks structured sections matching the template guidelines. Restructure the description using the template format with clear sections: Problem Statement, Solution, Changes Made, Testing, and Usage. Add more specific context on the AOT profiling toolchain objectives and validation results.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(benchmark): Add CPU benchmark tool with context length sweep' clearly and specifically summarizes the main changes: introducing a new CPU benchmark tool with context length sweep functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/mllm-llm-benchmark/main.cpp (1)

175-222: ⚠️ Potential issue | 🟡 Minor

Average results are computed but never printed to console — only written to CSV.

When --output_csv is not provided, the averages (avg_ttft, avg_prefill_speed, avg_decode_speed, latencies, KV estimates) are computed and then discarded. Users running without CSV output never see the summary. Consider printing a console summary regardless of CSV output.

Proposed addition after line 192
     float avg_decode_ms_per_tok = (avg_decode_speed > 0.0f) ? (1.0f / avg_decode_speed) * 1000.0f : 0.0f;
 
+    mllm::print("\n  Average Results (" , R, "runs ):");
+    mllm::print("    Avg TTFT          :", avg_ttft, "ms");
+    mllm::print("    Avg Prefill Speed :", avg_prefill_speed, "tokens/s");
+    mllm::print("    Avg Decode Speed  :", avg_decode_speed, "tokens/s");
+    mllm::print("    Avg Prefill Lat   :", avg_prefill_ms, "ms");
+    mllm::print("    Avg Decode Lat    :", avg_decode_ms_per_tok, "ms/tok");
+
     // Rough KV cache estimate (bytes)
🤖 Fix all issues with AI agents
In `@bench_artifacts/README.md`:
- Around line 22-30: The README contains hardcoded personal paths (e.g., BIN,
MODEL, CFG variables pointing to ~/mllm-runok and /home/huangzhenhua/...) which
are not portable; update the example to use neutral placeholders and environment
variables (e.g., $PROJECT_ROOT or $MODEL_DIR and relative paths) for BIN, MODEL,
CFG, and any user-specific directories so other developers can reproduce the
commands without leaking usernames or personal directories.

In `@bench_artifacts/snapshot.md`:
- Around line 12-24: The markdown images are missing alt text causing
accessibility lint MD045; update each image tag to include a short descriptive
alt string that conveys the content (e.g., "Decode latency vs context plot",
"Decode per-token latency vs context plot", "TTFT and prefill vs context plot",
"Peak RSS memory growth plot", "KV estimate growth (MB) plot") — specifically
edit the five image references referencing
plots/context_sweep_v2.decode_heavy.decode_ms.png,
plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png,
plots/context_sweep_v2.prefill_ttft.ttft_ms.png,
plots/context_sweep_v2.memory.peak_rss_gb.png, and
plots/context_sweep_v2.memory.kv_est_mb.png to replace ![](...) with
![descriptive alt text](...) for each.

In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp`:
- Line 137: The declaration int64_t batch_count_mm = 1; in function mat_mul is
mis-indented (starts at column 0); fix by indenting it to match the surrounding
function body style (align with other local variable declarations inside
mat_mul) so batch_count_mm sits at the same indent level as nearby locals and
code blocks.
- Around line 136-141: Remove the unconditional top-level mm_shape_record(0,
...) in mat_mul (the initial mm_shape_record that uses batch_count_mm) and
instead record shapes only inside the actual compute branches: add
mm_shape_record(0, ...) in the first llamafile sgemm path when batch_count == 1
(inside check_llamafile_sgemm success branch) and add a batched
mm_shape_record(0, ...) in the llamafile batched path (the branch that handles
batch_count > 1), mirroring the existing post-quantize placement; ensure you do
not emit the kind=0 record when later branches will emit kind=1 or kind=2
records to avoid double-counting.

In `@mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp`:
- Line 10: Wrap the unconditional `#include` <immintrin.h> with the same x86 guard
used elsewhere (e.g., `#if` defined(__SSE__) || defined(__x86_64__)), so the
header is only included on x86 builds; locate the include line at the top of
fwd_bshd.hpp and surround it with the existing platform macro guard used for
intrinsics in this file to maintain cross-platform portability.

In `@scripts/make_snapshot_nopandas.py`:
- Around line 8-14: The helpers use bare except which swallows
KeyboardInterrupt/SystemExit and to_int is unused; change both functions to
catch only conversion-related exceptions (e.g., ValueError, TypeError) in
to_float and to_int, and either remove the unused to_int function or use it
where intended (search for to_int and to_float to determine intent) — do not use
a bare except and ensure interrupted signals are allowed to propagate.
- Around line 117-122: The printed "Wrote:" list omits two PNGs—prefill and
decode per-ms plots—so update the print block that uses summary_path, out_dir
and stamp to include the missing file paths
f"{stamp}.prefill_ttft.prefill_ms.png" and f"{stamp}.decode_heavy.decode_ms.png"
alongside the existing entries (the block that currently prints
f"{stamp}.prefill_ttft.ttft_ms.png" and
f"{stamp}.decode_heavy.decode_ms_per_tok.png"); ensure you print
os.path.join(out_dir, ...) for those two filenames so all six generated PNGs are
listed.

In `@sweep_context_v2.sh`:
- Around line 16-18: The default MODEL value contains a hardcoded user-specific
path; update the MODEL variable assignment in sweep_context_v2.sh
(MODEL="${MODEL:-...}") to use $HOME (or another portable relative/config-driven
location) instead of "/home/huangzhenhua", mirroring how BIN and CFG are defined
so the script works for other users/CI environments.
- Line 128: The CSV echo currently hardcodes the model name "tiny_llama"; change
it to use a variable so the correct model is recorded by replacing the hardcoded
string in the echo that writes to "$OUTCSV" with a dynamic value derived from
either a new environment variable MODEL_NAME or from parsing the existing $MODEL
path (e.g., basename or strip extensions) before the echo; update the script to
set MODEL_NAME if unset by extracting it from $MODEL and reference MODEL_NAME in
the echo line so other models (Qwen, Llama-3, etc.) are correctly labeled.

In `@tools/mllm-llm-benchmark/main.cpp`:
- Around line 197-202: Validate the KV dtype byte option (kv_dtype_bytes) right
after it's parsed and convert it into a validated variable (e.g., kv_bpe) that
only allows 1, 2, or 4; if the value is outside that set, log an error and exit
(or clamp/fallback to a safe default) to prevent zero/negative values. Replace
direct uses of kv_dtype_bytes.get() in the KV estimate calculations (the
kv_est_bytes_pp / kv_est_bytes_final assignments) with the validated kv_bpe
variable so bytes_per is always one of 1, 2, or 4.

In `@tools/mllm-llm-benchmark/models/Llama.hpp`:
- Around line 89-92: The clear() override in Llama.hpp is a no-op so the KV
cache persists across runs (LlamaForCausalLM), leaking state between
benchmark->clear() calls from main.cpp and corrupting multi-run / multi-pair
results; to fix, either implement a real KV reset when available (call the
model's KV reset API from clear() once LlamaForCausalLM exposes it) or, as an
immediate mitigation, (1) modify clear() to emit a clear-cache warning into the
benchmark output/CSV and (2) add an argument validation in main.cpp that
enforces --runs == 1 (or refuses >1) when using this Llama implementation, so
benchmark->clear() cannot be relied upon until a proper reset API is added.
- Around line 136-140: The decode throughput calculation in Llama.hpp computes
decode_tokens as (token_count > 0 ? token_count - 1 : 0) and uses that for
r.decode_speed, but Qwen3_W4A32_KAI.hpp::run() uses token_count directly, making
benchmarks inconsistent; update the Qwen3 implementation (in run()) to match
Llama's convention by excluding the first token from decode throughput (use
token_count - 1 with the same guards: ensure decode_tokens is non-negative and
only compute r.decode_speed when decode_us and decode_tokens are > 0) so both
benchmarks use the same definition of decode throughput.
- Around line 17-24: kvEstimateInfo() currently divides cfg_->hidden_size by
cfg_->num_attention_heads without a guard; mirror the defensive behavior used in
printModelInfo(): check cfg_ and that cfg_->num_attention_heads is non‑zero
before computing head_dim. If cfg_->num_attention_heads == 0 (or cfg_ is null)
return std::nullopt (or otherwise avoid the division) so
KVCacheEstimateInfo.head_dim is never computed with a zero divisor; update the
kvEstimateInfo() function to perform this check using the existing cfg_ and
KVCacheEstimateInfo symbols.
🧹 Nitpick comments (11)
mllm/backends/cpu/kernels/common/ggml/matmul.cpp (4)

3-12: Duplicate and unused includes.

<cstdint> is included twice (lines 4 and 12), and <tuple> (line 8) is not used anywhere in this file.

Proposed fix
 `#include` <algorithm>
-#include <cstdint>
 `#include` <cstdio>
 `#include` <cstdlib>
 `#include` <mutex>
-#include <tuple>
 `#include` <unordered_map>
 `#include` <vector>
 `#include` <cassert>
 `#include` <cstdint>

74-98: atexit handler accesses file-scope statics — safe only if no detached threads outlive main().

The atexit callback locks g_mm_shape_mu and iterates g_mm_shape. This works because these statics are constructed before the atexit registration (so they are destroyed after the handler runs). However, if any detached thread is still calling mm_shape_record when the process exits, the mutex lock in the handler could race or deadlock. Since this is a profiling/debug feature gated behind an env var, the risk is low — just worth noting.


101-109: Mutex acquisition on every matmul call when shape logging is enabled.

When MLLM_MATMUL_SHAPE_LOG is active, every GEMM invocation contends on g_mm_shape_mu. Since this tool measures latency and throughput, the synchronization overhead could skew the benchmarks it is designed to support. A thread-local accumulation flushed at the end would avoid contention on the hot path.


52-61: Non-English comments reduce accessibility for contributors.

Several comments in this file are in Chinese (lines 52, 60–61, 288, 331, 341). The rest of the codebase uses English. Please translate these for consistency.

tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)

72-72: Consider adding [[nodiscard]] to kvEstimateInfo().

The return value should always be checked by callers. This aligns with the static analysis hint (modernize-use-nodiscard).

Proposed fix
-  virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
+  [[nodiscard]] virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
tools/mllm-llm-benchmark/models/All.hpp (1)

31-35: Redundant conditions: "tinyllama" and "tiny_llama" already contain "llama".

The first condition find("llama") != npos already matches any string containing "tinyllama" or "tiny_llama", making the two extra checks dead code. Not a bug, but could be simplified.

Simplified condition
-  if (normalized_model_name.find("llama") != std::string::npos ||
-      normalized_model_name.find("tinyllama") != std::string::npos ||
-      normalized_model_name.find("tiny_llama") != std::string::npos) {
+  if (normalized_model_name.find("llama") != std::string::npos) {
     return std::make_shared<Llama_Benchmark>();
   }
tools/mllm-llm-benchmark/main.cpp (1)

22-28: Duplicate stringification macros: STR/STR_HELPER are identical to STRINGIFY/STRINGIFY_INTERNAL (lines 17-18) and appear unused.

STR and STR_HELPER are never referenced in the file. They duplicate STRINGIFY/STRINGIFY_INTERNAL already defined above.

Remove unused macros
-#ifndef MLLM_GIT_COMMIT_HASH
-#define MLLM_GIT_COMMIT_HASH unknown
-#endif
-
-#define STR_HELPER(x) `#x`
-#define STR(x) STR_HELPER(x)
-
+#ifndef MLLM_GIT_COMMIT_HASH
+#define MLLM_GIT_COMMIT_HASH unknown
+#endif
scripts/make_snapshot_nopandas.py (1)

1-1: math is imported but never used.

Proposed fix
-import sys, os, csv, math
+import sys, os, csv
sweep_context_v2.sh (3)

64-69: Unreachable guard on line 69.

If cl <= tg, the function returns on line 66. Otherwise pp = cl - tg ≥ 1, so the pp < 1 check on line 69 is dead code. Harmless, but can be removed for clarity.


111-113: rg (ripgrep) is a non-standard dependency.

The metric parsing relies on rg which isn't available by default on most systems. Consider either documenting this prerequisite in the usage comments or falling back to grep -oP (available on most GNU/Linux).

Alternative using grep -oP
-  TTFT_MS="$(rg -o 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
-  PREFILL_MS="$(rg -o 'Prefill Latency\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
-  DECODE_MS="$(rg -o 'Decode Latency\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
+  TTFT_MS="$(grep -oP 'TTFT\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"
+  PREFILL_MS="$(grep -oP 'Prefill Latency\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"
+  DECODE_MS="$(grep -oP 'Decode Latency\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"

131-135: A single failed run aborts the entire sweep due to set -e.

Since run_one returns 1 on failure and set -e is active, any failed benchmark run (e.g., OOM at a large context length) will terminate the entire sweep. If the intent is to collect as much data as possible, consider tolerating individual failures:

Proposed fix
 for CL in $CLS; do
-  run_one "decode_heavy" "$CL" "$TG_DH"
-  run_one "prefill_ttft" "$CL" "$TG_TTFT"
+  run_one "decode_heavy" "$CL" "$TG_DH" || echo "WARN: decode_heavy cl=$CL failed, continuing..."
+  run_one "prefill_ttft" "$CL" "$TG_TTFT" || echo "WARN: prefill_ttft cl=$CL failed, continuing..."
 done

Comment on lines 12 to 24
![](plots/context_sweep_v2.decode_heavy.decode_ms.png)

### **Fig.2: Decode per-token latency vs context (ms/tok)**
![](plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png)

### **Fig.3: TTFT/Prefill vs context (ms)**
![](plots/context_sweep_v2.prefill_ttft.ttft_ms.png)

### **Fig.4: Memory growth (Peak RSS, GB)**
![](plots/context_sweep_v2.memory.peak_rss_gb.png)

### **Fig.5: KV estimate growth (MB, formula-based)**
![](plots/context_sweep_v2.memory.kv_est_mb.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Images are missing alt text (accessibility).

All five image references use ![]() with empty alt text. Markdownlint MD045 flags this. Adding descriptive alt text improves accessibility and provides context when images fail to load.

Example fix for one image
-![](plots/context_sweep_v2.decode_heavy.decode_ms.png)
+![Decode latency vs context length](plots/context_sweep_v2.decode_heavy.decode_ms.png)

Apply similarly to all five image references.

🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 12-12: Images should have alternate text (alt text)

(MD045, no-alt-text)


[warning] 15-15: Images should have alternate text (alt text)

(MD045, no-alt-text)


[warning] 18-18: Images should have alternate text (alt text)

(MD045, no-alt-text)


[warning] 21-21: Images should have alternate text (alt text)

(MD045, no-alt-text)


[warning] 24-24: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents
In `@bench_artifacts/snapshot.md` around lines 12 - 24, The markdown images are
missing alt text causing accessibility lint MD045; update each image tag to
include a short descriptive alt string that conveys the content (e.g., "Decode
latency vs context plot", "Decode per-token latency vs context plot", "TTFT and
prefill vs context plot", "Peak RSS memory growth plot", "KV estimate growth
(MB) plot") — specifically edit the five image references referencing
plots/context_sweep_v2.decode_heavy.decode_ms.png,
plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png,
plots/context_sweep_v2.prefill_ttft.ttft_ms.png,
plots/context_sweep_v2.memory.peak_rss_gb.png, and
plots/context_sweep_v2.memory.kv_est_mb.png to replace ![](...) with
![descriptive alt text](...) for each.

Comment on lines 136 to 141
// NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path)
int64_t batch_count_mm = 1;
for (size_t i = 0; i + 2 < dst_shape.size(); ++i) batch_count_mm *= dst_shape[i];
mm_shape_record(0, M, N, K,
(uint64_t)batch_count_mm,
(uint64_t)(2.0 * (double)batch_count_mm * M * N * (double)K));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Double-counting of GEMM shapes when the first llamafile path fails.

This mm_shape_record(kind=0, ...) fires unconditionally at the top of mat_mul. If the first check_llamafile_sgemm (line 170) fails and execution falls through to the post-quantize llamafile path (line 284), a second mm_shape_record(kind=0, ...) is emitted at line 289. Similarly, if execution reaches the gemv/gemm path (line 330), the top-level kind=0 record is redundant alongside the kind=1/kind=2 records.

This inflates FLOPs totals in the shape log. The top-level record should be removed and shape recording should happen only in the path that actually executes the compute.

Proposed fix — remove unconditional top-level recording
-  // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path)
-int64_t batch_count_mm = 1;
-  for (size_t i = 0; i + 2 < dst_shape.size(); ++i) batch_count_mm *= dst_shape[i];
-  mm_shape_record(0, M, N, K,
-                  (uint64_t)batch_count_mm,
-                  (uint64_t)(2.0 * (double)batch_count_mm * M * N * (double)K));
-
-
   auto src0_dtype = src0_.dtype();

Then add a shape record inside the first llamafile batch_count == 1 path (around line 176) and the batched path (around line 190), mirroring what is already done for the post-quantize path at line 289.

🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` around lines 136 - 141,
Remove the unconditional top-level mm_shape_record(0, ...) in mat_mul (the
initial mm_shape_record that uses batch_count_mm) and instead record shapes only
inside the actual compute branches: add mm_shape_record(0, ...) in the first
llamafile sgemm path when batch_count == 1 (inside check_llamafile_sgemm success
branch) and add a batched mm_shape_record(0, ...) in the llamafile batched path
(the branch that handles batch_count > 1), mirroring the existing post-quantize
placement; ensure you do not emit the kind=0 record when later branches will
emit kind=1 or kind=2 records to avoid double-counting.

const int N = transpose1 ? src1_shape[src1_shape.size() - 2] : src1_shape[src1_shape.size() - 1];

// NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path)
int64_t batch_count_mm = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Indentation is broken — int64_t batch_count_mm starts at column 0.

This line is inside the mat_mul function body but has no indentation, breaking the file's consistent style.

-int64_t batch_count_mm = 1;
+  int64_t batch_count_mm = 1;
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` at line 137, The
declaration int64_t batch_count_mm = 1; in function mat_mul is mis-indented
(starts at column 0); fix by indenting it to match the surrounding function body
style (align with other local variable declarations inside mat_mul) so
batch_count_mm sits at the same indent level as nearby locals and code blocks.

#include "mllm/core/Parallel.hpp"
#include "mllm/backends/cpu/kernels/common/paged_attn/arch.hpp"
#include "mllm/utils/UnsafeMacros.hpp"
#include <immintrin.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the file to verify the include and its context
cat -n mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 7335


🏁 Script executed:

# Search for other x86-specific includes in the CPU backend
rg -n '#include\s*<(immintrin|xmmintrin|emmintrin|pmmintrin)\.h>' mllm/backends/cpu/ -A2 -B2

Repository: UbiquitousLearning/mllm

Length of output: 2384


🏁 Script executed:

# Check if immintrin.h is used elsewhere and how it's handled
rg -l 'immintrin\.h' mllm/ --type cpp --type hpp

Repository: UbiquitousLearning/mllm

Length of output: 94


Guard the x86-specific include for cross-platform portability.

The unconditional include of <immintrin.h> will fail on non-x86 platforms (ARM, RISC-V, etc.) where this header does not exist. Since all intrinsics usage is properly guarded by #if defined(__SSE__) || defined(__x86_64__) throughout the file, and every other x86 intrinsic include in the codebase follows the same pattern, this include should be similarly guarded to maintain consistency and portability.

Proposed fix
+#if defined(__SSE__) || defined(__x86_64__)
 `#include` <immintrin.h>
+#endif
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <immintrin.h>
`#if` defined(__SSE__) || defined(__x86_64__)
`#include` <immintrin.h>
`#endif`
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp` at line 10, Wrap
the unconditional `#include` <immintrin.h> with the same x86 guard used elsewhere
(e.g., `#if` defined(__SSE__) || defined(__x86_64__)), so the header is only
included on x86 builds; locate the include line at the top of fwd_bshd.hpp and
surround it with the existing platform macro guard used for intrinsics in this
file to maintain cross-platform portability.


echo "TTFT=$TTFT_MS ms Prefill=$PREFILL_MS ms Decode=$DECODE_MS ms Decode/tok=$DECODE_PER_TOK ms peakRSS=$PEAK_RSS_KB KB KV_est=$KV_EST_KB KB"

echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,$TTFT_MS,$PREFILL_MS,$DECODE_MS,$DECODE_PER_TOK,$PEAK_RSS_KB,$KV_EST_KB" >> "$OUTCSV"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Model name is hardcoded as tiny_llama in CSV output.

If this script is reused for other models (Qwen, Llama-3, etc.), the CSV will contain an incorrect model identifier. Consider deriving it from the $MODEL path or introducing a MODEL_NAME environment variable.

Proposed fix
+MODEL_NAME="${MODEL_NAME:-$(basename "$MODEL" | sed 's/\.[^.]*$//')}"
+
 # (in the CSV echo on line 128)
-  echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,..." >> "$OUTCSV"
+  echo "$TS,$GIT,$ARCH,$MODEL_NAME,$mode,$cl,$pp,$tg,$THREADS,..." >> "$OUTCSV"
🤖 Prompt for AI Agents
In `@sweep_context_v2.sh` at line 128, The CSV echo currently hardcodes the model
name "tiny_llama"; change it to use a variable so the correct model is recorded
by replacing the hardcoded string in the echo that writes to "$OUTCSV" with a
dynamic value derived from either a new environment variable MODEL_NAME or from
parsing the existing $MODEL path (e.g., basename or strip extensions) before the
echo; update the script to set MODEL_NAME if unset by extracting it from $MODEL
and reference MODEL_NAME in the echo line so other models (Qwen, Llama-3, etc.)
are correctly labeled.

Comment on lines +197 to +202
if (auto info = benchmark->kvEstimateInfo(); info.has_value()) {
const int32_t bytes_per = kv_dtype_bytes.get(); // 1/2/4
// LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes
kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per;
kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No validation on kv_dtype_bytes — accepts arbitrary values including 0 and negatives.

The help text states valid values are 1, 2, or 4, but any integer is accepted. A value of 0 would silently produce zero estimates; a negative value would produce negative estimates.

Proposed validation (e.g., after line 89)
+  int kv_bpe = kv_dtype_bytes.get();
+  if (kv_bpe != 1 && kv_bpe != 2 && kv_bpe != 4) {
+    mllm::print("[WARN] --kv_dtype_bytes should be 1, 2, or 4; got:", kv_bpe, "— defaulting to 4");
+    kv_bpe = 4;
+  }

Then use kv_bpe instead of kv_dtype_bytes.get() on line 198.

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/main.cpp` around lines 197 - 202, Validate the KV
dtype byte option (kv_dtype_bytes) right after it's parsed and convert it into a
validated variable (e.g., kv_bpe) that only allows 1, 2, or 4; if the value is
outside that set, log an error and exit (or clamp/fallback to a safe default) to
prevent zero/negative values. Replace direct uses of kv_dtype_bytes.get() in the
KV estimate calculations (the kv_est_bytes_pp / kv_est_bytes_final assignments)
with the validated kv_bpe variable so bytes_per is always one of 1, 2, or 4.

Comment on lines 17 to 24
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
return info;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing division-by-zero guard in kvEstimateInfo() unlike printModelInfo().

Line 22 computes hidden_size / num_attention_heads without guarding against zero, whereas printModelInfo() (line 58) has a defensive check. For consistency and robustness:

Proposed fix
   KVCacheEstimateInfo info;
   info.num_layers = cfg_->num_hidden_layers;
   info.num_kv_heads = cfg_->num_key_value_heads;
-  info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
+  info.head_dim = (cfg_->num_attention_heads > 0)
+                    ? (cfg_->hidden_size / cfg_->num_attention_heads)
+                    : 0;
   return info;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
return info;
}
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = (cfg_->num_attention_heads > 0)
? (cfg_->hidden_size / cfg_->num_attention_heads)
: 0;
return info;
}
🧰 Tools
🪛 Clang (14.0.6)

[error] 17-17: function 'kvEstimateInfo' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 17 - 24,
kvEstimateInfo() currently divides cfg_->hidden_size by
cfg_->num_attention_heads without a guard; mirror the defensive behavior used in
printModelInfo(): check cfg_ and that cfg_->num_attention_heads is non‑zero
before computing head_dim. If cfg_->num_attention_heads == 0 (or cfg_ is null)
return std::nullopt (or otherwise avoid the division) so
KVCacheEstimateInfo.head_dim is never computed with a zero divisor; update the
kvEstimateInfo() function to perform this check using the existing cfg_ and
KVCacheEstimateInfo symbols.

Comment on lines 89 to 92
void clear() override {
// TODO: expose a public KV-cache reset API for LlamaForCausalLM (if needed).
// For now, keep it as no-op to minimize API changes in PR1.
}
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

clear() is a no-op — KV cache persists across consecutive runs.

Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.

Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 89 - 92, The clear()
override in Llama.hpp is a no-op so the KV cache persists across runs
(LlamaForCausalLM), leaking state between benchmark->clear() calls from main.cpp
and corrupting multi-run / multi-pair results; to fix, either implement a real
KV reset when available (call the model's KV reset API from clear() once
LlamaForCausalLM exposes it) or, as an immediate mitigation, (1) modify clear()
to emit a clear-cache warning into the benchmark output/CSV and (2) add an
argument validation in main.cpp that enforces --runs == 1 (or refuses >1) when
using this Llama implementation, so benchmark->clear() cannot be relied upon
until a proper reset API is added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huangzhenhua111

  1. please see this coderabbit's review;
  2. In your sweep_context_v2.sh dose the 'cl' mean context length(include prompt length and test_generation_length? But in initial code the 'cl' mean cache length, it may results some misunderstand;
  3. I don't see any information about 'cache lenght' in you code, why?

https://github.com/UbiquitousLearning/mllm/pull/639/changes#r2807444788

⚠️ Potential issue | 🟠 Major

clear() is a no-op — KV cache persists across consecutive runs.

Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.

Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.

🤖 Prompt for AI Agents

Comment on lines 136 to 140
// NOTE: decode_us is measured from first token timestamp; exclude that first token from decode throughput.
int decode_tokens = (token_count > 0) ? (token_count - 1) : 0;
r.decode_speed = (decode_us > 0 && decode_tokens > 0)
? (static_cast<float>(decode_tokens) / decode_us) * 1e6f
: 0.f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Decode throughput calculation differs from Qwen3 benchmark.

This implementation correctly excludes the first token from decode throughput (token_count - 1), since the first token's latency is attributed to TTFT/prefill. However, Qwen3_W4A32_KAI.hpp's run() uses token_count directly (includes the first token in decode speed), making the two benchmarks non-comparable.

Consider aligning both implementations to the same convention (the Llama approach is arguably more correct).

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 136 - 140, The decode
throughput calculation in Llama.hpp computes decode_tokens as (token_count > 0 ?
token_count - 1 : 0) and uses that for r.decode_speed, but
Qwen3_W4A32_KAI.hpp::run() uses token_count directly, making benchmarks
inconsistent; update the Qwen3 implementation (in run()) to match Llama's
convention by excluding the first token from decode throughput (use token_count
- 1 with the same guards: ensure decode_tokens is non-negative and only compute
r.decode_speed when decode_us and decode_tokens are > 0) so both benchmarks use
the same definition of decode throughput.

@huangzhenhua111
Copy link
Author

Hi @chenghuaWang ,thank you for your email guidance very much , can you take a look at this PR? I sincerely wish it can help you.

@chenghuaWang
Copy link
Collaborator

Thank you for your PR.

  1. Your PR does not benchmark the NPU. This does not match the description in your PR's README.
  2. Please describe what Context Sweep is testing.
  3. We do not object to using AI programming, but this PR has very obvious AI traces and it is unclear what it is doing. Please check this PR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@bench_artifacts/snapshot.md`:
- Line 4: The phrase "highly stable" over the 0.67–0.88 ms/tok range is
misleading because that span is ~31% variation; update the sentence in
bench_artifacts/snapshot.md to either replace "highly stable" with "relatively
consistent" or similar, and append the precise variation (≈31% variation) or the
numeric range explicitly to clarify; ensure the revised sentence still notes
that TTFT/prefill latency scales and becomes the primary bottleneck for long
contexts.
- Line 14: The term "VRAM" is incorrect for CPU benchmarks in the "Memory
Planning" section; update the phrase "VRAM lower bounds" to use CPU-appropriate
terminology such as "RAM lower bounds", "system memory lower bounds", or simply
"memory lower bounds" in the sentence under "Memory Planning" so it correctly
refers to x86_64 CPU baseline measurements and aligns with the rest of the
document.
🧹 Nitpick comments (3)
bench_artifacts/snapshot.md (3)

4-4: Add CPU model and system specifications for reproducibility.

Specify the CPU model (e.g., Intel i7-12700K, AMD Ryzen 7 5800X) and other relevant system details (OS version, RAM, compiler flags) to make these benchmark results reproducible and comparable.


6-6: Quantify "heavily concentrated" with profiling percentages.

Replace the vague "heavily concentrated" with specific metrics from perf output, such as "accounts for X% of total CPU time" or "dominates the top-N hotspots by Y%."


14-14: Quantify "strictly aligns" claim with error margins.

The phrase "strictly aligns" is vague. Provide quantitative support, such as "aligns within ±X% error" or "deviates by less than Y MB," to make this claim verifiable and meaningful.

# MLLM x86 TinyLlama: Context Sweep & Profiling Snapshot

## 1. Summary
On the x86_64 CPU baseline (8 threads), TinyLlama's decode per-token latency remains highly stable (~0.67–0.88 ms/tok) across context lengths from CL=512 to 4096. However, TTFT (Time-to-First-Token) / prefill latency scales significantly and becomes the primary compute bottleneck for long contexts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify "highly stable" with 31% variation range.

The latency range 0.67–0.88 ms/tok represents approximately 31% variation. Consider whether "highly stable" accurately characterizes this range, or use terminology like "relatively consistent" with explicit acknowledgment of the variation.

🤖 Prompt for AI Agents
In `@bench_artifacts/snapshot.md` at line 4, The phrase "highly stable" over the
0.67–0.88 ms/tok range is misleading because that span is ~31% variation; update
the sentence in bench_artifacts/snapshot.md to either replace "highly stable"
with "relatively consistent" or similar, and append the precise variation (≈31%
variation) or the numeric range explicitly to clarify; ensure the revised
sentence still notes that TTFT/prefill latency scales and becomes the primary
bottleneck for long contexts.


## 3. Implications for Static Graph & AOT
1. **Shape Bucketing:** The heavily reused GEMM shapes captured here provide the exact target dimensions required for AOT static-graph nearest-padding.
2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incorrect "VRAM" terminology for CPU benchmarks.

This document describes "x86_64 CPU baseline" benchmarks (line 4), but line 14 references "VRAM lower bounds." VRAM (Video RAM) is GPU/NPU memory terminology. For CPU benchmarks, use "RAM," "memory," or "system memory" instead.

📝 Suggested fix
-2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.
+2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.
2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation.
🤖 Prompt for AI Agents
In `@bench_artifacts/snapshot.md` at line 14, The term "VRAM" is incorrect for CPU
benchmarks in the "Memory Planning" section; update the phrase "VRAM lower
bounds" to use CPU-appropriate terminology such as "RAM lower bounds", "system
memory lower bounds", or simply "memory lower bounds" in the sentence under
"Memory Planning" so it correctly refers to x86_64 CPU baseline measurements and
aligns with the rest of the document.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@sweep_context_v2.sh`:
- Around line 106-111: The heredoc used to compute DECODE_PER_TOK injects
shell-expanded variables into the Python snippet; change the delimiter to a
quoted heredoc (use <<'PY' instead of <<PY) so the Python here-doc is not
subject to shell expansion, keeping the computation for DECODE_PER_TOK (the
python3 - <<PY block) consistent with the kv_est_kb pattern and preventing
accidental interpolation of $tg and $DECODE_MS.
- Around line 121-124: The loop over context lengths (variable CLS) calls
run_one for tasks "decode_heavy" and "prefill_ttft" but because the script is
running with set -e any non‑zero return from run_one will abort the whole sweep;
change the loop to run each run_one call in a way that captures failures without
exiting (e.g., append || true or capture the exit code into a variable), record
failures per CL (store context lengths or task names that failed), and after the
loop exit non‑zero only if you want to signal overall failure while still
allowing the loop to continue; update the block where run_one is invoked for
decode_heavy and prefill_ttft to use this non‑fatal invocation and add a final
summary report of failures.
🧹 Nitpick comments (2)
sweep_context_v2.sh (2)

74-77: Hardcoded -n tiny_llama benchmark model name.

The model identifier passed to the benchmark binary is hardcoded. Combined with the hardcoded tiny_llama in the CSV (line 118), this means reusing the script for another model (Qwen, Llama-3) requires edits in multiple places. Extract a single MODEL_NAME variable and reference it in both locations.

Proposed fix

Add near the top defaults (e.g., after line 16):

+MODEL_NAME="${MODEL_NAME:-tiny_llama}"

Then update line 75 and line 118:

-    "$BIN" -n tiny_llama -m "$MODEL" -c "$CFG" \
+    "$BIN" -n "$MODEL_NAME" -m "$MODEL" -c "$CFG" \
-  echo "$TS,$GIT,$ARCH,tiny_llama,$mode,..." >> "$OUTCSV"
+  echo "$TS,$GIT,$ARCH,$MODEL_NAME,$mode,..." >> "$OUTCSV"

101-113: rg (ripgrep) is not universally available — consider grep -oE for portability.

The script requires ripgrep, which is not installed by default on most Linux distributions or CI images. Standard grep -oE provides equivalent functionality here and improves portability.

Example replacement for line 101
-  TTFT_MS="$(rg -o 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)"
+  TTFT_MS="$(grep -oE 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | grep -oE '[0-9.]+' | head -n 1 || echo 0)"

Apply the same pattern to lines 102, 103, and 113.

Comment on lines 106 to 111
DECODE_PER_TOK="$(python3 - <<PY
tg=float("$tg")
d=float("$DECODE_MS")
print(d/tg if tg>0 else 0.0)
PY
)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unquoted heredoc allows shell expansion into Python code — use a quoted delimiter for consistency and safety.

Unlike the kv_est_kb function (line 38) which correctly uses a quoted heredoc (<<'PY'), this block uses an unquoted <<PY. This causes $tg and $DECODE_MS to be interpolated directly into the Python source. While these values are expected to be numeric, using a quoted heredoc with environment variables (matching the pattern in kv_est_kb) is both safer and more consistent.

Proposed fix
   local DECODE_PER_TOK
-  DECODE_PER_TOK="$(python3 - <<PY
-tg=float("$tg")
-d=float("$DECODE_MS")
-print(d/tg if tg>0 else 0.0)
-PY
+  DECODE_PER_TOK="$(TG="$tg" DECODE_MS="$DECODE_MS" python3 - <<'PY'
+import os
+tg=float(os.environ["TG"])
+d=float(os.environ["DECODE_MS"])
+print(d/tg if tg>0 else 0.0)
+PY
 )"

Based on learnings: "Identify potential security issues (e.g., injection risks) and recommend using secure alternatives."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
DECODE_PER_TOK="$(python3 - <<PY
tg=float("$tg")
d=float("$DECODE_MS")
print(d/tg if tg>0 else 0.0)
PY
)"
DECODE_PER_TOK="$(TG="$tg" DECODE_MS="$DECODE_MS" python3 - <<'PY'
import os
tg=float(os.environ["TG"])
d=float(os.environ["DECODE_MS"])
print(d/tg if tg>0 else 0.0)
PY
)"
🤖 Prompt for AI Agents
In `@sweep_context_v2.sh` around lines 106 - 111, The heredoc used to compute
DECODE_PER_TOK injects shell-expanded variables into the Python snippet; change
the delimiter to a quoted heredoc (use <<'PY' instead of <<PY) so the Python
here-doc is not subject to shell expansion, keeping the computation for
DECODE_PER_TOK (the python3 - <<PY block) consistent with the kv_est_kb pattern
and preventing accidental interpolation of $tg and $DECODE_MS.

Comment on lines 121 to 124
for CL in $CLS; do
run_one "decode_heavy" "$CL" "$TG_DH"
run_one "prefill_ttft" "$CL" "$TG_TTFT"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

A single failed run aborts the entire sweep due to set -e.

Because run_one returns 1 on failure and set -e is active, a failure at any context length (e.g., OOM at 4096) will terminate the loop and skip all remaining configurations. For a benchmarking sweep, you typically want best-effort execution so partial results are still collected.

Proposed fix — continue on failure and report at the end
+FAILURES=0
 for CL in $CLS; do
-  run_one "decode_heavy" "$CL" "$TG_DH"
-  run_one "prefill_ttft" "$CL" "$TG_TTFT"
+  run_one "decode_heavy" "$CL" "$TG_DH" || ((FAILURES++))
+  run_one "prefill_ttft" "$CL" "$TG_TTFT" || ((FAILURES++))
 done
 
 echo
 echo "DONE -> $OUTCSV"
+if (( FAILURES > 0 )); then
+  echo "WARNING: $FAILURES run(s) failed — check logs in $OUTDIR"
+  exit 1
+fi
🤖 Prompt for AI Agents
In `@sweep_context_v2.sh` around lines 121 - 124, The loop over context lengths
(variable CLS) calls run_one for tasks "decode_heavy" and "prefill_ttft" but
because the script is running with set -e any non‑zero return from run_one will
abort the whole sweep; change the loop to run each run_one call in a way that
captures failures without exiting (e.g., append || true or capture the exit code
into a variable), record failures per CL (store context lengths or task names
that failed), and after the loop exit non‑zero only if you want to signal
overall failure while still allowing the loop to continue; update the block
where run_one is invoked for decode_heavy and prefill_ttft to use this non‑fatal
invocation and add a final summary report of failures.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp`:
- Around line 293-298: The inline Chinese comment near the id==0 path should be
replaced with an English comment (e.g., explaining flops = 2*M*N*K and noting
llamafile_sgemm parameter ordering) for consistency; additionally, the batched
branch (where batch_count > 1) must call mm_shape_record before iterating the
batch so batched post-quantize sgemm calls are logged — call
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count *
(uint64_t)M * (uint64_t)N * (uint64_t)K) in that branch (place it near the start
of the else/batched block that uses batch_count, id, M, N, K).
🧹 Nitpick comments (1)
mllm/backends/cpu/kernels/common/ggml/matmul.cpp (1)

3-12: Duplicate <cstdint> include and unused <tuple>.

<cstdint> is included on both line 4 and line 12. <tuple> (line 8) does not appear to be used anywhere in this file.

Proposed fix
 `#include` <algorithm>
 `#include` <cstdint>
 `#include` <cstdio>
 `#include` <cstdlib>
 `#include` <mutex>
-#include <tuple>
 `#include` <unordered_map>
 `#include` <vector>
 `#include` <cassert>
-#include <cstdint>

Comment on lines 293 to 298
if (id == 0) {
// flops = 2*M*N*K(注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价)
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/1,
/*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Non-English comment; also, shape recording is missing for the batched branch.

Line 294 has a Chinese comment — the rest of the codebase uses English. Please translate for consistency.

Additionally, the batch_count == 1 branch records shapes (lines 293-298), but the else batched branch (lines 306-330) does not record any shapes at all. Batched post-quantize sgemm calls will be invisible in the shape log.

Proposed fix
-        if (id == 0) {
-          // flops = 2*M*N*K(注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价)
-          mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
-                          /*calls=*/1,
-                          /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
-        }
+        if (id == 0) {
+          // flops = 2*M*N*K (llamafile_sgemm arg order is N,M,K but mathematically equivalent)
+          mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
+                          /*calls=*/1,
+                          /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
+        }

And in the batched branch (around line 312), add shape recording before the loop:

      // Record shape once for the entire batched operation
      mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
                      /*calls=*/(uint64_t)batch_count,
                      /*flops=*/2ull * (uint64_t)batch_count * (uint64_t)M * (uint64_t)N * (uint64_t)K);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (id == 0) {
// flops = 2*M*N*K(注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价)
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/1,
/*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
}
if (id == 0) {
// flops = 2*M*N*K (llamafile_sgemm arg order is N,M,K but mathematically equivalent)
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/1,
/*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
}
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` around lines 293 - 298, The
inline Chinese comment near the id==0 path should be replaced with an English
comment (e.g., explaining flops = 2*M*N*K and noting llamafile_sgemm parameter
ordering) for consistency; additionally, the batched branch (where batch_count >
1) must call mm_shape_record before iterating the batch so batched post-quantize
sgemm calls are logged — call mm_shape_record(/*kind=*/0, (int)M, (int)N,
(int)K, /*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count
* (uint64_t)M * (uint64_t)N * (uint64_t)K) in that branch (place it near the
start of the else/batched block that uses batch_count, id, M, N, K).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@bench_artifacts/README.md`:
- Around line 62-66: Replace the hardcoded personal path and undefined variables
in the perf repro snippet: remove or replace `cd ~/mllm-runok` with a relative
project-root reference (e.g., use an existing $ROOT or run from repo root), and
add/inline definitions for the referenced variables `$MODEL` and `$CFG` (reuse
the same variable names defined in the "Quick Repro" section or declare them
near `$BIN`) so the perf record command uses defined symbols (`$BIN`, `$MODEL`,
`$CFG`) and is portable and reproducible across environments.
🧹 Nitpick comments (1)
bench_artifacts/README.md (1)

8-8: Consider simpler wording.

"prior to execution" could be simplified to "before execution" for better readability.

✏️ Suggested simplification
-2. **KV Cache Memory Bounds:** Calculates theoretical and empirical (Peak RSS) VRAM requirements prior to execution, establishing strict lower bounds for AOT memory pre-allocation.
+2. **KV Cache Memory Bounds:** Calculates theoretical and empirical (Peak RSS) VRAM requirements before execution, establishing strict lower bounds for AOT memory pre-allocation.

Comment on lines 62 to 66
cd ~/mllm-runok
rm -f perf.data perf.data.old
perf record -F 99 -g -- \
"$BIN" -n tiny_llama -m "$MODEL" -c "$CFG" \
-pp 32 -tg 256 -t 8 -cl 2048 -r 1 -cs 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded path and undefined variables prevent reproduction.

This section has multiple portability issues:

  1. Line 62: cd ~/mllm-runok is a hardcoded personal path that other users cannot reproduce.
  2. Lines 65-66: Variables $MODEL and $CFG are referenced but never defined in this document (only $BIN was defined earlier in line 29).
  3. Inconsistent with the portable approach used in the "Quick Repro" section above.
📝 Proposed fix for portability
-cd ~/mllm-runok
+# Note: Define these variables first or use actual paths
+export MODEL=/path/to/models/tinyllama-fp32.mllm
+export CFG=./examples/llama/config_tiny_llama.json
+
 rm -f perf.data perf.data.old
 perf record -F 99 -g -- \

Alternatively, reference the variables defined in the Quick Repro section:

-cd ~/mllm-runok
+# Prerequisites: Ensure BIN, MODEL, and CFG are set (see Quick Repro section)
+# Example:
+# export BIN=./build/bin/mllm-llm-benchmark
+# export MODEL=/path/to/your/model.mllm
+# export CFG=./examples/llama/config_tiny_llama.json
+
 rm -f perf.data perf.data.old
🤖 Prompt for AI Agents
In `@bench_artifacts/README.md` around lines 62 - 66, Replace the hardcoded
personal path and undefined variables in the perf repro snippet: remove or
replace `cd ~/mllm-runok` with a relative project-root reference (e.g., use an
existing $ROOT or run from repo root), and add/inline definitions for the
referenced variables `$MODEL` and `$CFG` (reuse the same variable names defined
in the "Quick Repro" section or declare them near `$BIN`) so the perf record
command uses defined symbols (`$BIN`, `$MODEL`, `$CFG`) and is portable and
reproducible across environments.

@huangzhenhua111 huangzhenhua111 changed the title [Bench] AOT Profiling Toolchain: Shape Capture, KV-Memory & Context Sweep [Bench] x86 CPU Baseline Profiler & Context Sweep Tool Feb 15, 2026
@huangzhenhua111 huangzhenhua111 changed the title [Bench] x86 CPU Baseline Profiler & Context Sweep Tool Add CPU benchmark tool with context length sweep Feb 15, 2026
@huangzhenhua111 huangzhenhua111 force-pushed the pr-627 branch 2 times, most recently from 8d45d81 to 04013c8 Compare February 16, 2026 09:01
@huangzhenhua111
Copy link
Author

Thank you for your PR.

  1. Your PR does not benchmark the NPU. This does not match the description in your PR's README.
  2. Please describe what Context Sweep is testing.
  3. We do not object to using AI programming, but this PR has very obvious AI traces and it is unclear what it is doing. Please check this PR.

Hi,Senior Wang,thank you very much for your time to review, and sorry for the rough initial submission.

  1. I know this PR is CPU-only, just preparing some data for NPU/AOT which I wish can help you with your current work. I've updated the README and desciption and so on to clearly state my intention.
  2. Added description of what context sweep tests — it measures prefill/decode latency and memory at CL=256–4096 to find CPU bottlenecks.
  3. I have figured out and cleaned up those unclear codes including AI traces to update my PR.
    Could you please take another look when you have time? If you have any further requests, please let me know.
    Wishing you a Happy New Year in advance!

@chenghuaWang
Copy link
Collaborator

There is no convenient and unified benchmark to evaluate the performance of mllm with different context lengths.

The MLLM benchmark tools include settings for Prefill Length and Decode Length.

Besides,I learned that you are focusing on NPU/AOT,so I add the shape log and KV estimation functions.The shape log helps identify which shape of matrix is most frequently used and can act as your bucket padding goal. KV memory size estimation helps determine how much memory should be reserved to avoid OOM.

The program for LLM is deterministic, its shape is known in advance, and the memory for KV can be calculated beforehand. I don't think there is a need to capture shape. MLLM provides the perfetto tool to view kernel flame graphs to analyze specific bottlenecks.

For your PR:

  1. Please do not directly submit images and CSV files. You can describe the performance of MLLM on TinyLLaMA in the README, including the test methodology, the machine used for testing (system, CPU model, number of threads used).

  2. Please do not modify the MLLM kernel code, as it is not related to the benchmark. I suggest you only submit code related to the benchmark tools.

@chenghuaWang chenghuaWang changed the title Add CPU benchmark tool with context length sweep feat(benchmark): Add CPU benchmark tool with context length sweep Feb 17, 2026
@huangzhenhua111
Copy link
Author

There is no convenient and unified benchmark to evaluate the performance of mllm with different context lengths.

The MLLM benchmark tools include settings for Prefill Length and Decode Length.

Besides,I learned that you are focusing on NPU/AOT,so I add the shape log and KV estimation functions.The shape log helps identify which shape of matrix is most frequently used and can act as your bucket padding goal. KV memory size estimation helps determine how much memory should be reserved to avoid OOM.

The program for LLM is deterministic, its shape is known in advance, and the memory for KV can be calculated beforehand. I don't think there is a need to capture shape. MLLM provides the perfetto tool to view kernel flame graphs to analyze specific bottlenecks.

For your PR:

  1. Please do not directly submit images and CSV files. You can describe the performance of MLLM on TinyLLaMA in the README, including the test methodology, the machine used for testing (system, CPU model, number of threads used).
  2. Please do not modify the MLLM kernel code, as it is not related to the benchmark. I suggest you only submit code related to the benchmark tools.

Senior Wang,Happy New Year! Thank you for your review. I've revised it according to your requests. I hope you'll take another look when you have time. If there's anything else I need to change, please feel free to let me know.


#include "models/All.hpp"

#ifndef MLLM_GIT_COMMIT_HASH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLLM_GIT_COMMIT_HASH is provide by cmake system. Do not modify it.

avg_prefill_speed /= 3.0f;
avg_decode_speed /= 3.0f;

// Print average results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls keep the cli output


### Single run
```bash
./mllm-llm-benchmark \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is for developer, not a description of your PR. You should add your part on the top of the original README.md

int32_t head_dim = 0; // hidden_size / num_attention_heads
};

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not delete all comments here!

@chenghuaWang
Copy link
Collaborator

  1. pls use the building method in mllm document: python task.py tasks/build_x86.py
  2. pls format your code based on the .clang-format before u submit.
  3. follow the review. modify ur code.

Added CPU benchmark tool for context length sweeps. The existing benchmark
tools had PP/TG settings but no way to automate testing across different
context lengths. So I wrote a bash script to run sweeps automatically.

Key changes:
- Add Llama.hpp for LLaMA/TinyLLaMA models
- Update BenchmarkTemplate to support KV cache estimation
- Add context sweep automation scripts with plotting
- Add comprehensive README documentation

Tested on WSL2 (Ryzen 7 6800H) with TinyLLaMA fp32
@huangzhenhua111
Copy link
Author

  1. pls use the building method in mllm document: python task.py tasks/build_x86.py
  2. pls format your code based on the .clang-format before u submit.
  3. follow the review. modify ur code.

Hi,Senior Wang,thank you for your review, I have revised it according to your requirements,please take another look at it when you have time.

@@ -1,5 +1,11 @@
# MLLM LLM Benchmark Tool

## Why
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to have why section

@chenghuaWang
Copy link
Collaborator

  1. pls follow the review. modify ur code.
  2. @jialilve pls check this PR again.

@huangzhenhua111
Copy link
Author

  1. pls follow the review. modify ur code.
  2. @jialilve pls check this PR again.

Hi @jialilve , I have delete my why section. Could you please take a look when you have time?

Comment on lines 89 to 92
void clear() override {
// TODO: expose a public KV-cache reset API for LlamaForCausalLM (if needed).
// For now, keep it as no-op to minimize API changes in PR1.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huangzhenhua111

  1. please see this coderabbit's review;
  2. In your sweep_context_v2.sh dose the 'cl' mean context length(include prompt length and test_generation_length? But in initial code the 'cl' mean cache length, it may results some misunderstand;
  3. I don't see any information about 'cache lenght' in you code, why?

https://github.com/UbiquitousLearning/mllm/pull/639/changes#r2807444788

⚠️ Potential issue | 🟠 Major

clear() is a no-op — KV cache persists across consecutive runs.

Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.

Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.

🤖 Prompt for AI Agents

@huangzhenhua111
Copy link
Author

Hi, @jialilve , thank you for your review:

  1. StaticCache::clearCache() is private in LlamaForCausalLM. Chenghua suggested not modifying core lib in this PR, so I added a safeguard: forcing --runs=1 with a warning if user tries >1. The script already defaults to 1 anyway. I wanted to Imitate your clear method in Qwen but LlamaForCausalLM unlike Qwen3ForCausalLM does not provide a public kvCache() method to obtain a reference to kv_cache. If you think it is not enough, @chenghuaWang , can I add a kvCache() getter to LlamaForCausalLM to match Qwen3's API? This would let me implement proper KV-cache reset in the benchmark.
  2. Script uses cl for context length, but binary's -cl is cache length. Renamed to ctx_len in scripts to avoid confusion.
  3. Added it to console output and CSV now.
    Could you please take another look when you have time? If you have any other further requst pls let me know.

- rename cl->ctx_len in scripts (was confusing with -cl flag)
- print cache_length in output for clarity
- force runs=1 for LLaMA only (clear() is no-op, Qwen3 unaffected)
- update README with CTX_LENS and RUNS limitation note

per @jialilve feedback
@chenghuaWang
Copy link
Collaborator

can I add a kvCache() getter to LlamaForCausalLM to match Qwen3's API?

Pls add one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants