feat(benchmark): Add CPU benchmark tool with context length sweep#639
feat(benchmark): Add CPU benchmark tool with context length sweep#639huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds benchmark artifact ignores and docs, context‑sweep orchestration and plotting tools, offline GEMM/GEMV shape logging in matmul, an x86 SIMD include, extended benchmark CLI with CSV/multi‑run/cooldown/KV estimation, and a new LLaMA benchmark implementation. Changes
Sequence Diagram(s)sequenceDiagram
participant Sweep as "sweep_context_v2.sh"
participant Bench as "mllm-llm-benchmark (binary)"
participant Model as "Llama_Benchmark"
participant FS as "Filesystem / Logs / CSV"
participant Plot as "make_snapshot_nopandas.py"
Sweep->>Bench: launch run (mode, pp, tg, threads)
Bench->>Model: init()/warmup()/run() (prefill + decode)
Model-->>Bench: timing metrics, KV estimate
Bench-->>FS: append CSV row, write per-run logs, perf data
Sweep->>FS: collect logs, extract metrics
Sweep->>Plot: feed CSV
Plot-->>FS: write summary CSV and PNGs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/mllm-llm-benchmark/main.cpp (1)
175-222:⚠️ Potential issue | 🟡 MinorAverage results are computed but never printed to console — only written to CSV.
When
--output_csvis not provided, the averages (avg_ttft,avg_prefill_speed,avg_decode_speed, latencies, KV estimates) are computed and then discarded. Users running without CSV output never see the summary. Consider printing a console summary regardless of CSV output.Proposed addition after line 192
float avg_decode_ms_per_tok = (avg_decode_speed > 0.0f) ? (1.0f / avg_decode_speed) * 1000.0f : 0.0f; + mllm::print("\n Average Results (" , R, "runs ):"); + mllm::print(" Avg TTFT :", avg_ttft, "ms"); + mllm::print(" Avg Prefill Speed :", avg_prefill_speed, "tokens/s"); + mllm::print(" Avg Decode Speed :", avg_decode_speed, "tokens/s"); + mllm::print(" Avg Prefill Lat :", avg_prefill_ms, "ms"); + mllm::print(" Avg Decode Lat :", avg_decode_ms_per_tok, "ms/tok"); + // Rough KV cache estimate (bytes)
🤖 Fix all issues with AI agents
In `@bench_artifacts/README.md`:
- Around line 22-30: The README contains hardcoded personal paths (e.g., BIN,
MODEL, CFG variables pointing to ~/mllm-runok and /home/huangzhenhua/...) which
are not portable; update the example to use neutral placeholders and environment
variables (e.g., $PROJECT_ROOT or $MODEL_DIR and relative paths) for BIN, MODEL,
CFG, and any user-specific directories so other developers can reproduce the
commands without leaking usernames or personal directories.
In `@bench_artifacts/snapshot.md`:
- Around line 12-24: The markdown images are missing alt text causing
accessibility lint MD045; update each image tag to include a short descriptive
alt string that conveys the content (e.g., "Decode latency vs context plot",
"Decode per-token latency vs context plot", "TTFT and prefill vs context plot",
"Peak RSS memory growth plot", "KV estimate growth (MB) plot") — specifically
edit the five image references referencing
plots/context_sweep_v2.decode_heavy.decode_ms.png,
plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png,
plots/context_sweep_v2.prefill_ttft.ttft_ms.png,
plots/context_sweep_v2.memory.peak_rss_gb.png, and
plots/context_sweep_v2.memory.kv_est_mb.png to replace  with
 for each.
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp`:
- Line 137: The declaration int64_t batch_count_mm = 1; in function mat_mul is
mis-indented (starts at column 0); fix by indenting it to match the surrounding
function body style (align with other local variable declarations inside
mat_mul) so batch_count_mm sits at the same indent level as nearby locals and
code blocks.
- Around line 136-141: Remove the unconditional top-level mm_shape_record(0,
...) in mat_mul (the initial mm_shape_record that uses batch_count_mm) and
instead record shapes only inside the actual compute branches: add
mm_shape_record(0, ...) in the first llamafile sgemm path when batch_count == 1
(inside check_llamafile_sgemm success branch) and add a batched
mm_shape_record(0, ...) in the llamafile batched path (the branch that handles
batch_count > 1), mirroring the existing post-quantize placement; ensure you do
not emit the kind=0 record when later branches will emit kind=1 or kind=2
records to avoid double-counting.
In `@mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp`:
- Line 10: Wrap the unconditional `#include` <immintrin.h> with the same x86 guard
used elsewhere (e.g., `#if` defined(__SSE__) || defined(__x86_64__)), so the
header is only included on x86 builds; locate the include line at the top of
fwd_bshd.hpp and surround it with the existing platform macro guard used for
intrinsics in this file to maintain cross-platform portability.
In `@scripts/make_snapshot_nopandas.py`:
- Around line 8-14: The helpers use bare except which swallows
KeyboardInterrupt/SystemExit and to_int is unused; change both functions to
catch only conversion-related exceptions (e.g., ValueError, TypeError) in
to_float and to_int, and either remove the unused to_int function or use it
where intended (search for to_int and to_float to determine intent) — do not use
a bare except and ensure interrupted signals are allowed to propagate.
- Around line 117-122: The printed "Wrote:" list omits two PNGs—prefill and
decode per-ms plots—so update the print block that uses summary_path, out_dir
and stamp to include the missing file paths
f"{stamp}.prefill_ttft.prefill_ms.png" and f"{stamp}.decode_heavy.decode_ms.png"
alongside the existing entries (the block that currently prints
f"{stamp}.prefill_ttft.ttft_ms.png" and
f"{stamp}.decode_heavy.decode_ms_per_tok.png"); ensure you print
os.path.join(out_dir, ...) for those two filenames so all six generated PNGs are
listed.
In `@sweep_context_v2.sh`:
- Around line 16-18: The default MODEL value contains a hardcoded user-specific
path; update the MODEL variable assignment in sweep_context_v2.sh
(MODEL="${MODEL:-...}") to use $HOME (or another portable relative/config-driven
location) instead of "/home/huangzhenhua", mirroring how BIN and CFG are defined
so the script works for other users/CI environments.
- Line 128: The CSV echo currently hardcodes the model name "tiny_llama"; change
it to use a variable so the correct model is recorded by replacing the hardcoded
string in the echo that writes to "$OUTCSV" with a dynamic value derived from
either a new environment variable MODEL_NAME or from parsing the existing $MODEL
path (e.g., basename or strip extensions) before the echo; update the script to
set MODEL_NAME if unset by extracting it from $MODEL and reference MODEL_NAME in
the echo line so other models (Qwen, Llama-3, etc.) are correctly labeled.
In `@tools/mllm-llm-benchmark/main.cpp`:
- Around line 197-202: Validate the KV dtype byte option (kv_dtype_bytes) right
after it's parsed and convert it into a validated variable (e.g., kv_bpe) that
only allows 1, 2, or 4; if the value is outside that set, log an error and exit
(or clamp/fallback to a safe default) to prevent zero/negative values. Replace
direct uses of kv_dtype_bytes.get() in the KV estimate calculations (the
kv_est_bytes_pp / kv_est_bytes_final assignments) with the validated kv_bpe
variable so bytes_per is always one of 1, 2, or 4.
In `@tools/mllm-llm-benchmark/models/Llama.hpp`:
- Around line 89-92: The clear() override in Llama.hpp is a no-op so the KV
cache persists across runs (LlamaForCausalLM), leaking state between
benchmark->clear() calls from main.cpp and corrupting multi-run / multi-pair
results; to fix, either implement a real KV reset when available (call the
model's KV reset API from clear() once LlamaForCausalLM exposes it) or, as an
immediate mitigation, (1) modify clear() to emit a clear-cache warning into the
benchmark output/CSV and (2) add an argument validation in main.cpp that
enforces --runs == 1 (or refuses >1) when using this Llama implementation, so
benchmark->clear() cannot be relied upon until a proper reset API is added.
- Around line 136-140: The decode throughput calculation in Llama.hpp computes
decode_tokens as (token_count > 0 ? token_count - 1 : 0) and uses that for
r.decode_speed, but Qwen3_W4A32_KAI.hpp::run() uses token_count directly, making
benchmarks inconsistent; update the Qwen3 implementation (in run()) to match
Llama's convention by excluding the first token from decode throughput (use
token_count - 1 with the same guards: ensure decode_tokens is non-negative and
only compute r.decode_speed when decode_us and decode_tokens are > 0) so both
benchmarks use the same definition of decode throughput.
- Around line 17-24: kvEstimateInfo() currently divides cfg_->hidden_size by
cfg_->num_attention_heads without a guard; mirror the defensive behavior used in
printModelInfo(): check cfg_ and that cfg_->num_attention_heads is non‑zero
before computing head_dim. If cfg_->num_attention_heads == 0 (or cfg_ is null)
return std::nullopt (or otherwise avoid the division) so
KVCacheEstimateInfo.head_dim is never computed with a zero divisor; update the
kvEstimateInfo() function to perform this check using the existing cfg_ and
KVCacheEstimateInfo symbols.
🧹 Nitpick comments (11)
mllm/backends/cpu/kernels/common/ggml/matmul.cpp (4)
3-12: Duplicate and unused includes.
<cstdint>is included twice (lines 4 and 12), and<tuple>(line 8) is not used anywhere in this file.Proposed fix
`#include` <algorithm> -#include <cstdint> `#include` <cstdio> `#include` <cstdlib> `#include` <mutex> -#include <tuple> `#include` <unordered_map> `#include` <vector> `#include` <cassert> `#include` <cstdint>
74-98:atexithandler accesses file-scope statics — safe only if no detached threads outlivemain().The
atexitcallback locksg_mm_shape_muand iteratesg_mm_shape. This works because these statics are constructed before theatexitregistration (so they are destroyed after the handler runs). However, if any detached thread is still callingmm_shape_recordwhen the process exits, the mutex lock in the handler could race or deadlock. Since this is a profiling/debug feature gated behind an env var, the risk is low — just worth noting.
101-109: Mutex acquisition on every matmul call when shape logging is enabled.When
MLLM_MATMUL_SHAPE_LOGis active, every GEMM invocation contends ong_mm_shape_mu. Since this tool measures latency and throughput, the synchronization overhead could skew the benchmarks it is designed to support. A thread-local accumulation flushed at the end would avoid contention on the hot path.
52-61: Non-English comments reduce accessibility for contributors.Several comments in this file are in Chinese (lines 52, 60–61, 288, 331, 341). The rest of the codebase uses English. Please translate these for consistency.
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)
72-72: Consider adding[[nodiscard]]tokvEstimateInfo().The return value should always be checked by callers. This aligns with the static analysis hint (modernize-use-nodiscard).
Proposed fix
- virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; } + [[nodiscard]] virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }tools/mllm-llm-benchmark/models/All.hpp (1)
31-35: Redundant conditions:"tinyllama"and"tiny_llama"already contain"llama".The first condition
find("llama") != nposalready matches any string containing "tinyllama" or "tiny_llama", making the two extra checks dead code. Not a bug, but could be simplified.Simplified condition
- if (normalized_model_name.find("llama") != std::string::npos || - normalized_model_name.find("tinyllama") != std::string::npos || - normalized_model_name.find("tiny_llama") != std::string::npos) { + if (normalized_model_name.find("llama") != std::string::npos) { return std::make_shared<Llama_Benchmark>(); }tools/mllm-llm-benchmark/main.cpp (1)
22-28: Duplicate stringification macros:STR/STR_HELPERare identical toSTRINGIFY/STRINGIFY_INTERNAL(lines 17-18) and appear unused.
STRandSTR_HELPERare never referenced in the file. They duplicateSTRINGIFY/STRINGIFY_INTERNALalready defined above.Remove unused macros
-#ifndef MLLM_GIT_COMMIT_HASH -#define MLLM_GIT_COMMIT_HASH unknown -#endif - -#define STR_HELPER(x) `#x` -#define STR(x) STR_HELPER(x) - +#ifndef MLLM_GIT_COMMIT_HASH +#define MLLM_GIT_COMMIT_HASH unknown +#endifscripts/make_snapshot_nopandas.py (1)
1-1:mathis imported but never used.Proposed fix
-import sys, os, csv, math +import sys, os, csvsweep_context_v2.sh (3)
64-69: Unreachable guard on line 69.If
cl <= tg, the function returns on line 66. Otherwisepp = cl - tg ≥ 1, so thepp < 1check on line 69 is dead code. Harmless, but can be removed for clarity.
111-113:rg(ripgrep) is a non-standard dependency.The metric parsing relies on
rgwhich isn't available by default on most systems. Consider either documenting this prerequisite in the usage comments or falling back togrep -oP(available on most GNU/Linux).Alternative using grep -oP
- TTFT_MS="$(rg -o 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)" - PREFILL_MS="$(rg -o 'Prefill Latency\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)" - DECODE_MS="$(rg -o 'Decode Latency\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)" + TTFT_MS="$(grep -oP 'TTFT\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)" + PREFILL_MS="$(grep -oP 'Prefill Latency\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)" + DECODE_MS="$(grep -oP 'Decode Latency\s*:\s*\K[0-9.]+' "$ALLLOG" | head -n 1 || echo 0)"
131-135: A single failed run aborts the entire sweep due toset -e.Since
run_onereturns 1 on failure andset -eis active, any failed benchmark run (e.g., OOM at a large context length) will terminate the entire sweep. If the intent is to collect as much data as possible, consider tolerating individual failures:Proposed fix
for CL in $CLS; do - run_one "decode_heavy" "$CL" "$TG_DH" - run_one "prefill_ttft" "$CL" "$TG_TTFT" + run_one "decode_heavy" "$CL" "$TG_DH" || echo "WARN: decode_heavy cl=$CL failed, continuing..." + run_one "prefill_ttft" "$CL" "$TG_TTFT" || echo "WARN: prefill_ttft cl=$CL failed, continuing..." done
bench_artifacts/snapshot.md
Outdated
|  | ||
|
|
||
| ### **Fig.2: Decode per-token latency vs context (ms/tok)** | ||
|  | ||
|
|
||
| ### **Fig.3: TTFT/Prefill vs context (ms)** | ||
|  | ||
|
|
||
| ### **Fig.4: Memory growth (Peak RSS, GB)** | ||
|  | ||
|
|
||
| ### **Fig.5: KV estimate growth (MB, formula-based)** | ||
|  |
There was a problem hiding this comment.
Images are missing alt text (accessibility).
All five image references use ![]() with empty alt text. Markdownlint MD045 flags this. Adding descriptive alt text improves accessibility and provides context when images fail to load.
Example fix for one image
-
+Apply similarly to all five image references.
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 12-12: Images should have alternate text (alt text)
(MD045, no-alt-text)
[warning] 15-15: Images should have alternate text (alt text)
(MD045, no-alt-text)
[warning] 18-18: Images should have alternate text (alt text)
(MD045, no-alt-text)
[warning] 21-21: Images should have alternate text (alt text)
(MD045, no-alt-text)
[warning] 24-24: Images should have alternate text (alt text)
(MD045, no-alt-text)
🤖 Prompt for AI Agents
In `@bench_artifacts/snapshot.md` around lines 12 - 24, The markdown images are
missing alt text causing accessibility lint MD045; update each image tag to
include a short descriptive alt string that conveys the content (e.g., "Decode
latency vs context plot", "Decode per-token latency vs context plot", "TTFT and
prefill vs context plot", "Peak RSS memory growth plot", "KV estimate growth
(MB) plot") — specifically edit the five image references referencing
plots/context_sweep_v2.decode_heavy.decode_ms.png,
plots/context_sweep_v2.decode_heavy.decode_ms_per_tok.png,
plots/context_sweep_v2.prefill_ttft.ttft_ms.png,
plots/context_sweep_v2.memory.peak_rss_gb.png, and
plots/context_sweep_v2.memory.kv_est_mb.png to replace  with
 for each.
| // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path) | ||
| int64_t batch_count_mm = 1; | ||
| for (size_t i = 0; i + 2 < dst_shape.size(); ++i) batch_count_mm *= dst_shape[i]; | ||
| mm_shape_record(0, M, N, K, | ||
| (uint64_t)batch_count_mm, | ||
| (uint64_t)(2.0 * (double)batch_count_mm * M * N * (double)K)); |
There was a problem hiding this comment.
Double-counting of GEMM shapes when the first llamafile path fails.
This mm_shape_record(kind=0, ...) fires unconditionally at the top of mat_mul. If the first check_llamafile_sgemm (line 170) fails and execution falls through to the post-quantize llamafile path (line 284), a second mm_shape_record(kind=0, ...) is emitted at line 289. Similarly, if execution reaches the gemv/gemm path (line 330), the top-level kind=0 record is redundant alongside the kind=1/kind=2 records.
This inflates FLOPs totals in the shape log. The top-level record should be removed and shape recording should happen only in the path that actually executes the compute.
Proposed fix — remove unconditional top-level recording
- // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path)
-int64_t batch_count_mm = 1;
- for (size_t i = 0; i + 2 < dst_shape.size(); ++i) batch_count_mm *= dst_shape[i];
- mm_shape_record(0, M, N, K,
- (uint64_t)batch_count_mm,
- (uint64_t)(2.0 * (double)batch_count_mm * M * N * (double)K));
-
-
auto src0_dtype = src0_.dtype();Then add a shape record inside the first llamafile batch_count == 1 path (around line 176) and the batched path (around line 190), mirroring what is already done for the post-quantize path at line 289.
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` around lines 136 - 141,
Remove the unconditional top-level mm_shape_record(0, ...) in mat_mul (the
initial mm_shape_record that uses batch_count_mm) and instead record shapes only
inside the actual compute branches: add mm_shape_record(0, ...) in the first
llamafile sgemm path when batch_count == 1 (inside check_llamafile_sgemm success
branch) and add a batched mm_shape_record(0, ...) in the llamafile batched path
(the branch that handles batch_count > 1), mirroring the existing post-quantize
placement; ensure you do not emit the kind=0 record when later branches will
emit kind=1 or kind=2 records to avoid double-counting.
| const int N = transpose1 ? src1_shape[src1_shape.size() - 2] : src1_shape[src1_shape.size() - 1]; | ||
|
|
||
| // NOTE: batch_count_mm is ONLY for MLLM_MATMUL_SHAPE_LOG aggregation (does not affect compute path) | ||
| int64_t batch_count_mm = 1; |
There was a problem hiding this comment.
Indentation is broken — int64_t batch_count_mm starts at column 0.
This line is inside the mat_mul function body but has no indentation, breaking the file's consistent style.
-int64_t batch_count_mm = 1;
+ int64_t batch_count_mm = 1;🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` at line 137, The
declaration int64_t batch_count_mm = 1; in function mat_mul is mis-indented
(starts at column 0); fix by indenting it to match the surrounding function body
style (align with other local variable declarations inside mat_mul) so
batch_count_mm sits at the same indent level as nearby locals and code blocks.
| #include "mllm/core/Parallel.hpp" | ||
| #include "mllm/backends/cpu/kernels/common/paged_attn/arch.hpp" | ||
| #include "mllm/utils/UnsafeMacros.hpp" | ||
| #include <immintrin.h> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the file to verify the include and its context
cat -n mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp | head -150Repository: UbiquitousLearning/mllm
Length of output: 7335
🏁 Script executed:
# Search for other x86-specific includes in the CPU backend
rg -n '#include\s*<(immintrin|xmmintrin|emmintrin|pmmintrin)\.h>' mllm/backends/cpu/ -A2 -B2Repository: UbiquitousLearning/mllm
Length of output: 2384
🏁 Script executed:
# Check if immintrin.h is used elsewhere and how it's handled
rg -l 'immintrin\.h' mllm/ --type cpp --type hppRepository: UbiquitousLearning/mllm
Length of output: 94
Guard the x86-specific include for cross-platform portability.
The unconditional include of <immintrin.h> will fail on non-x86 platforms (ARM, RISC-V, etc.) where this header does not exist. Since all intrinsics usage is properly guarded by #if defined(__SSE__) || defined(__x86_64__) throughout the file, and every other x86 intrinsic include in the codebase follows the same pattern, this include should be similarly guarded to maintain consistency and portability.
Proposed fix
+#if defined(__SSE__) || defined(__x86_64__)
`#include` <immintrin.h>
+#endif📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <immintrin.h> | |
| `#if` defined(__SSE__) || defined(__x86_64__) | |
| `#include` <immintrin.h> | |
| `#endif` |
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/paged_attn/fwd_bshd.hpp` at line 10, Wrap
the unconditional `#include` <immintrin.h> with the same x86 guard used elsewhere
(e.g., `#if` defined(__SSE__) || defined(__x86_64__)), so the header is only
included on x86 builds; locate the include line at the top of fwd_bshd.hpp and
surround it with the existing platform macro guard used for intrinsics in this
file to maintain cross-platform portability.
|
|
||
| echo "TTFT=$TTFT_MS ms Prefill=$PREFILL_MS ms Decode=$DECODE_MS ms Decode/tok=$DECODE_PER_TOK ms peakRSS=$PEAK_RSS_KB KB KV_est=$KV_EST_KB KB" | ||
|
|
||
| echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,$TTFT_MS,$PREFILL_MS,$DECODE_MS,$DECODE_PER_TOK,$PEAK_RSS_KB,$KV_EST_KB" >> "$OUTCSV" |
There was a problem hiding this comment.
Model name is hardcoded as tiny_llama in CSV output.
If this script is reused for other models (Qwen, Llama-3, etc.), the CSV will contain an incorrect model identifier. Consider deriving it from the $MODEL path or introducing a MODEL_NAME environment variable.
Proposed fix
+MODEL_NAME="${MODEL_NAME:-$(basename "$MODEL" | sed 's/\.[^.]*$//')}"
+
# (in the CSV echo on line 128)
- echo "$TS,$GIT,$ARCH,tiny_llama,$mode,$cl,$pp,$tg,$THREADS,..." >> "$OUTCSV"
+ echo "$TS,$GIT,$ARCH,$MODEL_NAME,$mode,$cl,$pp,$tg,$THREADS,..." >> "$OUTCSV"🤖 Prompt for AI Agents
In `@sweep_context_v2.sh` at line 128, The CSV echo currently hardcodes the model
name "tiny_llama"; change it to use a variable so the correct model is recorded
by replacing the hardcoded string in the echo that writes to "$OUTCSV" with a
dynamic value derived from either a new environment variable MODEL_NAME or from
parsing the existing $MODEL path (e.g., basename or strip extensions) before the
echo; update the script to set MODEL_NAME if unset by extracting it from $MODEL
and reference MODEL_NAME in the echo line so other models (Qwen, Llama-3, etc.)
are correctly labeled.
| if (auto info = benchmark->kvEstimateInfo(); info.has_value()) { | ||
| const int32_t bytes_per = kv_dtype_bytes.get(); // 1/2/4 | ||
| // LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes | ||
| kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per; | ||
| kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per; | ||
| } |
There was a problem hiding this comment.
No validation on kv_dtype_bytes — accepts arbitrary values including 0 and negatives.
The help text states valid values are 1, 2, or 4, but any integer is accepted. A value of 0 would silently produce zero estimates; a negative value would produce negative estimates.
Proposed validation (e.g., after line 89)
+ int kv_bpe = kv_dtype_bytes.get();
+ if (kv_bpe != 1 && kv_bpe != 2 && kv_bpe != 4) {
+ mllm::print("[WARN] --kv_dtype_bytes should be 1, 2, or 4; got:", kv_bpe, "— defaulting to 4");
+ kv_bpe = 4;
+ }Then use kv_bpe instead of kv_dtype_bytes.get() on line 198.
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/main.cpp` around lines 197 - 202, Validate the KV
dtype byte option (kv_dtype_bytes) right after it's parsed and convert it into a
validated variable (e.g., kv_bpe) that only allows 1, 2, or 4; if the value is
outside that set, log an error and exit (or clamp/fallback to a safe default) to
prevent zero/negative values. Replace direct uses of kv_dtype_bytes.get() in the
KV estimate calculations (the kv_est_bytes_pp / kv_est_bytes_final assignments)
with the validated kv_bpe variable so bytes_per is always one of 1, 2, or 4.
| std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { | ||
| if (!cfg_) return std::nullopt; | ||
| KVCacheEstimateInfo info; | ||
| info.num_layers = cfg_->num_hidden_layers; | ||
| info.num_kv_heads = cfg_->num_key_value_heads; | ||
| info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; | ||
| return info; | ||
| } |
There was a problem hiding this comment.
Missing division-by-zero guard in kvEstimateInfo() unlike printModelInfo().
Line 22 computes hidden_size / num_attention_heads without guarding against zero, whereas printModelInfo() (line 58) has a defensive check. For consistency and robustness:
Proposed fix
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
- info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
+ info.head_dim = (cfg_->num_attention_heads > 0)
+ ? (cfg_->hidden_size / cfg_->num_attention_heads)
+ : 0;
return info;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { | |
| if (!cfg_) return std::nullopt; | |
| KVCacheEstimateInfo info; | |
| info.num_layers = cfg_->num_hidden_layers; | |
| info.num_kv_heads = cfg_->num_key_value_heads; | |
| info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; | |
| return info; | |
| } | |
| std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { | |
| if (!cfg_) return std::nullopt; | |
| KVCacheEstimateInfo info; | |
| info.num_layers = cfg_->num_hidden_layers; | |
| info.num_kv_heads = cfg_->num_key_value_heads; | |
| info.head_dim = (cfg_->num_attention_heads > 0) | |
| ? (cfg_->hidden_size / cfg_->num_attention_heads) | |
| : 0; | |
| return info; | |
| } |
🧰 Tools
🪛 Clang (14.0.6)
[error] 17-17: function 'kvEstimateInfo' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 17 - 24,
kvEstimateInfo() currently divides cfg_->hidden_size by
cfg_->num_attention_heads without a guard; mirror the defensive behavior used in
printModelInfo(): check cfg_ and that cfg_->num_attention_heads is non‑zero
before computing head_dim. If cfg_->num_attention_heads == 0 (or cfg_ is null)
return std::nullopt (or otherwise avoid the division) so
KVCacheEstimateInfo.head_dim is never computed with a zero divisor; update the
kvEstimateInfo() function to perform this check using the existing cfg_ and
KVCacheEstimateInfo symbols.
| void clear() override { | ||
| // TODO: expose a public KV-cache reset API for LlamaForCausalLM (if needed). | ||
| // For now, keep it as no-op to minimize API changes in PR1. | ||
| } |
There was a problem hiding this comment.
clear() is a no-op — KV cache persists across consecutive runs.
Since main.cpp calls benchmark->clear() before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.
Consider at minimum documenting this limitation prominently in the CSV/output, or restricting --runs to 1 until the reset API is available.
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 89 - 92, The clear()
override in Llama.hpp is a no-op so the KV cache persists across runs
(LlamaForCausalLM), leaking state between benchmark->clear() calls from main.cpp
and corrupting multi-run / multi-pair results; to fix, either implement a real
KV reset when available (call the model's KV reset API from clear() once
LlamaForCausalLM exposes it) or, as an immediate mitigation, (1) modify clear()
to emit a clear-cache warning into the benchmark output/CSV and (2) add an
argument validation in main.cpp that enforces --runs == 1 (or refuses >1) when
using this Llama implementation, so benchmark->clear() cannot be relied upon
until a proper reset API is added.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
- please see this coderabbit's review;
- In your sweep_context_v2.sh dose the 'cl' mean context length(include prompt length and test_generation_length? But in initial code the 'cl' mean cache length, it may results some misunderstand;
- I don't see any information about 'cache lenght' in you code, why?
https://github.com/UbiquitousLearning/mllm/pull/639/changes#r2807444788
⚠️ Potential issue | 🟠 Major
clear()is a no-op — KV cache persists across consecutive runs.Since
main.cppcallsbenchmark->clear()before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.Consider at minimum documenting this limitation prominently in the CSV/output, or restricting
--runsto 1 until the reset API is available.🤖 Prompt for AI Agents
| // NOTE: decode_us is measured from first token timestamp; exclude that first token from decode throughput. | ||
| int decode_tokens = (token_count > 0) ? (token_count - 1) : 0; | ||
| r.decode_speed = (decode_us > 0 && decode_tokens > 0) | ||
| ? (static_cast<float>(decode_tokens) / decode_us) * 1e6f | ||
| : 0.f; |
There was a problem hiding this comment.
Decode throughput calculation differs from Qwen3 benchmark.
This implementation correctly excludes the first token from decode throughput (token_count - 1), since the first token's latency is attributed to TTFT/prefill. However, Qwen3_W4A32_KAI.hpp's run() uses token_count directly (includes the first token in decode speed), making the two benchmarks non-comparable.
Consider aligning both implementations to the same convention (the Llama approach is arguably more correct).
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 136 - 140, The decode
throughput calculation in Llama.hpp computes decode_tokens as (token_count > 0 ?
token_count - 1 : 0) and uses that for r.decode_speed, but
Qwen3_W4A32_KAI.hpp::run() uses token_count directly, making benchmarks
inconsistent; update the Qwen3 implementation (in run()) to match Llama's
convention by excluding the first token from decode throughput (use token_count
- 1 with the same guards: ensure decode_tokens is non-negative and only compute
r.decode_speed when decode_us and decode_tokens are > 0) so both benchmarks use
the same definition of decode throughput.
|
Hi @chenghuaWang ,thank you for your email guidance very much , can you take a look at this PR? I sincerely wish it can help you. |
|
Thank you for your PR.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@bench_artifacts/snapshot.md`:
- Line 4: The phrase "highly stable" over the 0.67–0.88 ms/tok range is
misleading because that span is ~31% variation; update the sentence in
bench_artifacts/snapshot.md to either replace "highly stable" with "relatively
consistent" or similar, and append the precise variation (≈31% variation) or the
numeric range explicitly to clarify; ensure the revised sentence still notes
that TTFT/prefill latency scales and becomes the primary bottleneck for long
contexts.
- Line 14: The term "VRAM" is incorrect for CPU benchmarks in the "Memory
Planning" section; update the phrase "VRAM lower bounds" to use CPU-appropriate
terminology such as "RAM lower bounds", "system memory lower bounds", or simply
"memory lower bounds" in the sentence under "Memory Planning" so it correctly
refers to x86_64 CPU baseline measurements and aligns with the rest of the
document.
🧹 Nitpick comments (3)
bench_artifacts/snapshot.md (3)
4-4: Add CPU model and system specifications for reproducibility.Specify the CPU model (e.g., Intel i7-12700K, AMD Ryzen 7 5800X) and other relevant system details (OS version, RAM, compiler flags) to make these benchmark results reproducible and comparable.
6-6: Quantify "heavily concentrated" with profiling percentages.Replace the vague "heavily concentrated" with specific metrics from
perfoutput, such as "accounts for X% of total CPU time" or "dominates the top-N hotspots by Y%."
14-14: Quantify "strictly aligns" claim with error margins.The phrase "strictly aligns" is vague. Provide quantitative support, such as "aligns within ±X% error" or "deviates by less than Y MB," to make this claim verifiable and meaningful.
bench_artifacts/snapshot.md
Outdated
| # MLLM x86 TinyLlama: Context Sweep & Profiling Snapshot | ||
|
|
||
| ## 1. Summary | ||
| On the x86_64 CPU baseline (8 threads), TinyLlama's decode per-token latency remains highly stable (~0.67–0.88 ms/tok) across context lengths from CL=512 to 4096. However, TTFT (Time-to-First-Token) / prefill latency scales significantly and becomes the primary compute bottleneck for long contexts. |
There was a problem hiding this comment.
Clarify "highly stable" with 31% variation range.
The latency range 0.67–0.88 ms/tok represents approximately 31% variation. Consider whether "highly stable" accurately characterizes this range, or use terminology like "relatively consistent" with explicit acknowledgment of the variation.
🤖 Prompt for AI Agents
In `@bench_artifacts/snapshot.md` at line 4, The phrase "highly stable" over the
0.67–0.88 ms/tok range is misleading because that span is ~31% variation; update
the sentence in bench_artifacts/snapshot.md to either replace "highly stable"
with "relatively consistent" or similar, and append the precise variation (≈31%
variation) or the numeric range explicitly to clarify; ensure the revised
sentence still notes that TTFT/prefill latency scales and becomes the primary
bottleneck for long contexts.
bench_artifacts/snapshot.md
Outdated
|
|
||
| ## 3. Implications for Static Graph & AOT | ||
| 1. **Shape Bucketing:** The heavily reused GEMM shapes captured here provide the exact target dimensions required for AOT static-graph nearest-padding. | ||
| 2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation. |
There was a problem hiding this comment.
Incorrect "VRAM" terminology for CPU benchmarks.
This document describes "x86_64 CPU baseline" benchmarks (line 4), but line 14 references "VRAM lower bounds." VRAM (Video RAM) is GPU/NPU memory terminology. For CPU benchmarks, use "RAM," "memory," or "system memory" instead.
📝 Suggested fix
-2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation.
+2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable VRAM lower bounds for AOT memory pre-allocation. | |
| 2. **Memory Planning:** Peak RSS scales predictably with context length. The formula-based KV-cache estimation strictly aligns with the real-time footprint, establishing reliable RAM lower bounds for AOT memory pre-allocation. |
🤖 Prompt for AI Agents
In `@bench_artifacts/snapshot.md` at line 14, The term "VRAM" is incorrect for CPU
benchmarks in the "Memory Planning" section; update the phrase "VRAM lower
bounds" to use CPU-appropriate terminology such as "RAM lower bounds", "system
memory lower bounds", or simply "memory lower bounds" in the sentence under
"Memory Planning" so it correctly refers to x86_64 CPU baseline measurements and
aligns with the rest of the document.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@sweep_context_v2.sh`:
- Around line 106-111: The heredoc used to compute DECODE_PER_TOK injects
shell-expanded variables into the Python snippet; change the delimiter to a
quoted heredoc (use <<'PY' instead of <<PY) so the Python here-doc is not
subject to shell expansion, keeping the computation for DECODE_PER_TOK (the
python3 - <<PY block) consistent with the kv_est_kb pattern and preventing
accidental interpolation of $tg and $DECODE_MS.
- Around line 121-124: The loop over context lengths (variable CLS) calls
run_one for tasks "decode_heavy" and "prefill_ttft" but because the script is
running with set -e any non‑zero return from run_one will abort the whole sweep;
change the loop to run each run_one call in a way that captures failures without
exiting (e.g., append || true or capture the exit code into a variable), record
failures per CL (store context lengths or task names that failed), and after the
loop exit non‑zero only if you want to signal overall failure while still
allowing the loop to continue; update the block where run_one is invoked for
decode_heavy and prefill_ttft to use this non‑fatal invocation and add a final
summary report of failures.
🧹 Nitpick comments (2)
sweep_context_v2.sh (2)
74-77: Hardcoded-n tiny_llamabenchmark model name.The model identifier passed to the benchmark binary is hardcoded. Combined with the hardcoded
tiny_llamain the CSV (line 118), this means reusing the script for another model (Qwen, Llama-3) requires edits in multiple places. Extract a singleMODEL_NAMEvariable and reference it in both locations.Proposed fix
Add near the top defaults (e.g., after line 16):
+MODEL_NAME="${MODEL_NAME:-tiny_llama}"Then update line 75 and line 118:
- "$BIN" -n tiny_llama -m "$MODEL" -c "$CFG" \ + "$BIN" -n "$MODEL_NAME" -m "$MODEL" -c "$CFG" \- echo "$TS,$GIT,$ARCH,tiny_llama,$mode,..." >> "$OUTCSV" + echo "$TS,$GIT,$ARCH,$MODEL_NAME,$mode,..." >> "$OUTCSV"
101-113:rg(ripgrep) is not universally available — considergrep -oEfor portability.The script requires ripgrep, which is not installed by default on most Linux distributions or CI images. Standard
grep -oEprovides equivalent functionality here and improves portability.Example replacement for line 101
- TTFT_MS="$(rg -o 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | rg -o '[0-9.]+' | head -n 1 || echo 0)" + TTFT_MS="$(grep -oE 'TTFT\s*: *[0-9.]+ ms' "$ALLLOG" | grep -oE '[0-9.]+' | head -n 1 || echo 0)"Apply the same pattern to lines 102, 103, and 113.
sweep_context_v2.sh
Outdated
| DECODE_PER_TOK="$(python3 - <<PY | ||
| tg=float("$tg") | ||
| d=float("$DECODE_MS") | ||
| print(d/tg if tg>0 else 0.0) | ||
| PY | ||
| )" |
There was a problem hiding this comment.
Unquoted heredoc allows shell expansion into Python code — use a quoted delimiter for consistency and safety.
Unlike the kv_est_kb function (line 38) which correctly uses a quoted heredoc (<<'PY'), this block uses an unquoted <<PY. This causes $tg and $DECODE_MS to be interpolated directly into the Python source. While these values are expected to be numeric, using a quoted heredoc with environment variables (matching the pattern in kv_est_kb) is both safer and more consistent.
Proposed fix
local DECODE_PER_TOK
- DECODE_PER_TOK="$(python3 - <<PY
-tg=float("$tg")
-d=float("$DECODE_MS")
-print(d/tg if tg>0 else 0.0)
-PY
+ DECODE_PER_TOK="$(TG="$tg" DECODE_MS="$DECODE_MS" python3 - <<'PY'
+import os
+tg=float(os.environ["TG"])
+d=float(os.environ["DECODE_MS"])
+print(d/tg if tg>0 else 0.0)
+PY
)"Based on learnings: "Identify potential security issues (e.g., injection risks) and recommend using secure alternatives."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DECODE_PER_TOK="$(python3 - <<PY | |
| tg=float("$tg") | |
| d=float("$DECODE_MS") | |
| print(d/tg if tg>0 else 0.0) | |
| PY | |
| )" | |
| DECODE_PER_TOK="$(TG="$tg" DECODE_MS="$DECODE_MS" python3 - <<'PY' | |
| import os | |
| tg=float(os.environ["TG"]) | |
| d=float(os.environ["DECODE_MS"]) | |
| print(d/tg if tg>0 else 0.0) | |
| PY | |
| )" |
🤖 Prompt for AI Agents
In `@sweep_context_v2.sh` around lines 106 - 111, The heredoc used to compute
DECODE_PER_TOK injects shell-expanded variables into the Python snippet; change
the delimiter to a quoted heredoc (use <<'PY' instead of <<PY) so the Python
here-doc is not subject to shell expansion, keeping the computation for
DECODE_PER_TOK (the python3 - <<PY block) consistent with the kv_est_kb pattern
and preventing accidental interpolation of $tg and $DECODE_MS.
| for CL in $CLS; do | ||
| run_one "decode_heavy" "$CL" "$TG_DH" | ||
| run_one "prefill_ttft" "$CL" "$TG_TTFT" | ||
| done |
There was a problem hiding this comment.
A single failed run aborts the entire sweep due to set -e.
Because run_one returns 1 on failure and set -e is active, a failure at any context length (e.g., OOM at 4096) will terminate the loop and skip all remaining configurations. For a benchmarking sweep, you typically want best-effort execution so partial results are still collected.
Proposed fix — continue on failure and report at the end
+FAILURES=0
for CL in $CLS; do
- run_one "decode_heavy" "$CL" "$TG_DH"
- run_one "prefill_ttft" "$CL" "$TG_TTFT"
+ run_one "decode_heavy" "$CL" "$TG_DH" || ((FAILURES++))
+ run_one "prefill_ttft" "$CL" "$TG_TTFT" || ((FAILURES++))
done
echo
echo "DONE -> $OUTCSV"
+if (( FAILURES > 0 )); then
+ echo "WARNING: $FAILURES run(s) failed — check logs in $OUTDIR"
+ exit 1
+fi🤖 Prompt for AI Agents
In `@sweep_context_v2.sh` around lines 121 - 124, The loop over context lengths
(variable CLS) calls run_one for tasks "decode_heavy" and "prefill_ttft" but
because the script is running with set -e any non‑zero return from run_one will
abort the whole sweep; change the loop to run each run_one call in a way that
captures failures without exiting (e.g., append || true or capture the exit code
into a variable), record failures per CL (store context lengths or task names
that failed), and after the loop exit non‑zero only if you want to signal
overall failure while still allowing the loop to continue; update the block
where run_one is invoked for decode_heavy and prefill_ttft to use this non‑fatal
invocation and add a final summary report of failures.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp`:
- Around line 293-298: The inline Chinese comment near the id==0 path should be
replaced with an English comment (e.g., explaining flops = 2*M*N*K and noting
llamafile_sgemm parameter ordering) for consistency; additionally, the batched
branch (where batch_count > 1) must call mm_shape_record before iterating the
batch so batched post-quantize sgemm calls are logged — call
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count *
(uint64_t)M * (uint64_t)N * (uint64_t)K) in that branch (place it near the start
of the else/batched block that uses batch_count, id, M, N, K).
🧹 Nitpick comments (1)
mllm/backends/cpu/kernels/common/ggml/matmul.cpp (1)
3-12: Duplicate<cstdint>include and unused<tuple>.
<cstdint>is included on both line 4 and line 12.<tuple>(line 8) does not appear to be used anywhere in this file.Proposed fix
`#include` <algorithm> `#include` <cstdint> `#include` <cstdio> `#include` <cstdlib> `#include` <mutex> -#include <tuple> `#include` <unordered_map> `#include` <vector> `#include` <cassert> -#include <cstdint>
| if (id == 0) { | ||
| // flops = 2*M*N*K(注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价) | ||
| mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, | ||
| /*calls=*/1, | ||
| /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K); | ||
| } |
There was a problem hiding this comment.
Non-English comment; also, shape recording is missing for the batched branch.
Line 294 has a Chinese comment — the rest of the codebase uses English. Please translate for consistency.
Additionally, the batch_count == 1 branch records shapes (lines 293-298), but the else batched branch (lines 306-330) does not record any shapes at all. Batched post-quantize sgemm calls will be invisible in the shape log.
Proposed fix
- if (id == 0) {
- // flops = 2*M*N*K(注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价)
- mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
- /*calls=*/1,
- /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
- }
+ if (id == 0) {
+ // flops = 2*M*N*K (llamafile_sgemm arg order is N,M,K but mathematically equivalent)
+ mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
+ /*calls=*/1,
+ /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K);
+ }And in the batched branch (around line 312), add shape recording before the loop:
// Record shape once for the entire batched operation
mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K,
/*calls=*/(uint64_t)batch_count,
/*flops=*/2ull * (uint64_t)batch_count * (uint64_t)M * (uint64_t)N * (uint64_t)K);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (id == 0) { | |
| // flops = 2*M*N*K(注意 llamafile_sgemm 参数顺序是 N,M,K/... 但数学等价) | |
| mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, | |
| /*calls=*/1, | |
| /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K); | |
| } | |
| if (id == 0) { | |
| // flops = 2*M*N*K (llamafile_sgemm arg order is N,M,K but mathematically equivalent) | |
| mm_shape_record(/*kind=*/0, (int)M, (int)N, (int)K, | |
| /*calls=*/1, | |
| /*flops=*/2ull * (uint64_t)M * (uint64_t)N * (uint64_t)K); | |
| } |
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/ggml/matmul.cpp` around lines 293 - 298, The
inline Chinese comment near the id==0 path should be replaced with an English
comment (e.g., explaining flops = 2*M*N*K and noting llamafile_sgemm parameter
ordering) for consistency; additionally, the batched branch (where batch_count >
1) must call mm_shape_record before iterating the batch so batched post-quantize
sgemm calls are logged — call mm_shape_record(/*kind=*/0, (int)M, (int)N,
(int)K, /*calls=*/(uint64_t)batch_count, /*flops=*/2ull * (uint64_t)batch_count
* (uint64_t)M * (uint64_t)N * (uint64_t)K) in that branch (place it near the
start of the else/batched block that uses batch_count, id, M, N, K).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@bench_artifacts/README.md`:
- Around line 62-66: Replace the hardcoded personal path and undefined variables
in the perf repro snippet: remove or replace `cd ~/mllm-runok` with a relative
project-root reference (e.g., use an existing $ROOT or run from repo root), and
add/inline definitions for the referenced variables `$MODEL` and `$CFG` (reuse
the same variable names defined in the "Quick Repro" section or declare them
near `$BIN`) so the perf record command uses defined symbols (`$BIN`, `$MODEL`,
`$CFG`) and is portable and reproducible across environments.
🧹 Nitpick comments (1)
bench_artifacts/README.md (1)
8-8: Consider simpler wording."prior to execution" could be simplified to "before execution" for better readability.
✏️ Suggested simplification
-2. **KV Cache Memory Bounds:** Calculates theoretical and empirical (Peak RSS) VRAM requirements prior to execution, establishing strict lower bounds for AOT memory pre-allocation. +2. **KV Cache Memory Bounds:** Calculates theoretical and empirical (Peak RSS) VRAM requirements before execution, establishing strict lower bounds for AOT memory pre-allocation.
bench_artifacts/README.md
Outdated
| cd ~/mllm-runok | ||
| rm -f perf.data perf.data.old | ||
| perf record -F 99 -g -- \ | ||
| "$BIN" -n tiny_llama -m "$MODEL" -c "$CFG" \ | ||
| -pp 32 -tg 256 -t 8 -cl 2048 -r 1 -cs 0 |
There was a problem hiding this comment.
Hardcoded path and undefined variables prevent reproduction.
This section has multiple portability issues:
- Line 62:
cd ~/mllm-runokis a hardcoded personal path that other users cannot reproduce. - Lines 65-66: Variables
$MODELand$CFGare referenced but never defined in this document (only$BINwas defined earlier in line 29). - Inconsistent with the portable approach used in the "Quick Repro" section above.
📝 Proposed fix for portability
-cd ~/mllm-runok
+# Note: Define these variables first or use actual paths
+export MODEL=/path/to/models/tinyllama-fp32.mllm
+export CFG=./examples/llama/config_tiny_llama.json
+
rm -f perf.data perf.data.old
perf record -F 99 -g -- \Alternatively, reference the variables defined in the Quick Repro section:
-cd ~/mllm-runok
+# Prerequisites: Ensure BIN, MODEL, and CFG are set (see Quick Repro section)
+# Example:
+# export BIN=./build/bin/mllm-llm-benchmark
+# export MODEL=/path/to/your/model.mllm
+# export CFG=./examples/llama/config_tiny_llama.json
+
rm -f perf.data perf.data.old🤖 Prompt for AI Agents
In `@bench_artifacts/README.md` around lines 62 - 66, Replace the hardcoded
personal path and undefined variables in the perf repro snippet: remove or
replace `cd ~/mllm-runok` with a relative project-root reference (e.g., use an
existing $ROOT or run from repo root), and add/inline definitions for the
referenced variables `$MODEL` and `$CFG` (reuse the same variable names defined
in the "Quick Repro" section or declare them near `$BIN`) so the perf record
command uses defined symbols (`$BIN`, `$MODEL`, `$CFG`) and is portable and
reproducible across environments.
8d45d81 to
04013c8
Compare
Hi,Senior Wang,thank you very much for your time to review, and sorry for the rough initial submission.
|
The MLLM benchmark tools include settings for Prefill Length and Decode Length.
The program for LLM is deterministic, its shape is known in advance, and the memory for KV can be calculated beforehand. I don't think there is a need to capture shape. MLLM provides the perfetto tool to view kernel flame graphs to analyze specific bottlenecks. For your PR:
|
81cb984 to
c96ea48
Compare
Senior Wang,Happy New Year! Thank you for your review. I've revised it according to your requests. I hope you'll take another look when you have time. If there's anything else I need to change, please feel free to let me know. |
tools/mllm-llm-benchmark/main.cpp
Outdated
|
|
||
| #include "models/All.hpp" | ||
|
|
||
| #ifndef MLLM_GIT_COMMIT_HASH |
There was a problem hiding this comment.
MLLM_GIT_COMMIT_HASH is provide by cmake system. Do not modify it.
| avg_prefill_speed /= 3.0f; | ||
| avg_decode_speed /= 3.0f; | ||
|
|
||
| // Print average results |
There was a problem hiding this comment.
pls keep the cli output
|
|
||
| ### Single run | ||
| ```bash | ||
| ./mllm-llm-benchmark \ |
There was a problem hiding this comment.
The README is for developer, not a description of your PR. You should add your part on the top of the original README.md
| int32_t head_dim = 0; // hidden_size / num_attention_heads | ||
| }; | ||
|
|
||
| /** |
There was a problem hiding this comment.
Do not delete all comments here!
|
7caabc1 to
4c42b62
Compare
4c42b62 to
edfba9c
Compare
Added CPU benchmark tool for context length sweeps. The existing benchmark tools had PP/TG settings but no way to automate testing across different context lengths. So I wrote a bash script to run sweeps automatically. Key changes: - Add Llama.hpp for LLaMA/TinyLLaMA models - Update BenchmarkTemplate to support KV cache estimation - Add context sweep automation scripts with plotting - Add comprehensive README documentation Tested on WSL2 (Ryzen 7 6800H) with TinyLLaMA fp32
edfba9c to
fc46527
Compare
Hi,Senior Wang,thank you for your review, I have revised it according to your requirements,please take another look at it when you have time. |
tools/mllm-llm-benchmark/README.md
Outdated
| @@ -1,5 +1,11 @@ | |||
| # MLLM LLM Benchmark Tool | |||
|
|
|||
| ## Why | |||
There was a problem hiding this comment.
There is no need to have why section
|
| void clear() override { | ||
| // TODO: expose a public KV-cache reset API for LlamaForCausalLM (if needed). | ||
| // For now, keep it as no-op to minimize API changes in PR1. | ||
| } |
There was a problem hiding this comment.
- please see this coderabbit's review;
- In your sweep_context_v2.sh dose the 'cl' mean context length(include prompt length and test_generation_length? But in initial code the 'cl' mean cache length, it may results some misunderstand;
- I don't see any information about 'cache lenght' in you code, why?
https://github.com/UbiquitousLearning/mllm/pull/639/changes#r2807444788
⚠️ Potential issue | 🟠 Major
clear()is a no-op — KV cache persists across consecutive runs.Since
main.cppcallsbenchmark->clear()before each run within a pp/tg pair loop, this no-op means the KV cache from run N leaks into run N+1 (and into different pp/tg pairs). This can inflate prefill timings and corrupt throughput numbers for multi-run or multi-pair benchmarks.Consider at minimum documenting this limitation prominently in the CSV/output, or restricting
--runsto 1 until the reset API is available.🤖 Prompt for AI Agents
|
Hi, @jialilve , thank you for your review:
|
- rename cl->ctx_len in scripts (was confusing with -cl flag) - print cache_length in output for clarity - force runs=1 for LLaMA only (clear() is no-op, Qwen3 unaffected) - update README with CTX_LENS and RUNS limitation note per @jialilve feedback
60a549f to
31fc935
Compare
Pls add one. |
Added CPU benchmark tool for context length sweeps.The existing benchmark tools had PP/TG settings but no way to automate testing across different context lengths. So I wrote a bash script that sweeps from 256 to 4096 tokens and logs everything to CSV.Tested on my machine (WSL2, Ryzen 7 6800H) with TinyLLaMA fp32. Script runs two modes per context - one measures TTFT, other measures decode throughput. Also added KV cache size estimation.Usage is in the README. Basically just export BIN/MODEL/CFG and run the sweep script. Or use the benchmark binary directly for single runs.
Code structure: