feat:upgrade to 0ccbfde by lochjin · Pull Request #4 · Qitmeer/llama.cpp

lochjin · 2026-02-14T02:39:12Z

Make sure to read the contributing guidelines before submitting a PR

* from previous PR * Make instruction(system) as first message * Convert [input_message] (text/image/file) * Rename convert_responses_to_chatcmpl(body) -> response_body * Initial tool call support * Erase instructions field from chatcmpl body * Feed reasoning texts to chat template * Use std::vector instead of opaque json array * Make output_item.added events consistent * Move `server_task_result_cmpl_partial::update` from header to source * Match ID of output_item.added and .done events * Add function_call only if there is no "fc_" prefix * Add function call output at non-streaming API * Test if ID is persistent * Add doc * Fix style - use trailing comma * Rewrite state management * catch up with upstream/master * Fix style - "type" is the first item of SSE data * Explicitly check "instructions" from response_body * Make lambdas static * Check if reasoning content exists * Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final * Reject `input_file` since it is not supported by chatcmpl * Add "fc_" prefix to non-straming function call id as coderabbit pointed out --------- Co-authored-by: openingnow <>

…ml-org#18987) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…18945) * vulkan: Remove transfer_ctx, do everything in compute_ctx. We had a bug where a set_tensor_async (using transfer_ctx) didn't get submitted before the graph_compute (using compute_ctx) that came after it. To avoid this sort of issue, just do everything in compute_ctx. Remove transfer_cmd_pool, which was already unused. * fix crash with perf logger

…8997) This commit removes the mention of RoPE in the comment for the Q and K computation as RoPE is not applied.

* fix: Use `tabular-nums` for chat message statistics * fix: Rebuild WebUI

* jinja: support none|string * Update common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add as_string() --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add Ministral3ForCausalLM architeture This adds support for newer architectres like Devstral-2 * removed blank line found after function decorator Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

While buffers reside in host memory, additional transformation is needed to use buffers with zDNN. Fixes ggml-org#18848

* Move `task_result_state::update_chat_msg` to match with header * Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header --------- Co-authored-by: openingnow <>

)

… for specialized kqv kernel for adreno (ggml-org#18970) * opencl: add `copy_to_contiguous` and utilize mm kernels * opencl: only copy to cont for f32 and f16 tensors * opencl: use cont mm for fallback when dst is large * opencl: use nb local to copy-to-cont * opencl: use local offset as well

…g#18866)

* mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

* cli : load parser definition * cont : only unload if a parser is defined

…ions (i8mm) (ggml-org#18860) * Boilerplate for q5_Kx8 REPACK on ARM and fallback Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Implements make_block_q5_Kx8 by extending make_block_q4_Kx8 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q5_K repack gemm and gemv generics * Gemm and Gemv ARM implementations (i8mm) * Improved qh manipulation looking at non-repack vec_dot implementation * Full unroll * Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments. Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix wrong fallback definitions of Q5_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed comments. Reverted unnecessary formatting Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed typo in generic definitions * Switching AND + Shift with Shift Insert. Better op interleaving. * Vectorize + unroll the block scales * Apply gemm optimizations to gemv * Improve bias calculation --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

) This commit modifies all the utility scripts to use an optional BUILD_DIR variable/argument to specify the build directory. The motivation for this is that Commit 3d55846 ("model-conversion : add BUILD_DIR variable to run-converted-model scripts") introduced this variable to the causal and embeddings scripts, but I missed the scripts in the utils directory.

…rg#19019)

…g#18992) * use malloc to support both iGPU and dGPU in same time * support windows --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

…gml-org#18898) * graph : avoid branches between embedding and token inputs * models : make deepstack graphs (e.g. Qwen3 VL) have constant topology * ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI * cont : pad token embeddings to n_embd_inp

* optimize flash attention kernel by improving score computation and online softmax update * wip * Refactor online softmax update in flash attention kernel for improved performance * Optimize flash attention kernel by replacing float array with HVX_Vector for score computation * wip

* ggml-cuda: add split-wise cuda graph * add n-cpu-moe compare_llama_bench.py * fix hip/musa builds

…gml-org#19460) * model: support GLM MoE DSA arch * working version * pyright * keep indexer tensors * add indexer gguf params * loaded now * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * update * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * minor fix and cleanup --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : remove legacy .json to .etag migration code Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : simplify common_download_file_single_online This commit also force a redownload if the file exists but has no .etag file. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…gml-org#19583) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>

* Faster tensors (ggml-org#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (ggml-org#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* FlashAttention (ggml-org#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (ggml-org#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (ggml-org#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (ggml-org#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

jeffbolznv and others added 30 commits January 21, 2026 17:43

vulkan: support flash attention GQA/split_k with small batches (ggml-…

33f890e

…org#18938)

common : improve error message when HTTPS is missing but required (gg…

14be5a3

…ml-org#18987) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

llama : clarify nemotron-h.cpp comment about RoPE [no ci] (ggml-org#1…

9da3dcd

…8997) This commit removes the mention of RoPE in the comment for the Q and K computation as RoPE is not applied.

fix: Use tabular-nums for chat message statistics (ggml-org#18915)

3802d3c

* fix: Use `tabular-nums` for chat message statistics * fix: Rebuild WebUI

ci : update GitHub Actions versions [no ci] (ggml-org#18935)

6b99a22

ggml-zdnn : mark zDNN buffers as non-host (ggml-org#18967)

94242a6

While buffers reside in host memory, additional transformation is needed to use buffers with zDNN. Fixes ggml-org#18848

opencl: add TRI op support (ggml-org#18979)

5516b9c

CUDA: add gqa_ratio 4 for GLM 4.7 flash (ggml-org#18953)

b70d251

server: Reorder methods in server-task.cpp (ggml-org#19016)

c6926d1

* Move `task_result_state::update_chat_msg` to match with header * Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header --------- Co-authored-by: openingnow <>

mtmd : update docs to use llama_model_n_embd_inp (ggml-org#18999)

9eb5bfe

release: update github api (ggml-org#19022)

8b30840

quant : manual overrides of tensor types take precedence (ggml-org#18952

0e4ebeb

)

server: do not log certain endpoints (avoid log spam) (ggml-org#19028)

4e595b2

convert_hf_to_gguf.py: refactor modify_tensors to call super (ggml-or…

e34d6d0

…g#18866)

CUDA: fix alignment check for FA (ggml-org#19023)

e2baf02

server : support preserving reasoning_content in assistant message (g…

51fa458

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

cli : load parser definition (ggml-org#19031)

a3e8128

* cli : load parser definition * cont : only unload if a parser is defined

chat : fix translategemma crash on common_chat_format_example (ggml-o…

b5b8fa1

…rg#19019)

[SYCL] use malloc to support both iGPU and dGPU in same time (ggml-or…

cb6caca

…g#18992) * use malloc to support both iGPU and dGPU in same time * support windows --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

ggml-cuda: enable cuda-graphs for n-cpu-moe (ggml-org#18934)

81ab64f

* ggml-cuda: add split-wise cuda graph * add n-cpu-moe compare_llama_bench.py * fix hip/musa builds

Alcpz and others added 7 commits February 13, 2026 20:32

Fix wrong memcpy length for block_interleave == 4 (ggml-org#19575)

cc2aa81

vulkan: restore -inf check in FA shaders (ggml-org#19582)

05a6f0e

github : add missing backends to issue templates (ggml-org#19603)

94a602d

feat:upgrade to 0ccbfde

bf45430

github-actions bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan IBM zDNN testing build examples devops python script server ggml nix Ascend NPU OpenCL android model jinja parser labels Feb 14, 2026

dindinw merged commit 6adc0b9 into Qitmeer:master Feb 14, 2026
69 of 86 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat:upgrade to 0ccbfde#4

feat:upgrade to 0ccbfde#4
dindinw merged 1301 commits intoQitmeer:masterfrom
lochjin:master

lochjin commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

lochjin commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants