Weekly GitHub Report for Llama.cpp: February 01, 2026 - February 08, 2026 (15:58:16)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, focusing on improved stability and new feature integrations. Notable highlights include streamlined workflows and expanded compatibility with emerging technologies.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[BUG-UNCONFIRMED] Eval bug: Lllama.cpp crashes when running Qwen Next 80B Coder: This issue reports a crash in llama.cpp when running the Qwen3 Coder Next 80B model with very long context lengths (around 80K tokens) and multiple tool calls, resulting in a runtime error related to an unexpected empty grammar stack during token acceptance. The problem appears reproducible across different hardware and backends (CUDA, ROCm, Vulkan) and is specifically triggered by the combination of long context and tool usage, with some users confirming that a particular branch (autoparser) mitigates the crash for certain configurations.
- Multiple users confirm the crash occurs consistently with Qwen3 Coder Next models under long context and tool call conditions, sharing detailed stack traces and logs; attempts to reproduce vary, but the issue is widespread across platforms and backends. A suggested fix on the autoparser branch shows promise for some, while others report related but distinct errors with newer Codex versions or different setups.
- Number of comments this week: 24
-
[BUG-UNCONFIRMED] Eval bug: Llama.cpp 40% slower than VLLM + high CPU usage when running Qwen Coder Next: This issue reports that when running the Qwen 3 Coder Next model at FP8 precision, the llama.cpp implementation is approximately 40% slower than VLLM and exhibits unusually high CPU usage, reaching 100%. The user provides detailed logs, benchmarks, and comparisons across different backends and hardware, highlighting that llama.cpp’s CUDA backend may lack certain optimizations (such as for unified KV cache), and that Vulkan backend shows better performance; suggestions include disabling unified KV cache and simplifying the GGML graph to improve speed and reduce CPU load.
- The comments discuss profiling results pinpointing CUDA kernel overhead, GPU utilization comparisons, benchmarking across models and backends, confirmation that the correct model variant is used, and attempts to tweak launch parameters and cache settings; overall, the interaction converges on the need for further optimization in llama.cpp’s CUDA backend, especially regarding unified KV cache and graph simplification, while Vulkan backend shows promising improvements.
- Number of comments this week: 18
-
[BUG-UNCONFIRMED] Eval bug: GLM 4.7 Flash not working with flash attention on: This issue describes a problem where the GLM 4.7 Flash model fails to work correctly with flash attention enabled when quantization of the key-value cache is turned on, causing server crashes due to an assertion failure related to GPU block configuration. The user reports that disabling flash attention avoids the crash but results in slower performance, and through discussion it is revealed that the problem is likely tied to specific ROCm versions, with ROCm 7.1.1 causing the issue and downgrading to ROCm 7.0.1 resolving it.
- The comments confirm the issue occurs on AMD GPUs with ROCm 7.1.1 but not with ROCm 7.0, with multiple users reproducing the crash and sharing logs; some users note that using different hardware setups or downgrading ROCm fixes the problem, and there is ongoing discussion about performance trade-offs and related issues.
- Number of comments this week: 11
-
[BUG-UNCONFIRMED] Eval bug: llama.cpp server crashed when running QWen3-Coder-Next GGUF model on Ubuntu 24.04 on Strix Halo: This issue reports a reproducible crash of the llama.cpp server when running the QWen3-Coder-Next GGUF model on Ubuntu 24.04 with an AMD Ryzen AI MAX+ 395 GPU using ROCm 7.2. The crash occurs after processing a large number of tokens and is caused by an unexpected empty grammar stack error during token acceptance, leading to a std::runtime_error and core dump.
- The comments discuss similar crashes on different hardware and configurations, confirm the issue is reproducible, share details about ROCm versions and compilation methods, and note that some users do not experience the crash under certain conditions; the issue is suggested to be a duplicate of #19304.
- Number of comments this week: 11
-
[BUG] [AMD GPU] [3RD PARTY] Compile bug: ROCm 7.2 + rocwmma: This issue reports a compilation error encountered when building llama.cpp with ROCm 7.2 and rocwmma support on Ubuntu 24.04, specifically related to ambiguous partial specializations in rocwmma headers causing build failures. The user notes that earlier ROCm versions 6.4 and 7.1 do not have this problem, and the error appears to be linked to an upstream bug triggered by LLVM 21+ affecting rocwmma.
- The comments clarify that the issue is reproducible on certain AMD GPUs and ROCm 7.2, with users confirming the error and sharing environment details including rocwmma version 2.2.0; it is identified as an upstream bug in the ROCm libraries triggered by LLVM 21+, and some users report successful builds on different hardware or with adjusted build targets.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 46
Summarized Issues:
- Model loading and startup errors: Multiple issues report problems with loading models or starting the llama-server, including missing model files, memory errors, and backend incompatibilities. These errors cause crashes, assertion failures, or incorrect outputs, often related to specific hardware setups or quantization options.
- Performance anomalies and regressions: Several reports highlight unexpected performance issues such as slower inference speeds, high CPU usage, or regressions in output quality and speed across different models and hardware. These include slower token generation on Apple Silicon, slower FP8 precision runs, and deteriorated long-context output quality after specific commits.
- Backend and hardware compatibility issues: Problems arise with Vulkan, ROCm, SYCL, Metal, and CUDA backends on various GPUs and integrated graphics, causing crashes, corrupted output, or unsupported features. These include GPU resets due to driver timeouts, flash attention incompatibilities, and missing int8 support in Metal.
- Memory and resource management problems: Issues include high GPU load despite sleep mode, subprocesses not terminating and holding VRAM, and shared libraries installed in incorrect directories causing loading failures. These problems lead to inefficient resource usage and require manual intervention or workarounds.
- Model and feature requests: Users request support for new models such as STEP3-VL-10B, GLM-OCR, and Yuan3.0-Flash-4bit, as well as features like distributed parallel processing, search functionality for Anthropic API, cache snapshotting, and automatic model downloading in router mode. These aim to expand capabilities and improve usability.
- issues/19258, issues/19259, issues/19277, issues/19298, issues/19335, issues/19342, [issues/19351](https://github.com/issues/19351]
- Crashes related to grammar stack and token processing: Multiple issues report crashes caused by unexpected empty grammar stacks during token acceptance, especially with Qwen3 Coder Next models on ROCm and other backends. These crashes cause server aborts and core dumps during long context or tool call processing.
- issues/19304, issues/19355, [issues/19421](https://github.com/issues/19421]
- Output correctness and formatting bugs: Problems include corrupted or gibberish output with flash attention enabled, invalid JSON output with duplicate fields, and garbled CJK characters due to JSON serialization settings. These issues affect usability and require fixes in serialization or output handling.
- issues/19276, issues/19336, issues/19382, [issues/19391](https://github.com/issues/19391]
- Concurrency and race conditions: ThreadSanitizer detected race conditions in CI tests involving concurrent access to shared data structures during multi-threaded operations, indicating potential stability and correctness issues under parallel workloads.
- Tokenization and embedding discrepancies: Research and reports highlight challenges in GPU-accelerated BPE tokenization on Vulkan and significant mismatches in embedding results between llama_cpp and Huggingface models, indicating areas needing further investigation and optimization.
- issues/19401, [issues/19410](https://github.com/issues/19410]
- Security vulnerability: A remote Denial of Service (DoS) vulnerability exists due to unvalidated floating-point sampling parameters in the completion API, allowing attackers to crash or hang the server by sending extreme values.
- [issues/19367](https://github.com/issues/19367]
- Compilation and build errors: Compilation fails on Ubuntu 24.04 with ROCm 7.2 due to ambiguous template specializations in rocwmma headers, preventing successful builds with HIP support.
- [issues/19269](https://github.com/issues/19269]
- Command line and tool usability issues: The llama-cli tool no longer supports shell redirection for output files, and attempts to use the -o flag result in errors, reducing usability for saving outputs.
- [issues/19256](https://github.com/issues/19256]
- Model conversion and compatibility: Requests to update converter tools to support new architectures like YuanForCausalLM are made to enable running custom multimodal foundation models requiring trust_remote_code=True.
- [issues/19342](https://github.com/issues/19342]
- Docker and GPU offloading issues: The server-cuda13 Docker image fails to offload processing to the GPU unlike server-cuda12, causing unexpected performance and resource usage differences under Kubernetes.
- [issues/19279](https://github.com/issues/19279]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 18
Summarized Issues:
- Template and Tool Integration Errors: Multiple issues describe problems related to template parsing and tool integration, including unknown tests causing server errors, missing tool descriptions leading to fallback behavior, and tool type validation errors breaking integrations. These issues result in server crashes, incorrect tool handling, and degraded user experience when using models like Qwen3-Coder and llamacpp backend.
- issues/19004, issues/19009, issues/19343
- Qwen3-Coder-Next Model Crashes and Bugs: Several reports highlight crashes and bugs specific to the Qwen3-Coder-Next model, including decoding failures due to inconsistent token positions, syntax error output problems caused by chunking logic bugs, and forced full prompt re-processing due to cache issues. These problems lead to invalid input errors, poor output quality, and practically unusable tools in certain scenarios.
- issues/19267, issues/19305, issues/19394
- Inference Crashes and Runtime Errors: There are multiple crash reports during inference or evaluation, including runtime errors from empty grammar stacks, crashes in grammar token acceptance, and invalid input batch errors triggered by specific prompts. These crashes cause server termination and require workarounds such as disabling certain flags.
- issues/19329, issues/19350, issues/19353
- Performance and Hardware Issues: Performance regressions and hardware-specific problems are noted, such as increased memory usage and degraded performance on Intel integrated GPUs due to faulty GPU detection, and tuning decode kernel parameters on Apple M5 hardware to improve performance. These issues affect efficiency and resource utilization during model execution.
- issues/19221, issues/19303
- Build and Compilation Failures: Several issues report build errors on different platforms, including MacOS Tahoe and Android Termux, caused by system header incompatibilities and missing files. These failures prevent successful compilation and require platform-specific fixes.
- issues/19385, issues/19388
- Feature Requests for Model Support and Router Behavior: Requests include adding support for the StepFun Step 3.5 model architecture and preventing automatic model loading in router mode when accessing metrics endpoints. These features aim to expand model compatibility and improve operational control.
- issues/19311, issues/19346
- Template Parameter Passing Issues: One issue describes the failure of the translategemma template to receive necessary language code parameters due to attribute dropping during message processing, causing translation failures. This impacts the correct application of language codes in translation workflows.
- issues/19295
- Web UI Message Editing Bug: A bug in the Web UI causes all subsequent messages to disappear when editing a generation message without branching, resulting in loss of message history. This affects user interaction and message management.
- issues/19300
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 41
Key Open Pull Requests
1. Notebook page on webui: This pull request adds a Notebook page to the llama.cpp web UI that enables raw text generation using new /completion and /tokenize API calls, featuring text continuation/abortion, rudimentary undo/redo, keyboard shortcuts, auto-scroll, generation statistics, and safeguards against accidental data loss, all tested across devices and themes, while reusing existing components for model selection and error handling.
- URL: pull/19339
- Associated Commits: 6d967, 3af9b, c9f98, ff2f0, e80ba, 8a711, 301c3, 11e3c, f041a, 7892b, 3657a, fb209, f42d8, 9dc75, 210dc, 03077, 9cf47, f20b1, ad3b8, efd27, 251ba, 393fa, 031e4, 77dc9, a5744, 4659a, a0c5c
2. sampling : blue noise rng: This pull request implements a new blue noise shaped random number generator as an optional replacement for the default white noise mt19937 RNG in the sampling distribution step, adding it as a CLI option and request parameter to improve sampling quality by leveraging blue noise’s natural error diffusion properties to reduce repetition and rambling in generated text outputs.
- URL: pull/19409
- Associated Commits: e856c, b2ee2, f2715, 3b406, 766d8, ad731, d5def, e829f, 6f7f6, 267cd, 1b1b2, 15ade, a0323, 23b5a
3. [WIP] ggml-hexagon: convert f32 to f16 - fa opt part3: This pull request focuses on optimizing the ggml-hexagon backend by implementing an in-place conversion of float32 to float16 data to eliminate redundant conversions, introducing new fused multiply-add (MAD) compute primitives for FP16 inputs with separate scaling factors, and streamlining flash attention operations for improved performance and clarity.
- URL: pull/19282
Other Open Pull Requests
- ARM64 GEMM and GEMV Optimizations: Multiple pull requests focus on implementing and optimizing GEMM and GEMV operations for ARM64 CPUs with various quantizations like q5_K and Q6_K. These changes deliver significant performance improvements on devices such as Apple M4 Max and Raspberry Pi 5 while maintaining correctness verified through perplexity and output consistency tests.
- Backend Enhancements and Multi-Stream Support: Several pull requests introduce new backend features including multi-stream support for the CANN backend, Hexagon backend operator additions and optimizations, and the GGML-VirtGPU backend documentation. These updates improve parallel execution, operator coverage, and usability across different hardware platforms.
- Memory Management and Leak Fixes: A pull request addresses memory leaks in the WebGPU implementation by properly destroying buffers and converting heap allocations to smart pointers. This ensures all allocated resources are freed on shutdown to prevent heap memory leaks.
- Tensor Parallelism and Multi-GPU Support: One pull request introduces backend-agnostic tensor parallelism with a meta backend that orchestrates multiple simple backends for distributed tensor operations. It extends backend interfaces with asynchronous tensor copy and AllReduce functions, enabling multi-GPU setups with ongoing optimizations and some current limitations.
- Quantization and Kernel Support for SME2 and INT2: A pull request adds support for INT2 quantization and KleidiAI SME2 gemm/gemv kernels targeting SME2-capable devices like MacBook M4 and Vivo X300, with plans for future NEON kernel support.
- N-gram Caching and Statistics Updates: Updates to the n-gram caching mechanism introduce score-based pruning to improve processing speed by pruning low-scored n-grams while retaining useful ones. Follow-up changes remove a configuration parameter, rename statistics variables, add new counters, and disable the key-map-stats feature.
- Rope Functionality Fixes in CUDA Backend: A pull request ports fixes from the CPU implementation to the CUDA backend to correct broken rope logic, including variable renaming and addressing multiple rope-related components for consistency and proper memory layout.
- Server and API Enhancements: Several pull requests add features to the server such as hiding models from the WebUI in router mode, adding OpenAI and Anthropic compatible prompt token cache information in API responses, and updating the compare-logprobs corpus to use server documentation for improved accuracy.
- OpenCL Improvements and Operator Refactoring: Pull requests introduce general implementations of Q6_K and Q4_K matrix operations in OpenCL and refactor EXPM1 and Softplus OpenCL operators to improve code clarity and reduce duplication.
- Platform-Specific Performance Optimizations: Optimizations include SIMD integration for ggml_vec_dot_bf16 on s390x platforms yielding over 83% speedup, and Hexagon backend kernel tuning for parallel 2x2 dot product computations resulting in over 10x prompt processing speed improvements.
- Compilation and Compatibility Updates: Updates add compiler flags to enable llama.cpp compilation in UWP environments, update ROCm docker container and CI to version 7.2 with hardware support changes, and fix segmentation faults on Intel CPUs by disabling bfloat16 by default.
- Vulkan Backend Fixes: A pull request fixes the Vulkan implementation of ggml_acc to work correctly in 4D (previously only 3D) and adds a test case to verify the fix.
- MCP Support in llama-cli: A pull request introduces minimal Modular Control Protocol (MCP) support in the llama-cli tool using subprocesses to pass JSON configs and disable approval queries, serving as a proof of concept without full production features like memory or session management.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 85
Key Closed Pull Requests
1. Support Step3.5-Flash: This pull request adds comprehensive support for the Step3.5-Flash model architecture to the project, including improvements such as simplified GGUF conversion, adjustments for normalization, disabling KV shifting for the new architecture, and various code refinements based on review feedback.
- URL: pull/19283
- Associated Commits: 02092, 5c7c6, d9d74, 34a4d, 60ab1, 0293d, ff62b, 512a7, 19cff, f7ca9, aea96, 4e6e2, a7e96, 46e84, f542d, 430da, 402fc, 5737b
- Associated Commits: 02092, 5c7c6, d9d74, 34a4d, 60ab1, 0293d, ff62b, 512a7, 19cff, f7ca9, aea96, 4e6e2, a7e96, 46e84, f542d, 430da, 402fc, 5737b
2. ggml-virtgpu: make the code thread safe: This pull request improves the ggml-virtgpu backend by making the code thread safe through the use of mutexes for accessing shared memory buffers, pre-caching constant backend values during initialization, and deprecating the unused buffer_type_is_host method, along with various commits addressing thread safety, error handling, and code cleanup.
- URL: pull/19204
3. vendor : add missing llama_add_compile_flags: This pull request ensures that the httplib and boringssl/libressl components are built with the appropriate sanitizer compile flags by adding the missing llama_add_compile_flags and adjusting CMake configurations to fix CI issues and manage compiler warnings effectively.
- URL: pull/19322
Other Closed Pull Requests
- Performance improvements in matrix operations and backend optimizations: Multiple pull requests focus on enhancing performance through various means, including improving CUDA performance for small batch sizes using the mmvq method, refactoring OpenCL operations for faster evaluation, and optimizing Metal kernel implementations. These changes collectively result in significant speedups and better resource handling across different hardware and backends.
- pull/18958, pull/19226, pull/19390
- Memory and concurrency management enhancements: Some pull requests address concurrency and memory issues by removing unnecessary mutexes in pipeline caches due to per-thread contexts, fixing memory-related regex errors in MSVC builds, and improving Vulkan specialization constants for better compiler optimization. These updates improve stability and performance in multi-threaded and cross-platform environments.
- pull/19195, pull/19340, pull/19309
- New features and model support: Several pull requests introduce new capabilities such as a manual thinking budget for reasoning models with related CLI options, a unified delta net implementation compatible with Kimi Linear, and integration attempts of an updated Step3.5 model from an external fork. These additions expand the functionality and model compatibility of the project.
- pull/19358, pull/19125, pull/19271
- Mixture of Experts (MoE) model support: One pull request adds a new operator, GGML_OP_MOE_SUM, to efficiently sum outputs from multiple experts in MoE models, featuring optimized CPU and CUDA implementations with fallback and benchmarking. This improves performance and flexibility for MoE architectures.
- pull/19362
- Backend and shader improvements: Enhancements to the ggml-webgpu backend include just-in-time compilation for binary operators and shader generation that correctly handles overlapping buffer bindings, ensuring proper execution of binary operations. These changes improve the robustness and efficiency of the WebGPU backend.
- pull/19310
- Testing and bug fixes in caching and rope operations: Tests were added for non-contiguous and inplace rope operations in the KV cache, addressing tensor contiguity issues and noting Vulkan backend limitations. Additionally, fixes were made to ngram-map and ngram-simple spec methods to improve message regeneration and conversation branching.
- pull/19296, pull/19253, pull/19261
- Build and CI improvements: Adjustments to the continuous integration process include reducing the number of jobs when building with sanitizers to prevent random terminations and increasing the maximum allowed CMake version to fix build failures on Windows Snapdragon devices without breaking other platforms.
- pull/19411, pull/19188
- Documentation and usability enhancements: Improvements were made to the formatting and clarity of the
llama-quantize --helpoutput, updates to documentation for the ngram-mod specification and reasoning budget usage, and housekeeping updates such as copyright year bumps and URL replacements. - pull/19317, pull/19252
- Data type and operation support expansions: Support for the F16 data type was added to the GGML_OP_CEIL operation in the SYCL backend, aligning with CPU semantics and passing functional tests. This extends the range of supported data types for element-wise operations.
- pull/19306
- Dependency updates: The
sentencepiecedependency in thegguf-pypackage was updated to a newer version to resolve issues and improve compatibility, especially for users relying on conda-forge. - pull/19319
- Miscellaneous improvements: Other changes include improving the readability of debug print functions, simplifying embedding processing by removing unnecessary loops and clamps, and adding support for virtual Metal devices to simulate multi-GPU environments on Mac.
- pull/19331, pull/19286, pull/18919
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| ggerganov | 110 | 25 | 0 | 20 |
| CISC | 65 | 5 | 0 | 48 |
| ngxson | 69 | 5 | 0 | 24 |
| pwilkin | 35 | 3 | 2 | 21 |
| am17an | 43 | 5 | 0 | 2 |
| max-krasnyansky | 47 | 3 | 0 | 0 |
| danbev | 38 | 8 | 0 | 1 |
| Alcpz | 42 | 2 | 0 | 1 |
| 0cc4m | 35 | 1 | 0 | 5 |
| nikhilJain17 | 33 | 2 | 0 | 4 |