Weekly GitHub Report for Llama.cpp: April 25, 2026 - May 02, 2026 (19:20:07)

Weekly GitHub Report for Llama.cpp: April 25, 2026 - May 02, 2026 (19:20:07)

        Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces significant updates enhancing overall performance and user experience, with notable improvements in system stability and new feature integrations that streamline workflows. This release reflects a continued focus on optimizing functionality and addressing user feedback.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 
As of our latest update, there are no active issues with ongoing comments this week. 
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 17
Summarized Issues:

Compilation and Build Issues: Several issues report problems during the compilation or build process of llama.cpp on different platforms and configurations. These include indefinite hangs during CUDA compiler checks on Linux, multiple compilation errors on Windows with the HIP backend, and failures converting MLX-fused models due to unrecognized tensor artifacts.  
issues/22383, issues/22390, issues/22431

Backend Crashes and Performance Problems: Multiple backend-related issues cause crashes or poor performance during model execution. Crashes occur in the Vulkan backend after processing many tokens and in the Meta backend with tensor split mode on CUDA GPUs due to memory exhaustion. Additionally, the SYCL backend shows extremely poor performance on Battlemage hardware compared to Vulkan on Windows.  
issues/22404, issues/22413, issues/22425

Model Output and Grammar Generation Bugs: There are issues with model output correctness and grammar-constrained generation. The GBNF grammar-constrained generation produces degenerate loops or unconstrained output on larger MoE models starting from a specific commit, and the QWEN3.6MOE model outputs incorrect or garbled text on MacOS Vulkan backend. The Qwen 3.6 27B model also outputs unexpected tags causing agentic workload stops.  
issues/22381, issues/22398, issues/22430

API and Protocol Compatibility Issues: The llama.cpp server does not support the new OpenAI Responses API tool call format requiring a "type": "function" field, causing tool calls to fail with 400 errors on newer Codex versions. Additionally, evaluation fails to parse certain XML-like tool call commands on Linux CUDA backends, resulting in server errors.  
issues/22389, issues/22422

Feature Requests and Usability Improvements: There is a request to handle maximum context size errors on CPU by either trimming older tokens automatically or stopping generation safely with clear messaging. Also, a UI usability issue is reported where the sidebar toggle button's z-index causes it to overlap other interactive elements, suggesting UI restructuring or z-index adjustment.  
issues/22392, issues/22395

Parameter Handling and Configuration Bugs: A bug in the llama-server causes ngram parameters to be ignored due to incorrect use of max and min functions, always setting values to 1024 instead of respecting user input.  
issues/22414

Quantization and Perplexity Anomalies: Unexpected extreme increases in perplexity are observed when aggressively quantizing the Gemma 4 E4B base model, unlike the more gradual changes in the E2B model, raising questions about the expected behavior given the models' PLE mechanism.  
issues/22407

Router Mode Endpoint Accessibility: The /slots endpoint is inaccessible when running llama-server in router mode, causing errors and limiting management of model slots such as saving and restoring context, which works correctly outside router mode.  
issues/22373

Sampler Initialization Failures with JSON-Schema: Using the --json-schema flag with Gemma 4 models in llama-cli causes sampler initialization to fail with a std::exception, while generation works without the flag or with hand-written grammars, indicating issues in JSON-Schema to grammar conversion or sampler interaction on Gemma 4.  
issues/22396

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 10
Summarized Issues:

GPU and Backend Crashes and Bugs: Several issues report crashes and bugs related to GPU usage and backend implementations. These include llama-server crashing with the "-sm tensor" option on NVIDIA GPUs, segmentation faults when running Qwen3.6 35B model on AMD ROCm GPUs with tensor splitting, and Vulkan backend incorrectly selecting memory heaps causing allocation failures.  
issues/22268, issues/22351, issues/22368

Model and Tensor Conversion Failures: Problems with model conversion and tensor handling are highlighted, such as the convert_hf_to_gguf.py script failing to convert FP8 or NVFP4 Nemotron 3 Super models due to a ValueError in tensor mapping. Additionally, tensor parallelism on three or more GPUs causes infinite output streams in llama-server with Qwen3.6-35B-A3B models.  
issues/22346, issues/22391

Performance Regressions and Improvements: Performance issues include a 32-39% regression in prompt processing on MoE models using the Vulkan RADV driver on Radeon 8060S GPUs, while token generation remains stable. Conversely, fixing the context checkpoint restore mechanism for hybrid and recurrent models significantly reduces prompt re-processing time from seconds to milliseconds.  
issues/22375, issues/22384

Parsing and Content Handling Bugs: A bug in the Gemma 4 PEG parser causes premature truncation of assistant content at the <|tool_call> token due to incorrect delimiter usage, resulting in silent content loss during multi-turn tool calls.  
issues/22371

Build and Compilation Issues: There is a compile bug reported with the Musa GGML backend, though details about the problem and logs are incomplete.  
issues/22416

Windows Program Hanging: The test-chat-template.exe program hangs indefinitely on Windows when built with specific MSYS2 UCRT64 and GCC configurations after a certain commit, causing the test to complete but not exit properly and preventing interruption via Ctrl-C.  
issues/22142

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 7
Key Open Pull Requests
1. Wip/deepseek v4 support: This pull request introduces comprehensive support for DeepSeek V4 in the llama.cpp project, including GGUF conversion, native FP4 and FP8 quantization, runtime graph and memory management, CUDA performance optimizations, fused kernel tuning, and enhanced expert routing with fast top-k operations.

URL: pull/22378

Associated Commits: afa35, 77f42, c3b9f, 97517, 172df, 9805e, 4eee9, c4268, c9dd6, d9a1f, ba173, 48669

2. ggml-cpu: fuse RMS_NORM + MUL on CPU backend : This pull request introduces a fused RMS_NORM plus MUL kernel on the CPU backend that computes the combined operation in a single pass to eliminate intermediate result materialization and significantly improve performance, along with enhancements to the test-backend-ops benchmarking framework to support accurate multi-operation performance measurement.

URL: pull/22423

Associated Commits: 5dc34, c7340

3. Windows: raise stdio limit for loading many GGUF shards: This pull request raises the maximum number of standard I/O handles on Windows at startup to 2048 by calling _setmaxstdio(), enabling llama.cpp to load models split across many .gguf shards without hitting the default Windows/MSVC file handle limit and thereby preventing shard-loading failures for heavily sharded models.

URL: pull/22385

Associated Commits: d51f6

Other Open Pull Requests

API Enhancements for Runtime Configuration: This set of pull requests introduces new getter and setter methods for the slot_prompt_similarity field in the server_context API, allowing runtime querying and modification of the slot-selection similarity threshold. These changes restore previous embedder behavior and maintain API parity, improving flexibility for users.  
pull/22393

Speculative Decoding Improvements: These pull requests add the ability to partially rollback speculative decoding in GDN models by storing intermediate states up to a specified draft maximum, which reduces redundant computation after rejected draft tokens. The feature currently supports CPU and CUDA only and introduces breaking changes to the GDN API.  
pull/22400

Backend Support for Layer Normalization: This pull request adds support for layer normalization operations to the ggml-webgpu backend, including a shader implementation that uses workgroup barriers. All test cases pass with no observed performance regressions, ensuring stable integration.  
pull/22406

Speculative N-gram Parameter Clamping Fixes: This pull request fixes the clamping logic for speculative n-gram request parameters in llama-server to properly limit user-provided values within [1, 1024]. It preserves valid settings instead of forcing them to 1024 and adds regression tests to verify the correct clamping behavior.  
pull/22432

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 49
Key Closed Pull Requests
1. webui: Server tools: This pull request introduces server tools integration into the web user interface, including features such as a /tools endpoint, built-in and JSON schema tools, UI improvements, and reorganized settings sections to enhance server management capabilities.

URL: pull/21237

Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0

Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0

2. ggml-cuda: Repost of 21896: Blackwell native NVFP4 support: This pull request is a restored repost of a previously closed pull request that adds native NVFP4 support for Blackwell GPUs in the ggml-cuda backend, including kernel implementations, quantizer guards, and various refactorings to improve FP4 matrix multiplication and quantization functionality.

URL: pull/22196

Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0

Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0

3. hexagon: hmx flash attention: This pull request implements HMX-based flash attention for the Hexagon backend, introducing an FP16 exponential function to optimize performance despite some numerical loss, addresses multi-threading and pipeline improvements, and includes various bug fixes and refinements while noting an unresolved non-deterministic compilation issue that currently keeps the PR in draft status.

URL: pull/22347

Associated Commits: 29853, 35b2f, 3c770, 0b9b5, 8ae33, c82b0, a9cd7, 5ce4a, 4f42c, 3d1b4, a5a4d, ae78a, c9cb1, 3bda0, 1888d, 98f09, 4595f, 85dd8, 56ee0

Associated Commits: 29853, 35b2f, 3c770, 0b9b5, 8ae33, c82b0, a9cd7, 5ce4a, 4f42c, 3d1b4, a5a4d, ae78a, c9cb1, 3bda0, 1888d, 98f09, 4595f, 85dd8, 56ee0

Other Closed Pull Requests

Logger and Resource Cleanup Fixes: Multiple pull requests address issues related to logger lifecycle and resource management to prevent crashes and hangs. These include intentionally leaking the logger instance on Windows to avoid DLL teardown issues, replacing dynamic vectors with static arrays in the logger on Linux, and adding explicit logger cleanup at the end of unit tests to resolve timing conflicts with asynchronous logger threads.  
pull/22273

Security and Access Control Enhancements: Pull requests introduce IP-based access control with CIDR support for the llama-server, allowing users to restrict HTTP server access via a new --whitelist CLI argument and environment variable. This middleware validation returns structured JSON errors for unauthorized requests, improving security without relying on external proxies or firewalls.  
pull/22191

Mixture of Experts (MoE) Pipeline and GPU Optimizations: A pull request redesigns the MoE pipeline with optimizations for the MxFP4 data type on Adreno GPUs, including router table reordering, pre-transposing expert weights, and separate kernels for prefill and decode phases. It maintains fallback to generic implementations for other GPU vendors to ensure broad compatibility.  
pull/22301

OpenCL Backend and Adreno GPU Support: Comprehensive support for the iq4_nl feature is added to the OpenCL backend, including general implementation and specific optimizations targeting Adreno GPUs. This enhances performance and compatibility for OpenCL workloads on these devices.  
pull/22272

Stability Fixes for CPU Implementations: To prevent segmentation faults on AIX systems, the tiled matrix multiplication path in the ggml-cpu sgemm implementation is disabled, falling back to the mnpack implementation for stable execution. This addresses issues caused by vec_xst operations near 4KB page boundaries.  
pull/22293

Dynamic Library Export Fixes: Missing exports in the llama-common dynamic library that caused linker errors during LTO builds are fixed by marking certain callback instantiations with LLAMA_API. This ensures proper symbol visibility and accessibility by other dynamic libraries.  
pull/22340

Parser and Chat Functionality Fixes: The parser's handling of spaces in reasoning markers within chat functionality is corrected by extracting the fix from a larger change set. This improves the robustness of chat message parsing.  
pull/22353

Scale Tensor Refactoring and FP8 Optimization: Handling of scale tensors is refactored by introducing reusable methods and removing the input_scale parameter specifically for dequantized FP8 model optimization tensors. This streamlines tensor management in the project.  
pull/22356

Model Addition and Conversion Script Updates: A pull request requests adding a new model to the project’s list and updates the convert_hf_to_gguf.py script accordingly, although it was not merged.  
pull/22380

DeepSeek-V4 GGUF Documentation: Comprehensive documentation is added for DeepSeek-V4 GGUF support, covering model conversion, metadata standards, quantization rules, and deployment practices. Existing documentation references are updated to ensure consistent and auditable handling of DeepSeek-V4-Pro models.  
pull/22405

Model Prefetching Feature: Support is introduced for proactively prefetching llama.cpp models from a preset file using a new --prefetch flag. This enables background prioritized downloads with cancellation and progress bar display to improve model availability and loading efficiency when starting the server.  
pull/22417

Router Multipart/Form-Data Forwarding Fix: The router is fixed to properly forward multipart/form-data to the model server by regenerating the multipart body. This enables correct use of the /v1/audio/transcriptions API in router mode.  
pull/22118

Flash-Attention Support for Mistral Small 4 Model: Flash-attention support is added for the Mistral Small 4 model with specific head sizes by introducing MMA-f16 and tile kernel configurations. This prevents fallback to CPU and significantly improves CUDA backend throughput.  
pull/22286

CMake Configuration for RISC-V SpacemiT Toolchain: The CMake configuration is updated to append the custom xsmtvdotii extension to the march string when GGML_CPU_RISCV64_SPACEMIT is enabled. This enables successful assembly of inline vmadot instructions and prevents build errors related to unrecognized opcodes.  
pull/22317

Gemma 4 Tool Call Parsing Fix: Ambiguity in the Gemma 4 multi-turn tool call parsing is resolved by changing the delimiter to <|tool_call>call:. This prevents premature termination of content when a literal <|tool_call> appears inside the content and includes regression tests for verification.  
pull/22367

Upscale Shader Addition to ggml-webgpu: An upscale shader is added to the ggml-webgpu project implementing nearest, bilinear (with and without antialiasing), and bicubic interpolation methods with optional aligned_corner flags. Tests pass successfully though no performance comparisons are included.  
pull/22419

Additional Gemma4 Parsing Test Cases: Additional positive and negative test cases are added for parsing edge cases in the common/gemma4 module using a real model file. This complements a previous fix and ensures robustness.  
pull/22420

SVE-Optimized Quantized Matrix Multiplication Kernel: A Scalable Vector Extensions (SVE) optimized implementation of the ggml_gemm_q8_0_4x8_q8_0() kernel is added, improving LLM inference performance by about 20% on ARM Graviton3E processors while maintaining accuracy comparable to the NEON version.  
pull/21916

SPIR-V Header Detection Improvement: Detection of SPIR-V headers in ggml-vulkan.cpp is improved by introducing a __has_include-based mechanism that automatically selects the correct header path across diverse build environments. The original _WIN32 platform-specific logic is preserved as a fallback, fixing build failures without affecting standard builds.  
pull/21918

GGUF Quantization Tag Regex Improvement: The GGUF quantization tag regex is improved to accurately identify tags with optional uppercase prefixes followed by hyphens, such as "UD-Q8_K_XL." This ensures consistent and correct parsing of model quantization tags during download.  
pull/22164

WebGPU Kernel Tuning Parameter Updates: Tuning parameters for WebGPU register tiling and subgroup matrix multiplication kernels are updated based on extensive performance data from multiple GPUs. This enhances average performance and reduces worst-case slowdowns by allowing independent kernel configurations and proposing new default tile sizes and workgroup dimensions.  
pull/22241

Backend and Device Registration Prevention: A change is introduced to prevent the registration of backends and devices that have already been registered, avoiding redundant processing in the ggml project.  
pull/22296

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

kainlan
147
0
0
0

TheTom
98
1
0
0

No author found
90
0
0
0

ggerganov
77
4
0
0

ngxson
79
1
0
0

max-krasnyansky
79
1
0
0

aldehir
48
0
0
0

michaelw9999
46
1
0
0

gary149
41
0
0
0

Constannnnnt
36
3
0
0

                                Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests
kainlan	147	0
TheTom	98	1
No author found	90	0
ggerganov	77	4
ngxson	79	1
max-krasnyansky	79	1
aldehir	48	0
michaelw9999	46	1
gary149	41	0
Constannnnnt	36	3