Weekly GitHub Report for Llama.cpp: April 25, 2026 - May 02, 2026 (19:20:07)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces significant updates enhancing overall performance and user experience, with notable improvements in system stability and new feature integrations that streamline workflows. This release reflects a continued focus on optimizing functionality and addressing user feedback.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
As of our latest update, there are no active issues with ongoing comments this week.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 17
Summarized Issues:
- Compilation and Build Issues: Several issues report problems during the compilation or build process of llama.cpp on different platforms and configurations. These include indefinite hangs during CUDA compiler checks on Linux, multiple compilation errors on Windows with the HIP backend, and failures converting MLX-fused models due to unrecognized tensor artifacts.
- issues/22383, issues/22390, issues/22431
- Backend Crashes and Performance Problems: Multiple backend-related issues cause crashes or poor performance during model execution. Crashes occur in the Vulkan backend after processing many tokens and in the Meta backend with tensor split mode on CUDA GPUs due to memory exhaustion. Additionally, the SYCL backend shows extremely poor performance on Battlemage hardware compared to Vulkan on Windows.
- issues/22404, issues/22413, issues/22425
- Model Output and Grammar Generation Bugs: There are issues with model output correctness and grammar-constrained generation. The GBNF grammar-constrained generation produces degenerate loops or unconstrained output on larger MoE models starting from a specific commit, and the QWEN3.6MOE model outputs incorrect or garbled text on MacOS Vulkan backend. The Qwen 3.6 27B model also outputs unexpected tags causing agentic workload stops.
- issues/22381, issues/22398, issues/22430
- API and Protocol Compatibility Issues: The llama.cpp server does not support the new OpenAI Responses API tool call format requiring a "type": "function" field, causing tool calls to fail with 400 errors on newer Codex versions. Additionally, evaluation fails to parse certain XML-like tool call commands on Linux CUDA backends, resulting in server errors.
- issues/22389, issues/22422
- Feature Requests and Usability Improvements: There is a request to handle maximum context size errors on CPU by either trimming older tokens automatically or stopping generation safely with clear messaging. Also, a UI usability issue is reported where the sidebar toggle button's z-index causes it to overlap other interactive elements, suggesting UI restructuring or z-index adjustment.
- issues/22392, issues/22395
- Parameter Handling and Configuration Bugs: A bug in the llama-server causes ngram parameters to be ignored due to incorrect use of max and min functions, always setting values to 1024 instead of respecting user input.
- issues/22414
- Quantization and Perplexity Anomalies: Unexpected extreme increases in perplexity are observed when aggressively quantizing the Gemma 4 E4B base model, unlike the more gradual changes in the E2B model, raising questions about the expected behavior given the models' PLE mechanism.
- issues/22407
- Router Mode Endpoint Accessibility: The
/slotsendpoint is inaccessible when running llama-server in router mode, causing errors and limiting management of model slots such as saving and restoring context, which works correctly outside router mode. - issues/22373
- Sampler Initialization Failures with JSON-Schema: Using the
--json-schemaflag with Gemma 4 models inllama-clicauses sampler initialization to fail with astd::exception, while generation works without the flag or with hand-written grammars, indicating issues in JSON-Schema to grammar conversion or sampler interaction on Gemma 4. - issues/22396
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 10
Summarized Issues:
- GPU and Backend Crashes and Bugs: Several issues report crashes and bugs related to GPU usage and backend implementations. These include llama-server crashing with the "-sm tensor" option on NVIDIA GPUs, segmentation faults when running Qwen3.6 35B model on AMD ROCm GPUs with tensor splitting, and Vulkan backend incorrectly selecting memory heaps causing allocation failures.
- issues/22268, issues/22351, issues/22368
- Model and Tensor Conversion Failures: Problems with model conversion and tensor handling are highlighted, such as the
convert_hf_to_gguf.pyscript failing to convert FP8 or NVFP4 Nemotron 3 Super models due to a ValueError in tensor mapping. Additionally, tensor parallelism on three or more GPUs causes infinite output streams in llama-server with Qwen3.6-35B-A3B models. - issues/22346, issues/22391
- Performance Regressions and Improvements: Performance issues include a 32-39% regression in prompt processing on MoE models using the Vulkan RADV driver on Radeon 8060S GPUs, while token generation remains stable. Conversely, fixing the context checkpoint restore mechanism for hybrid and recurrent models significantly reduces prompt re-processing time from seconds to milliseconds.
- issues/22375, issues/22384
- Parsing and Content Handling Bugs: A bug in the Gemma 4 PEG parser causes premature truncation of assistant content at the
<|tool_call>token due to incorrect delimiter usage, resulting in silent content loss during multi-turn tool calls. - issues/22371
- Build and Compilation Issues: There is a compile bug reported with the Musa GGML backend, though details about the problem and logs are incomplete.
- issues/22416
- Windows Program Hanging: The test-chat-template.exe program hangs indefinitely on Windows when built with specific MSYS2 UCRT64 and GCC configurations after a certain commit, causing the test to complete but not exit properly and preventing interruption via Ctrl-C.
- issues/22142
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 7
Key Open Pull Requests
1. Wip/deepseek v4 support: This pull request introduces comprehensive support for DeepSeek V4 in the llama.cpp project, including GGUF conversion, native FP4 and FP8 quantization, runtime graph and memory management, CUDA performance optimizations, fused kernel tuning, and enhanced expert routing with fast top-k operations.
- URL: pull/22378
- Associated Commits: afa35, 77f42, c3b9f, 97517, 172df, 9805e, 4eee9, c4268, c9dd6, d9a1f, ba173, 48669
2. ggml-cpu: fuse RMS_NORM + MUL on CPU backend : This pull request introduces a fused RMS_NORM plus MUL kernel on the CPU backend that computes the combined operation in a single pass to eliminate intermediate result materialization and significantly improve performance, along with enhancements to the test-backend-ops benchmarking framework to support accurate multi-operation performance measurement.
- URL: pull/22423
3. Windows: raise stdio limit for loading many GGUF shards: This pull request raises the maximum number of standard I/O handles on Windows at startup to 2048 by calling _setmaxstdio(), enabling llama.cpp to load models split across many .gguf shards without hitting the default Windows/MSVC file handle limit and thereby preventing shard-loading failures for heavily sharded models.
- URL: pull/22385
- Associated Commits: d51f6
Other Open Pull Requests
- API Enhancements for Runtime Configuration: This set of pull requests introduces new getter and setter methods for the
slot_prompt_similarityfield in theserver_contextAPI, allowing runtime querying and modification of the slot-selection similarity threshold. These changes restore previous embedder behavior and maintain API parity, improving flexibility for users. - pull/22393
- Speculative Decoding Improvements: These pull requests add the ability to partially rollback speculative decoding in GDN models by storing intermediate states up to a specified draft maximum, which reduces redundant computation after rejected draft tokens. The feature currently supports CPU and CUDA only and introduces breaking changes to the GDN API.
- pull/22400
- Backend Support for Layer Normalization: This pull request adds support for layer normalization operations to the ggml-webgpu backend, including a shader implementation that uses workgroup barriers. All test cases pass with no observed performance regressions, ensuring stable integration.
- pull/22406
- Speculative N-gram Parameter Clamping Fixes: This pull request fixes the clamping logic for speculative n-gram request parameters in
llama-serverto properly limit user-provided values within[1, 1024]. It preserves valid settings instead of forcing them to1024and adds regression tests to verify the correct clamping behavior. - pull/22432
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 49
Key Closed Pull Requests
1. webui: Server tools: This pull request introduces server tools integration into the web user interface, including features such as a /tools endpoint, built-in and JSON schema tools, UI improvements, and reorganized settings sections to enhance server management capabilities.
- URL: pull/21237
- Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0
- Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0
2. ggml-cuda: Repost of 21896: Blackwell native NVFP4 support: This pull request is a restored repost of a previously closed pull request that adds native NVFP4 support for Blackwell GPUs in the ggml-cuda backend, including kernel implementations, quantizer guards, and various refactorings to improve FP4 matrix multiplication and quantization functionality.
- URL: pull/22196
- Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0
- Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0
3. hexagon: hmx flash attention: This pull request implements HMX-based flash attention for the Hexagon backend, introducing an FP16 exponential function to optimize performance despite some numerical loss, addresses multi-threading and pipeline improvements, and includes various bug fixes and refinements while noting an unresolved non-deterministic compilation issue that currently keeps the PR in draft status.
- URL: pull/22347
- Associated Commits: 29853, 35b2f, 3c770, 0b9b5, 8ae33, c82b0, a9cd7, 5ce4a, 4f42c, 3d1b4, a5a4d, ae78a, c9cb1, 3bda0, 1888d, 98f09, 4595f, 85dd8, 56ee0
- Associated Commits: 29853, 35b2f, 3c770, 0b9b5, 8ae33, c82b0, a9cd7, 5ce4a, 4f42c, 3d1b4, a5a4d, ae78a, c9cb1, 3bda0, 1888d, 98f09, 4595f, 85dd8, 56ee0
Other Closed Pull Requests
- Logger and Resource Cleanup Fixes: Multiple pull requests address issues related to logger lifecycle and resource management to prevent crashes and hangs. These include intentionally leaking the logger instance on Windows to avoid DLL teardown issues, replacing dynamic vectors with static arrays in the logger on Linux, and adding explicit logger cleanup at the end of unit tests to resolve timing conflicts with asynchronous logger threads.
- Security and Access Control Enhancements: Pull requests introduce IP-based access control with CIDR support for the llama-server, allowing users to restrict HTTP server access via a new
--whitelistCLI argument and environment variable. This middleware validation returns structured JSON errors for unauthorized requests, improving security without relying on external proxies or firewalls.
- Mixture of Experts (MoE) Pipeline and GPU Optimizations: A pull request redesigns the MoE pipeline with optimizations for the MxFP4 data type on Adreno GPUs, including router table reordering, pre-transposing expert weights, and separate kernels for prefill and decode phases. It maintains fallback to generic implementations for other GPU vendors to ensure broad compatibility.
- OpenCL Backend and Adreno GPU Support: Comprehensive support for the iq4_nl feature is added to the OpenCL backend, including general implementation and specific optimizations targeting Adreno GPUs. This enhances performance and compatibility for OpenCL workloads on these devices.
- Stability Fixes for CPU Implementations: To prevent segmentation faults on AIX systems, the tiled matrix multiplication path in the ggml-cpu sgemm implementation is disabled, falling back to the mnpack implementation for stable execution. This addresses issues caused by
vec_xstoperations near 4KB page boundaries.
- Dynamic Library Export Fixes: Missing exports in the
llama-commondynamic library that caused linker errors during LTO builds are fixed by marking certain callback instantiations withLLAMA_API. This ensures proper symbol visibility and accessibility by other dynamic libraries.
- Parser and Chat Functionality Fixes: The parser's handling of spaces in reasoning markers within chat functionality is corrected by extracting the fix from a larger change set. This improves the robustness of chat message parsing.
- Scale Tensor Refactoring and FP8 Optimization: Handling of scale tensors is refactored by introducing reusable methods and removing the
input_scaleparameter specifically for dequantized FP8 model optimization tensors. This streamlines tensor management in the project.
- Model Addition and Conversion Script Updates: A pull request requests adding a new model to the project’s list and updates the
convert_hf_to_gguf.pyscript accordingly, although it was not merged.
- DeepSeek-V4 GGUF Documentation: Comprehensive documentation is added for DeepSeek-V4 GGUF support, covering model conversion, metadata standards, quantization rules, and deployment practices. Existing documentation references are updated to ensure consistent and auditable handling of DeepSeek-V4-Pro models.
- Model Prefetching Feature: Support is introduced for proactively prefetching llama.cpp models from a preset file using a new
--prefetchflag. This enables background prioritized downloads with cancellation and progress bar display to improve model availability and loading efficiency when starting the server.
- Router Multipart/Form-Data Forwarding Fix: The router is fixed to properly forward multipart/form-data to the model server by regenerating the multipart body. This enables correct use of the
/v1/audio/transcriptionsAPI in router mode.
- Flash-Attention Support for Mistral Small 4 Model: Flash-attention support is added for the Mistral Small 4 model with specific head sizes by introducing MMA-f16 and tile kernel configurations. This prevents fallback to CPU and significantly improves CUDA backend throughput.
- CMake Configuration for RISC-V SpacemiT Toolchain: The CMake configuration is updated to append the custom
xsmtvdotiiextension to the march string whenGGML_CPU_RISCV64_SPACEMITis enabled. This enables successful assembly of inline vmadot instructions and prevents build errors related to unrecognized opcodes.
- Gemma 4 Tool Call Parsing Fix: Ambiguity in the Gemma 4 multi-turn tool call parsing is resolved by changing the delimiter to
<|tool_call>call:. This prevents premature termination of content when a literal<|tool_call>appears inside the content and includes regression tests for verification.
- Upscale Shader Addition to ggml-webgpu: An upscale shader is added to the ggml-webgpu project implementing nearest, bilinear (with and without antialiasing), and bicubic interpolation methods with optional aligned_corner flags. Tests pass successfully though no performance comparisons are included.
- Additional Gemma4 Parsing Test Cases: Additional positive and negative test cases are added for parsing edge cases in the common/gemma4 module using a real model file. This complements a previous fix and ensures robustness.
- SVE-Optimized Quantized Matrix Multiplication Kernel: A Scalable Vector Extensions (SVE) optimized implementation of the
ggml_gemm_q8_0_4x8_q8_0()kernel is added, improving LLM inference performance by about 20% on ARM Graviton3E processors while maintaining accuracy comparable to the NEON version.
- SPIR-V Header Detection Improvement: Detection of SPIR-V headers in
ggml-vulkan.cppis improved by introducing a__has_include-based mechanism that automatically selects the correct header path across diverse build environments. The original_WIN32platform-specific logic is preserved as a fallback, fixing build failures without affecting standard builds.
- GGUF Quantization Tag Regex Improvement: The GGUF quantization tag regex is improved to accurately identify tags with optional uppercase prefixes followed by hyphens, such as "UD-Q8_K_XL." This ensures consistent and correct parsing of model quantization tags during download.
- WebGPU Kernel Tuning Parameter Updates: Tuning parameters for WebGPU register tiling and subgroup matrix multiplication kernels are updated based on extensive performance data from multiple GPUs. This enhances average performance and reduces worst-case slowdowns by allowing independent kernel configurations and proposing new default tile sizes and workgroup dimensions.
- Backend and Device Registration Prevention: A change is introduced to prevent the registration of backends and devices that have already been registered, avoiding redundant processing in the ggml project.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| kainlan | 147 | 0 | 0 | 0 |
| TheTom | 98 | 1 | 0 | 0 |
| No author found | 90 | 0 | 0 | 0 |
| ggerganov | 77 | 4 | 0 | 0 |
| ngxson | 79 | 1 | 0 | 0 |
| max-krasnyansky | 79 | 1 | 0 | 0 |
| aldehir | 48 | 0 | 0 | 0 |
| michaelw9999 | 46 | 1 | 0 | 0 |
| gary149 | 41 | 0 | 0 | 0 |
| Constannnnnt | 36 | 3 | 0 | 0 |