Weekly GitHub Report for Llama.cpp: April 27, 2026 - May 04, 2026 (14:32:30)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, reflecting a continued focus on optimizing user experience and system reliability. Notable highlights include improved processing speed and bug fixes addressing previous issues.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
As of our latest update, there are no active issues with ongoing comments this week.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 0
Summarized Issues:
As of our latest update, there are no open issues for the project this week.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 3
Summarized Issues:
- Program hang and freeze on Windows: The test-chat-template.exe program hangs indefinitely on Windows when built with specific MSYS2 UCRT64 and GCC configurations, causing the test to complete but then freeze without responding to interrupts like Ctrl-C. This issue appeared after a certain commit and affects the program's responsiveness during testing.
- issues/22142
- Memory allocation bug in Vulkan backend: The Vulkan backend incorrectly selects the smallest memory heap for device memory allocation, leading to allocation failures despite there being sufficient free memory on the intended device. This causes problems in managing device memory efficiently and prevents proper allocation.
- issues/22368
- Tensor parallelism output error on multiple GPUs: Using tensor parallelism with three or more GPUs on Qwen3.6-35B-A3B models in llama-server results in an infinite stream of slashes as output, while using two GPUs or smaller models does not cause this problem. This bug affects the correctness of output when scaling tensor parallelism beyond two GPUs.
- issues/22391
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 0
As of our latest update, there are no open pull requests for the project this week.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 29
Key Closed Pull Requests
1. webui: Server tools: This pull request introduces server tools integration into the web user interface, including features such as a /tools endpoint, built-in and JSON schema tools, UI improvements, and reorganized settings sections to enhance server management capabilities.
- URL: pull/21237
- Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0
- Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0
2. model: move load_hparams and load_tensors to per-model definition: This pull request restructures the codebase by moving the load_hparams and load_tensors functions into per-model definitions to enable a deterministic migration process via a heuristic script, ensuring clearer model architecture organization and inheritance rules while preparing for subsequent cleanup and deduplication.
- URL: pull/22004
- Associated Commits: 05905, 59f82, eefe3, e078d, 7e71b, 4d873, ede26, 96a95, bc5f2, 2c918, 589de, 71270, e56f5, f1549, e4e52, e73ac, 80e75, 9445c, 86130, 10aa6, b8e91, 55569, e95c4, 4f58c, 9d3bd, 6c6ec, 5096a, b3dc2, 7f22f, 6d39d, 47d7a, 64ce0, 01b4b, ae406, f4ee2
- Associated Commits: 05905, 59f82, eefe3, e078d, 7e71b, 4d873, ede26, 96a95, bc5f2, 2c918, 589de, 71270, e56f5, f1549, e4e52, e73ac, 80e75, 9445c, 86130, 10aa6, b8e91, 55569, e95c4, 4f58c, 9d3bd, 6c6ec, 5096a, b3dc2, 7f22f, 6d39d, 47d7a, 64ce0, 01b4b, ae406, f4ee2
3. ggml-cuda: Repost of 21896: Blackwell native NVFP4 support: This pull request is a restored repost of a previously closed pull request that adds native NVFP4 support for Blackwell GPUs in the ggml-cuda backend, including kernel implementations, quantizer guards, and various refactorings to optimize FP4 operations specifically for Blackwell architecture.
- URL: pull/22196
- Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0
- Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0
Other Closed Pull Requests
- Flash Attention and CUDA Backend Optimizations: Multiple pull requests introduce and enhance flash attention support and CUDA backend optimizations, including HMX-based flash attention for the Hexagon backend with FP16 exponential function and multi-threaded operations, as well as flash-attention support for the Mistral Small 4 model with specialized kernel configurations to improve throughput and prevent CPU fallback. These changes collectively aim to boost performance and correctness across different hardware targets.
- Logger and Resource Management Fixes: Several pull requests address stability issues related to logger lifecycle and resource management, including intentional logger instance leaks to avoid DLL teardown crashes on Windows, replacing dynamic vectors with static arrays to prevent Linux crashes, and adding explicit logger cleanup in unit tests to resolve asynchronous startup timing conflicts. These fixes improve robustness across platforms by managing resource cleanup more safely.
- Speculative Decoding and Vocabulary Compatibility: Pull requests refactor speculative decoding parameters with breaking CLI changes and update vocabulary compatibility checks in speculative examples by correcting logging and porting previous fixes. These updates ensure better parameter management and consistency in speculative decoding workflows.
- Network Security and Router Improvements: Enhancements include adding IP whitelist functionality with CIDR support to restrict HTTP server access and fixing multipart/form-data forwarding in the router to enable proper use of audio transcription APIs. These changes improve network-level security and API functionality.
- Mixture of Experts (MoE) Pipeline and GPU Optimizations: A redesigned MoE pipeline targets MxFP4 data type on Adreno GPUs with router table reordering, expert weight pre-transposing, and kernel differentiation, while maintaining fallback for other GPUs. This optimizes MoE performance on specific hardware.
- Build and Compilation Fixes: Pull requests fix build issues by reverting problematic library linking changes that broke CUDA compilation, updating CMake to append custom extensions for RISC-V toolchains, and improving SPIR-V header detection with
__has_includeto fix build failures across environments. These ensure smoother and more portable builds.
- Matrix Multiplication and Quantization Enhancements: Several pull requests improve matrix multiplication implementations, including disabling tiled sgemm on AIX to prevent faults, adding SVE-optimized quantized gemm kernels for ARM Graviton3E with ~20% speedup, and adding WebGPU support for Q1_0 quantization and fast matrix-vector multiplication for all i-quant types. These changes enhance performance and stability across platforms.
- Documentation and Testing Improvements: Documentation for DeepSeek-V4 GGUF support is added with detailed guidelines on model conversion and deployment, while additional positive and negative test cases are introduced for parsing edge cases in the common/gemma4 module to ensure robustness. These efforts improve user guidance and code reliability.
- Model Prefetching and Cache Management: A new
--prefetchflag enables proactive background downloading of models with cancellation and progress display, and cache directory creation is fixed on Windows with improved logging of cache file names. These changes enhance model loading efficiency and cache usability.
- Shader and WebGPU Backend Enhancements: Pull requests add an upscale shader with multiple interpolation methods, layer normalization support with workgroup barriers, and fix shader validation errors by silencing subgroup uniformity diagnostics, enabling better WebGPU backend functionality and compatibility.
- Bug Fixes in Reasoning and Tokenization: Fixes include correcting the reasoning budget state machine to properly re-arm budgets on new
tags with regression tests, and resolving tokenization errors in the OpenAI-compatible chat completions API by preserving media markers in templates. These ensure correct model behavior and API reliability.
- Backend Registration and Redundancy Prevention: A change prevents redundant registration of backends and devices already registered, reducing unnecessary processing and potential conflicts.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| kainlan | 107 | 0 | 0 | 0 |
| TheTom | 98 | 1 | 0 | 0 |
| No author found | 88 | 0 | 0 | 0 |
| max-krasnyansky | 79 | 0 | 0 | 0 |
| ngxson | 77 | 1 | 0 | 0 |
| ggerganov | 73 | 2 | 0 | 0 |
| michaelw9999 | 46 | 1 | 0 | 0 |
| Constannnnnt | 36 | 2 | 0 | 0 |
| scutler-nv | 33 | 1 | 0 | 0 |
| gary149 | 33 | 0 | 0 | 0 |