Weekly GitHub Report for Llama.cpp: April 27, 2026 - May 04, 2026 (14:32:30)

Weekly GitHub Report for Llama.cpp: April 27, 2026 - May 04, 2026 (14:32:30)

        Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, reflecting a continued focus on optimizing user experience and system reliability. Notable highlights include improved processing speed and bug fixes addressing previous issues.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 
As of our latest update, there are no active issues with ongoing comments this week. 
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 0
Summarized Issues:
As of our latest update, there are no open issues for the project this week.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 3
Summarized Issues:

Program hang and freeze on Windows: The test-chat-template.exe program hangs indefinitely on Windows when built with specific MSYS2 UCRT64 and GCC configurations, causing the test to complete but then freeze without responding to interrupts like Ctrl-C. This issue appeared after a certain commit and affects the program's responsiveness during testing.  
issues/22142

Memory allocation bug in Vulkan backend: The Vulkan backend incorrectly selects the smallest memory heap for device memory allocation, leading to allocation failures despite there being sufficient free memory on the intended device. This causes problems in managing device memory efficiently and prevents proper allocation.  
issues/22368

Tensor parallelism output error on multiple GPUs: Using tensor parallelism with three or more GPUs on Qwen3.6-35B-A3B models in llama-server results in an infinite stream of slashes as output, while using two GPUs or smaller models does not cause this problem. This bug affects the correctness of output when scaling tensor parallelism beyond two GPUs.  
issues/22391

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 0
As of our latest update, there are no open pull requests for the project this week.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 29
Key Closed Pull Requests
1. webui: Server tools: This pull request introduces server tools integration into the web user interface, including features such as a /tools endpoint, built-in and JSON schema tools, UI improvements, and reorganized settings sections to enhance server management capabilities.

URL: pull/21237

Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0

Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0

2. model: move load_hparams and load_tensors to per-model definition: This pull request restructures the codebase by moving the load_hparams and load_tensors functions into per-model definitions to enable a deterministic migration process via a heuristic script, ensuring clearer model architecture organization and inheritance rules while preparing for subsequent cleanup and deduplication.

URL: pull/22004

Associated Commits: 05905, 59f82, eefe3, e078d, 7e71b, 4d873, ede26, 96a95, bc5f2, 2c918, 589de, 71270, e56f5, f1549, e4e52, e73ac, 80e75, 9445c, 86130, 10aa6, b8e91, 55569, e95c4, 4f58c, 9d3bd, 6c6ec, 5096a, b3dc2, 7f22f, 6d39d, 47d7a, 64ce0, 01b4b, ae406, f4ee2

Associated Commits: 05905, 59f82, eefe3, e078d, 7e71b, 4d873, ede26, 96a95, bc5f2, 2c918, 589de, 71270, e56f5, f1549, e4e52, e73ac, 80e75, 9445c, 86130, 10aa6, b8e91, 55569, e95c4, 4f58c, 9d3bd, 6c6ec, 5096a, b3dc2, 7f22f, 6d39d, 47d7a, 64ce0, 01b4b, ae406, f4ee2

3. ggml-cuda: Repost of 21896: Blackwell native NVFP4 support: This pull request is a restored repost of a previously closed pull request that adds native NVFP4 support for Blackwell GPUs in the ggml-cuda backend, including kernel implementations, quantizer guards, and various refactorings to optimize FP4 operations specifically for Blackwell architecture.

URL: pull/22196

Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0

Associated Commits: a0818, 9fb7e, 0bcf7, 4625a, 3ea6b, db595, 83b41, c3188, 78596, a6832, 6e31a, 58e27, 0e2c7, 72fc0, 7fcc8, 7c731, 6b26a, e34b6, 02df2, 92045, 667cc, 553c3, 0d9e0

Other Closed Pull Requests

Flash Attention and CUDA Backend Optimizations: Multiple pull requests introduce and enhance flash attention support and CUDA backend optimizations, including HMX-based flash attention for the Hexagon backend with FP16 exponential function and multi-threaded operations, as well as flash-attention support for the Mistral Small 4 model with specialized kernel configurations to improve throughput and prevent CPU fallback. These changes collectively aim to boost performance and correctness across different hardware targets.  
pull/22347, pull/22286

Logger and Resource Management Fixes: Several pull requests address stability issues related to logger lifecycle and resource management, including intentional logger instance leaks to avoid DLL teardown crashes on Windows, replacing dynamic vectors with static arrays to prevent Linux crashes, and adding explicit logger cleanup in unit tests to resolve asynchronous startup timing conflicts. These fixes improve robustness across platforms by managing resource cleanup more safely.  
pull/22273

Speculative Decoding and Vocabulary Compatibility: Pull requests refactor speculative decoding parameters with breaking CLI changes and update vocabulary compatibility checks in speculative examples by correcting logging and porting previous fixes. These updates ensure better parameter management and consistency in speculative decoding workflows.  
pull/22397, pull/22426

Network Security and Router Improvements: Enhancements include adding IP whitelist functionality with CIDR support to restrict HTTP server access and fixing multipart/form-data forwarding in the router to enable proper use of audio transcription APIs. These changes improve network-level security and API functionality.  
pull/22191, pull/22118

Mixture of Experts (MoE) Pipeline and GPU Optimizations: A redesigned MoE pipeline targets MxFP4 data type on Adreno GPUs with router table reordering, expert weight pre-transposing, and kernel differentiation, while maintaining fallback for other GPUs. This optimizes MoE performance on specific hardware.  
pull/22301

Build and Compilation Fixes: Pull requests fix build issues by reverting problematic library linking changes that broke CUDA compilation, updating CMake to append custom extensions for RISC-V toolchains, and improving SPIR-V header detection with __has_include to fix build failures across environments. These ensure smoother and more portable builds.  
pull/22355, pull/22317, pull/21916

Matrix Multiplication and Quantization Enhancements: Several pull requests improve matrix multiplication implementations, including disabling tiled sgemm on AIX to prevent faults, adding SVE-optimized quantized gemm kernels for ARM Graviton3E with ~20% speedup, and adding WebGPU support for Q1_0 quantization and fast matrix-vector multiplication for all i-quant types. These changes enhance performance and stability across platforms.  
pull/22293, pull/22420, pull/22342, pull/22344, pull/22293

Documentation and Testing Improvements: Documentation for DeepSeek-V4 GGUF support is added with detailed guidelines on model conversion and deployment, while additional positive and negative test cases are introduced for parsing edge cases in the common/gemma4 module to ensure robustness. These efforts improve user guidance and code reliability.  
pull/22405, pull/22419

Model Prefetching and Cache Management: A new --prefetch flag enables proactive background downloading of models with cancellation and progress display, and cache directory creation is fixed on Windows with improved logging of cache file names. These changes enhance model loading efficiency and cache usability.  
pull/22417, pull/22394

Shader and WebGPU Backend Enhancements: Pull requests add an upscale shader with multiple interpolation methods, layer normalization support with workgroup barriers, and fix shader validation errors by silencing subgroup uniformity diagnostics, enabling better WebGPU backend functionality and compatibility.  
pull/22419, pull/22406, pull/22323

Bug Fixes in Reasoning and Tokenization: Fixes include correcting the reasoning budget state machine to properly re-arm budgets on new  tags with regression tests, and resolving tokenization errors in the OpenAI-compatible chat completions API by preserving media markers in templates. These ensure correct model behavior and API reliability.  
pull/22323, pull/22332

Backend Registration and Redundancy Prevention: A change prevents redundant registration of backends and devices already registered, reducing unnecessary processing and potential conflicts.  
pull/22296

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

kainlan
107
0
0
0

TheTom
98
1
0
0

No author found
88
0
0
0

max-krasnyansky
79
0
0
0

ngxson
77
1
0
0

ggerganov
73
2
0
0

michaelw9999
46
1
0
0

Constannnnnt
36
2
0
0

scutler-nv
33
1
0
0

gary149
33
0
0
0

                                Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests
kainlan	107	0
TheTom	98	1
No author found	88	0
max-krasnyansky	79	0
ngxson	77	1
ggerganov	73	2
michaelw9999	46	1
Constannnnnt	36	2
scutler-nv	33	1
gary149	33	0