Weekly GitHub Report for Llama.cpp: May 04, 2026 - May 11, 2026 (14:43:50)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, with notable improvements in system stability and feature optimization. This release reflects a continued focus on refining core functionalities and addressing user feedback.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
As of our latest update, there are no active issues with ongoing comments this week.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 0
Summarized Issues:
As of our latest update, there are no open issues for the project this week.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 4
Summarized Issues:
- Vulkan Backend Stability and Performance Issues: Multiple issues report crashes and freezes when using the Vulkan backend for different models and hardware configurations, including Gemma-4-26B-A4B on AMD 8840U and Qwen3.6-35B-A3B on Ryzen AI 9 HX 370. These problems highlight instability and device loss errors after extended token processing, indicating potential driver or compatibility problems with Vulkan on certain systems.
- [issues/21497, issues/22425]
- Backend Performance Discrepancies Across Platforms: There is a significant performance gap between SYCL and Vulkan backends on different hardware, with SYCL showing poor GPU acceleration on Battlemage GPUs under Linux, while Vulkan performs much better on Windows. This raises concerns about the maturity of SYCL support, hardware age, and system resource limitations affecting GPU utilization.
- [issues/22413]
- Validation Errors in TaskUpdate Tool Integration: The TaskUpdate tool used with llama.cpp fails to validate
anyOfschema types correctly due to string values including literal quote characters, causing valid enum values like "in_progress" to be rejected. This issue points to a problem in how the tool processes string inputs during tool call validation. - [issues/22240]
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 0
As of our latest update, there are no open pull requests for the project this week.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 11
Key Closed Pull Requests
1. internal AllReduce kernel for CUDA provider: This pull request implements an internal CUDA AllReduce kernel for tensor parallelism mode in multi-GPU setups, providing improved performance over the NCCL-based approach by using a single-phase CUDA kernel with pipelined device-to-host communication and cross-GPU synchronization, and includes features such as environment-variable-based provider selection, a watchdog system for detecting hangs, and support for up to 2 GPUs with FP32 tensors up to 256 KB.
- URL: pull/22299
- Associated Commits: 4c117, 52af9, 8c0a7, 10c47, 2c1a1, c5207, 172cb, 3b584, 860ee, 433c3, bc8b0, 8da7e, 372d4, 50282, cfdb0, 239f2, cfcae, bb019, ebc31, 014ad, a6981, 77c0e, 43743, 2573b, 4d773, f0460, fbcae, f3d32, 7449c, 2b91e, 485ee, 892b2, b34ad
- Associated Commits: 4c117, 52af9, 8c0a7, 10c47, 2c1a1, c5207, 172cb, 3b584, 860ee, 433c3, bc8b0, 8da7e, 372d4, 50282, cfdb0, 239f2, cfcae, bb019, ebc31, 014ad, a6981, 77c0e, 43743, 2573b, 4d773, f0460, fbcae, f3d32, 7449c, 2b91e, 485ee, 892b2, b34ad
2. Modality conditional adapters: This pull request introduces a mechanism for automatically toggling LoRA adapters based on detected input modalities, enabling modular models to seamlessly switch between text and modality-specific modes without needing separate boots, thereby improving support for models like ibm-granite’s speech and vision variants by preserving the base LLM and conditionally activating modality adapters.
- URL: pull/22184
- Associated Commits: 63e84, 17b31, 2a2c2, 76d44, ef303, 3faed, 12ebf, 992a5, f5240, b3305, 7780f, a1779, d2227, c5313, 63dc7, 0c552, a0498, bed59, 11795, 831a4, 43e52, f8c9c, 5600f, fc881, f6d44, 9a9ef
- Associated Commits: 63e84, 17b31, 2a2c2, 76d44, ef303, 3faed, 12ebf, 992a5, f5240, b3305, 7780f, a1779, d2227, c5313, 63dc7, 0c552, a0498, bed59, 11795, 831a4, 43e52, f8c9c, 5600f, fc881, f6d44, 9a9ef
3. Wip/deepseek v4 support: This pull request introduces comprehensive support for DeepSeek V4 in the llama.cpp project, including GGUF conversion, runtime graph and memory management, native FP4/FP8 quantization, CUDA performance optimizations, and enhanced activation quantization and kernel tuning to improve efficiency and accuracy.
- URL: pull/22378
- Associated Commits: afa35, 77f42, c3b9f, 97517, 172df, 9805e, 4eee9, c4268, c9dd6, d9a1f, ba173, 48669
- Associated Commits: afa35, 77f42, c3b9f, 97517, 172df, 9805e, 4eee9, c4268, c9dd6, d9a1f, ba173, 48669
Other Closed Pull Requests
- Model support and integration: Multiple pull requests add support for new models and advanced audio preprocessing techniques. These include integration of the ibm-granite/granite-4.0-1b-speech model with a Conformer encoder and QFormer projector, as well as support for the Sarashina2.2 Vision 3B model with new projector types and compatibility updates across tools and scripts.
- OpenCL backend improvements: Several pull requests enhance the OpenCL backend by refactoring code for clarity, adding a new Adreno-specific xmem attention path to improve throughput and memory robustness, and enabling Ahead-Of-Time (AOT) builds for Intel GPUs with subgroup size annotations. These changes improve performance, maintain correctness, and document kernel behavior.
- Quantization and reorder optimizations: Updates extend the reorder-quantized codepath to support the new Q5_K format and enhance the Q8_0 reorder MMVQ path with specialized block layouts and new kernels. These improvements are integrated into the SYCL dispatch system to boost matrix-vector multiplication and dequantization performance.
- Performance kernel enhancements: A fused RMS_NORM plus MUL kernel is introduced on the CPU backend to compute combined operations in a single pass, eliminating intermediate results and improving performance by up to 2.07x. The benchmarking framework is also extended to measure multi-operation performance accurately.
- Backend loading optimization: One pull request optimizes the project by modifying it to load backends only when required, addressing multiple issues and improving resource management.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| TheTom | 97 | 0 | 0 | 0 |
| ngxson | 72 | 1 | 0 | 0 |
| No author found | 67 | 0 | 0 | 0 |
| ggerganov | 60 | 0 | 0 | 0 |
| michaelw9999 | 46 | 0 | 0 | 0 |
| scutler-nv | 33 | 1 | 0 | 0 |
| chraac | 27 | 0 | 0 | 0 |
| gabe-l-hart | 26 | 1 | 0 | 0 |
| Constannnnnt | 25 | 0 | 0 | 0 |
| johndpope | 25 | 0 | 0 | 0 |