Weekly GitHub Report for Llama.cpp: July 07, 2025 - July 14, 2025 (12:06:29)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Feature Request: Support Kimi K2: This issue is a feature request to support the Kimi K2 model in the llama.cpp project, highlighting two main problems: the number of experts exceeding 256 and a missing tokenizer file. The user emphasizes the model's potential based on benchmark data and seeks a solution to integrate it effectively.
- The comments discuss potential solutions and workarounds for the issues, including modifying code to handle the expert count and addressing tokenizer problems. Users share experiences with converting model formats, suggest using specific scripts for tensor conversion, and provide patches to facilitate the conversion process. There is also a discussion about the compatibility of certain hardware with the conversion scripts.
- Number of comments this week: 11
-
Eval bug: Gemma 3n incoherent with HIP when prompt length > ubatch: This issue reports a bug in the Gemma 3n model when using the HIP backend on Linux, where the model becomes incoherent if the prompt length exceeds the ubatch size, although it works fine on Vulkan and CPU. The problem seems to be related to the ROCm device and is less frequent when certain layers are offloaded to the CPU, with a potential connection to a previous issue specific to gfx11.
- The comments discuss the issue's reproducibility and potential fixes, with some users confirming the problem on different setups, including CUDA. A specific commit is identified as resolving the issue, and users confirm the fix works across different systems, with a brief discussion on the operating systems used.
- Number of comments this week: 9
-
Eval bug: server: unnecessary prompt re-processing with Jamba models: This issue is about a bug in the llama-server where the Jamba models unnecessarily re-process the entire prompt context due to a lack of cache data, even though it should only need to restart from the last generation point. The user suspects that the check for Jamba models, which do not use SWA, might be incorrect, leading to this inefficient behavior.
- The comments discuss the possibility that the initial assumption about the bug might be incorrect due to the recurrent nature of Jamba models, which cannot roll back like a typical KV cache. A clarification is requested regarding the status of rolling back recurrent states, and it is confirmed that rollback is not yet implemented but could be supported in the future with changes to the
llama_memory
API. Additionally, it is suggested that the warning message should be updated to mention the recurrent state cache. - Number of comments this week: 3
- The comments discuss the possibility that the initial assumption about the bug might be incorrect due to the recurrent nature of Jamba models, which cannot roll back like a typical KV cache. A clarification is requested regarding the status of rolling back recurrent states, and it is confirmed that rollback is not yet implemented but could be supported in the future with changes to the
-
Support for Ovis2 models: This issue is about the lack of support for Ovis2 models in the llama.cpp project, despite their high performance and previous discussions on the topic. The user has opened this issue to bring attention to the need for integration of these models, particularly highlighting their impressive visual capabilities.
- The comments express strong support for the Ovis-U1-3B model, noting its accuracy and effectiveness compared to other vision models. Examples are provided to illustrate its superior performance in specific scenarios, such as recognizing details that other models miss.
- Number of comments this week: 2
-
Eval bug: ROCm error: batched GEMM not supported: This issue reports a bug encountered when running a model using the llama software on a system with an AMD Radeon RX 7900 XTX GPU, where the process fails with a "ROCm error: CUBLAS_STATUS_NOT_SUPPORTED" message, indicating that batched GEMM operations are not supported. The problem seems to be related to the use of HIP and ROCm backends, and the error occurs specifically when attempting to use GPU resources, while CPU execution works as expected.
- The comments discuss attempts to resolve the issue by modifying the code in
ggml-cuda.cu
and running tests withhipblas-bench
, which initially suggested that the operation might be supported. Further investigation revealed that the problem was due to outdated headers from a previous version of hipblas, and updating these headers to align with changes in HipBlas version 6.0 resolved the compilation and runtime issues. - Number of comments this week: 2
- The comments discuss attempts to resolve the issue by modifying the code in
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is exhibiting a GGML_OP_GET_ROWS error that does not occur with other Vulkan backends. The problem has been open for a significant duration of 469 days and 19 hours, indicating a potentially complex or unresolved technical challenge.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is looking for any documented or known methods to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress reporting during these parallel download operations. - kubernetes example: This issue is about the need for a Helm chart for the
llama.cpp
server to facilitate its deployment on Kubernetes, which is a popular platform for deploying applications at scale in the industry. The issue has been open for a significant amount of time, and while initial work has been started, the contributor is seeking additional help from the community to advance the project. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the CUDA backend on a system with an NVIDIA GeForce RTX 3060 GPU, where the process fails due to an error related to tensor information not being read correctly. The error message indicates that a specific tensor, 'blk.0.ffn_down.weight', has an element count per row that is not a multiple of the expected block size, leading to a failure in loading the model from the specified file.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 20
Summarized Issues:
- Vulkan and GPU Compatibility Issues: Several issues highlight problems with Vulkan and GPU compatibility, such as crashes on Vulkan with new memory allocation size calculations and ROCm errors with AMD Radeon RX 7900 XTX. These issues suggest that specific environment variables or reverting to older methods may resolve the problems, and they highlight the impact of breaking changes in HipBlas version 6.0 on compatibility.
- Feature Requests for Enhanced Functionality: There are multiple feature requests aimed at enhancing the functionality of the llama.cpp project, such as adding built-in options for log probability output and exposing Top‑K/Top‑P candidate token lists. These enhancements are intended to improve tasks like confidence estimation and model transparency, while also increasing usability and appeal to new contributors.
- Bugs in Model and API Behavior: Various bugs have been reported in the llama.cpp project, including incorrect HTTP status returns, unexpected model responses, and issues with parameter passing. These bugs affect the functionality and reliability of the project, complicating client automation and causing unexpected behavior during model execution.
- Build and Compatibility Issues: Issues related to build and compatibility have been reported, such as Docker builds failing due to missing submodule URLs and problems with model quantization. These issues highlight challenges in maintaining compatibility with newer versions and ensuring successful builds from official Dockerfiles.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 6
Summarized Issues:
- Vulkan and AMD Compatibility Issues: Users have reported problems when running models on Vulkan with AMD hardware, including a significant drop in generation speed and incoherent output after generating around 200 tokens. These issues were linked to AMD driver compatibility and were resolved by updating the graphics driver or through subsequent commits.
- CMake and CUDA Compiler Errors: A compilation error was encountered where the CMake build system failed to find the CUDA compiler, resulting in a "CMAKE_CUDA_COMPILER-NOTFOUND" error. This issue was resolved by addressing a missing
nvcc
command after installing NVIDIA JetPack.
- Continuous Integration Pipeline Failures: The
ubuntu-22-cmake-vulkan
process in the CI pipeline experienced spurious failures, often due to timeouts. This process is the longest-running task in the pipeline, indicating potential inefficiencies that need addressing.
- Model Conversion and Quantization Errors: Users faced issues during model conversion and quantization, including a RuntimeError due to an inability to parse the ModelProto from a tokenizer file and a failure in the quantization step due to a missing key. These problems were linked to potential library issues or incorrect model usage.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 20
Key Open Pull Requests
1. model : add PLaMo-2 model: This pull request introduces the PLaMo-2 model to the llama.cpp project, incorporating a custom tokenizer and a model architecture that combines Mamba and Attention, and provides detailed instructions for retrieving, modifying, and testing the model, as well as converting it into the gguf format and building the necessary binaries.
- URL: pull/14560
- Merged: No
- Associated Commits: 27110, 8db1e, 00280, 0c8b3, d6684, a09db, c460f, b6faf, b7ec1, 3b57b, 7e13f, cbc74, 0fd13, 61a88, ea2e6, fc594, 181da, 3a414, 4e4c4, 3587a, 5d3c7, 72eea, 18d1c, 61200, eb589, 8fb57, 17f6c, fee3c, 6840a, 37248, 43d8d, ff794, 33425, 10c3c, 9b38f, bc320, fcb88, a03e3, 9d3f4, 5f62d, 375de, 4bb4b, 63ac3, 124c2, 8006f, 69169, e3fe6, 2bcaf, 908e6, 4682e, 20f8e, 07c25, f7163, f6567, 4728e, 6acaf, 7e4c5, 149b9, 77865, 0424a, ea95a, 2d76b, fccec, 5231e, 521c1, df95f, 498b8, 6afd3, 34360, 71abd, fb2ae
2. HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3: This pull request introduces support for Matrix cores using MFMA instructions in MMQ kernels, enables stream-K for CDNA3, and removes the hardcoded WARP_SIZE constant to enhance compatibility with AMD GPUs, while also discussing future plans for additional backend support and performance improvements.
- URL: pull/14624
- Merged: No
- Associated Commits: 68da4, 79f34, 89ba8, e57e5, 9784a, dad79, ff60f, e8eeb, a1619, 75d38, 0215a, aa35f, ba17f, 5ab14
3. metal : fuse add: This pull request aims to enhance the llama.cpp project by introducing a new feature for fusing certain operations, as indicated by the title "metal : fuse add," and includes tasks such as disabling the feature with an environment variable, printing fusion statistics, and potentially implementing a cleaner kernel, while also involving several commits that focus on reusing compute graphs, adding and removing parameters, and refactoring the graph context.
- URL: pull/14596
- Merged: No
Other Open Pull Requests
- Model Architecture and Support Enhancements: This topic covers the introduction of new model architectures and support for existing models. The EXAONE 4.0 model architecture was added, including modeling code and conversion tools for HuggingFace transformers checkpoints. Additionally, support for the Kimi-K2 model was introduced, utilizing a set_vocab approach similar to HunYuanMoE.
- Utility and Pipeline Improvements: New utilities and build pipelines were introduced to enhance the project's functionality. A utility named
convert-to-train-gguf
was created to convert datasets into the GGUF format, and a Docker-based build pipeline was developed to integrate with Huawei's CANN backend.
- Performance Optimizations: Several pull requests focused on optimizing performance across different backends and operations. A 1D kernel was proposed for the SYCL backend's
set_rows
function, and a newmul_mat_f16_f32_image
kernel was designed to enhance performance on mobile GPUs.
- Model Configuration and Support: Configuration presets and basic support for new models were added. The Falcon model received configuration presets, and basic support for diffusion models was introduced with the Dream 7B model.
- Backend and Operation Enhancements: Enhancements were made to backend operations and support for new operations. CUDA non-contiguous unary operations were supported, and RTE variants for Vulkan API functions were added.
- Quantization and Optimization Fixes: Fixes and optimizations were applied to improve quantization and performance. A logic flaw in the quantization process was addressed, and micro-optimizations were made to the kv-cache input KQ mask setting process.
- SYCL Backend Improvements: Improvements were made to the SYCL backend, including a rework of batched matrix multiplications and a proof-of-concept for reusing Metal graphs. These changes aimed to optimize performance and handling of non-continuous data.
- Web UI Enhancements: Enhancements to the web UI included modifications to the download function and the addition of a feature for managing settings presets. These changes aimed to improve user experience and functionality.
- Graph Context Refactoring: The graph context was refactored to keep the graph reference within
llm_graph_context
, removing the need to pass it explicitly. This change also involved removing thellm_graph_result_i
abstraction.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 45
Key Closed Pull Requests
1. Smoldocling support: This pull request introduces support for the "smoldocling" text model architecture by adding an end-of-token check, updating tensor mappings, and incorporating a regex for a pre-tokenizer, while ensuring compatibility with existing llama architecture and verifying tensor consistency between Hugging Face and gguf formats.
- URL: pull/14597
- Merged: Yes
- Associated Commits: 61cfa, 8c184, 2b209, c3670, 6ea3b, 7e16c, f4a6f, 5050f, fbfcd, 43942, 00d2f, e1b0b, ea265, 52955, 996c5, 6a3ed, bd137, dd6d3, c971e, ca3cc, b9b53, 376ea, 5856f, 5c3d9, 69fa6, f5e4d, e47dc, 037db, 43319, 661ab, a3be5, 85900, db54e, f021d, 3f4c5, 310eb, bcf4a, a4f66
2. llama : support LiquidAI LFM2 hybrid model family: This pull request adds support for the LiquidAI LFM2 hybrid model family, including models LFM2-350M, LFM2-700M, and LFM2-1.2B, as well as the LFM2-Tokenizer and the ShortConv
operator, while also implementing conversion to gguf and quantization, with the prerequisite of installing the transformers library from source due to the LFM2 models being merged but not yet released.
- URL: pull/14620
- Merged: Yes
3. SYCL: Initial set_rows kernel implementation: This pull request introduces an initial implementation of the set_rows
kernel for SYCL, specifically targeting fp32 to fp32 and fp32 to fp16 conversions, and while it passes the test-backend-ops, it results in a reduction of decoding speed by approximately 3 tokens per second during inference with LLAMA 3.2 3B when enabled, prompting a need for further discussion on optimizing threads per block and threads per row.
- URL: pull/14562
- Merged: Yes
Other Closed Pull Requests
- SmolLM3 Model Introduction: This topic covers the introduction of the SmolLM3 model, which supersedes a previous pull request and includes performance metrics and instructions for enabling a "thinking mode" using the
--jinja
flag. The initial implementation by a contributor is acknowledged, and a fix in the SmolLM3 Jinja template is also addressed by removing the[:]
from the chat template string.
- README Updates: This topic involves updates to the README file, including adding a section for hot pull requests, removing the roadmap, and updating the title. Additionally, the LFM2 model is added to the models section, acknowledging a previous omission.
- CUDA and Vulkan Enhancements: This topic includes enhancements to the CUDA and Vulkan implementations, such as adding set rows functionality for f32 and f16 data types and optimizing Vulkan's deepseek prompt processing. The Vulkan implementation of flash attention is also optimized by allowing split_k operations with smaller key-value sizes.
- Bug Fixes in CUDA and Vulkan: This topic addresses bug fixes in the CUDA and Vulkan implementations, including issues with ropes with partial rotation and non-contiguous source data. A critical issue with batch splits for recurrent models is also resolved to ensure stability and correctness.
- Documentation and Synchronization: This topic covers the introduction of a script to automatically generate documentation for ggml operations and the synchronization of the ggml component in the llama.cpp project. Specific implementations for Vulkan, such as ggml_roll and bilinear interpolation, are included.
- OpenCL and Vulkan Set Rows Support: This topic introduces the
set_rows
functionality forf16
andf32
data types in the OpenCL component and adds support for the SET_ROWS operation in the Vulkan backend. Improvements for selecting the optimal workgroup size and optimizing work distribution are included.
- Model Support and Fixes: This topic adds support for the skt/A.X-4.0 and skt/A.X-4.0-Light models and addresses a fix for the "hunyuan moe chat template." The incorrect shape of
minicpm3 v_states
is also corrected.
- Build Process Improvements: This topic eliminates the manual search for curl libraries in the CMake configuration by using
find_package(CURL)
and addresses several build warnings related to unused variables. These changes simplify the build process and resolve associated errors.
- CUDA FlashAttention and Bilinear Interpolation: This topic introduces 4-dimensional CUDA FlashAttention support and bilinear interpolation for upscaling in CUDA. These features ensure compatibility with existing optimizations and refine the code for better performance.
- Vulkan GPU Selection and ggml_cont Removal: This topic aims to add Vulkan support for selecting an Integrated GPU on AIPC and involves removing unnecessary
ggml_cont()
functions from the codebase. These changes result in a significant performance boost for the affected models.
- GPTNeoX Reversion and CI Timeout Adjustment: This topic aims to revert the removal of GPTNeoX content due to a mysterious issue and addresses issue #14569 by increasing the CI timeout. Plans to revert Vulkan testing back to GPU are also included.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 127 | 15 | 0 | 65 |
CISC | 46 | 7 | 0 | 90 |
taronaeo | 108 | 1 | 0 | 2 |
chraac | 79 | 1 | 0 | 0 |
jeffbolznv | 38 | 8 | 0 | 23 |
am17an | 45 | 8 | 2 | 10 |
JohannesGaessler | 7 | 2 | 0 | 56 |
ngxson | 23 | 6 | 0 | 17 |
ryan-mangeno | 33 | 1 | 1 | 4 |
gabe-l-hart | 32 | 1 | 0 | 4 |