Weekly GitHub Report for Llama.cpp: May 12, 2025 - May 19, 2025 (12:03:53)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, although specific details about the updates are not provided in the given data.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: HIP backend performs poorly on AMD Ryzen AI MAX 395 (Strix Halo gfx1151): This issue highlights the poor performance of the HIP backend on an AMD Ryzen AI MAX 395 (Strix Halo gfx1151) when compared to other architectures, particularly in terms of token processing efficiency. The user provides detailed benchmarking results and observations, noting that the HIP backend's performance is significantly lower than expected, even when compared to the Vulkan backend, and suggests that the problem may be related to the implementation of matrix multiplication kernels.
- The comments discuss the potential reasons for the poor performance, focusing on the inefficiency of the matrix multiplication kernels used by rocBLAS for gfx115x. Suggestions are made to test alternative configurations, such as using hipBLASLt, which significantly improves performance to match Vulkan speeds. The user also shares additional testing results and files a bug report regarding kernel regressions, while another commenter inquires about benchmarking with different data types.
- Number of comments this week: 8
-
tutorials : list for llama.cpp: This issue is about creating a list of tutorials for the llama.cpp project, focusing on various topics such as computing embeddings and parallel inference using Hugging Face endpoints. The issue encourages contributions for writing tutorials and provides a list of potential topics and discussions to guide contributors.
- The comments discuss the complexity of frontend development in llama.cpp, emphasizing the need for modern web development practices like ReactJS and Typescript. One comment mentions the removal of a TODO item and encourages direct edits to the original post. Another comment briefly references a discussion on prompt caching.
- Number of comments this week: 3
-
Phi-4-mini reasoning CRASH!!! (Vulkan): This issue reports a crash occurring during inference when using the Phi-4-mini-reasoning model with the Vulkan backend on an AMD RX 7600 GPU. The problem is suspected to be related to a pre-allocated tensor operation that the Vulkan buffer cannot execute, as indicated by the relevant log output.
- The comments discuss the possibility of the crash being related to an out-of-memory issue or a 4GB limit, but the original poster clarifies that the GPU has 8GB of VRAM and can run other models that fully utilize the VRAM, suggesting that the issue might not be due to memory limitations.
- Number of comments this week: 2
-
Misc. bug: llama-cli stopped starting in release b4191 (c9b00a7): This issue reports a bug where the
llama-cli
application fails to start after upgrading to release b4191, resulting in an access violation error on Windows 10 Pro with an AMD Ryzen 7 3700X processor. The problem persists across different binary versions of the release, and the user provides detailed logs and a call stack to illustrate the issue.- The user initially tried removing a Citrix DLL and rerunning the application, but the access violation persisted. A solution was found in a Stack Overflow post suggesting the installation of the latest MSVC redistributable or defining a specific preprocessor macro, which resolved the issue after the user installed the redistributable.
- Number of comments this week: 2
-
Misc. bug: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed: This issue reports a bug in the llama.cpp project where using specific context sizes with the Llama-4-Scout-17B-16E-Instruct model causes an assertion failure, leading to the program hanging. The problem occurs when the
--ctx-size
is set to 10485760 or 5242880, but not when set to 2621440, suggesting a potential issue with memory handling or integer overflow.- The comments discuss the possibility of an integer overflow causing the issue and request a stack trace to diagnose the problem further. The user inquires about how to obtain the stack trace, indicating a need for guidance on debugging.
- Number of comments this week: 2
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is exhibiting a GGML_OP_GET_ROWS error that does not occur with other Vulkan backends. The issue has been open for over 413 days, indicating a potentially complex or unresolved problem that requires further investigation and resolution.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for the Hugging Face Candle project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - kubernetes example: This issue is about the need for a Helm chart for the
llama.cpp
server to facilitate its deployment on Kubernetes, which is a popular platform for deploying applications at scale in the industry. The issue has been open for over a year, and while initial work has begun, the contributor is seeking additional help from the community to advance the project. - gmake[2]: [tests/CMakeFiles/test-tokenizer-0.dir/build.make:107: bin/test-tokenizer-0] Error 1*: This issue involves a build error encountered while compiling the 'llama.cpp' project on Ubuntu 24.04, specifically during the linking stage of the 'test-tokenizer-0' executable. The error is caused by an undefined reference to
ggml_backend_blas_reg
in the 'libggml.so' library, resulting in a failure to complete the build process.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 30
Summarized Issues:
- Feature Requests: The llama.cpp project has several feature requests aimed at enhancing its functionality. One request is to incorporate a draft model into the llama-bench tool to evaluate performance speeds and determine optimal configurations. Another request is to support Apple's Fast-VLM multimodal language models, which are noted for their impressive performance. Additionally, there is a request to enable dynamic adjustment of the number of active experts in a Mixture of Experts model per request, allowing users to trade off intelligence for speed at runtime.
- Performance Issues: Several issues highlight performance discrepancies in the llama.cpp project. The
mul_mat
operation in the GGML library is significantly slower compared to the same operation in llama.cpp, with users seeking insights into the performance difference. Additionally, the llama_server module is reported to be 5% to 10% slower in token generation speed compared to llama_cli, with further performance decrease when enabling Flash Attention. The HIP backend also underperforms on an AMD Ryzen AI MAX 395 compared to the Vulkan backend and other RDNA3 architectures.
- Bugs and Crashes: The llama.cpp project has multiple issues related to bugs and crashes. Users have reported server crashes and segmentation faults when running specific models, potentially due to memory allocation problems or integer overflow. Additionally, there are issues with the llama-server web UI, such as incorrect scrolling behavior and JSON file downloads containing only metadata.
- Model and Training Issues: There are several issues related to model training and functionality in the llama.cpp project. Users have encountered problems with LoRA training approaches and the functionality of finetuned models on specific hardware. Additionally, there are issues with the GGUF conversion process and the integration of VITA 1.5 for multi-modal deployment.
- Tool and Configuration Issues: The llama.cpp project has issues related to tool functionality and configuration errors. Users have reported problems with the
llama-cli
application failing to start on Windows and thellama-batched-bench
script generating no output. Additionally, there are issues with thegguf-new-metadata
andgguf-editor-gui
tools incorrectly converting integer arrays to INT32 format.
- Backend and Compatibility Issues: Several issues pertain to backend compatibility and functionality in the llama.cpp project. Users have reported problems with the Vulkan backend operation in the GGML library and the llama.cpp application crashing on specific GPUs. Additionally, there are issues with the
hf-to-gguf
tool failing due to missing tokenizer files and attribute errors.
- Documentation and Usability: The llama.cpp project has issues related to documentation and usability. Users have requested the creation and organization of tutorials for the project, including discussions on computing embeddings and parallel inference. Additionally, there is a potential documentation error regarding the logit-bias feature in the llama-server module.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 33
Summarized Issues:
- Bug in Qwen3 Model with Jinja Templates: The Qwen3 model encounters a bug where
llama.cpp
fails to parse a chat template due to unsupported list slicing syntax in Jinja, specifically the[::-1]
reverse slicing. This leads to errors in template parsing and requires a workaround to function correctly.
- SYCL and Vulkan Buffer Allocation Issues: Attempting inference with the Qwen3 Q4_0 model using SYCL on Windows causes the screen to briefly go black and the application to malfunction. Additionally, using the Qwen2.5-VL model on an AMD Ryzen APU under Windows results in a failure to allocate a Vulkan buffer, likely due to device memory allocation limits.
- Server Crashes and Assertion Errors: The
rpc-server
crashes when started as a background process without specifying a cache, highlighting a warning about exposing the RPC server to an open network. Similarly, the llama-server module crashes with an assertion error when the total number of tokens processed exceeds then_ctx_slot
limit.
- Compilation and CUDA Architecture Errors: A compilation error occurs when replacing an NVIDIA 3090 GPU with a 5070, resolved by specifying a compatible CUDA architecture version. Additionally, a compilation error is encountered when attempting to compile the
llama.cpp
project with CUDA support on a fresh Ubuntu installation due to the CUDA compiler not being found.
- Segmentation Faults and Memory Allocation Issues: A segmentation fault error occurs when attempting to quantize a Llama-3.2-1B model from F16 to Q4_K_M format using the llama-quantize tool. Additionally, a segmentation fault error occurs when submitting an image to the ggml-org/Qwen2.5-VL-7B-Instruct-GGUF model using llama-server, leading to insufficient memory and a crash.
- Image Recognition and Multimodal Vision Errors: The Qwen2.5-VL model crashes during image recognition on an AMD GPU when the image resolution is set to 1242x881. Additionally, a server error 500 is encountered when attempting to use the llama-server for multimodal vision recognition, indicating that image input is not supported by the server.
- CI Tool and Editorconfig-checker Errors: A false positive error by the CI tool editorconfig-checker incorrectly flags a line in a pull request as having trailing whitespace, despite the absence of such whitespace. This affects the documentation and GitHub modules on a Linux operating system.
- Quantization and Conversion Process Errors: A bug in the conversion process of a finetuned Gemma3 model to the GGUF format results in an AssertionError due to an unexpected non-null value for the "attn_logit_softcapping" parameter. Additionally, the
llama-quantize.exe
executable crashes on Windows 11 systems starting from version b5298 during the quantization process.
- Web UI and Accessibility Enhancements: Enhancements are proposed for the accessibility of the llama.cpp server's web UI to comply with WCAG 2.2 standards, particularly by improving button labeling for blind users. Additionally, the missing "download" and "delete" buttons in the Web UI of the llama-server module on Windows for version b5392 were relocated to the three dots menu.
- DLL and Wrapper Functionality Issues: DLLs from version b5028 and above of a wrapper for llama.cpp in Pascal Delphi are not functioning as expected, particularly when using CUDA or GPU DLLs. Version b5026 appears to be the last version with valid DLLs, prompting the creation of a test project to investigate and resolve the problem.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 23
Key Open Pull Requests
1. Granite Four: This pull request introduces comprehensive architecture support for Granite 4.0 in the llama.cpp project, incorporating changes from other branches such as Mamba2 model support and Hybrid cache, while also replacing previous work on Bamba support and addressing outstanding questions related to model architecture, efficiency, and testing with various models.
- URL: pull/13550
- Merged: No
- Associated Commits: 1f0fe, dceff, 2bfe9, aff96, e0491, fa358, 38913, 0e601, 273e7, 7d6cb, 2c77d, 87b97, 03d0e, 7a351, 8b15b, 5b8ec, 62b09, 038d9, 80551, 7d16e, 3bc71, 8d8f0, b4e9c, 1ee6c, c9ecf, 35d06, cf4f0, 6def5, 79199, 94c3d, 929fe, d55b0, e94f3, 2fa19, f7a58, 70e6a, 1b8b2, 01c35, 9c026, 18534, 282ba, 4d7ce, 8f6c7, 259a9, fe618, 6630b, c3652, cef6d, 6fa6d, e8433, 7c1cb, 03333, 662f3, eebb3, b08a8, 2c0f0, fd5a6, 629bd, 589c7, 8ca90, 0304a, 62840, 009f1, 3a837, 12c55, dba46, ab2da, 8f380, 00abc
2. MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now): This pull request proposes a modification to the MLA and FlashAttention implementations in the llama.cpp project, aiming to reduce the KV-cache size by 47% by utilizing only the K-cache and eliminating the V-cache, which is currently tested to work with pull request #13435 and requires further validation with other backends to ensure compatibility with non-contiguous V-cache handling.
- URL: pull/13529
- Merged: No
3. feat(server): Add tool call support to WebUI (LLama Server): This pull request introduces tool calling support to the WebUI of the LLama Server, initially implementing a basic JavaScript interpreter with eval functionality, while also establishing a code structure that allows for future expansion of tool capabilities.
- URL: pull/13501
- Merged: No
Other Open Pull Requests
- Dependency and Compatibility Updates: This topic covers updates to project dependencies and compatibility adjustments for Python versions 3.10 to 3.13, PyTorch 2.5.1, and NumPy 2.1. The pull requests resolve type annotation issues and dependency conflicts to ensure smooth integration with updated libraries.
- SYCL Backend Enhancements: Multiple pull requests focus on improving the SYCL backend, including documentation updates, performance improvements, and addressing compatibility issues. These changes involve removing unnecessary workarounds, updating workflows, and ensuring compatibility with CUDA.
- Model and Configuration Updates: These pull requests introduce new configurations and address issues in existing models, such as the Qwen VL and GLM4 models. They aim to improve user experience, ensure accurate context tracking, and resolve token leakage problems.
- Performance and Optimization Enhancements: Several pull requests focus on performance improvements across different platforms, including CUDA, ARM, and Vulkan. These enhancements involve optimizing kernel support, addressing numerical issues, and improving speed for specific models.
- RPC and Build Improvements: Pull requests in this category aim to enhance RPC functionality and resolve build issues on various platforms. They include enabling RPC for Docker images, addressing race conditions, and ensuring compatibility with OpenBSD.
- Miscellaneous Enhancements: This includes various improvements such as adding new functions, updating the web UI, and setting compiler paths for better system compatibility. These changes aim to streamline development and enhance user interaction with the project.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 73
Key Closed Pull Requests
1. Create c-cpp.yml: This pull request involves the creation of a new YAML configuration file named c-cpp.yml
for the GitHub project, which likely pertains to setting up or modifying the continuous integration (CI) workflow for C/C++ projects within the repository.
- URL: pull/13578
- Merged: No
- Associated Commits: 35370, aff9d, 64082, 37b9f, fb28f, 00137, 4ba9d, 66168, 66023, 2016f, 84a9b, 5368d, 1d735, 24345, 7b533, ab47d, dc39a, 65898, 2cca6, eb177, ecda2, 56304, b3b6d, 7604a, 80982, 7c727, 572b3, 13b45, b10d8, c6e8c, 63b49, 87616, 22625, 13be0, 553a5, 514c4, edb18, 558a7, 29535, d5fe4, 77d5e, 47537, 2d451, ca2bb, 59e99, e2914, ced44, c0a97, 85f36, 69699, f0dd6, e5d6c, 43f2b, d0a41, a4c34, 5fa9e, d2b20, fb047, 4e879, 1831f, 43dda, eaea3, 5f5e3, b6ce7, e98b3, 00e3e, 7d3af, cdf76, 5a639, d9d39, e2e1d, 19e89, a0f70, da84c, 5933e, 44cd8, 07c2e, 41631, e5007, 3b127, ceda2, 3e168, 16a45, 6f67c, e1e8e, 99985, 4254b, 8d33d, a7018, 13c9a, 89367, b5769, 99881, b1dd4, b0ecb, fc727, 79f26, e0f57, b6e4f, d7a14, f0578, 8efbd, d24d5, dcf88, fab64, e8477, 2af68, 62608, cb06a, c642b, 074e4, 7d212, 2f567, 3f376, a75cb, b3444, 1d36b, 3bf78, 36667, 3e959, 8ae5e, 8afbd, 93c4e, 9f2da, 86bd6, 6eb7d, 9fdfc, 27aa2, 66645, ae803, 5215b, 9b61a, b34c8, 23346, 90703, a7366, 15a28, 764b8, 2356f, 2f54e, 1e333, f4ed1, 91a86, ffc72, 32916, 141a9, 6c7fd, 4773d, 1f733, 39e73, bc4e1, bba9d, 13b0a, d8794, 814f7, 8733e, f0610, 70a69, 51fb9, 6562e, 0ccc1, 1a844, 8c834, ee01d, f05a6, 15e03, d9c4a, 02115, b486b, 3f96a, 2189f, 05277, efb8b, 5c86c, 27ebf, 0cf67, 611aa, 17512, 33eff, 7c28a, dc1d2, 7fef1, d8919, 05336, b064a, 43dfd, 3b24d, 15e61, d2a4e, 02083, 62d42, a634d, 3eac2, 7f323, 7474e, 09232, 9a390, c1040, 14492, df849, 95e18, a71a4, 22cda, 91159, 064cc, 10d2a, de4c0, f0d46, cf0a4, 1e280, d590c, bf793, b4726, b89d6, 4f711, c252e, f0995, 71bdb, e5c83, ab397, be1d4, 21ca9, d486d, bb168, 24e86, 09d13, 360a9, 05317, 5e7d9, 6da34, b7d26, 4696d, 017f1, f5170, 31984, 5ab5d, e3a94, aa48e, b2838, c753d, 3cc1f, 6c8b9, 9c404, 64bb5, 02cdd, c531e, 07ad2, c6a2c, bc098, 554cd
2. scripts : support arbitrary input file formats in compare-llama-bench.py: This pull request refactors the compare-llama-bench.py
script to support arbitrary input file formats by introducing specialized classes for git/file handling, allowing the -i
/--input
parameter to be used multiple times for loading multiple files, and enabling the script to handle single files as SQLite3, JSON, JSONL, or CSV, while also supporting multiple JSON or CSV files.
- URL: pull/13455
- Merged: 2025-05-13T13:31:13Z
3. Model: Granite MoE shared: This pull request introduces the GraniteMoEShared
architecture to the project, enhancing the existing GraniteMoE
model by incorporating a shared expert into each MoE layer, aligning with the implementation found in the Hugging Face Transformers library, and serves as a foundational component for the newly released Granite 4 architecture.
- URL: pull/13269
- Merged: 2025-05-13T13:12:01Z
Other Closed Pull Requests
- Web UI Enhancements: This topic includes pull requests that introduce new features to the web UI, such as allowing users to paste files directly from the clipboard and upload PDFs. These enhancements aim to improve user experience by providing more flexible file handling options.
- Accessibility Improvements: The pull request under this topic focuses on enhancing the accessibility of the web user interface for visually impaired users. It addresses issues such as unlabeled buttons and non-functional toast messages to improve label readability and navigation.
- Build and Configuration Updates: This topic covers pull requests that address build and configuration issues, such as nondeterministic build results and incorrect build configurations. These updates ensure consistent and reliable builds across different environments.
- Parallel Processing Enhancements: Pull requests in this topic introduce enhancements to the parallel processing capabilities of the project. They add options for non-shared and larger prompts, improve chat formats, and provide new command-line options for better performance.
- Code Optimization and Refactoring: This topic includes pull requests that focus on optimizing the codebase and refactoring existing code. These changes aim to improve memory usage, streamline logic, and enhance performance without compromising compatibility.
- Server and Backend Enhancements: Pull requests under this topic enhance server functionality and backend integration. They introduce features like the
--no-prefill-assistant
flag and hints for loading models without backends, improving user control and transition to dynamic backends.
- Feature Additions and Improvements: This topic covers pull requests that add new features or improve existing ones, such as the ability for
llama-bench
to accept ranges for integer parameters and enhancements to pattern matching for tensors. These changes provide users with more flexibility and control.
- Bug Fixes and Issue Resolutions: Pull requests in this topic address and resolve various issues, such as out-of-bounds access in the kv-cache and MKL static linking problems. These fixes ensure the stability and reliability of the project.
- Documentation and Logging Updates: This topic includes pull requests that update documentation and logging systems. These updates enhance the clarity and usability of project documentation and refine exception messages for better debugging.
- Library and Tool Changes: Pull requests under this topic involve significant changes to libraries and tools, such as the removal of the
libllava
library and the introduction of a new GEMM SME kernel. These changes mark important updates and optimizations in the project.
- Project Synchronization and Integration: This topic covers pull requests that synchronize the project with external repositories and integrate new systems. These updates ensure the project remains up-to-date with the latest changes and improvements.
- Sampling and Benchmarking Enhancements: Pull requests in this topic focus on enhancing sampling and benchmarking capabilities. They introduce support for new sampling methods and improve benchmarking tools for more accurate performance evaluation.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ngxson | 311 | 18 | 7 | 61 |
ggerganov | 97 | 12 | 2 | 28 |
JohannesGaessler | 53 | 5 | 3 | 55 |
slaren | 59 | 7 | 0 | 42 |
CISC | 52 | 6 | 1 | 44 |
gabe-l-hart | 46 | 4 | 1 | 3 |
matteoserva | 36 | 2 | 4 | 5 |
jeffbolznv | 30 | 4 | 0 | 11 |
BradHutchings | 40 | 0 | 0 | 0 |
danielhanchen | 33 | 0 | 1 | 3 |