Weekly GitHub Report for Llama.cpp: March 24, 2025 - March 31, 2025 (12:09:16)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
This version release, created on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without further information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
csm : implement Sesame-based conversation example: This issue involves implementing a local example of a Sesame-based conversation model using the CSM model, which is openly available, and integrating it with the existing talk-llama example to support audio generation. The task requires implementing Kyutai's Mimi audio codec similarly to the WavTokenizer and modifying the talk-llama example to enable the use of any LLM for text response generation while utilizing Sesame for speech input and output.
- The comments discuss various aspects of implementing the Mimi codec and CSM model, including challenges with the transformer and vector quantizer, the need for a decoder, and the importance of maintaining conversational context. Contributors share progress updates, links to their work, and suggestions for structuring the implementation, with some focusing on specific components like the encoder and decoder, while others propose creating a dedicated folder for audio models to facilitate future development.
- Number of comments this week: 19
-
Feature Request: Qwen 2.5 VL: This issue is a feature request for implementing Qwen 2.5 VL in the llama.cpp project, with the original poster expressing interest in attempting the implementation despite being new to the source code. The motivation for this request is that the current version is not functioning, and the possible implementation could be based on the existing Qwen 2 VL framework.
- The comments discuss various aspects of implementing and testing Qwen 2.5 VL, including waiting for the official paper, sharing progress updates, encountering and troubleshooting errors, and sharing resources like model files and conversion scripts. Users express support, share insights on hardware requirements, and discuss potential improvements and future plans for the project.
- Number of comments this week: 8
-
Compile bug: There was a errror while compiling support for the backend Vulkan: This issue involves a compilation error encountered while attempting to compile the llama.cpp project with Vulkan support on a Linux system using an Orange Pi 5. The error is related to outdated Vulkan headers, which are causing specific functions and features to be unrecognized during the build process.
- The comments discuss the outdated Vulkan headers as the root cause of the issue and suggest updating them. The user seeks guidance on updating the Vulkan SDK for an aarch64 architecture on Ubuntu 22.04, with suggestions including installing a Vulkan SDK or using a PPA. Links to potential resources for updating the SDK are shared, but there is uncertainty about the best approach for the aarch64 architecture, with a suggestion to build it manually if necessary.
- Number of comments this week: 8
-
Misc. bug: Flash attention on Vulkan: This issue discusses a bug related to Flash Attention operations not being fully supported by the Vulkan backend, causing them to fall back to CPU processing, which affects performance on certain models and hardware configurations. The user is experiencing this issue on a Linux system with an AMD Radeon RX 6700 XT GPU and is seeking clarification on whether this is a known limitation or specific to their setup.
- The comments reveal that Flash Attention in Vulkan is currently only implemented for Nvidia GPUs using a specific beta driver, and the issue is due to limitations in the
VK_KHR_cooperative_matrix
extension. There is a discussion about the potential for future extensions or implementations that could address this, but it is noted that such developments are complex and not prioritized at the moment. The user acknowledges the explanations and suggests closing the issue or using it as a placeholder for future feature requests. - Number of comments this week: 6
- The comments reveal that Flash Attention in Vulkan is currently only implemented for Nvidia GPUs using a specific beta driver, and the issue is due to limitations in the
-
Compile bug: SYCL backend build fail on debug config: This issue involves a compilation bug in the SYCL backend of a project when building in debug configuration on Linux, where the build fails due to a SYCL kernel attempting to call an undefined function without the SYCL_EXTERNAL attribute. The problem does not occur in the release configuration, and the user has provided specific steps and commands to reproduce the issue, highlighting the error message encountered during the build process.
- The comments discuss a workaround for the issue, suggesting disabling assertions in the debug build to bypass the error, although this is not considered a proper fix. It is noted that building with "Debug" is not recommended for SYCL kernels running on GPUs, and alternative debugging methods using gdb are suggested. The workaround is confirmed to work by the original poster, who expresses gratitude and a willingness to share any better solutions found in the future.
- Number of comments this week: 6
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error. It is noted that this error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
- Feature Request: Task Cancellation on Client Disconnection: This issue is about a feature request to enhance the current embedding server setup by implementing task cancellation when a client disconnects, as the existing system continues processing queued tasks even after a client cancels a request, leading to inefficiencies and potential server overload. The proposed modification aims to terminate task processing upon request cancellation to prevent delays in processing subsequent requests, especially in scenarios where a client makes numerous requests and then disconnects, potentially paralyzing the server.
- Question: How to generate an MPS gputrace: This issue is a query about generating a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for the Hugging Face Candle project. The user is seeking guidance on whether there is a documented or known method to produce this type of debugger output, similar to what is available in the Xcode Metal Debugger.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - Server: Add prompt processing progress endpoint?: This issue proposes the addition of a new server endpoint to provide real-time updates on the progress of prompt processing, which would include details such as whether processing is ongoing, the length of the prompt in uncached tokens, and the number of tokens remaining to be processed. The motivation behind this feature is to offer users insight into the progress of longer or slower prompt processing tasks, potentially benefiting other projects beyond the server in question.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 26
Summarized Issues:
- Memory Management Issues: The llama.cpp project faces several memory-related challenges, including a potential memory allocation leak in the llama-server module, which causes the system to slow down significantly after long prompts. Additionally, there are memory allocation failures on CUDA devices due to attempts to allocate excessively large buffer sizes, leading to out-of-memory errors.
- Model Conversion and Compatibility Problems: Users encounter various issues when converting models to the GGUF format, such as unsupported model types and significant accuracy drops. These problems highlight the need for enhanced support in conversion scripts to handle different model architectures and maintain performance.
- Performance and Regression Issues: The llama.cpp project experiences performance regressions in various backends, such as Vulkan and SYCL, leading to decreased processing speeds and increased variance. These issues often require specific configuration adjustments or code modifications to mitigate their impact.
- Crashes and Errors in Execution: Several issues report crashes and errors during execution, such as segmentation faults and runtime errors, often due to specific configurations or hardware setups. These problems necessitate debugging and potential code fixes to ensure stable operation across different environments.
- Feature Requests for Model and Script Enhancements: Users request new features to support additional models and improve existing scripts, such as adding support for the Qwen2.5-Omni-7B model and implementing interleaved sliding window attention. These enhancements aim to expand the project's capabilities and optimize performance.
- Compilation and Build Failures: The project encounters various compilation and build failures, often due to outdated dependencies or specific build configurations. These issues require updates to headers or adjustments in the build process to ensure successful compilation across different platforms.
- Script and Model Execution Bugs: Bugs in scripts and model execution, such as assertion errors and incorrect data handling, disrupt the expected workflow and require debugging to resolve. These issues highlight the need for thorough testing and validation of code changes.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 52
Summarized Issues:
- Feature Requests: The llama.cpp project has received multiple feature requests aimed at enhancing its capabilities. These include requests for supporting new models like SmolVLM and Qwen2Model, as well as making the project compatible with gaming consoles like PS5 and Xbox. Additionally, there are requests for improving performance by offloading specific layers and supporting new parameters for user-controllable generation.
- Compilation and Build Issues: Several issues have been reported regarding compilation and build failures in the llama.cpp project. These include errors related to outdated toolkits, missing libraries, and platform-specific problems, which often require updates or configuration changes to resolve.
- Bugs in Model Execution: The project has encountered various bugs during model execution, affecting different components like the llama-cli, llama-server, and quantization processes. These bugs often result in crashes, incorrect outputs, or performance issues, and require debugging and code fixes to address.
- issues/10929, issues/11078, issues/11704, issues/11764, issues/11799, issues/11823, issues/11825, issues/11828, issues/11829, issues/11841, issues/11868, issues/12277, issues/12341, issues/12433, issues/12474, issues/12504, issues/12517, issues/12528, issues/12542, issues/12561, issues/12567, issues/12572, issues/12574, issues/12587, issues/12588, issues/12596, issues/12614, issues/12644, issues/12647
- Backend and Performance Issues: The llama.cpp project has faced backend and performance-related issues, including GPU and CPU resource allocation problems, memory usage inefficiencies, and backend-specific errors. These issues often require optimizations and configuration adjustments to improve performance and stability.
- Conversion and Compatibility Issues: There are several issues related to model conversion and compatibility within the llama.cpp project. These include errors in conversion scripts, compatibility problems with newer software versions, and issues with specific model formats, which often require script updates or alternative methods to resolve.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 16
Key Open Pull Requests
1. ggml-quants : weighted rounding algorithms with cumulative search: This pull request introduces weighted rounding algorithms with cumulative search to improve the quantization process in the llama.cpp
project, enhancing the accuracy and compatibility of various quantization types such as TQ1_0
, TQ2_0
, Q3_K
, IQ4_NL
, and IQ4_XS
, while maintaining backward and forward compatibility with other versions, and includes detailed performance evaluations and algorithmic visualizations to demonstrate the improvements.
- URL: pull/12557
- Merged: No
- Associated Commits: dd6b8, d0060, 6f7fe, f27c1, 0c9e4, 30ad9, 3be11, f86b8, 3e4b6, af23a, a4113, 8b8b8, a5b19
2. Add Yandex instruct model template support: This pull request aims to integrate support for a custom chat template for Yandex's 8B instruct model into the llama.cpp project, ensuring compatibility and functionality as verified by local testing, without impacting other parts of the project.
- URL: pull/12621
- Merged: No
3. (draft) tts: Sesame support: This pull request is a draft for adding Sesame support to the project by translating safetensor models to the gguf format, implementing necessary changes to the safetensor configuration, and splitting the models for conversion, with ongoing verification and adjustments to ensure accuracy.
- URL: pull/12549
- Merged: No
Other Open Pull Requests
- Trillion-7B-preview Model Support: This topic covers the introduction of support for the Trillion-7B-preview model, a large language model compatible with the Llama architecture. The changes primarily focus on updating the tokenizer to support multiple languages including English, Korean, Chinese, and Japanese.
- BailingMoE (Ling) Model Support: This topic involves adding support for BailingMoE (Ling) models to the llama.cpp project. It includes several specific models hosted on Hugging Face, with a note that the Ling-plus model remains untested due to its size.
- Vulkan Backend Enhancements: This topic includes multiple enhancements to the Vulkan backend, such as implementing split_k for cooperative matrix flash attention and introducing a hybrid approach to reduce fence latency. These changes aim to optimize performance and improve throughput in various scenarios.
- ggml-sycl Backend Configuration: This topic covers the capability to configure and compile the ggml-sycl backend of llama.cpp as a Visual Studio project/solution on Windows. It ensures compatibility with the Intel official compiler and includes updates to CMake configuration and documentation.
- Hugepage Memory Allocation: This topic introduces a feature to the llama-map project that supports hugepage memory allocation with page sizes of 2M or 1G. This can significantly accelerate the loading of large models when the system has sufficient RAM to pre-allocate hugetlbfs model files.
- Bfloat16 Matrix Multiplication in Vulkan: This topic introduces support for bfloat16 matrix multiplication in Vulkan using the VK_KHR_shader_bfloat16 extension. It highlights the necessity of this extension for cooperative matrix multiplication and the requirement for a custom build of glslc.
- Matrix Multiply Assist for PowerPC: This topic aims to enable Matrix Multiply Assist (MMA) for BF16 data types on PowerPC architecture. The pull request is indicated by the commit message and the final version status but has not yet been merged.
- OpenCL Backend Enhancements: This topic includes enhancements to the OpenCL backend, such as adding support for multiple devices and addressing a compilation issue in the ggml-opencl_mm.cl file. These changes prioritize platforms with GPUs and ensure thread safety.
- Universal Assisted Decoding: This topic implements universal assisted decoding in the llama-server, allowing speculative decoding between a draft model and a main model with incompatible tokenizers. It suggests potential improvements like token healing and caching the translation process.
- Console Output Refactoring in llama-tts: This topic refactors the console output of llama tokens in the llama-tts project by replacing
printf()
withLOG_INF
. It allows the--log-disable
option to suppress token printing and the--log-file
option to direct the output to a specified log file.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 58
Key Closed Pull Requests
1. Work in progress: This pull request involves renaming and updating various documentation files, such as README.md and BUILD-INSTRUCTIONS.md, to improve clarity and differentiation, alongside addressing issues in server.cpp related to argument file handling and Cosmo STL compatibility, although it was ultimately not merged.
- URL: pull/12550
- Merged: No
- Associated Commits: 82bb6, 2afcf, 72519, 95962, 767b2, 70125, b9bb5, a9ed7, bcc24, 176a2, 4bd01, 5d49e, 01b47, bb91f, 5f728, 84840, 1ed1a, c4537, 140b9, 32005, 4cb5f, 83ff3, 2f273, 24227, 3725f, 84b1f, bdaa0, 341c9, e5b05, dfe5b, 502e1, 7ca55, 12234, 145f8, 9e05f, 326c1, c197e, ee567, 1f8b9, 43e47, 44ed9, c584c, 86093, 93368, b2ffd, 7402b, c58ea, b7d30, 1abf3, 54502, 6d821, 83755, 85f06, f4fa3, 857b1, 8eb95, 90750, 096d7, dd774, 40643, 83c11, 8a3cc, a7c55, 1bed3, 34af8, 74369, df8c5, 24954, cd0b7, 4232c, 0e0b7, 6797f, 118b6
2. Add PLM GGUF Conversion & Inference Support: This pull request introduces support for converting and inferring the PLM-1.8B-Instruct model from Hugging Face to the GGUF format, incorporating features such as Sparse FFN with Squared ReLU and Multi-head Latent Attention, and has been successfully tested with quantized versions of the model.
- URL: pull/12457
- Merged: 2025-03-27T10:49:15Z
- Associated Commits: 563ec, f006d, c14ca, 1a47c, 21ed7, 9a542, 08b5a, 7813d, 731ed, b808f, f687e, 5646e, 25188, ff3d9, 444df, 22d35, 93cf1, 850d3, 42356, 0fcce, 55b86, 69d61, a7f4a, 06690, 9d47a, 95de3, d7a2f, 91f06, 4bd85, 5f754, cd460, 6d3ac, 0b8de, 64652, f5b52, 23391, 7772d, e9c7f, 1ec1c, 82889, 3a079
3. Add simple-tts example: This pull request introduces a new example called "simple-tts" to the llama.cpp project, which differs from the existing "tts" example by eliminating the dependency on common.h
and instead utilizing only llama.h
, as evidenced by a series of commits that include the development of functions, prompts, model and context loading, and various fixes and improvements.
- URL: pull/12261
- Merged: No
- Associated Commits: 0a001, 6ed17, 6429d, 2b206, 4f40f, 0b864, 6b25a, d39b6, 9c728, 8690e, 31127, e3de6, b97e3, 5a315, d087b, 0ce5d, 67ee4, 75b0b, 26b97, 08a1d, 40c02, fb28c, fac17, 8fb16
Other Closed Pull Requests
- GitHub Actions Workflow Automation: This pull request introduces a new GitHub Actions workflow to automate syncing the upstream repository and updating the
dev
branch. The workflow is scheduled to run daily at midnight UTC and can also be triggered manually to ensure thedev
branch remains current with the latest upstream releases.
- FA and MoE Component Enhancements: Enhancements to the FA and MoE components include support for different head sizes in FA kernels and optimization of FA-vec kernels for specific head sizes and large contexts. These changes improve the condition for using the mat-mat version of
mul_mat_id
and provide significant performance improvements for quantized KV cache across various models.
- Matrix-Vector Code Refactoring: Refactoring the matrix-vector code in the "metal" component prepares for dynamic threadgroup allocation, resulting in performance improvements for certain models. Benchmark tests show speedups in processing times for models like
llama 7B Q5_K_M
andqwen2 1.5B IQ4_NL - 4.5 bpw
.
- Support for New gfx1200 and gfx1201 Targets: This pull request adds support for the new gfx1200 and gfx1201 targets, addresses code review comments, and includes fixes for fp32 to fp16 to fp32 conversions on RDNA4. The changes were successfully merged on March 26, 2025.
- Compilation Warnings Resolution in MUSA Component: This pull request resolves all compilation warnings in the MUSA component and re-enables the
-DLLAMA_FATAL_WARNINGS=ON
flag in the CI scriptrun.sh
. It also updates documentation to includeccache
installation, improving the build process and code quality.
- Enhancements and Fixes for Tensor Data and Cache: Enhancements include sending a hash when tensor data exceeds a threshold and storing cache under the user's home directory. The pull request also attempts to fix build issues on Windows 32-bit systems and removes a dependency on the llama library.
- Multiple Updates and Improvements to ggml Project: This pull request includes enhancements to the command.wasm example, adjustments to build instructions, suppression of compiler warnings, and refactoring of CPU operators. These changes aim to improve code quality and functionality.
- Support for 128-bit RISC-V V Extension: The pull request introduces support for 128-bit RISC-V V extension architectures by adding
vec_dot
compatibility and implementing dynamic kernel selection. It also enhances k-quant kernel performance and incorporates the RISC-V Zfhmin extension for float16 data type conversions.
- OpenCL Implementation Enhancements: This pull request adds support for multi and vision modes for rope, as well as the
gelu_quick
andim2col
functionalities, to the OpenCL implementation in the llama.cpp project.
- Decoder Implementation for Kyutai's Mimi Model: Implementing a decoder for Kyutai's Mimi model within the llama.cpp project, this pull request includes tasks such as implementing the
decode_frame
function and testing with audio codes from Sesame. It also provides instructions for converting the model to GGUF format.
- Build Failure Resolution in Vulkan Project: This pull request addresses a build failure issue in the Vulkan project by ensuring the status of coopmat and coopmat2 support is correctly passed during cross-compilation. It enables proper shader generation and is tested on both native and cross-compiling environments.
- Hyperparameters Initialization Fixes: This pull request addresses the initialization issue of hyperparameters for the Mistral3 and Gemma3 models by implementing fixes across multiple commits. It includes setting positional arguments correctly and utilizing existing hyperparameters if provided.
- SYCL Backend
memset
Interface Implementation: Implementing the missingmemset
interface for the ggml backend buffer in the SYCL environment, this pull request addresses an issue discovered during debugging oftest-opt
CI failures.
- LoRA Adapters Compatibility Fix: This pull request ensures that LoRA adapters are now loaded into the default CPU buffer type, fixing issue #12587. It references a previous discussion in pull request #12181.
- Speculative Decoding Statistics in
timings
Object: New fields are added to thetimings
object to include speculative decoding statistics, such as the number of draft tokens generated and accepted. The server console output is updated to display the draft acceptance rate.
- Fedora CUDA Guide Updates: This pull request involves updating and improving the Fedora CUDA guide by relocating it to the
backend
folder and enhancing its content with various improvements and clarifications.
mul_mat_id
Function Fixes: Addressing issue #12528, this pull request fixes themul_mat_id
function to correctly handle theQ8_K
type and adjusts theIQ4_NL
parameter type toQ8_0
. It also includes improvements to code indentations and repack templates.
- Quantization Error Fix in CUDA Backend: This pull request addresses a quantization error in the llama-llava-clip-quantize-cli by moving the quantization processes to the CPU backend. It fixes errors related to accessing video memory and adjusts function implementations.
- Bug Fix for "Squeeze" Operation on ssm_conv Tensors: This pull request resolves a bug fix for the "squeeze" operation on ssm_conv tensors, addressing an issue referenced in a previous pull request (#10784) and documented in issue #12572.
- Synchronization of 'ggml' Component: This pull request involves synchronizing the 'ggml' component, including merging PowerPC build commands via CMake, and was successfully merged on March 27, 2025.
- Synchronization of 'ggml' Component with Script Updates: This pull request involves synchronizing the 'ggml' component, including updates to scripts and fixes to CMake merge issues.
- Unix Socket Support for Example Server: This pull request introduces the capability for the example server to listen on a Unix socket by implementing a method where if the
--host
parameter ends with.sock
, the server will create and listen on a Unix socket instead of TCP.
- Consolidation of
fmt
andformat
Functions: This pull request consolidates thefmt
andformat
functions into a single function to optimize the code by eliminating redundancy and reducing the need for multiple buffers.
- Verbose Output Consistency in Streaming Mode: This pull request addresses the inconsistency in the verbose output of the /chat/completions and /v1/completions endpoints by adding the "__verbose" field to the server response, aligning the streaming behavior with the non-streaming behavior.
- CPU Matrix Multiplication Kernels for ppc64le ISA: Implementing CPU matrix multiplication kernels for the ppc64le ISA using MMA builtins, this pull request enhances matrix multiplication between quantized datatypes and results in a 5% to 50% speed improvement across various batch sizes.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 87 | 30 | 2 | 140 |
ngxson | 102 | 11 | 0 | 122 |
ochafik | 88 | 8 | 0 | 50 |
zhouwg | 94 | 5 | 2 | 39 |
CISC | 45 | 16 | 0 | 39 |
BradHutchings | 73 | 2 | 0 | 0 |
bandoti | 39 | 1 | 0 | 12 |
jeffbolznv | 17 | 14 | 0 | 16 |
Rbiessy | 15 | 1 | 0 | 30 |
No author found | 45 | 0 | 0 | 0 |