Weekly GitHub Report for Llama.cpp: March 31, 2025 - April 07, 2025 (12:09:28)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Cannot compile SYCL backend SYCL_LIBRARY=SYCL_LIBRARY - NOTFOUND as per documentation: This issue involves a user encountering a compilation error when attempting to compile the SYCL backend using the Intel oneAPI Base Toolkit version 2025.1.0, as the SYCL library is not found, contrary to the documentation. The user followed the documented steps for Windows, but the process failed due to missing SYCL library support, leading to errors in the CMake configuration related to the IntelSYCL package.
- The comments discuss the issue being specific to oneAPI 2025.1, with suggestions to use oneAPI 2025.0 as a workaround or apply a patch to the IntelSYCLConfig.cmake file. Users share their experiences and solutions, including modifying the SYCL_FEATURE_TEST_EXTRACT function and addressing performance issues. The conversation also touches on potential environment-related causes and the need for a clean build or CMake reinstallation to resolve the problem.
- Number of comments this week: 11
-
When will llama.cpp's vulkan provide support for Intel Arc's matrix core?: This issue is about a user inquiring when the llama.cpp project will provide Vulkan support for Intel Arc's matrix core, highlighting the current lack of support and performance issues with Intel's implementation. The discussion reveals challenges with the
VK_KHR_cooperative_matrix
extension on Intel hardware, which currently results in reduced performance and incorrect outputs, and mentions ongoing efforts and limitations in addressing these issues.- The comments discuss the performance and implementation challenges of the
VK_KHR_cooperative_matrix
extension on Intel hardware, with users sharing their experiences and testing results. There is mention of driver issues and incomplete support in the kernel, with some users trying alternative drivers without success. The conversation also touches on the hope for broader adoption of improved extensions like coopmat2, which are currently vendor-specific. - Number of comments this week: 8
- The comments discuss the performance and implementation challenges of the
-
Feature Request: llama 4: This issue is a feature request for the integration of Llama 4, a newly released multimodal large language model (LLM), into the ggml-org/llama.cpp project. The request highlights the potential benefits of using Llama 4, such as its improved multimodal capabilities and the availability of its technical details and weights.
- The comments discuss the differences between Llama 4 and its predecessor, Llama 3.3, including architectural changes and performance improvements. There is anticipation for more details to be revealed at an upcoming event, LLAMACon. Some comments provide technical insights into the model's architecture, such as interleaved attention layers and chunked attention, while others share links to related resources and forks.
- Number of comments this week: 7
-
Compile bug: compilation warnings (clang) Introduced in #10558: This issue reports compilation warnings generated by the MUSA backend when compiled using Clang, specifically in the file
ssm-conv.cu
, which were introduced in a previous commit. The warnings include casting fromconst float *
tochar *
that drops the const qualifier and unused parameters, which are not reported by NVCC by default.- The comments discuss the origin of the issue, with a plan to fix it later in the week. There is a discussion about whether the warnings are specific to the MUSA backend, with a conclusion that it is likely compiler-related since Clang reports these warnings while NVCC does not. A request is made to test a branch for warnings before submitting a fix, and instructions are provided to verify the issue using a Docker container.
- Number of comments this week: 6
-
Eval bug: Jinja not replacing
date_string
: This issue reports a bug in the Llama project where the Jinja template engine is not replacing thedate_string
variable as expected. The problem occurs when running the Llama server with specific configurations, and the user suggests using thestrftime_now
function to address the issue, while also discussing potential workarounds and improvements.- The comments discuss the inability to pass variables simply, suggesting the use of
strftime_now
for date replacement. There is a conversation about whether this feature will be supported, with a contributor expressing interest in making the time overridable and synced withstrftime_now
. The discussion also touches on the use of variables in other models and the limitations ofstrftime_now
with certain date formats. - Number of comments this week: 5
- The comments discuss the inability to pass variables simply, suggesting the use of
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend in a GitHub project, where it triggers a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
- Feature Request: Task Cancellation on Client Disconnection: This issue is a feature request for the current embedding server setup, aiming to implement task cancellation when a client disconnects to prevent unnecessary processing of queued tasks, which can lead to inefficiencies and potential server overload. The request highlights the need for the server to terminate task processing upon request cancellation, ensuring that new requests are processed promptly without delay, especially during high-load scenarios.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference. The user is working on improving the Metal backend for a project and is looking for documented methods or known practices to obtain debugger output similar to what is provided by the Metal Debugger in Xcode.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - Prompt eval is 5x slower than in Ollama and maxes out the CPU: This issue highlights a significant performance discrepancy between the
llama.cpp
andollama
implementations when running the same Q4_K_M model on similar hardware, withollama
achieving a prompt evaluation rate that is five times faster thanllama.cpp
. The user notes that despite both implementations utilizing the GPU,llama.cpp
maxes out CPU usage during prompt evaluation, and there are differences in buffer sizes and graph splits that may contribute to the performance gap.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 25
Summarized Issues:
- Compilation Errors and Warnings: Users have reported various compilation issues across different environments and compilers. These include errors with g++ on Linux due to invalid type conversions, Clang warnings in the MUSA backend, and MSVC 2022 issues with
char8_t
conversions. Additionally, there are problems with Intel's oneAPI and the Accelerate backend on Mac, causing build failures.
- Feature Requests for Model Support: There are multiple requests for adding support for new models in the llama.cpp project. These include the StarVector-8b/1b model, Qwen2.5-Omni model, and Llama 4, each highlighting the need for enhanced capabilities in handling different data types and improved processing efficiency.
- Performance and Optimization Issues: Users have encountered performance degradation in various scenarios, such as the RDNA4 prefill process and Q4_K model weight repacking. These issues are often linked to specific hardware or software configurations, with suggestions for optimization and parallelization to improve speed.
- Bugs in Execution and Functionality: Several bugs have been reported affecting the execution of models and functions. These include issues with the
llama_tokenize
function, Vulkan backend memory preferences, and thellama-quantize
module causing crashes. Additionally, there are problems with thetrim
method and Jinja template engine not functioning as expected.
- GPU and Hardware Compatibility Issues: Users have faced challenges with GPU memory usage and compatibility, such as excessive GPU memory usage with Vulkan and CUDA errors on specific GPUs. These issues often require workarounds or hardware-specific solutions to resolve.
- Backend and Platform Support Issues: There are issues related to backend support and platform compatibility, such as the lack of support for Mac Catalyst and Intel Arc's matrix core. These issues highlight the need for broader platform support and improved backend implementations.
- Model Execution and Loading Errors: Users have reported errors related to model execution and loading, such as the Qwerky 72B model failing to load with specific options and the llama-server model insisting on GPU usage. These issues often require adjustments in execution parameters or configurations.
- Feature Requests for API Enhancements: There are requests for enhancements in the LLAVA_API, such as methods to return image token counts, which are crucial for managing complexity in multimodal models. These requests aim to improve the usability and functionality of the API.
- Execution and Performance Bugs: Bugs affecting execution and performance, such as system hangs with long prompts and performance issues with specific settings, have been reported. These issues often require detailed investigation and potential codebase changes to resolve.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 14
Summarized Issues:
- Segmentation Fault and Runtime Errors in Qwen2-VL Models: Issues have been reported regarding segmentation faults and runtime errors when using the Qwen2-VL models on different platforms. On a Mac with an M3 Max processor, a segmentation fault occurs with the Metal backend, while a runtime error is encountered due to missing metadata in the GGUF format model file.
- Bugs in GGML Backend and Llama-CLI Tool: The GGML backend and llama-cli tool have several bugs affecting output and performance. On Mac, the output is a repetitive sequence of '88888888', and on another occasion, repeated log messages indicate issues with KV cache updates, impacting model performance.
- Model Loading and Tensor Shape Mismatch Errors: Loading models on different systems has led to errors, such as a tensor shape mismatch in the Qwerky QwQ 32B model on Windows with CUDA, which was resolved by reconverting the model.
- Feature Requests and Activation Functions: A feature request has been made to support Scaled ReLU or SwiGLU activation functions in the DeepSeek-V3 model. The lack of these functions causes script failures and is believed to enhance model accuracy.
- Build Failures and Configuration Issues: Build failures and configuration issues have been reported, such as RISCV cross-compile warnings requiring a GCC upgrade and a CMake configuration failure with the SYCL backend due to filesystem mount options.
- Runtime and Performance Issues on ARM and Vulkan: Runtime issues on ARM processors and performance regressions in Vulkan have been noted. The Q4_0 quantized models fail on ARM, and Vulkan's token processing speed decreased on Iris Xe graphics.
- Bugs in LlamaSharp and Vulkan Buffer Allocation: LlamaSharp software has a bug in ubatch preparation on Windows with CUDA, and Vulkan faces buffer allocation failures due to device memory limits, affecting model execution.
- Command Option Bugs in Llama.cpp: The
examples/gguf-split
command has a bug where the--merge
operation does not respect the--dry-run
option, unlike the--split
operation, leading to inconsistencies in command execution.
- Tokenization and Special Token Handling: Slow tokenization times in the Gemma 3 model are due to inefficient handling of special tokens, which can be improved by sorting tokens or applying a patch to reduce execution time.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 18
Key Open Pull Requests
1. DeepSeek V2/V3 with -mla
option (final): This pull request introduces the final version of DeepSeek V2/V3 with the -mla
option, addressing issues related to tensor separation for attn_k_b_trans
and attn_v_b
, and includes various fixes and optimizations such as renaming variables, improving code tidiness, and ensuring compatibility with different attention mechanisms, while the author expresses a desire to conclude their involvement after extensive testing and development efforts.
- URL: pull/12772
- Merged: No
- Associated Commits: b4c16, 10207, ea3c0, 1f604, 1de07, 7f92e, 319e3, ee4b3, c00cd, 55ad3, 0c86f, b0c8a, 8c329, 68302, 937a4, 1fd0a, 4fb43, f9a0e, 5fe40, 9b862, 8e23e, b3840, 5dbf9, 01a61, c0ffe, 8d12c, 997a4
2. WIP: Add support for CogAgent: This pull request introduces support for CogAgent, a visual model designed for GUI recognition and visual grounding, by integrating two CLIP encoders—one for standard vision tasks and another for high-resolution images—into the existing infrastructure, while awaiting the completion of a new vision infrastructure to finalize its implementation.
- URL: pull/12679
- Merged: No
- Associated Commits: 2a458, 0a810, 6cabd, d0068, 4a7ab, 431bb, bd071, ad38e, 32daa, 9716c, ba489, c0d93, 8586d, 25a97, c3a65, b986a, b72d7, 0959c, 90eef, e884d, 07f58, 4c7ac, c4cf4, 5c19d, 1343d, b5184
3. cmake : enable curl by default: This pull request proposes enabling curl
by default in the llama.cpp
project, as it has become integral to the user experience in examples and is already included in most pre-built versions, including Docker images and release binaries, reflecting a shift from the initial decision to keep it disabled due to potential absence of libcurl
on target systems.
- URL: pull/12761
- Merged: No
- Associated Commits: 6080f, 64557, 2cc89, 79307, 2238e, 707f2, 79509, 21c42, 9bf42, a8a7e, a9637, 04edd, 64faa, 1c1c2
Other Open Pull Requests
- Enhancements to
llama_tensor_get_type
function: This topic involves modifications to thellama_tensor_get_type
function inllama-quant.cpp
to improve compatibility with DeepSeek models. The changes focus on optimizing performance for models with varying numbers of experts and improving perplexity metrics.
- Support for gguf models from ModelScope: This topic covers the addition of support for downloading and using gguf models from the ModelScope community on multiple platforms. The pull request includes successful tests for Hugging Face and ModelScope downloads, along with various code improvements and fixes.
- Introduction of
--show-statistics
option: This pull request introduces a new--show-statistics
option to the imatrix tool. It generates a detailed report on the importance score statistics of tensors and layers, aiding in layer-wise quantization analysis.
- Refactoring CPU operations and CUDA/MUSA checks: This topic involves refactoring CPU operations by moving operators into a separate C++ file and addressing warnings. It also includes improvements to the Arm fp16 CPU logic and reintroduces CUDA/MUSA checks.
- Chat memory interface implementation: This pull request proposes a proof-of-concept for a chat memory interface inspired by ChatGPT's memory feature. It aims to integrate a simple key/value store for session-specific memory management with minimal code changes.
- Integration of Ultravox audio input: This topic covers the integration of Ultravox audio input using a Whisper encoder and a vanilla Llama 3.2 1B model. The goal is to enable an efficient audio-to-summary pipeline, although the current implementation produces incorrect output.
- Update to
rope_multi
function: This pull request proposes an update to therope_multi
function by introducing an in-place version calledggml_rope_multi_inplace
. It also replaces a hardcoded value withGGML_MROPE_SECTIONS
.
- Resolution of Android file access issues: This pull request addresses file access permission problems causing abnormal exits on Android devices. The issue is resolved as detailed in a specific commit.
- Refactoring of CANN component: This topic involves refactoring the CANN component to minimize duplicate code. The pull request is open for review and aims to streamline the codebase.
- Removal of redundant memory copy operation: This pull request proposes the removal of a redundant memory copy operation in the
ggml_backend_sycl_buffer_set_tensor
function. The change aligns its logic with the default ggml backend and ggml-cann.
- Enhancement of Docker GPU images for CPU compatibility: This topic addresses issue #12500 by adding all CPU variants to Docker GPU images. The enhancement resolves compatibility issues with 'token_embd.weight' processing on CPUs.
- Improved identification of Adreno GPUs: This pull request enhances the identification of Adreno GPUs by checking for "Qualcomm" in the device name. It ensures the complete device name is accurately captured.
- Removal of unused 'min_compute_capability' code: This pull request proposes the removal of the unused 'min_compute_capability' code from the SYCL component. The code is not utilized anywhere in the codebase.
- Performance improvement with direct accumulation: This pull request replaces the traditional accumulate-to-zero pattern with direct accumulation into the output register. The change results in a ~12% speedup in prompt evaluation performance on an AMD Ryzen 9 9950X platform.
- Resolution of Android continuous integration issue: This pull request addresses a long-standing continuous integration issue in the Android build. The issue was potentially introduced by a previously approved pull request and has been verified through a specific GitHub Actions run.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 59
Key Closed Pull Requests
1. ci: add Linux cross-compile build: This pull request introduces a cross-compile build process targeting RISC-V architecture on Linux, aiming to minimize regression issues related to cross-compiling and providing a guide for cross-compilation using Ubuntu, with potential future updates to store artifacts for broader hardware compatibility.
- URL: pull/12428
- Merged: 2025-04-04T17:05:13Z
- Associated Commits: b437b, 6a447, ddd7b, d14ed, 10edb, 344bd, ce528, ce6e5, b52bb, 829bc, d2ac2, d8d3b, 7e276, ef737, 05bfb, 2ec6e, bb935, 6140b, 16aef, ee518, f9235
2. clip : refactor clip_init, add tests: This pull request refactors the clip_init
function by introducing a clip_model_loader
, adds a testing script llava/tests.sh
for evaluating multiple models, implements an enum patch_merge_type
to replace string comparisons, removes the bool has_(tensor name)
pattern, and includes various code improvements and fixes, such as style adjustments, logging system refactoring, and model-specific updates, with successful test results for several models.
- URL: pull/12757
- Merged: 2025-04-05T15:17:40Z
- Associated Commits: 44adf, 79c56, dd508, 7b9e7, ee1fa, b41ac, 6fe68, eeea3, 17be2, 85370, 376f8, 84b35, 88aec, c4bb0, 9d4ba, 13b2d
3. Fix clang warning in MUSA compiler: This pull request addresses and attempts to fix warnings generated by the MUSA compiler in the project, specifically targeting issues highlighted in a previous pull request (#12685), and includes various commits such as optimizing the ssm_scan function, removing unused comments, applying clang formatting, and modifying unnecessary calculations.
- URL: pull/12703
- Merged: No
Other Closed Pull Requests
- Downloading System Refactor: This topic involves refactoring the downloading system by removing JSON usage, adding a
--mmproj-url
option, and simplifying model path handling. These changes aim to improve usability and address multi-shard download issues and platform compatibility.
- KV Cache Refactor: The refactoring of the KV cache guard mechanism simplifies its operation and prepares for a separate recurrent cache implementation. It ensures
llama_decode
returns1
when a batch cannot fit and restores the KV cache state upon failure.
- Web UI Package Upgrade: This upgrade involves updating daisyui and tailwindcss packages in the server web UI, with code fixes and multiple commits. The changes include switching themes, reverting changes, updating formatting, and adding an index.html.gz file.
- SYCL Component Changes: The removal of the
ggml_sycl_op_flatten
function from the SYCL component is part of a series of changes. These include removing trailing whitespace, fixing the L2 norm, and adding a try-catch block forsycl::exception
.
- Custom Chat Template Support: This pull request introduces support for a custom chat template to accommodate Yandex's upcoming 8B instruct model. The changes ensure compatibility and functionality, verified by local testing.
- CANN Backend Optimization: The optimization of
get_rows
anddup
operators in the CANN backend replaces the AscendC implementation with the aclnn library. This results in improved performance metrics, such as reduced sampling and evaluation times.
- Upstream Synchronization: This pull request synchronizes changes from an upstream repository, including file renaming and code modifications. It addresses compatibility issues with the Cosmo STL and adds new files, although it was not merged.
- Sesame Support Draft: This draft pull request adds Sesame support by translating safetensor models to gguf format. It includes scripts for splitting and converting models, with translation accuracy still being verified.
- Trillion-7B-preview Model Support: Support for the Trillion-7B-preview model is added, a large language model supporting multiple languages. Changes are primarily made to the tokenizer within the Llama architecture.
- BailingMoE Support: This pull request adds support for BailingMoE, including links to various models on Hugging Face. The Ling-plus model remains untested due to its size, and YaRN is not currently supported.
- Quantifier Reversion: Issues caused by possessive quantifiers are addressed by reverting them to greedy quantifiers. The pull request includes changing quantifiers, adding tokenizer test files, and deleting specific vocabulary files.
- CANN Backend Memory Fixes: This pull request resolves backend operation failures and memory inefficiencies in the CANN component. It includes fixes for memory waste, backend operation failures, and code formatting improvements.
- OpenCL Documentation Update: The documentation for the OpenCL backend is updated by adding OpenCL information to
build.md
. It refines tool requirements for Windows 11 arm64 and includes a link toOPENCL.md
.
- FA Kernel Typedef Fix: The use of
constexpr
in FA kernels and a typedef issue are addressed. This pull request was successfully merged on March 30, 2025.
- Vulkan Cooperative Matrix Support: Synchronization of the 'ggml' component includes improvements to CMake configuration for better Vulkan cooperative matrix support checks. Minor adjustments like fixing whitespace issues are also made.
- CANN Backend Operator Optimization: The optimization of
sin
,cos
, andargmax
operators in the CANN backend uses the aclnn library. It ensures all tests pass successfully and includes code style adjustments.
- Custom Hugging Face Endpoints: Support for specifying custom Hugging Face endpoints via the
HF_ENDPOINT
environment variable is introduced. This allows users to configure endpoints similarly to the huggingface-cli.
- ggml-sycl Backend Configuration: The configuration and compilation of the ggml-sycl backend as a Visual Studio project/solution on Windows is enabled. It ensures compatibility with the Intel official compiler and has been tested on Windows 10.
- Vulkan Flash Attention Optimization: The "split_k" feature for cooperative matrix flash attention in Vulkan is implemented. It optimizes performance by distributing work across streaming multiprocessors, benefiting models with large KV caches.
- gguf-split Tool Update: The
gguf-split
tool is updated to respect the "dry-run" option during merge operations. This pull request includes commits for implementing this feature and removing a trailing space.
- Clang Compiler Warning Fix: A Clang compiler warning in the
gguf_check_reserved_keys
function is addressed. The parameter 'val' is properly handled, as detected by the in-house CI for the MUSA backend.
- BailingMoE Bug Fix: A bug fix in the BailingMoE module corrects the qkv split logic when the head_dim is zero. The Ling-lite-base model remains broken until a related pull request is merged.
- FA Kernel Precision Update: Issue #12441 is addressed by updating FA kernels to use F32 precision in the Metal backend. There is no observed performance impact on the M2 Studio.
- JSON Dependency Removal: The
#include "json.hpp"
directive is removed fromcommon.cpp
, andcommon_grammar_trigger::from/to_json
functionality is relocated to theserver
module. This is part of a broader effort to eliminate JSON dependencies.
- MUSA Compiler Warning Resolution: MUSA compiler warnings are resolved by replacing
(void)
withGGML_UNUSED
. This pull request was successfully merged on April 3, 2025.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ngxson | 125 | 8 | 0 | 60 |
ggerganov | 99 | 9 | 2 | 64 |
zhouwg | 94 | 4 | 1 | 37 |
ochafik | 75 | 2 | 0 | 23 |
BradHutchings | 79 | 1 | 0 | 0 |
CISC | 34 | 8 | 0 | 25 |
jukofyork | 39 | 2 | 0 | 3 |
0cc4m | 13 | 3 | 0 | 27 |
EAddario | 40 | 3 | 0 | 0 |
bandoti | 30 | 2 | 1 | 9 |