Weekly GitHub Report for Llama.cpp: April 14, 2025 - April 21, 2025 (12:03:03)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without further information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Eval bug: GLM-Z1-9B-0414: This issue involves a bug in the GLM-Z1-9B-0414 model where the text generation process enters an infinite loop after producing approximately 100 tokens, despite attempts to resolve it using different configurations and quantization levels. The problem persists even when using the highest quantization levels, and the issue does not occur when the model is run with Transformers, suggesting a potential implementation-specific error.
- The comments discuss various problems with the Z1 model, including crashes and infinite loops, and suggest potential fixes such as adjusting conversion code and avoiding certain flags. Users confirm that some fixes improve output, but issues like corrupted output and server errors persist. A pull request is mentioned as addressing some bugs, and users share commands and configurations that work for them, while also noting remaining issues with the model's behavior and server endpoints.
- Number of comments this week: 26
-
Misc. bug: Vulkan performance depends on thread priority: This issue highlights a performance inconsistency in the ggml-vulkan module, where Vulkan performance is unexpectedly influenced by thread or process priority, with higher priorities yielding better and more stable results. The user suspects that this might be related to CPU latency issues after waiting on a fence, and is seeking community input to gather data on how different systems are affected, particularly focusing on CPU, GPU, driver, and OS versions.
- The comments discuss various observations and suggestions, including the impact of e-cores on performance, attempts to disable e-cores and other CPU settings in BIOS, and the potential influence of Windows background tasks. Some users report no significant differences, while others suggest trying different BIOS settings or locking CPU clock speeds. A proposed solution involves using a thread information setting to prevent throttling, which has shown to restore performance in some cases.
- Number of comments this week: 13
-
Eval bug: Quad P40 unable to run 70B models on recent releases: This issue involves a bug where the Quad Nvidia Tesla P40 setup is unable to run 70B models on recent releases of llama.cpp, causing failures in model generation and requiring a server reboot to reinitialize CUDA devices. The problem seems to be related to the configuration of CUDA devices, as certain combinations of devices work while others do not, and reverting to an older version resolves the issue.
- The comments discuss troubleshooting steps, including suggestions to fetch a fresh pull from GitHub, disable certain options, and test with smaller models. The user reports testing various configurations and versions, noting that some combinations of CUDA devices work while others fail. They express appreciation for the suggestion to use
git bisect
to identify the problematic commit, although they face challenges with automated detection due to CUDA initialization errors. - Number of comments this week: 4
- The comments discuss troubleshooting steps, including suggestions to fetch a fresh pull from GitHub, disable certain options, and test with smaller models. The user reports testing various configurations and versions, noting that some combinations of CUDA devices work while others fail. They express appreciation for the suggestion to use
-
Feature Request: Improve model load time when using the RPC backend: This issue is a feature request to improve the model load time when using the RPC backend in a GitHub project, specifically by exploring the possibility of storing pre-computed hashes in GGUF to avoid loading the entire model on the main host. The motivation behind this request is to enhance the efficiency of loading large models, such as Llama 4, across multiple RPC servers, with the suggestion that reloads could be made instant if hashes and main model weights are cached into system RAM.
- The comments discuss the potential benefits of storing pre-computed hashes to speed up model loading and suggest a manual method for managing cache directories. A user provides a script for launching RPC servers with automatic cache directory creation, while another user points out that the cache directory can be overridden using an environment variable, which is acknowledged as a cleaner solution.
- Number of comments this week: 3
-
Misc. bug: llama-server speculative decoding not as performant as llama-speculative-simple: This issue highlights a performance discrepancy between the
llama-server
andllama-speculative-simple
modules in the llama.cpp project, where the former is less efficient in speculative decoding despite using similar arguments. The problem is observed whenllama-server
generates fewer drafted tokens and performs approximately 10%-15% worse thanllama-speculative-simple
in a best-case scenario test.- The comments discuss the discrepancy in behavior between the two modules, with one user suggesting enabling verbose output to identify the cause. Another user provides verbose logs, noting that
llama-speculative-simple
repeatedly invokes a specific function, whilellama-server
alternates between generating a single token and speculative decoding. The final comment acknowledges the potential for improvement in the server's implementation to match the efficiency of the simple example. - Number of comments this week: 3
- The comments discuss the discrepancy in behavior between the two modules, with one user suggesting enabling verbose output to identify the cause. Another user provides verbose logs, noting that
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error that does not occur with other Vulkan backends. The problem has been open for over a year, indicating a potentially complex or unresolved technical challenge within the project.
- Feature Request: Task Cancellation on Client Disconnection: This issue addresses the need for a feature in the embedding server setup that allows for task cancellation when a client disconnects, as currently, tasks continue processing even after a client cancels a request, leading to inefficiencies and potential server overload. The proposed feature aims to terminate task processing upon request cancellation to prevent delays in subsequent requests and avoid server paralysis when a client makes numerous requests and then disconnects.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for the Hugging Face Candle project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is a documented or known method to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded model files, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - Feature Request: deep/ recurrent processing like "thinking", but script based.: This issue is a feature request for implementing deep or recurrent processing methods, akin to "thinking," in a script-based manner within the project. The requester is interested in exploring whether the model can support or be adapted to use different processing methods, specifically referencing a model from Hugging Face and a related research paper, to enhance the way prompts are processed and responses are generated.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 31
Summarized Issues:
- Runtime Errors in llama-server: The llama-server component encounters runtime errors when executing specific commands, resulting in a termination with a response code 500 and an "Internal Error" message. These issues are often linked to internal errors from the HF API on systems running Ubuntu 24.04.2 LTS.
- Model Conversion and Inconsistencies: Users face challenges converting fine-tuned models to gguf format using llama.cpp, experiencing inconsistent responses and mixed language outputs. There is uncertainty about the correctness of the conversion script and how to verify it.
- Bugs in Model Execution: Several models, including GLM-Z1-9B-0414 and Deepseek V2 Lite, exhibit bugs such as infinite loops and crashes during execution. These issues persist despite attempts to resolve them with different configurations and quantization methods.
- Performance and Resource Utilization Issues: Performance discrepancies and resource utilization issues are reported in various components, such as excessive power draw in dual GPU setups and inefficient speculative decoding in llama-server. These problems affect the overall efficiency and performance of the system.
- Feature Requests for Enhanced Functionality: Multiple feature requests aim to improve the functionality of the llama.cpp project, including faster model load times, automatic image conversion, and enhanced CLI tools for interactive sessions. These enhancements are particularly beneficial for users with specific needs, such as those who are blind.
- Build and Compilation Errors: Users encounter build and compilation errors in the llama.cpp project, such as undefined references and missing directories. These issues raise questions about the correct setup and configuration of the build environment.
- Memory and Resource Management Bugs: Memory leaks and resource management issues are reported, particularly when offloading models to GPUs. These problems are linked to the improper handling of backend resources and static contexts.
- Compatibility and Configuration Issues: Compatibility issues arise with certain hardware configurations, such as CUDA errors on specific GPUs and Vulkan backend limitations. These problems often require adjustments to configurations or additional support for specific tensor types.
- Feature Enhancements for Model Management: Proposed feature enhancements aim to improve model management, such as enabling server model switching at runtime and packing multiple GGUFs into a single file. These features would enhance the flexibility and efficiency of model handling.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 13
Summarized Issues:
- Runtime Errors in llama.cpp: The
llama-server
module on Linux crashes due to a runtime error when processing a JSON-based chat completion request, triggered by specific CUDA settings. Another issue involves a CUDA crash caused by the-ot
option, which can be temporarily resolved by disabling CUDA graphs.
- Performance Issues in llama.cpp: The LLama 3.2 software experiences reduced processing speed as the CPU performs all the work instead of the GPU, despite the GPU's VRAM being filled. Additionally, a performance bottleneck in the libllama core library limits the tokens per second due to potential memory bandwidth limitations.
- Compilation Errors in llama.cpp: A CMake build process fails due to the inability to find the CURL library, which can be resolved by installing necessary packages or disabling the feature. Another compilation error occurs when building the "cann" component due to an undeclared identifier, resolved by updating the GCC version.
- Feature Requests in llama.cpp: A request for support for the Apriel-5B-Instruct model is made, though it may not be pursued further. Another request seeks to increase the limit of devices for RPC from 16 to a higher number, as the current limit restricts device utilization.
- Tensor and Model Issues in llama.cpp: A potential bug in the LLaVa Projector involves tensor shape discrepancies due to order differences, though it does not affect performance. A performance discrepancy is noted on an M3 Ultra with the phi-4 model, where GGUF format is slower than MLX format.
- Command and Input Handling in llama.cpp: The
llama-cli
module on Linux has a bug where Ctrl+D behaves like Enter, and Ctrl+C fails when input is held by/dev/null
, possibly due to a regression. A user questions the compatibility of flash-attention with the V100 GPU, despite observing improved performance.
- Code Refactoring in llama.cpp: Code refactoring involves renaming a variable and updating logic to improve clarity, including introducing an enum to better handle conditions for using SWA.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 19
Key Open Pull Requests
1. Nix portability improvements: This pull request focuses on enhancing the portability of the Nix package by implementing a multi-output derivation to minimize runtime closure sizes, appropriately categorizing dependencies into native
and propagated
buildInput
sets, avoiding linking stdc++fs
on unsupported platforms, enabling test execution, allowing configuration of LLAMA_BUILD_*
options through Nix, and adding a new maintainer for the Nix package, with successful testing on various platform configurations.
- URL: pull/13005
- Merged: No
2. Resolved half rope,multi-EOS issues in convert_hf_togguf.py for GLM4Z Model: This pull request addresses and resolves critical issues in the convert_hf_togguf.py
script for the GLM4Z Model, specifically targeting the half rope problem, multi-EOS issues, and the GGGG output problem, while also enhancing the codebase's efficiency and maintainability by leveraging existing architecture, as detailed in the linked issue #12946.
- URL: pull/12957
- Merged: No
3. mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli
: This pull request aims to consolidate the command-line interfaces for the vision models llava, gemma3, and minicpmv into a single unified llama-mtmd-cli
within the llama.cpp project, while also addressing the exclusion of Qwen2VL due to complications with M-RoPE and planning follow-up tasks such as documentation refactoring and support for additional models.
- URL: pull/13012
- Merged: No
Other Open Pull Requests
- Performance Enhancements in CUDA and AMD Architectures: The pull requests focus on optimizing performance for specific GPU configurations, including extending MMVQ support for noncontiguous and batched inputs, adding FP32 support, and manual tuning for AMD GCN architecture. These changes result in significant performance improvements for models on GPUs like the RX 6800, RTX 3090, and RX470.
- SYCL and Vulkan Implementation Fixes: These pull requests address issues in the SYCL and Vulkan implementations, including enabling SYCL graphs, supporting non-contiguous input in ROPE, and fixing the Deepseek V2 inference problem. The changes aim to improve functionality and performance in AI applications.
- Conversion Script and Multimodal Support: The pull requests introduce experimental support for converting multimodal projectors and fix bugs in conversion scripts for Hugging Face variants. These updates streamline the conversion process and ensure compatibility with various model checkpoints.
- Thread Scheduling and Performance Metrics: Enhancements include introducing a low priority scheduling option for GGML threads and adding the
n_graph_splits
performance metric to thellama-bench
tool. These updates aim to reduce contention and provide better hardware backend support assessment.
- Optimization and Bug Fixes in SYCL: The pull requests focus on optimizing the reorder process in SYCL and dynamically adjusting memset operations during inference. These changes aim to improve performance and resolve issues with tensor reordering in large language models.
- Miscellaneous Enhancements and Fixes: Various pull requests address issues such as modifying the RPC_CMD_SET_TENSOR command, updating the OneAPI base toolkit, and enhancing quantization methods in the Bitnet model. These updates aim to improve efficiency and resolve specific bugs.
- Feature Additions and Bug Fixes in GLM and Embedding: The pull requests append new features to GLM components and fix embedding issues by adjusting variable settings. These changes aim to enhance functionality and resolve errors from previous implementations.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 37
Key Closed Pull Requests
1. gguf-py : GGUF Editor GUI - Python + Qt: This pull request introduces a visual editor for GGUF files, developed using Python and Qt, to be integrated into the gguf-py package, enhancing the project's functionality by providing a graphical user interface for editing these files.
- URL: pull/12930
- Merged: 2025-04-18T18:30:42Z
- Associated Commits: 70a56, e6620, ad1c9, 7f52f, 02931, 1d304, 04d81, 26286, 1fa90, 0675b, 3f0f3, 92680, af8d3, cf0f6, b6df5, 0349a, b021f, 229da, 34234, ad607, bf1be, 42bd4, a1e2c, a7c9c, dd18e, e8215
2. DeepSeek V2/V3 MLA implementation: This pull request implements the DeepSeek V2/V3 MLA (Multi-Head Linear Attention) feature in the llama.cpp project, ensuring backward compatibility with legacy non-MLA GGUF files while introducing context-shifting capabilities and optimized paths for MQA models, requiring new GGUF files with specific metadata and making several code adjustments to support MLA functionality.
- URL: pull/12801
- Merged: 2025-04-15T06:49:57Z
- Associated Commits: 8c024, ddab5, 76125, fed66, c4494, 2a4e1, 77fe5, e2153, 815f4, 57788, 77ad5, 5d037, 638b0
3. opencl: split ggml-opencl.cl
into multiple files and cleanup: This pull request involves splitting the ggml-opencl.cl
file into multiple .cl
files and performing cleanup to enhance compatibility with older Adreno GPUs, such as the Adreno 660, while ensuring functionality with compilers newer than version E031.38.01.00.
- URL: pull/12886
- Merged: 2025-04-15T19:26:00Z
Other Closed Pull Requests
- Server Code Optimization: This topic focuses on optimizing server code by incorporating
std::move
to enhance performance. The pull requests include changes like adding and removing astd::unique_ptr<int> dummy
and addressing code review suggestions, task creation scoping, and bug fixes.
- DeepSeek V2/V3 MLA Optimizations: These pull requests involve optimizations for the DeepSeek V2/V3 MLA by permuting variables for contiguity and reintroducing MQA optimizations. They address performance regressions and anticipate further improvements from other contributors.
- Unified Memory Allocation for GPUs: This topic introduces a unified memory allocation logic for NVIDIA and AMD GPUs by replacing
GGML_HIP_UMA
withGGML_CUDA_ENABLE_UNIFIED_MEMORY
. It enables a single binary for integrated and dedicated GPUs and includes a fallback tohipMalloc()
with a warning for unsupported managed memory.
- Image Processing Enhancements: These pull requests introduce new functions and refactor existing ones to improve image processing capabilities. They include methods for obtaining the number of patches for images and enhancements for accessing
mtmd_image_tokens
.
- FA and MLA Compatibility: This topic enhances the compatibility of the FA mechanism with MLA by introducing initial Metal kernels and decompressing the FA result using the
v_mla
tensor. It also includes tests for MLA shapes and updates to naming conventions.
- SYCL and Vulkan Backend Improvements: These pull requests refactor SYCL binary broadcast operations and enhance the Vulkan backend. They correct dispatcher logic, enable FP16 support, and optimize cooperative matrix and split_k operations for performance improvements.
- Chat Memory Interface: This pull request proposes a proof-of-concept for a chat memory interface inspired by ChatGPT's memory feature. It integrates a simple key/value store using
unordered_map
and addresses issues like model hallucination and session memory expiration.
- ROPE Operator Optimization: This topic optimizes the ROPE operator to enhance inference performance by approximately 10%. It reduces unnecessary memory allocations and eliminates redundant transpose operations.
- Image Manipulation Refactoring: This pull request refactors code to introduce
image_manipulation
andllava_uhd
classes, enhancing vision model preprocessing. It implements a new algorithmic slicing and grid system for image manipulation.
- Quantization Process Enhancement: This pull request introduces the capability to quantize additional tensors beyond the token-embedding and output-tensor. It enhances the flexibility of the quantization process in the project.
- GEMM Operation Performance: This topic introduces an AVX512 implementation of the GEMM operation for the Q4_Kx8 model. It results in significant performance improvements in prompt processing compared to the previous AVX2 version.
- CANN Module Enhancements: These pull requests introduce asynchronous task submission and a new memory allocation method to the CANN module. They replace a macro with a function and allow users to select the allocation method via an environment variable.
- SYCL Implementation Fixes: These pull requests address fixes for the 'im2col' function and logging issues in the SYCL implementation. They ensure compatibility with Gemma 3 vision and correct the output of local_size dimensions in OpenCL profiling.
- Web UI Feature Addition: This pull request introduces a "Clear All Conversations" feature to the llama-server web UI. It allows users to delete all chat history from IndexedDB via a new button in the sidebar.
- Synchronization and Build Process Updates: These pull requests address synchronization updates for the 'ggml' component and introduce an x86 build CI process. They fix CPU backend support operations and monitor build failures specific to the x86 architecture.
- Prompt Evaluation Optimization: This pull request optimizes the accumulation process in the ggml library using specific instructions. It results in a ~12% improvement in prompt evaluation performance on an AMD Ryzen 9 9950X platform.
- ROPE Vision Kernel Introduction: This pull request introduces a new ROPE vision kernel to the SYCL framework. It is essential for Vision Transformers and includes image projectors for Vision-Language Models.
- CUDA Graphs Disabling: This pull request addresses issue #12798 by disabling CUDA graphs for unsupported DUP and CONT node types. It ensures compatibility and was successfully merged.
- Performance Measurement Function: This pull request adds the
llama_perf_context_print
function to thegemma3-cli
. It facilitates easier performance measurement and ensures consistency with other llava examples.
- Build Script Update: This pull request addresses the issue of the build script being blocked due to the absence of the curl library. It disables the curl lib check, aligning with changes in the Windows build script.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ngxson | 153 | 7 | 3 | 42 |
ggerganov | 83 | 5 | 1 | 38 |
zhouwg | 90 | 0 | 1 | 0 |
BradHutchings | 79 | 0 | 0 | 1 |
CISC | 30 | 1 | 0 | 26 |
jukofyork | 49 | 2 | 1 | 1 |
No author found | 43 | 0 | 0 | 0 |
0cc4m | 11 | 1 | 0 | 30 |
qnixsynapse | 30 | 4 | 0 | 5 |
jeffbolznv | 15 | 2 | 1 | 10 |