Weekly GitHub Report for Llama.cpp: July 14, 2025 - July 21, 2025 (12:22:22)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
This version, created on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Eval bug: Nondeterministic output with ROCm backend despite zero temperature: This issue involves a bug in the ROCm backend where nondeterministic outputs are generated despite using a zero temperature setting, which is expected to produce deterministic results. The problem is observed when using RDNA3 GPUs, such as the AMD Radeon RX 7800 XT and RX 7900 XT, while the same setup on a CPU or NVIDIA GPU yields deterministic behavior.
- The comments discuss various troubleshooting steps, including verifying temperature settings, using verbose logging, and testing with different configurations like CPU-only builds and enabling GGML_CUDA_FORCE_MMQ. It is noted that the issue is specific to the ROCm backend, and potential causes such as atomic operations in rocBLAS are considered. Suggestions include using the llama-eval-callback tool to identify the problematic operator and exploring rocBLAS settings for improved determinism.
- Number of comments this week: 15
-
Exaone-4 gibberish when using jinja template: This issue involves a problem with the EXAONE-4 model producing gibberish output when using a Jinja template, specifically related to the configuration and compatibility of the model with certain backend settings and template scripts. The user reports encountering errors and unexpected behavior when attempting to run the model with specific configurations, such as using the YaRN and rope settings, and seeks guidance on resolving these issues.
- The comments discuss potential causes and solutions for the issue, including the incompatibility of YaRN with EXAONE-4, suggestions for using rope settings, and a bug in the Jinja template related to the 'not in' operator. Users share their experiences with different configurations, and a temporary fix is suggested to change the syntax in the template.
- Number of comments this week: 10
-
Misc. bug: OpenAI API v1/responses llama-server: This issue is about a bug in the llama-server module where using the OpenAI compatible API with the v1/responses endpoint results in a 404 error, indicating that the endpoint might not be supported yet. The problem occurs when attempting to use the specified command line on a Windows operating system, and the user is unsure of the first bad commit that introduced this issue.
- The comments reveal that the v1/responses endpoint is not currently available in the server's registered endpoints. There is a discussion about the potential to add this endpoint to improve OpenAI API compatibility, although no plan is in place yet. A user mentions that tools will start using this endpoint as it is part of the API specification, and a Fabric maintainer offers to collaborate with the issue reporter to find a solution for similar cases.
- Number of comments this week: 5
-
Feature Request: Server stream response for "prompt processing progress": This issue is a feature request for the server tool in the project, specifically asking for the
/completion
endpoint to return "prompt processing progress" in a manner similar to what is displayed in the server log. The motivation behind this request is to provide users with real-time evaluation progress for lengthy completion processes.- The comments include expressions of gratitude for taking on the feature and dealing with feedback, offers of future collaboration, and a discussion about job opportunities and acknowledgments for contributions.
- Number of comments this week: 4
-
Eval bug: CUDA error: operation not supported: This issue involves a CUDA error encountered when using the llama.cpp server to serve the DeepSeek V3 model, specifically reporting an "operation not supported" error during execution. Additionally, the user reports that setting a large context size results in the server being killed due to RAM out-of-memory issues, with the server consuming nearly 1TB of RAM.
- The comments discuss potential solutions to the CUDA error, suggesting the use of specific CMake options during compilation. The user initially misunderstood the suggestion as environment variables, but after clarification, they successfully compiled with the recommended options, resolving the error but experiencing performance issues with high concurrent requests.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error that does not occur with other Vulkan backends. The issue has been open for a significant duration of 476 days, indicating a potentially complex problem that has yet to be resolved.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - kubernetes example: This issue is about the need for a Helm chart for the
llama.cpp
server to facilitate its deployment on Kubernetes, which is a widely used platform for deploying applications at scale. The issue has been open for 468 days, and the original poster has made some progress but is seeking additional help from the community to continue the development. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered when attempting to load a model using the GGML backend with CUDA on a system equipped with an NVIDIA GeForce RTX 3060. The error arises from a tensor type mismatch and block size inconsistency, preventing the successful loading of the model file '.\ggml-model-i2_s.gguf'.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 26
Summarized Issues:
- Bugs in Model Conversion and Execution: Several issues highlight bugs related to model conversion and execution in the llama.cpp project. These include FileNotFoundErrors and TypeErrors during model conversion due to missing files and incorrect methods, as well as execution errors like CUDA and device memory allocation errors on various backends, affecting model loading and performance.
- Feature Requests for Enhanced Functionality: Multiple feature requests aim to enhance the functionality of the llama.cpp project. These requests include adding support for ARMv7 architecture, optimizing Mixture of Experts architecture, and enabling direct conversion of FP8 models, which would improve performance and compatibility across different systems and use cases.
- Regression and Compatibility Issues: Several issues report regression and compatibility problems in the llama.cpp project. These include JSON formatting errors affecting API compatibility, unsupported GPU configurations, and endpoint errors, which disrupt the expected functionality and integration with other systems.
- Compilation and Runtime Errors: Various issues describe compilation and runtime errors in the llama.cpp project. These include missing binaries, loop unrolling warnings, and runtime errors due to mismatched tensor sizes, which affect the build process and execution on different platforms.
- Bugs in Model and Server Functionality: Several issues highlight bugs in model and server functionality within the llama.cpp project. These include problems with tag closure, memory allocation, and unsupported operations, which hinder the proper functioning of models and servers across different environments.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 10
Summarized Issues:
- Script Argument Errors: Users have encountered issues with invalid arguments when using scripts, such as the
convert_hf_to_gguf.py
script, which does not support certain--outtype
arguments. These errors lead to confusion about the capabilities of the script, particularly regarding quantization options beyond the default settings.
- Model Processing Failures: Several issues have been reported regarding failures in model processing, including assertion failures with parallel requests on Mac systems and errors with hybrid models due to memory handling bugs. These problems often result in crashes or failed model loads, indicating potential areas for improvement in model architecture and backend compatibility.
- Backend Compatibility Issues: Users have faced compatibility issues with different backends, such as the SYCL backend causing errors with Mixture of Experts models on Intel iGPUs due to work-group size limitations. These issues highlight the need for backend-specific optimizations to ensure smooth operation across various hardware configurations.
- Compilation and Build Errors: Compilation errors have been a recurring problem, with users experiencing failures when compiling with HIP or Vulkan support due to missing dependencies or incorrect compiler settings. These issues often require users to adjust their development environments or update their toolchains to resolve.
- Streaming and Token Handling Bugs: Bugs in streaming and token handling have been reported, particularly affecting the
llama-server
on Mac, where changes in streaming token handling lead to unexpected disconnections or limited token output. These issues suggest a need for more robust handling of streaming data in the server's architecture.
- Unresolved or Incomplete Issues: Some issues, such as "Data offline," lack sufficient detail or context, making it difficult to address or resolve them effectively. This highlights the importance of providing comprehensive information when reporting issues to facilitate quicker and more accurate troubleshooting.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 20
Key Open Pull Requests
1. Adding a simple-function-call example - hopefully not doing anything wrong: This pull request introduces a new example of a simple function call by adding a simple-function-call.cpp
file, updating the README.md
, and modifying the CMakeLists.txt
files to include this example in the llama.cpp
project.
- URL: pull/14682
- Merged: No
- Associated Commits: 9d755, 52767, 25fcd, 3bbe7, 7158e, 65f3c, 0f3e6, d8bd3, a4951, 82915, 53212, 10253, ddbde, d0b04, 52bea, 08e90, 7a915, 72ce7
2. metal: SSM_SCAN performance: This pull request aims to enhance the performance of the SSM_SCAN
function in the mamba2
implementation for the metal
backend, specifically targeting improvements for the Granite Four model by optimizing kernel launch configurations and memory usage, resulting in significant speedups in both prefill and decode operations.
- URL: pull/14743
- Merged: No
3. Fix KleidiAI compilation errors with -DGGML_NATIVE=OFF (issue #14464): This pull request addresses and resolves compilation errors in the KleidiAI code when building with the -DGGML_NATIVE=OFF
flag by implementing conditional compilation, adding null pointer checks, improving error handling, and ensuring compatibility with non-ARM systems, while also updating the CMake configuration and download methods for better reliability and stability.
- URL: pull/14700
- Merged: No
Other Open Pull Requests
- KleidiAI Acceleration for Q4_0 Matrix Multiplication: This pull request introduces support for KleidiAI acceleration of the Q4_0 matrix multiplication operation when the weight tensor is shared with the get_rows operator. It is particularly useful in scenarios like those found in whisper.cpp.
- Jinja Support in libmtmd for Qwen Models: This pull request introduces support for Jinja in the libmtmd library specifically for QwenVL and Qwen Omni models. It adds two optional metadata fields for GGUF to handle image and audio tokens, maintaining backward compatibility with the MTMD tokenizer.
- Bug Fix for Null Layers in Recurrent Memory: This pull request addresses a bug fix to handle the saving and loading of null layers in recurrent memory, which previously caused crashes. It particularly affects the new LiquidAI/LFM2 models and includes updates to the code and comments for improved styling and clarity.
- HIP Version Check Update for HIPBLAS V2 API: This pull request updates the HIP version check in the codebase to ensure compatibility with the HIPBLAS V2 API. It lowers the required version threshold from 70000000 to 50600000, aligning it with the introduction of HIPBLAS V2 features in ROCm 5.x.
- Introduction of LLaDA 8B Diffusion Model: This pull request introduces the LLaDA 8B Diffusion model to the project, providing a new example command for its execution. It adds a README for guidance and initiates a discussion on integrating it with the server API.
- Prompt Processing Progress Streaming: This pull request introduces a new feature to the server by adding prompt processing progress streaming for the /completion endpoint. It includes implementing a
server_task_result_cmpl_progress
struct and asend_progress_response()
function to provide real-time progress updates.
- Enhancements to CUDA FlashAttention Kernel: This pull request introduces enhancements to the CUDA FlashAttention kernel by implementing logic to skip fully masked-out KQ slices. It aims to improve performance by reducing unnecessary data preloading and optimizing GPU compute resource utilization.
- Integration of Mistral Models with llama.cpp: This pull request aims to enhance the integration of Mistral models with llama.cpp by addressing conversion issues between formats. It introduces a direct conversion script from Hugging Face to GGUF and registers Mistral architecture in llama.cpp for native support.
- Example Feature for Predicted Outputs in Text Generation: This pull request introduces an example feature that allows users to specify predicted outputs to expedite text generation. It includes scripts for testing and comparing predicted outputs with speculative and lookup decoding.
- Fix for Hardcoded Values in MinicpmV Model Converter: This pull request addresses the issue of hardcoded values in the MinicpmV model converter and clip by implementing a fix. It has been tested using the llama-mtmd-cli with various MinicpmV models.
- Tests for Non-Contiguous K and V Tensors in FA Process: This pull request introduces tests in the
test-backend-ops
to ensure that the K and V tensors, which can now be non-contiguous due to the implementation of a split KV cache, are properly handled in the FA process. It addresses an issue reported in a previous pull request.
- BF16 Copy Operations and CONT Operation in CUDA Backend: This pull request implements the missing BF16 copy operations and enables the BF16 CONT operation in the CUDA backend. It also fixes a cut-and-paste error for F16 to F16 operations in the
ggml_cuda_cpy_fn
function.
- Restoration of Missing Messages in Web UI's JSON Export: This pull request addresses the issue of missing messages in the web UI's JSON export by restoring them. They have been absent for the past few weeks and are linked to issue #13552.
- Extended Sampling API: This pull request introduces an extended sampling API to the project, which includes the addition of a new
llama_sampling_result
struct. It enables developers to access detailed sampling information such as selected token IDs, logits, probabilities, and candidate token lists.
- Documentation Update for Installation Method: This pull request aims to update the documentation by adding information about the installation method using apt/deb. It is indicated by the commit message and the title and is currently open and not yet merged.
- Command-Line Argument for Learning Rate and Optimizer Type: This pull request introduces a command-line argument to the
finetune.cpp
file for specifying the learning rate and optimizer type. It adds support for the SGD optimizer alongside the default AdamW.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 46
Key Closed Pull Requests
1. U: This pull request involves the creation and multiple updates of various configuration and script files, such as start.sh
, render.yaml
, c-cpp.yml
, run-llama.yml
, build1.yml
, editorconfig.yml
, main.yml
, and Issa.yml
, aimed at setting up and refining the build and execution processes for the llama.cpp
project, although it was ultimately not merged.
- URL: pull/14774
- Merged: No
- Associated Commits: cfd64, d82d8, 7b7d8, be0bf, b1bde, 2efaf, fc383, 06c60, 10bab, 0c440, 0be67, bcc85, 2d661, a16fc, a014b, 97ac4, 99cfe, 73ff9, c0ad2, b64f3, 6f3a7, 7f2bf, 4910e, 91f8e
2. Model: Add support for Ernie 4.5 MoE: This pull request adds support for the Ernie 4.5 MoE architecture to the project by implementing new conversion logic to generate a GGUF file with all necessary layers, addressing various code review suggestions, and fixing issues such as Flake errors, tensor mappings, and non-MoE regression, ultimately closing issue #14465.
- URL: pull/14658
- Merged: Yes
- Associated Commits: 8501c, 4a231, 056ab, 07a5c, bb23d, bd27e, 992d4, dde77, a387e, 950b4, 76748, 8d6ac, 35114, 542f3, 87b18, 075ff, e9f96
3. scripts: benchmark for HTTP server throughput: This pull request introduces a straightforward Python script designed to benchmark the throughput of the llama.cpp HTTP server, offering a simpler alternative to the existing tools/server/bench
tool by reducing complexity in both installation and implementation, and includes example output and performance metrics such as request throughput and average prompt latency.
- URL: pull/14668
- Merged: Yes
Other Closed Pull Requests
- LoRA Layer Range Option: This pull request introduces a new
--lora-layer-range START END
option to the server, allowing users to specify the range of layers to which a LoRA (Low-Rank Adaptation) should be applied. It includes changes to the function signature ofllama_adapter_lora_init
and updates to theREADME.md
for documentation.
- Vulkan Crossbuilds and Backend Fixes: This pull request addresses the issue of failing Vulkan crossbuilds by attempting multiple fixes before ultimately disabling the problematic builds. Additionally, it fixes a bug in the Vulkan backend by correcting the non-contiguous check for matrix multiplication ID splitting and other related issues.
- Logits Retrieval and Model Execution Enhancements: This pull request adds the ability to retrieve logits in the llama-context and addresses an issue with the Gemma3n model not being executed as a CUDA_GRAPH on NVIDIA GPUs. It includes performance improvements and suggestions for better handling of batched inference with CUDA Graphs.
- Documentation and Community Guidelines Updates: This pull request updates the CONTRIBUTING.md file to promote a more inclusive environment and reorganizes the Vulkan section in the build.md documentation. It reflects LunarG's discontinuation of the Ubuntu
vulkan-sdk
package and improves clarity for users.
- Architecture and Model Support Additions: This pull request adds support for the EXAONE4 architecture and introduces support for the Cosyvoice2-0.5B text-to-speech model. It includes integrating a fork from LiquidAI and enhancing the model's performance in llama.cpp.
- Script and Template Enhancements: This pull request adds a Jinja template for rwkv-world and makes the Hugging Face (HF) token optional for a script. It allows the script to function without the token unless accessing gated repositories.
- Quality of Life Improvements and Assertions: This pull request introduces assertions to improve the quality of life by helping developers notice floating-point range issues. It includes commits for adding asserts and fixing a constant type.
- MoE Model Optimization: This pull request optimizes MoE (Mixture of Experts) models by eliminating the need for large warm-up graphs. It relies on hot loading the experts for matrix multiplication to efficiently heat up the caches.
- Code Refactoring and Performance Improvements: This pull request refactors the
llamafile_sgemm
code by removing unnecessary templates and reducing deeply nested conditionals. It results in a performance improvement of approximately 2-7% in the Q8 Model and 15-50% in the Q4 Model.
- Build Warnings and Hotfixes: This pull request resolves build warnings related to unused variables in the CUDA implementation and addresses a hotfix for the non-DNNL codepath. It ensures correct behavior when not running DNNL.
- Server Benchmarks and Model Order Handling: This pull request extends the
scripts/server-bench.py
for consistent server performance testing and addresses potential order mishaps when adding new models. It ensures pre-computed hashes are added first.
- Script Modifications and Tokenizer Checks: This pull request modifies the
gguf_dump.py
script to display bits per weight and addresses unnecessary checks for the tokenizer folder. It resolves problems related tochkhsh
removals during download failures.
- Parallel Processing and Feature Logging: This pull request updates the shortconv component to support parallel processing and adds functionality to log the support status of the VK_KHR_bfloat16 feature. It addresses Issue #13274 and has been tested on a Windows system.
- Runtime Error Fixes and Server Bug Resolution: This pull request addresses a runtime error in the llama-cpp-python library on Windows and fixes a bug in the server by correcting the handling of the
ignore_eos
flag. It ensures compatibility and prevents the AttributeError.
- CUDA Function Refactoring: This pull request involves refactoring the CUDA set_rows and cpy.cu functions to support additional quantized data types. It moves cpy functions to a common header for reuse.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- doc: update CONTRIBUTING.md
- Toxicity Score: 0.55 (Reflective tone, Justification of behavior, Underlying tensions.)
- This GitHub conversation involves a user expressing concern over another user's behavior in the community, highlighting past incidents where the latter was warned for non-technical comments. The tone is reflective and somewhat defensive, as the user attempts to justify the actions of the other user while acknowledging the appropriateness of the punishment. The conversation hints at underlying tensions due to differing views on community guidelines and the severity of enforcement.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 107 | 20 | 0 | 47 |
CISC | 43 | 9 | 0 | 77 |
taronaeo | 73 | 1 | 0 | 2 |
JohannesGaessler | 10 | 3 | 0 | 56 |
jeffbolznv | 39 | 3 | 0 | 25 |
chraac | 63 | 0 | 0 | 0 |
am17an | 40 | 8 | 2 | 5 |
ryan-mangeno | 33 | 0 | 1 | 1 |
pwilkin | 18 | 2 | 0 | 9 |
mitmul | 19 | 2 | 0 | 6 |