Weekly GitHub Report for Llama.cpp: July 21, 2025 - July 28, 2025 (12:05:30)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, reflecting a continued focus on stability and feature improvements. Notable highlights include optimized system responsiveness and refined interface elements.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: llama-quant "found point XXX not on grid: XXXX": This issue reports a bug encountered when quantizing the Qwen3 Mixture of Experts (MoE) model using llama-quantize, where the process fails with an error indicating a point not found on the quantization grid. The user is uncertain whether the problem originates from the Qwen model files or from the llama.cpp quantization implementation and seeks clarification on the root cause.
- The comments discuss possible workarounds such as using a different quantization type (Q4_K_M), confirm that the issue also affects other Qwen3 MoE GGUF files leading to the deletion of broken files, and note that the problem appears specific to MoE models rather than smaller Qwen3 variants; it is concluded that the bug is inherent to MoE quantization rather than related to recent imatrix changes.
- Number of comments this week: 5
-
ggml_vulkan: device Vulkan0 does not support 16-bit storage.: This issue reports that the Vulkan backend of the ggml library fails to run on an Intel HD Graphics 4400 GPU because the device does not support 16-bit storage, a requirement for the Vulkan backend to function. The user notes that earlier versions of the Vulkan backend worked on this GPU, but after updates, the backend stopped working, resulting in model loading errors and warnings about incomplete Vulkan support on Haswell hardware.
- The discussion confirms that 16-bit storage support has always been a requirement for the Vulkan backend, and the lack of this feature on the Intel HD Graphics 4400 GPU is the root cause of the failure. The user provided detailed
vulkaninfo
output showing the device capabilities, and it was clarified that the issue likely stems from the GPU driver or hardware limitations rather than changes in the backend itself. - Number of comments this week: 5
- The discussion confirms that 16-bit storage support has always been a requirement for the Vulkan backend, and the lack of this feature on the Intel HD Graphics 4400 GPU is the root cause of the failure. The user provided detailed
-
Misc. bug: Regression in unified KV cache appears after
llama.cpp
release b5912 in b5913: This issue reports a regression introduced in thellama.cpp
project between commits b5912 and b5913 that causes a crash when using the Python bindings (llama-cpp-python
) due to an assertion failure related to sequence ID handling in the unified KV cache. The problem does not affect the CLI tool, which continues to work correctly, indicating that the regression likely involves ABI changes or new parameters (such askv_unified
) that the Python bindings have not yet incorporated.- The discussion in the comments confirms that the Python bindings have not been updated to support the new
kv_unified
parameter introduced in the C-style API at commit b5913, causing the crash; users report the same issue, and maintainers recommend keeping the issue open until the bindings are updated to resolve the regression. - Number of comments this week: 5
- The discussion in the comments confirms that the Python bindings have not been updated to support the new
-
Misc. bug: Server cpp no image_data being used: This issue reports that the llama.cpp server does not appear to utilize the
image_data
parameter when sending prompts with images to the/completions
endpoint, causing the image input to be ignored or misinterpreted. The user seeks urgent assistance because the expected image-based completions are inaccurate or missing, and the comments reveal confusion about the correct method to send image data, with a suggestion that the server currently supports image input only via theimage_url
field in the chat completions endpoint rather thanimage_data
in the completions endpoint.- Commenters confirm that image data is not processed through the
/completions
endpoint and that theimage_data
parameter is absent in the server code; they share examples showing inaccurate image descriptions when usingimage_data
and clarify that the correct approach is to send images as base64-encoded URLs using theimage_url
field in the/v1/chat/completions
endpoint, which works as intended. - Number of comments this week: 4
- Commenters confirm that image data is not processed through the
-
Misc. bug: llama-server embedding endpoint returns vectors with just null values after a while: This issue describes a problem where the llama-server embedding endpoint initially returns valid embedding vectors but, after running for some time, starts returning vectors filled entirely with null values without any error messages in the logs. The user reports that restarting the server temporarily resolves the issue, but they are currently unable to reproduce the problem consistently or create a reliable test case to diagnose it further.
- The discussion in the comments focuses on whether out-of-memory (OOM) errors might be causing the issue, with the user confirming no OOM errors were found and ample free memory available. Suggestions were made to reduce the batch size and context window parameters, and the user is experimenting with disabling GPU layers and adjusting these settings to see if it affects the problem.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 483 days and highlights a discrepancy in behavior between different Vulkan backends used in the project.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in improving the Metal backend. The user is looking for documented or known methods to produce debugger output similar to what is provided by Apple's Metal debugger in Xcode.
- common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status indicators during parallel downloads.
- kubernetes example: This issue discusses the creation of a Kubernetes Helm chart for deploying the
llama.cpp
server, aiming to facilitate scalable application deployment within the community. The author has begun work on this example but seeks additional contributions and plans to continue development when time permits. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU and CUDA backend. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 22
Summarized Issues:
- Model output and response issues: Several issues report problems with model outputs, including mid-response cutoffs with unclosed tags, corrupted or nonsensical outputs, repeated token sequences, and embedding vectors turning null after extended use. These problems affect different models and backends, indicating instability or bugs in output generation and embedding handling.
- issues/14786, issues/14812, issues/14885, issues/14888
- Backend compatibility and runtime errors: Multiple issues describe failures or crashes related to specific backends such as CUDA, Vulkan, HIP, SYCL, and ROCm, including driver assertion failures, missing kernel errors, and silent exits without error messages. These indicate compatibility problems or bugs in backend implementations affecting model execution on various hardware and drivers.
- issues/14787, issues/14824, issues/14826, issues/14845, issues/14887
- Memory allocation and performance regressions: There are reports of out-of-memory errors despite sufficient VRAM, slow model loading for large models on Vulkan, and speed regressions with specific quantization options on CUDA. These issues highlight inefficiencies and bugs in memory management and performance optimization across different hardware and configurations.
- issues/14836, issues/14854, issues/14881
- Quantization and model conversion problems: Issues include fatal errors during quantization with messages about points not on grid and loss of visual fine-tuning effects when converting LoRA models to GGUF format. These problems suggest bugs or limitations in quantization tools and model conversion scripts.
- issues/14798, issues/14867
- Server and CLI tool functionality bugs: Problems reported include the llama-server silently exiting on Windows, failure to utilize image data in completions endpoint, and cline plugin errors preventing GGUF model loading in VS Code. These indicate bugs affecting usability and integration of server and CLI tools.
- issues/14807, issues/14826, issues/14866
- GPU resource management and multi-GPU issues: One issue describes inefficient GPU utilization where shared GPUs spike to 100% usage while exclusive GPUs remain idle when running multiple MultiGPU models simultaneously. This points to bugs in GPU scheduling or resource allocation.
- issues/14890
- Feature requests and model additions: Requests include adding an upload flag to llama-bench for submitting benchmark results and implementing the phi-3-M3-coder model from a Hugging Face architecture file. These reflect user desires for enhanced functionality and model support.
- issues/14791, issues/14846
- Python bindings regression: A regression causes crashes in Python bindings due to assertion failures related to sequence ID handling in the unified KV cache, while the CLI remains unaffected, indicating ABI or support mismatches.
- issues/14847
- Embedding output discrepancies: A significant difference in embedding outputs between two commits raises questions about expected behavior or potential bugs in embedding generation on Windows with CUDA.
- issues/14848
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 16
Summarized Issues:
- Crash and Evaluation Errors: Several issues describe crashes and errors during model evaluation and quantization, including a core dump caused by an out-of-bounds ring buffer when using reverse prompt without specifying
--prompt
(issue 14513), and a crash during low-bit quantization of the ERNIE 4.5 300B MoE model due to invalid weight values in F16 precision, resolved by switching to BF16 (issue 14788). These problems highlight stability challenges in handling specific model configurations and precision formats within llama.cpp. - [issues/14513, issues/14788]
- Feature Requests for Model Support and Output: There are requests to enhance llama.cpp with new features such as built-in log probability output for tokens to aid confidence estimation and calibration (issue 14611), support for the Kimi K2 open-source model including handling expert count limits and tokenizer files (issue 14642), and direct FP8 to q8 gguf conversion in the conversion script to avoid intermediate BF16 steps (issue 14762). These requests aim to improve usability and expand model compatibility.
- [issues/14611, issues/14642, issues/14762]
- Benchmarking Upload Feature: Two issues request adding a
--upload
flag to thellama-bench
tool to allow users to submit benchmarking results with system and model details to a public repository, facilitating a transparent and decentralized hardware-throughput leaderboard for llama.cpp (issues 14792, 14793). This feature would enhance community data sharing and performance comparison. - [issues/14792, issues/14793]
- GPU and Backend Bugs: Multiple issues report bugs related to GPU usage and backend implementations, including gibberish output caused by
LLAMA_SET_ROWS=1
in dual GPU offload due to graph reuse and pipeline parallelism bugs in CUDA (issue 14795), a sporadic IM2COL test failure on CUDA and Vulkan backends when kernel dimensions differ (issue 14777), and a performance regression in multi-GPU token generation caused by a removed backend scheduler reset (issue 14863). These highlight challenges in parallelism and backend stability. - [issues/14795, issues/14777, issues/14863]
- Server and Environment Failures: Issues report environment-related failures such as the llama-server failing to start in a CUDA Docker image due to a corrupted libcurl shared library (issue 14813), and the server returning HTTP 503 errors on Mac OS despite successful model loading, indicating runtime instability in certain configurations (issue 14829). These problems affect deployment and usability in specific environments.
- [issues/14813, issues/14829]
- Compilation and Codebase Issues: A compilation failure in the CUDA source file
convert.cu
caused by ambiguous assignment operators between__half
andnv_bfloat16
types leads to errors in CUDA'scuda_bf16.hpp
header, resolved by reverting to an earlier commit (issue 14834). This points to challenges in maintaining compatibility with CUDA's evolving type system. - [issues/14834]
- Model Output and Performance Regressions: After a specific commit, a model produces infinite repetitions of the word "and" instead of coherent output on Apple M2 Metal backend (issue 14835), and another issue reports a significant generation speed and power draw regression on RTX 5090 GPUs after updating beyond a certain version, later fixed by a pull request (issue 14876). These indicate regressions affecting output quality and performance.
- [issues/14835, issues/14876]
- Instruction Set and Architecture Issues: Enabling the GGML_NNPA SIMD instruction set on s390x architecture causes inconsistent gibberish token generation at higher thread counts, leading to a proposal to disable this feature by default until fixed (issue 14877). This reflects architecture-specific stability concerns.
- [issues/14877]
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 20
Key Open Pull Requests
1. SvelteKit-based WebUI: This pull request introduces a new SvelteKit-based web user interface for the project, featuring initial setup with Storybook and ShadCN, theme mode auto-switching, a development script for streamlined workflow, static site generation for improved deployment, and a basic chat UI with various UI and architectural enhancements.
- URL: pull/14839
- Merged: No
- Associated Commits: a9aad, efc05, 6e6c0, 2fe2c, 3150f, e334b, ec28b, b6e7e, fa3dc, 2e747, d1112, 9d496, 896e4, da8fb, 0c80b, 69ef3, 82f1b, db729, a6d85, a555a, 4fb2c, d5ccf, f9f06
2. model: add hunyuan dense: This pull request introduces support for the hunyuan_dense model, fixes the hunyuan_moe chat template, and includes various updates and corrections related to hunyuan model versions and chat functionalities.
- URL: pull/14878
- Merged: No
3. test-backend-ops: enables perf/eval testing of composite ops: This pull request adds support in the test-backend-ops
framework for performance and evaluation testing of composite computation graphs, enabling comparison of outputs and performance between fused operations, indirect implementations, and regular ops across different backends, thereby facilitating correctness checks and performance measurements of newly developed or composite operations.
- URL: pull/14833
- Merged: No
Other Open Pull Requests
- Model support and integration: Multiple pull requests add support for new models and templates, including the Granite chat template model, the Voxtral model for mtmd functionality, and the internlm/Intern-S1 model. These additions expand the framework's compatibility and provide necessary build and conversion instructions for seamless integration.
- Quantization and tensor operations in SYCL and Vulkan backends: Several pull requests enhance quantization support by adding GGML_OP_SET_ROWS for quantized tensor types, refactoring 8-bit quantization kernels for SYCL, and fixing Vulkan backend issues related to empty set_rows calls. These changes improve performance, maintainability, and correctness of quantization and tensor operations across backends.
- Performance optimizations and kernel improvements: Pull requests introduce fp16 support in Vulkan conv_2d kernel, a new tiled matrix multiplication variant for OpenCL targeting Adreno 830 GPUs, and graph processing optimizations for recurrent and hybrid models. These updates enhance execution speed and accuracy on various hardware platforms.
- pull/14872, [pull/14809](https://github.com/pull/14809], pull/14825
- Model accuracy and consistency fixes: Fixes include correcting the scaling factor in the PLaMo2 model's attention layers and aligning default values between conversion scripts and modeling code. These adjustments significantly improve output quality and maintain consistency across related components.
- Benchmarking and testing enhancements: Updates extend test case filtering to support multiple operations and full variation strings, add functionality to upload benchmark results with timestamped filenames, and synchronize the ggml submodule including formatting improvements. These changes facilitate more precise testing and result management.
- Security improvements: A pull request addresses a security vulnerability in the RPC server by replacing raw memory pointers with opaque random IDs, preventing potential exploits that could bypass ASLR protections. This enhances the overall security posture of the project.
- Data format and analysis improvements: Changes include setting GGUF as the default output format for imatrix to improve MoE model accuracy and introducing methods to calculate activation-based statistics for the GGUF imatrix format. These updates improve data handling and provide deeper insights into tensor transformations.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 57
Key Closed Pull Requests
1. ggml-cpu: disable GGML_NNPA by default due to instability: This pull request proposes disabling the GGML_NNPA feature by default due to its instability and updates the s390x build documentation accordingly to reflect this change.
- URL: pull/14879
- Merged: No
- Associated Commits: e086c, 8410b, a2cdf, ae77d, 549f9, f0409, 120ad, e77f2, 9e500, 0dd3c, 1e545, 4c94f, 888b7, 45fc0, 10a67, 44d48, 9b512, 1e558, ef619, bd3c2, e0f26, 90916, 7473a, a3ddd, 9db97, 5ad02, bd060, 7234b, e84b9, 63b42, 6286a, 07a49, c1d4f, 7c5ca, 4601f, 79025, 45c2c, caaeb, a1220, 328ed, 092c1, a6357, 2177c, 412f4, c1eea
2. Adding a simple-function-call example - hopefully not doing anything wrong: This pull request proposes adding a simple-function-call example to the project, including a new examples directory, a C++ source file for the function call, updates to CMakeLists.txt, and accompanying README documentation, although it was not merged.
- URL: pull/14682
- Merged: No
- Associated Commits: 9d755, 52767, 25fcd, 3bbe7, 7158e, 65f3c, 0f3e6, d8bd3, a4951, 82915, 53212, 10253, ddbde, d0b04, 52bea, 08e90, 7a915, 72ce7
3. CUDA: add fused rms norm: This pull request introduces a fused RMS normalization operation for CUDA in the ggml project, aiming to improve performance by enabling fusion optimizations similar to those in the Vulkan backend, resulting in measurable speedups on RTX 3090 GPUs across various model tests.
- URL: pull/14800
- Merged: 2025-07-23T01:25:42Z
Other Closed Pull Requests
- Type safety and build fixes: Multiple pull requests address build correctness and type safety by fixing 32-bit narrowing conversion errors with static_cast
and updating build system integration including CMake configuration and submodule synchronization. These changes ensure compatibility across platforms and improve dependency management without affecting functionality.
- GPU kernel and backend performance improvements: Several pull requests enhance GPU kernel performance and compatibility by enabling Matrix cores with MFMA instructions, parallelizing Metal backend kernels using SIMD groups, adding fused operations to OpenCL, and fixing Metal backend operation fusion issues. These updates improve speed and correctness on AMD, Metal, and OpenCL platforms, benefiting models like Granite Four and Qwen2.5.
- CUDA and quantization fixes: Pull requests fix issues in CUDA implementations by extending dequantization kernels to support multiple sequences and non-contiguous inputs, removing unnecessary cublasLt linking, and adding BF16 copy operations and CONT support. These changes improve CUDA backend stability, performance, and expand tensor operation support.
- Documentation and README improvements: Multiple pull requests fix broken links, improve formatting and footnotes, update Vulkan and libcurl installation instructions, and correct the backends table in the README and build documentation. These updates enhance clarity, usability, and accuracy for users and developers.
- Bug fixes in memory and CLI components: Fixes include handling null layers in the recurrent memory key-value cache to prevent crashes in new models, correcting the im2col function for Vulkan and CUDA to handle kernel size properly, and resolving a crash in llama-cli caused by the --reverse-prompt option. These fixes improve stability and correctness in critical components.
- Build and platform support updates: Updates include disabling the unstable GGML_NNPA compile flag by default, lowering the HIP version requirement to support ROCm 5.x, and upgrading the MUSA SDK with mublas API changes. These changes improve build stability and extend platform compatibility.
- Matrix multiplication and weight format enhancements: Pull requests add KleidiAI acceleration for Q4_0 matrix multiplication with shared weight tensors and implement conversion of matrix multiplication weights into an Ascend-friendly nz format for the Ascend 310P. These improvements optimize performance in specific hardware scenarios.
- Lazy output reordering and token parsing features: One pull request implements lazy output reordering to defer swapping logits and embeddings until accessed, improving efficiency, while another adds a parse_special option to the /tokenize endpoint to control special token parsing behavior. These features enhance runtime efficiency and user control.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 77 | 15 | 0 | 58 |
CISC | 35 | 4 | 0 | 93 |
JohannesGaessler | 17 | 4 | 0 | 57 |
am17an | 51 | 7 | 1 | 7 |
jeffbolznv | 29 | 4 | 0 | 10 |
chraac | 43 | 0 | 0 | 0 |
compilade | 13 | 3 | 0 | 20 |
ryan-mangeno | 33 | 0 | 1 | 1 |
ngxson | 23 | 3 | 0 | 8 |
mitmul | 23 | 2 | 0 | 6 |