Weekly GitHub Report for Llama.cpp: June 30, 2025 - July 07, 2025 (12:05:55)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version release on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: convert_hf_to_gguf.py not working on qwen3-embedding and qwen3-embedding lora tuned models: This issue involves a bug in the
convert_hf_to_gguf.py
script, which fails to convert the Qwen3-embedding and its lora-tuned models into the GGUF format. The problem arises due to an inability to map certain tensor names and a missing tokenizer model file, leading to errors during the conversion process.- The comments discuss a solution involving changes from a previous pull request to fix the conversion issue, which resolves the problem for the downloaded Qwen3-embedding but not for the swift-tuned version. A further bug related to the assumption of tied word embeddings is identified, and a code modification is suggested to address it, which successfully resolves the issue. Additionally, there is a discussion about implementing normalization for the model, with a note that some information is outdated and that reconversion is necessary for proper functionality.
- Number of comments this week: 6
-
Eval bug: Incoherence in Mistral 7B Q8_0 on Vulkan backend: This issue reports an incoherence in the output generated by the Mistral 7B Q8_0 model when using the Vulkan backend, specifically noting problems such as repetitions, missing spaces, and random letters after extended generation. The problem is linked to a specific commit and involves the use of the
llama-cli
tool on a Linux system with an RTX 3090 GPU, where the issue is reproducible under certain command-line parameters.- The comments discuss the reproducibility of the issue, potential causes related to broken quantizations, and differences from previously reported issues. There is a mention of a specific mode not being considered during fusion support, and a user plans to investigate further. Another user clarifies that the current issue is distinct from a similar one they reported earlier.
- Number of comments this week: 6
-
Feature Request: Support EXAONE 4.0: This issue is a feature request for adding support for the EXAONE 4.0 model architecture in the llama.cpp project, which would enable the provision of .GGUF files for model checkpoints to end users. The requester has provided a link to their implementation on Huggingface Transformers and is seeking the maintainers' consideration to integrate this support into GGUF-compatible libraries.
- The comments discuss concerns about the restrictive nature of the EXAONE license, highlighting that it is a research-only license with LG maintaining control over the model and its outputs. There is skepticism about supporting companies that do not contribute to open source, and a comparison is made to a legal case involving Disney's arbitration clause, illustrating the potential implications of such licenses.
- Number of comments this week: 4
-
Compile bug: SYCL with OneAPI Toolkit 2025.2 & NixOS: This issue involves a compilation error encountered when using SYCL with the OneAPI Toolkit version 2025.2 on NixOS, specifically when trying to compile the
llama.cpp
project. The user reports that while they can compile other projects likewhisper.cpp
without SYCL, they face errors related to macro conflicts between SYCL headers and their glibc version during the compilation ofllama.cpp
.- The comments discuss potential causes and solutions for the compilation error, suggesting that the issue might be related to a macro conflict with the glibc version. A user tested the OneAPI 2025.2 on Ubuntu without issues, indicating a possible environment-specific problem. Another user suggests downgrading the glibc version and mentions a related fix in another project, emphasizing the need for the
ggml
projects to be more version-agnostic to facilitate packaging on NixOS. - Number of comments this week: 3
- The comments discuss potential causes and solutions for the compilation error, suggesting that the issue might be related to a macro conflict with the glibc version. A user tested the OneAPI 2025.2 on Ubuntu without issues, indicating a possible environment-specific problem. Another user suggests downgrading the glibc version and mentions a related fix in another project, emphasizing the need for the
-
Eval bug: Gemma 3n on Vulkan on Ryzen APUs produces garbled output: This issue reports a bug where using the Gemma 3n model on Vulkan with Ryzen APUs results in garbled output, specifically when using certain quantization formats like Q8_0 and Q4_K_XL, while the Q4_K_M format does not exhibit the problem. The problem is observed on the integrated GPUs of Ryzen 5700G and 7840U, and it does not occur with other models or different GPUs, suggesting a potential issue with the Vulkan backend or specific operations like
GET_ROWS
.- The comments discuss attempts to bisect the issue without success, a workaround using
-ot per_layer_token_embd.weight=CPU
that resolves the problem, and detailed comparisons of debug outputs with and without the workaround, indicating a potential issue with the Vulkan implementation of theGET_ROWS
operation for certain quantization formats. - Number of comments this week: 3
- The comments discuss attempts to bisect the issue without success, a workaround using
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - kubernetes example: This issue is about the need for a Helm chart for the
llama.cpp
server to facilitate its deployment on Kubernetes, which is a widely used platform for deploying applications at scale. The issue has been open for a significant amount of time, and while initial work has been started, the contributor is seeking additional help from the community to advance the project. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the CUDA backend on a system with an NVIDIA GeForce RTX 3060 GPU, where the process fails due to a tensor type mismatch and an incorrect block size. The error occurs specifically when trying to load the model from a file, resulting in a failure to read tensor information and subsequently preventing the model from being loaded successfully.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 20
Summarized Issues:
- CUDA Out-of-Memory Errors: This issue describes a bug where a process fails to exit due to an out-of-memory (OOM) error occurring on a CUDA device while running a command with specific parameters on a Linux system using the llama-box tool. The error prevents the process from completing successfully, indicating a need for better memory management or error handling in the tool.
- Script Conversion Failures: This issue involves a bug in the
convert_hf_to_gguf.py
script, which fails to convert Qwen3-embedding and Qwen3-embedding LoRA-tuned models to GGUF format due to unrecognized BPE pre-tokenizer and tensor mapping errors. Updates to the script and model files are required for successful conversion, highlighting the need for compatibility improvements.
- Compilation Errors: Several issues involve compilation errors in the llama.cpp project, including a zero-size array error in
gemm_gemv_kernels
and an invalid feature modifier 'sme' when creating a portable arm64 build, and undefined references to thestd::filesystem
library on Linux systems. These errors indicate challenges in maintaining cross-platform compatibility and proper linking configurations.
- Feature Requests for Model Support: Multiple feature requests highlight the need for supporting new model architectures like ERNIE 4.5 MoE, EXAONE 4.0, Huawei Pangu Pro 72B MoE, and GLM-4.1V-9B-Thinking in the llama.cpp project. These requests emphasize the community's interest in leveraging advanced models for improved performance and capabilities.
- Vulkan Backend Bugs: Several issues involve bugs in the Vulkan backend, such as incorrect outputs on Intel N150 GPUs and garbled output on Ryzen APUs when using specific models and configurations. These problems suggest potential issues with Vulkan/SYCL capabilities and quantization configurations not being properly supported.
- Feature Requests for Performance Enhancements: Feature requests for performance enhancements include implementing per-chat prompt caching and real batch processing for multiple image inputs, as well as improving image encoding speed on Mac M2 devices using Metal. These enhancements aim to optimize performance and efficiency in various use cases.
- Compilation and Build Process Issues: Compilation errors during the build process, such as missing "rocwmma/rocwmma.hpp" and undeclared variables in Vulkan support, highlight challenges in maintaining build stability and resolving dependencies. These issues require careful attention to build configurations and dependency management.
- Bugs in Model Implementations: Bugs in model implementations, such as the RWKV model crashing without a prompt option and the Llama 3.2 vocabulary missing a newline token, indicate areas where model handling and token recognition need improvement. These issues affect the reliability and accuracy of model outputs.
- Functionality and Output Issues: Issues with functionality and output, such as the
mtmd_get_output_embd()
function not returning the length of the embedding vector and the llama-server module outputting multiple vectors instead of a single pooled vector, highlight the need for better output management and function refactoring.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 7
Summarized Issues:
- Bugs in llama.cpp related to device-specific issues: Several issues have been reported regarding bugs in llama.cpp when used with specific devices or configurations. One issue involves a bug with the
aclnnMatmul
function on910B4
devices, causing incorrect output due to data range issues. Another issue describes a "Floating point exception" error on the OpenCL backend with MoE models, leading to crashes when processing long prompts.
- Conversion and format issues in llama.cpp: There are issues related to the conversion of models to different formats using llama.cpp scripts. A ValueError occurs during the conversion of a Qwen3-8B model to GGUF format due to the script's inability to map certain tensors. This is linked to unknown 'scales' and 'biases' tensors that affect the quantized tensors' range.
- Crashes and errors in llama applications: Various llama applications have been reported to crash or encounter errors under specific conditions. The llama-simple-chat application crashes with a "failed to decode" error after several interactions, likely due to exceeding the default context size. Similarly, the llama-server software randomly crashes during inference on AMD ROCm devices due to a memory status assertion failure.
- Discrepancies in output between llama cpp and transformers library: An issue has been identified where the llama cpp server produces different values compared to the transformers library during reranking tasks. This discrepancy affects the consistency and reliability of the results generated by these tools.
- Assertion failures in llama-eval-callback tool: The
llama-eval-callback
tool has a bug where an assertion failure occurs when loading a model on a Mac with an M3 processor using the Metal backend. This issue was resolved by passing a prompt to the command line execution, which allowed the model to load successfully.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- Feature Request: Support EXAONE 4.0
- Toxicity Score: 0.55 (Critical tone, Distrust towards licensing, External references for support.)
- This GitHub conversation involves multiple users discussing the implications of the EXAONE 4.0 license, with one user expressing skepticism about supporting a company perceived as not contributing to open source. The tone is critical and somewhat sarcastic, with references to external sources to support their points. The conversation includes a comparison to a legal case involving Disney, which adds a layer of skepticism and distrust towards the licensing terms.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 16
Key Open Pull Requests
1. llama: add initial support for Falcon-H1 model family: This pull request introduces initial support for the Falcon-H1 model family by implementing model loading, basic inference, and tokenizer integration, while also updating build scripts and documentation to ensure compatibility, and includes new test cases to verify the functionality, all of which supersedes a previous attempt with a cleaner and more modular implementation.
- URL: pull/14534
- Merged: No
- Associated Commits: 991de, f897e, 71a68, 03568, 0c93e, fdd5c, 14c37, 8bea9, 071f4, 50ead, a39a8, 1415c, 243e4, cce35, 22de6, 2fe05, d22b4, 6c7d9, 15138, a6d00, 1fd05, 250b4, 3ee79, 2aa48, 9760c, 7a254
2. ggml: Add initial WebGPU backend: This pull request introduces an initial implementation of a WebGPU backend for the ggml project, aimed at passing continuous integration tests and running basic matrix multiplication examples, while acknowledging current limitations and expressing a commitment to further development and collaboration.
- URL: pull/14521
- Merged: No
- Associated Commits: 63eb6, c0a81, e5033, b17b1, e7071, c9a53, 8b860, 9e0c6, 520f5, 2d24a, d0480, f8a53, 39d95, 3d924, d036f, b8a22, c09bf, aec34, daa58, 1c396, ecb94, 0f054, 2eb76, 949b8
3. Chore: batch prompts, extract tensors specific layer: This pull request involves a series of updates and improvements to the llama.cpp project, including batching prompts and extracting tensors for specific layers, adjusting the README, adding a feature to list possible layers for parsing and setting a maximum number of layers to offload to the GPU, as well as fixing issues related to includes for Ubuntu and saving tensors and prompts/outputs to different files.
- URL: pull/14463
- Merged: No
Other Open Pull Requests
- Narrowing Conversion Errors in 32-bit Platforms: This topic addresses the issue of narrowing conversion errors in the
export-lora.cpp
andclip.cpp
files when building on 32-bit platforms. The pull request addsstatic_cast<size_t>
to ensure type safety and silence warnings, improving build correctness without affecting functionality.
- Input Token Truncation in Llama-Server: The pull request introduces an option to allow input token truncation during the embedding task in the llama-server. It addresses the issue where the server stops if the input token length exceeds the available context slots and includes updates for documentation and linting.
- Server Web UI Presets: A new feature is introduced to the server web UI's settings dialog, enabling users to create and manage presets. This allows for quick and easy changes to settings configurations.
- OpenCL Kernel Performance Improvement: The introduction of a new
mul_mat_f16_f32
kernel for OpenCL significantly improves performance through tiling and vectorization. The throughput increased from 19.24 to 168.17 t/s in thepp512
test on the Adreno 830.
- Server Prefix Mounting: This pull request introduces the capability to mount the server at a specified prefix, useful for scenarios where the server operates behind a reverse proxy on a non-root path. It includes commits that add a server prefix and correct the server path environment.
- Command-line Argument for Default WebUI Settings: A command-line argument is introduced to allow the
llama-server
to send locally-defined default JSON-encoded client-side webui settings. This provides users with flexibility to customize server deployments without modifying source code.
- Reuse of Computation Graphs: A feature to reuse computation graphs from previous micro-batches is introduced, enhancing performance by maintaining buffer allocations and updating graph parameters. It supports CPU and Metal backends and requires the
LLAMA_SET_ROWS
environment variable for activation.
- Matrix Multiplication Performance in Vulkan: Enhancements in matrix multiplication operations for a Vulkan-based project result in a 10-15% speed improvement on an RX 470 GPU. This is achieved by unpacking more values at a time for integer quantized matrices.
- MUSA SDK Upgrade: This pull request involves upgrading the MUSA SDK to a new version, incorporating changes to the mublas API. The commit is signed by Xiaodong Ye.
- Separation of K and V Buffers in KV Cache: A preparatory step is taken to support the separation of K and V buffers in the unified KV cache. This aims to enhance throughput for parallel decoding use cases without introducing functional changes.
- Loading Tokenized Data from Parquet Dataset: A feature is proposed to load already tokenized data from a Parquet dataset into the training process. Further enhancements for streaming and batching are needed but are considered more complex tasks.
- Support for bf16 and i32 in CUDA 'getrows' Function: Support for the bf16 and i32 data types is added to the 'getrows' function in the CUDA implementation. This is achieved by including the necessary case statements.
- CPU Detection Logic for AArch64 Platforms: Enhancements are made to the CPU detection logic for Linux on AArch64 platforms. The features identify and prefer high-performance "big" cores in hybrid big.LITTLE architectures to optimize computational performance.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 42
Key Closed Pull Requests
1. ggml : implement GEGLU_ERF and GEGLU_QUICK ops: This pull request implements the GEGLU_ERF and GEGLU_QUICK operations, which are complementary to other GLU operations and are used in the mtmd
project, for all currently GLU-supported backends, with an initial exception for GEGLU_ERF in Vulkan due to a missing erf
function, which was later added, and includes several commits addressing implementation, error fixes, and integration with OpenCL.
- URL: pull/14445
- Merged: Yes
2. model : add support for apple/DiffuCoder-7B-cpGRPO: This pull request introduces initial support for Apple's new DiffuCoder model, including multiple commits to support the DiffuCoder/Dream architecture and a quick fix, with plans to upload the F16 gguf to a specified Hugging Face repository, although it was not merged.
- URL: pull/14502
- Merged: No
3. Remove redundant include path in CMakeLists.txt: This pull request involves the removal of a redundant include path from the CMakeLists.txt file for the ggml-cpu-feats target, specifically eliminating the parent directory '..' to streamline the include directories and avoid unnecessary paths.
- URL: pull/14452
- Merged: Yes
Other Closed Pull Requests
- ggml Synchronization and Enhancements: This topic covers the synchronization of the 'ggml' component, including renaming variables for clarity and adding new functions. The pull requests also address bug fixes and improvements in the ggml library, such as fixing mask dimensions and adding a version retrieval function.
- CUDA and Vulkan Backend Improvements: These pull requests introduce features like softmax broadcast and performance enhancements in the CUDA and Vulkan backends. They also address issues like packing constants and supporting new head sizes in Vulkan.
- Metal and SYCL Backend Fixes: These pull requests focus on disabling fast-math optimizations in the Metal backend and re-enabling the fp16 exponential function in the SYCL backend. They aim to resolve issues caused by compiler optimizations and ensure correct computation.
- OpenCL Backend Enhancements: The OpenCL backend sees improvements with the addition of new functions and mechanisms to prevent crashes. These pull requests also address test failures and ensure consistency with other backends.
- Vulkan Component Updates: These pull requests address various updates to the Vulkan component, including splitting large matrices and adding missing functionalities. They ensure the Vulkan backend operates efficiently within memory constraints.
- Project Infrastructure and Documentation: This topic includes updates to project infrastructure and documentation, such as adding Vulkan images to documentation and addressing buffer overflow prevention. These changes aim to improve project accessibility and stability.
- Callback and Error Handling Mechanisms: These pull requests introduce a callback mechanism for handling aborts and address error handling in memory contexts. They enhance the project's ability to manage errors and shutdown processes gracefully.
- Chat and Template Support: The project sees enhancements in chat functionality with the addition of Jinja-based templates and fixes for context management. These changes improve the flexibility and reliability of chat features.
- Miscellaneous Bug Fixes: Various bug fixes are addressed, including issues with Gemma 3n conversion and disabling specific tests. These pull requests ensure the project runs smoothly by resolving reported problems.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 151 | 19 | 0 | 63 |
CISC | 66 | 10 | 0 | 51 |
taronaeo | 108 | 1 | 0 | 2 |
gabe-l-hart | 60 | 1 | 1 | 1 |
am17an | 45 | 8 | 1 | 8 |
jeffbolznv | 33 | 6 | 0 | 23 |
ngxson | 15 | 3 | 0 | 26 |
JohannesGaessler | 6 | 4 | 0 | 24 |
slaren | 10 | 1 | 0 | 19 |
compilade | 12 | 1 | 0 | 13 |