Weekly GitHub Report for Llama.cpp: June 30, 2025 - July 07, 2025 (12:05:55)

            Weekly GitHub Report for Llama.cpp: June 30, 2025 - July 07, 2025 (12:05:55)

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version release on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Misc. bug: convert_hf_to_gguf.py not working on qwen3-embedding and qwen3-embedding lora tuned models: This issue involves a bug in the convert_hf_to_gguf.py script, which fails to convert the Qwen3-embedding and its lora-tuned models into the GGUF format. The problem arises due to an inability to map certain tensor names and a missing tokenizer model file, leading to errors during the conversion process.

The comments discuss a solution involving changes from a previous pull request to fix the conversion issue, which resolves the problem for the downloaded Qwen3-embedding but not for the swift-tuned version. A further bug related to the assumption of tied word embeddings is identified, and a code modification is suggested to address it, which successfully resolves the issue. Additionally, there is a discussion about implementing normalization for the model, with a note that some information is outdated and that reconversion is necessary for proper functionality.
Number of comments this week: 6

Eval bug: Incoherence in Mistral 7B Q8_0 on Vulkan backend: This issue reports an incoherence in the output generated by the Mistral 7B Q8_0 model when using the Vulkan backend, specifically noting problems such as repetitions, missing spaces, and random letters after extended generation. The problem is linked to a specific commit and involves the use of the llama-cli tool on a Linux system with an RTX 3090 GPU, where the issue is reproducible under certain command-line parameters.

The comments discuss the reproducibility of the issue, potential causes related to broken quantizations, and differences from previously reported issues. There is a mention of a specific mode not being considered during fusion support, and a user plans to investigate further. Another user clarifies that the current issue is distinct from a similar one they reported earlier.
Number of comments this week: 6

Feature Request: Support EXAONE 4.0: This issue is a feature request for adding support for the EXAONE 4.0 model architecture in the llama.cpp project, which would enable the provision of .GGUF files for model checkpoints to end users. The requester has provided a link to their implementation on Huggingface Transformers and is seeking the maintainers' consideration to integrate this support into GGUF-compatible libraries.

The comments discuss concerns about the restrictive nature of the EXAONE license, highlighting that it is a research-only license with LG maintaining control over the model and its outputs. There is skepticism about supporting companies that do not contribute to open source, and a comparison is made to a legal case involving Disney's arbitration clause, illustrating the potential implications of such licenses.
Number of comments this week: 4

Compile bug: SYCL with OneAPI Toolkit 2025.2 & NixOS: This issue involves a compilation error encountered when using SYCL with the OneAPI Toolkit version 2025.2 on NixOS, specifically when trying to compile the llama.cpp project. The user reports that while they can compile other projects like whisper.cpp without SYCL, they face errors related to macro conflicts between SYCL headers and their glibc version during the compilation of llama.cpp.

The comments discuss potential causes and solutions for the compilation error, suggesting that the issue might be related to a macro conflict with the glibc version. A user tested the OneAPI 2025.2 on Ubuntu without issues, indicating a possible environment-specific problem. Another user suggests downgrading the glibc version and mentions a related fix in another project, emphasizing the need for the ggml projects to be more version-agnostic to facilitate packaging on NixOS.
Number of comments this week: 3

Eval bug: Gemma 3n on Vulkan on Ryzen APUs produces garbled output: This issue reports a bug where using the Gemma 3n model on Vulkan with Ryzen APUs results in garbled output, specifically when using certain quantization formats like Q8_0 and Q4_K_XL, while the Q4_K_M format does not exhibit the problem. The problem is observed on the integrated GPUs of Ryzen 5700G and 7840U, and it does not occur with other models or different GPUs, suggesting a potential issue with the Vulkan backend or specific operations like GET_ROWS.

The comments discuss attempts to bisect the issue without success, a workaround using -ot per_layer_token_embd.weight=CPU that resolves the problem, and detailed comparisons of debug outputs with and without the workaround, indicating a potential issue with the Vulkan implementation of the GET_ROWS operation for certain quantization formats.
Number of comments this week: 3

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the CURLOPT_NOPROGRESS option to ensure accurate and non-conflicting progress updates during the download process.
kubernetes example: This issue is about the need for a Helm chart for the llama.cpp server to facilitate its deployment on Kubernetes, which is a widely used platform for deploying applications at scale. The issue has been open for a significant amount of time, and while initial work has been started, the contributor is seeking additional help from the community to advance the project.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the CUDA backend on a system with an NVIDIA GeForce RTX 3060 GPU, where the process fails due to a tensor type mismatch and an incorrect block size. The error occurs specifically when trying to load the model from a file, resulting in a failure to read tensor information and subsequently preventing the model from being loaded successfully.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 20
Summarized Issues:

CUDA Out-of-Memory Errors: This issue describes a bug where a process fails to exit due to an out-of-memory (OOM) error occurring on a CUDA device while running a command with specific parameters on a Linux system using the llama-box tool. The error prevents the process from completing successfully, indicating a need for better memory management or error handling in the tool.
issues/14458

Script Conversion Failures: This issue involves a bug in the convert_hf_to_gguf.py script, which fails to convert Qwen3-embedding and Qwen3-embedding LoRA-tuned models to GGUF format due to unrecognized BPE pre-tokenizer and tensor mapping errors. Updates to the script and model files are required for successful conversion, highlighting the need for compatibility improvements.
issues/14459

Compilation Errors: Several issues involve compilation errors in the llama.cpp project, including a zero-size array error in gemm_gemv_kernels and an invalid feature modifier 'sme' when creating a portable arm64 build, and undefined references to the std::filesystem library on Linux systems. These errors indicate challenges in maintaining cross-platform compatibility and proper linking configurations.
issues/14464, issues/14536

Feature Requests for Model Support: Multiple feature requests highlight the need for supporting new model architectures like ERNIE 4.5 MoE, EXAONE 4.0, Huawei Pangu Pro 72B MoE, and GLM-4.1V-9B-Thinking in the llama.cpp project. These requests emphasize the community's interest in leveraging advanced models for improved performance and capabilities.
issues/14465, issues/14474, issues/14486, issues/14495

Vulkan Backend Bugs: Several issues involve bugs in the Vulkan backend, such as incorrect outputs on Intel N150 GPUs and garbled output on Ryzen APUs when using specific models and configurations. These problems suggest potential issues with Vulkan/SYCL capabilities and quantization configurations not being properly supported.
issues/14469, issues/14525, issues/14540

Feature Requests for Performance Enhancements: Feature requests for performance enhancements include implementing per-chat prompt caching and real batch processing for multiple image inputs, as well as improving image encoding speed on Mac M2 devices using Metal. These enhancements aim to optimize performance and efficiency in various use cases.
issues/14470, issues/14527, issues/14530

Compilation and Build Process Issues: Compilation errors during the build process, such as missing "rocwmma/rocwmma.hpp" and undeclared variables in Vulkan support, highlight challenges in maintaining build stability and resolving dependencies. These issues require careful attention to build configurations and dependency management.
issues/14538, issues/14542

Bugs in Model Implementations: Bugs in model implementations, such as the RWKV model crashing without a prompt option and the Llama 3.2 vocabulary missing a newline token, indicate areas where model handling and token recognition need improvement. These issues affect the reliability and accuracy of model outputs.
issues/14513, issues/14524

Functionality and Output Issues: Issues with functionality and output, such as the mtmd_get_output_embd() function not returning the length of the embedding vector and the llama-server module outputting multiple vectors instead of a single pooled vector, highlight the need for better output management and function refactoring.
issues/14516, issues/14543

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 7
Summarized Issues:

Bugs in llama.cpp related to device-specific issues: Several issues have been reported regarding bugs in llama.cpp when used with specific devices or configurations. One issue involves a bug with the aclnnMatmul function on 910B4 devices, causing incorrect output due to data range issues. Another issue describes a "Floating point exception" error on the OpenCL backend with MoE models, leading to crashes when processing long prompts.
issues/14441, issues/14453

Conversion and format issues in llama.cpp: There are issues related to the conversion of models to different formats using llama.cpp scripts. A ValueError occurs during the conversion of a Qwen3-8B model to GGUF format due to the script's inability to map certain tensors. This is linked to unknown 'scales' and 'biases' tensors that affect the quantized tensors' range.
issues/14467

Crashes and errors in llama applications: Various llama applications have been reported to crash or encounter errors under specific conditions. The llama-simple-chat application crashes with a "failed to decode" error after several interactions, likely due to exceeding the default context size. Similarly, the llama-server software randomly crashes during inference on AMD ROCm devices due to a memory status assertion failure.
issues/14487, issues/14506

Discrepancies in output between llama cpp and transformers library: An issue has been identified where the llama cpp server produces different values compared to the transformers library during reranking tasks. This discrepancy affects the consistency and reliability of the results generated by these tools.
issues/14533

Assertion failures in llama-eval-callback tool: The llama-eval-callback tool has a bug where an assertion failure occurs when loading a model on a Mac with an M3 processor using the Metal backend. This issue was resolved by passing a prompt to the command line execution, which allowed the model to load successfully.
issues/14537

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

Feature Request: Support EXAONE 4.0
Toxicity Score: 0.55 (Critical tone, Distrust towards licensing, External references for support.)
This GitHub conversation involves multiple users discussing the implications of the EXAONE 4.0 license, with one user expressing skepticism about supporting a company perceived as not contributing to open source. The tone is critical and somewhat sarcastic, with references to external sources to support their points. The conversation includes a comparison to a legal case involving Disney, which adds a layer of skepticism and distrust towards the licensing terms.

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 16
Key Open Pull Requests
1. llama: add initial support for Falcon-H1 model family: This pull request introduces initial support for the Falcon-H1 model family by implementing model loading, basic inference, and tokenizer integration, while also updating build scripts and documentation to ensure compatibility, and includes new test cases to verify the functionality, all of which supersedes a previous attempt with a cleaner and more modular implementation.

URL: pull/14534

Merged: No

Associated Commits: 991de, f897e, 71a68, 03568, 0c93e, fdd5c, 14c37, 8bea9, 071f4, 50ead, a39a8, 1415c, 243e4, cce35, 22de6, 2fe05, d22b4, 6c7d9, 15138, a6d00, 1fd05, 250b4, 3ee79, 2aa48, 9760c, 7a254

2. ggml: Add initial WebGPU backend: This pull request introduces an initial implementation of a WebGPU backend for the ggml project, aimed at passing continuous integration tests and running basic matrix multiplication examples, while acknowledging current limitations and expressing a commitment to further development and collaboration.

URL: pull/14521

Merged: No

Associated Commits: 63eb6, c0a81, e5033, b17b1, e7071, c9a53, 8b860, 9e0c6, 520f5, 2d24a, d0480, f8a53, 39d95, 3d924, d036f, b8a22, c09bf, aec34, daa58, 1c396, ecb94, 0f054, 2eb76, 949b8

3. Chore: batch prompts, extract tensors specific layer: This pull request involves a series of updates and improvements to the llama.cpp project, including batching prompts and extracting tensors for specific layers, adjusting the README, adding a feature to list possible layers for parsing and setting a maximum number of layers to offload to the GPU, as well as fixing issues related to includes for Ubuntu and saving tensors and prompts/outputs to different files.

URL: pull/14463

Merged: No

Associated Commits: de30c, d512b, 9113f, 791fa, 25b03

Other Open Pull Requests

Narrowing Conversion Errors in 32-bit Platforms: This topic addresses the issue of narrowing conversion errors in the export-lora.cpp and clip.cpp files when building on 32-bit platforms. The pull request adds static_cast<size_t> to ensure type safety and silence warnings, improving build correctness without affecting functionality.
pull/14503

Input Token Truncation in Llama-Server: The pull request introduces an option to allow input token truncation during the embedding task in the llama-server. It addresses the issue where the server stops if the input token length exceeds the available context slots and includes updates for documentation and linting.
pull/14493

Server Web UI Presets: A new feature is introduced to the server web UI's settings dialog, enabling users to create and manage presets. This allows for quick and easy changes to settings configurations.
pull/14523

OpenCL Kernel Performance Improvement: The introduction of a new mul_mat_f16_f32 kernel for OpenCL significantly improves performance through tiling and vectorization. The throughput increased from 19.24 to 168.17 t/s in the pp512 test on the Adreno 830.
pull/14535

Server Prefix Mounting: This pull request introduces the capability to mount the server at a specified prefix, useful for scenarios where the server operates behind a reverse proxy on a non-root path. It includes commits that add a server prefix and correct the server path environment.
pull/14544

Command-line Argument for Default WebUI Settings: A command-line argument is introduced to allow the llama-server to send locally-defined default JSON-encoded client-side webui settings. This provides users with flexibility to customize server deployments without modifying source code.
pull/14468

Reuse of Computation Graphs: A feature to reuse computation graphs from previous micro-batches is introduced, enhancing performance by maintaining buffer allocations and updating graph parameters. It supports CPU and Metal backends and requires the LLAMA_SET_ROWS environment variable for activation.
pull/14482

Matrix Multiplication Performance in Vulkan: Enhancements in matrix multiplication operations for a Vulkan-based project result in a 10-15% speed improvement on an RX 470 GPU. This is achieved by unpacking more values at a time for integer quantized matrices.
pull/14485

MUSA SDK Upgrade: This pull request involves upgrading the MUSA SDK to a new version, incorporating changes to the mublas API. The commit is signed by Xiaodong Ye.
pull/14498

Separation of K and V Buffers in KV Cache: A preparatory step is taken to support the separation of K and V buffers in the unified KV cache. This aims to enhance throughput for parallel decoding use cases without introducing functional changes.
pull/14517

Loading Tokenized Data from Parquet Dataset: A feature is proposed to load already tokenized data from a Parquet dataset into the training process. Further enhancements for streaming and batching are needed but are considered more complex tasks.
pull/14522

Support for bf16 and i32 in CUDA 'getrows' Function: Support for the bf16 and i32 data types is added to the 'getrows' function in the CUDA implementation. This is achieved by including the necessary case statements.
pull/14529

CPU Detection Logic for AArch64 Platforms: Enhancements are made to the CPU detection logic for Linux on AArch64 platforms. The features identify and prefer high-performance "big" cores in hybrid big.LITTLE architectures to optimize computational performance.
pull/14532

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 42
Key Closed Pull Requests
1. ggml : implement GEGLU_ERF and GEGLU_QUICK ops: This pull request implements the GEGLU_ERF and GEGLU_QUICK operations, which are complementary to other GLU operations and are used in the mtmd project, for all currently GLU-supported backends, with an initial exception for GEGLU_ERF in Vulkan due to a missing erf function, which was later added, and includes several commits addressing implementation, error fixes, and integration with OpenCL.

URL: pull/14445

Merged: Yes

Associated Commits: 6be13, 38593, d5e4a, d03f9, 95e15, a798e

2. model : add support for apple/DiffuCoder-7B-cpGRPO: This pull request introduces initial support for Apple's new DiffuCoder model, including multiple commits to support the DiffuCoder/Dream architecture and a quick fix, with plans to upload the F16 gguf to a specified Hugging Face repository, although it was not merged.

URL: pull/14502

Merged: No

Associated Commits: 6b125, 68a14, 2b9f3, 933bd, 9f203, 4e20a

3. Remove redundant include path in CMakeLists.txt: This pull request involves the removal of a redundant include path from the CMakeLists.txt file for the ggml-cpu-feats target, specifically eliminating the parent directory '..' to streamline the include directories and avoid unnecessary paths.

URL: pull/14452

Merged: Yes

Associated Commits: f25dc, 03070, 78ef0, 1cb88

Other Closed Pull Requests

ggml Synchronization and Enhancements: This topic covers the synchronization of the 'ggml' component, including renaming variables for clarity and adding new functions. The pull requests also address bug fixes and improvements in the ggml library, such as fixing mask dimensions and adding a version retrieval function.
pull/14473, pull/14505, pull/14507

CUDA and Vulkan Backend Improvements: These pull requests introduce features like softmax broadcast and performance enhancements in the CUDA and Vulkan backends. They also address issues like packing constants and supporting new head sizes in Vulkan.
pull/14475, pull/14497, pull/14518, pull/14509

Metal and SYCL Backend Fixes: These pull requests focus on disabling fast-math optimizations in the Metal backend and re-enabling the fp16 exponential function in the SYCL backend. They aim to resolve issues caused by compiler optimizations and ensure correct computation.
pull/14460, pull/14462, pull/14478

OpenCL Backend Enhancements: The OpenCL backend sees improvements with the addition of new functions and mechanisms to prevent crashes. These pull requests also address test failures and ensure consistency with other backends.
pull/14490, pull/14491, pull/14510, pull/14456

Vulkan Component Updates: These pull requests address various updates to the Vulkan component, including splitting large matrices and adding missing functionalities. They ensure the Vulkan backend operates efficiently within memory constraints.
pull/14449, pull/14451, pull/14455

Project Infrastructure and Documentation: This topic includes updates to project infrastructure and documentation, such as adding Vulkan images to documentation and addressing buffer overflow prevention. These changes aim to improve project accessibility and stability.
pull/14472, pull/14490

Callback and Error Handling Mechanisms: These pull requests introduce a callback mechanism for handling aborts and address error handling in memory contexts. They enhance the project's ability to manage errors and shutdown processes gracefully.
pull/14481, pull/14438

Chat and Template Support: The project sees enhancements in chat functionality with the addition of Jinja-based templates and fixes for context management. These changes improve the flexibility and reliability of chat features.
pull/14508, pull/14494

Miscellaneous Bug Fixes: Various bug fixes are addressed, including issues with Gemma 3n conversion and disabling specific tests. These pull requests ensure the project runs smoothly by resolving reported problems.
pull/14450, pull/14461

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
151
19
0
63

CISC
66
10
0
51

taronaeo
108
1
0
2

gabe-l-hart
60
1
1
1

am17an
45
8
1
8

jeffbolznv
33
6
0
23

ngxson
15
3
0
26

JohannesGaessler
6
4
0
24

slaren
10
1
0
19

compilade
12
1
0
13

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	151	19	0	63
CISC	66	10	0	51
taronaeo	108	1	0	2
gabe-l-hart	60	1	1	1
am17an	45	8	1	8
jeffbolznv	33	6	0	23
ngxson	15	3	0	26
JohannesGaessler	6	4	0	24
slaren	10	1	0	19
compilade	12	1	0	13