Weekly GitHub Report for Llama.cpp: June 09, 2025 - June 16, 2025 (12:04:20)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
This version release, created on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: Failure to allocate buffer with ROCm 6.4: This issue involves a failure to allocate a buffer when attempting to load a large model using ROCm 6.4 on a Radeon RX 7900 XT GPU, resulting in an out-of-memory error. The problem persists despite attempts to reduce the context size and rebuild the binary with different flags, and it appears to be specific to the user's setup, as others have reported success with similar configurations.
- The comments discuss various troubleshooting steps, including reducing context size, checking ROCm versions, and comparing performance with Vulkan. Some users suggest potential issues with the Docker container or specific system configurations, while others confirm that similar setups work for them, indicating the problem might be isolated to the original poster's environment.
- Number of comments this week: 12
-
Misc. bug: Performance regression on aarch64 q4_0: This issue reports a performance regression in the llama-cli tool on the aarch64 architecture, specifically affecting the Q4_0 model after a particular commit. The problem is observed in the llama-bench module, where the performance dropped significantly from 138.02 t/s to 15.84 t/s after the update.
- The comments discuss the need for additional logs to diagnose the issue, with users providing detailed build and runtime logs for different commits. A user confirms the regression is linked to the Android NDK toolchain and identifies a linking problem due to a broken build configuration. Suggestions for further investigation and potential fixes are shared, with some users expressing confidence in the original author's ability to resolve the issue.
- Number of comments this week: 11
-
Vulkan Runner Frequent Crashing under workload: This issue involves frequent crashes of the Vulkan Runner under heavy workloads when used with Koboldcpp, Llama.cpp, LMStudio, and the Ollama-Vulkan Fork, particularly during model switching and large query processing. The problem results in the machine halting, the screen turning black, and Windows explorer.exe restarting, necessitating a restart of the applications or the entire system.
- The comments discuss similar experiences with Vulkan-related crashes, suggesting a potential driver issue, particularly with AMD's drivers. One commenter notes that the problem might be due to VRAM allocation limits, while another suggests that the issue could be a bug in AMD's driver rather than the Vulkan backend itself, which is considered mature. There is a suggestion to report the issue to AMD, as their Windows driver has known issues.
- Number of comments this week: 4
-
Eval bug: (MAC) fail in
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_FLASH_ATTN_EXT_Q8_0_H96, flash_attn_ext_q8_0_h96, has_simdgroup_mm);
: This issue reports a crash occurring on a Mac Studio when attempting to load a model using the Metal backend, specifically at theGGML_METAL_ADD_KERNEL
function call related toflash_attn_ext_q8_0_h96
. The problem seems to be associated with the use of the "flash_attn" feature, although the exact cause is unclear as the crash is not consistently reproducible.- The comments discuss whether the issue can be reproduced with examples from the repository, with the original poster noting that the crash seems to occur when "flash_attn" is enabled. Further inquiries are made about the cleanliness of the repository copy and the exact command used, but the issue becomes non-reproducible, with observations indicating that the crash might occur during the initial model load on a Mac Studio.
- Number of comments this week: 4
-
Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes: This issue involves a bug in the
llama.cpp
project where the--cache-reuse
parameter no longer seems to cache prompt prefixes, affecting thellama-server
module on Linux systems. The problem appears to have been introduced between specific commits, as reverting to an older version resolves the caching issue.- The comments discuss the need for reproduction steps using
llama-server
, with the original poster providing command details and confirming the presence of--cache-reuse 1
. Another commenter suggests the parameter might not be relevant and points to a potential fix in a different pull request, while the original poster seeks clarification and offers to provide further reproduction steps if needed. - Number of comments this week: 4
- The comments discuss the need for reproduction steps using
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error that does not occur with other Vulkan backends. The issue has been open for over 441 days, indicating a potentially complex problem that has yet to be resolved.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a project at Hugging Face. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is looking for any documented or known methods to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - kubernetes example: This issue highlights the need for a Helm chart for the
llama.cpp
server to facilitate its deployment on Kubernetes, which is a popular platform for managing containerized applications at scale. The author has initiated the development of this chart and is seeking community assistance to further progress the project. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the CUDA backend on a system with an NVIDIA GeForce RTX 3060 GPU, where the process fails due to a tensor type mismatch and an incorrect block size. The error occurs when the system tries to read tensor information from the model file, resulting in a failure to load the model, as indicated by the log output.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 27
Summarized Issues:
- KV Cache Functionality Issues: The KV cache functionality in the llama-server software has a bug starting from version b5554, where it fails to reduce the number of tokens processed in subsequent queries, leading to slower performance. This issue persists across different versions, operating systems, and hardware configurations.
- CUDA Backend Output Errors: A bug in
llama.cpp:server-cuda
causes the model to generate gibberish or repeated output when using the-sm row
option, with the last GPU being pinned at 100% usage. This suggests a potential problem similar to a previously reported issue.
- Memory Allocation and Usage Problems: Several issues involve memory allocation errors, such as excessive memory usage when creating embeddings with a large context size or when setting certain parameters, leading to failures in buffer allocation. These problems are often linked to unnecessary batch size requirements or discrepancies in default behavior.
- Compilation and Linking Failures: Compilation failures occur on various systems, such as PowerPC and Gentoo, due to issues like unsupported features or undefined reference errors. These problems are often related to specific build configurations or system incompatibilities.
- CPU and Thread Utilization Issues: The Ollama Runner and llama-server modules face issues with CPU utilization, where only a single CPU is used despite multiple being allocated, leading to slower response times. This indicates inefficiencies in thread management.
- Server and Model Loading Bugs: Bugs in server tests and model loading processes cause failures unless manual delays are introduced or specific configurations are used. These issues highlight problems with readiness checks and model compatibility.
- Model Evaluation and Loading Errors: Errors occur during model evaluation and loading, such as mismatches in expected tensor counts or crashes during kernel execution. These issues are often related to backend compatibility or specific model settings.
- Multimodal Model Processing Issues: The Gemma3 model and other multimodal models face issues with memory slot allocation and prompt processing, leading to failures or hangs. These problems occur when running with multiple parallel slots or when specific tags are present in the input.
- Tool Call Parsing and Message Handling Bugs: The llama-server module has bugs in parsing multiple tool calls and handling multi-part content, resulting in only the first tool call being processed correctly and unexpected behavior in message processing.
- Script and Path Conversion Errors: The
convert_hf_to_gguf.py
script fails on Windows due to path separator conversion issues, causing the HF model ID to be unrecognized. This highlights problems with cross-platform compatibility in script execution.
- Vulkan Runner Crashes: The Vulkan Runner frequently crashes under heavy workloads, causing the machine to halt and restart. This is potentially due to driver crashes or memory overloads, particularly with AMD drivers.
- Prompt and Message Order Errors: The llama-server module incorrectly orders prompts for certain models, placing the assistant message before the system message. This results in prompts not being built correctly with all messages in the intended order.
- Metric Naming Convention Concerns: There is a concern regarding the naming convention of Prometheus metrics, which currently use a colon prefix. It is suggested to update them to follow Prometheus guidelines by replacing the colon with an underscore.
- Model Loading and Memory Management Research: Research is being conducted on the effects of memory-mapped file eviction when loading large models into GPU memory. A hypothesis suggests that loading weights in reverse inference order could improve load times.
- Model Support and Template Handling: There is a need to add support for the solar-10.7b-instruct model in llama-cli, as it currently results in a runtime error due to an unsupported custom template. A potential solution involves modifying the template handling code.
- Conversion Process Guidance: Guidance is sought for converting
prismatic-vlms
intoggufs
, with a request for clarification on the conversion process tomtmd
, including necessary files and structure.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 10
Summarized Issues:
- llama-server Module Bugs: The llama-server module experiences multiple issues, including a failure to start in a Docker environment on systems with AMD Radeon Graphics using ROCm, resulting in a crashLookBackoff state. Additionally, the server's REST API freezes when processing tasks, particularly with a batch of up to 64 strings for embeddings, which was resolved by a fix in a pull request.
- Compilation and Execution Errors: Various compilation and execution errors are reported, such as a ValueError due to a duplicated key during model conversion and a bug in Metal shader compilation on macOS due to character encoding issues. Additionally, the
test-chat
executable fails on x86 Windows builds, and thehipcc
compiler is incorrectly executed with host C++ compiler flags, causing build errors on Linux.
- Model and Cache Handling Issues: The Qwen models lose their reasoning task capabilities by default, potentially due to recent changes, and the Cohere Command-A model fails to utilize the KV cache effectively, leading to full prompt re-processing. These issues suggest a need for adjustments in server options and cache handling mechanisms.
- Image Processing and Compilation Errors: The llama-server module encounters a bug when processing more than 10 images, resulting in a 500 error, which was initially mitigated by a workaround but requires a proper fix. Additionally, a compilation error occurs on Linux systems with the BLAS backend due to deprecated and incorrect references in the code.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 18
Key Open Pull Requests
1. ggml : implement REGLU/GEGLU/SWIGLU ops: This pull request aims to implement the REGLU, GEGLU, and SWIGLU operations in the ggml project to enhance efficiency by reducing unnecessary tensor duplications and combining operations, with initial support for CPU and CUDA, while seeking assistance to extend compatibility to other backends.
- URL: pull/14158
- Merged: No
- Associated Commits: c7171, 92943, 319c6, 6fe7e, 7e075, 1acd1, 5c581, f4be7, 56486, 4b7d4, d1d3f, e3d2b, 39eba, 98a50, 8dc1d, 95e4b
2. ggml: aarch64: Implement SVE Kernels for Int 8 Quantization: This pull request introduces SVE kernel support for Int8 quantization specific to the ARM architecture by implementing SVE intrinsics for the functions quantize_row_q8_0()
and dequantize_row_q8_0()
, achieving comparable performance to the baseline while maintaining model accuracy.
- URL: pull/14117
- Merged: No
3. tests : add test-model-random: This pull request introduces a new testing framework that generates random models to evaluate the consistency of outputs across different batch concurrencies, aiming to detect issues such as broken recurrent caches and prevent accidental regressions by facilitating the testing of numerous edge cases on supported architectures without requiring model downloads, starting with the Mamba architecture and planning to expand to others.
- URL: pull/14139
- Merged: No
Other Open Pull Requests
- NeoBERT Model Integration: This topic involves integrating support for the NeoBERT model by converting it to the GGUF format and adding an inference graph. The pull request also addresses code quality issues based on reviewer feedback.
- Batch Processing Enhancement: The enhancement of batch processing is addressed by implementing automatic generation of input positions when missing. It also ensures the sanitization of input batches containing multiple sequences per token.
- Weak Aliasing Rework for Apple Targets: This topic covers the reworking of weak aliasing for Apple targets in the ggml-cpu component. It includes improvements to PowerPC detection on Darwin systems and fixes issue #14138.
- Model Metadata Quantization: The addition of a
uint
type to the--override-kv
functionality in the llama-quantize tool is addressed. This change resolves errors related to model metadata parameters requiring an unsigned integer type.
- Arcee Foundation Model Support: Support for the upcoming Arcee Foundation Model (AFM) architecture is introduced, incorporating ReLU² activation in the MLP blocks. A draft update for the conversion script is also included, pending further review.
- Model Alias Presets Feature: A new feature is introduced to the server by adding model alias presets, including an option for specifying an
--alias-presets-file
. This addition is supported by a test and a README entry.
- Windows Remote Option Fix: This topic addresses a fix for the remote option in Windows and includes a refactor of the model command line argument. The argument is changed to a string rather than a Path.
- Mistral Model Chat Template: A new chat template for the Mistral-Small-3.1-24B-Instruct-2503 model is introduced, with tool calling support. It also addresses a bug related to a broken chat template for Mistral small models.
- Vulkan Shaders Compilation: The refinement of the ExternalProject_Add logic for vulkan-shaders-gen in CMake is aimed at addressing compilation issues. This is particularly relevant when using multi-configuration generators.
- Gated Linear Unit Implementation: The implementation of the Gated Linear Unit (GLU) for split up/gate functionality is introduced. It includes tests for the new feature and seeks input for adding support to Metal and Vulkan.
- Dots.llm1 Architecture Support: Support for the "dots.llm1" architecture is introduced by adding necessary constants and a new
Dots1Model
. It also includes a chat template for llama-server and references existing models and documentation.
- Vulkan Thread Safety Fix: A mutex is introduced around the
vkQueueSubmit
function in the Vulkan implementation. This change addresses and fixes a crash related to thread safety during testing.
- Server Logging Behavior Change: A change to the logging behavior of a server application is proposed, ensuring accurate logging of the socket file path. This prevents incorrect display of an HTTP URL and port when listening on a Unix domain socket.
- AMDGCN Macro Deprecation Workaround: A workaround for the deprecation of the
__AMDGCN_WAVEFRONT_SIZE__
macro is implemented. It involves guessing the wavefront size based on the GPU generation.
- ROCm Linux Build Process: The ROCm Linux build process is re-enabled while limiting the built targets to those supported by rocBLAS. This is indicated by the commit message and the absence of additional descriptive information.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 51
Key Closed Pull Requests
1. compare llama-bench: add option to plot: This pull request introduces a feature to the llama-bench project that allows users to generate graphs by varying one variable (defaulting to n_depth
) while keeping others constant, and includes new script options --plot
for specifying the plot location and --plot_x
for setting the x-axis, along with several commits addressing review comments, adding matplotlib to requirements, fixing tests, and improving comments and test conditions.
- URL: pull/14169
- Merged: Yes
2. server: Experimental new speculative decoding algorithm : This pull request introduces an experimental speculative decoding algorithm for the llama.cpp
project, aiming to optimize the marginal cost of adding tokens to a batch by implementing a power-law model for cost estimation, which allows users to adjust parameters for different models and use cases, potentially improving performance without needing to reload models for various tasks.
- URL: pull/14132
- Merged: No
3. llama : support GEGLU for jina-bert-v2: This pull request introduces support for the GEGLU activation function in the jina-bert-v2
models by cleaning up the conversion process, removing unnecessary components, and ensuring compatibility with both old and newly converted models, as well as testing the changes on CPU and CUDA with different configurations.
- URL: pull/14090
- Merged: Yes
Other Closed Pull Requests
- llama_batch_allocr Optimization: This topic involves reworking the
llama_batch_allocr
by moving input batch validation logic into it and persisting its state withinllama_context
to optimize memory allocation. It also includes fixing integer type inconsistencies, counting outputs, and preparing for a future refactor ofllama_ubatch
indexing, with plans to add multi-sequence validation and aLLAMA_BATCH_DEBUG
environment variable.
- ARM Architecture Optimization: The GGML_CPU_ALL_VARIANTS feature is implemented for ARM architecture, introducing ARM-specific feature detection and scoring mechanisms within the build system. Multiple backend builds for different ARM versions are included to optimize performance, with testing conducted on a Graviton4 armv9.0-a processor to ensure correct loading and no regressions.
- Web UI Layout Fixes: This topic addresses layout issues in the web UI, such as infinite horizontal scrolling caused by long numbers and main content covering the sidebar. Solutions include implementing a Tailwind CSS class to wrap numbers and removing the
w-screen
class to ensure the main content only occupies the remaining space after the sidebar.
- KV-Cache and Defragmentation: This topic addresses and resolves issues related to the kv-cache, including fixing the shift and defragmentation logic. The pull requests correct the shift operation, reset shift indices, and prevent the defragmentation process from erasing cells that did not move.
- Metal Component Optimization: The FA kernel in the metal component is optimized by reducing stack memory usage and accumulating data into shared memory instead of registers. This is part of ongoing improvements referenced in issue #13975 and includes a fix for the BF16 variant.
- ggml Component Synchronization: Synchronizing the 'ggml' component involves adding an in-build 'ggml::ggml' ALIAS library for uniform linking with subprojects and using 'find_package'. It also includes a fix for a weak alias issue on Win32 systems.
- CMake Configuration Enhancements: Enhancements to the CMake configuration include handling whitespaces in file paths during the Metal build process and introducing the capability to specify LLAMA_BUILD_NUMBER and LLAMA_COMMIT. These changes ensure consistent variable passing to GGML_BUILD_* and address issue #14108.
- ROCm Support and Updates: This topic involves re-enabling ROCm container images and updating ROCm versions in GitHub Actions. It addresses a previous compilation issue and a problem with ROCm memory detection limiting models to approximately 600MiB of video memory.
- GeGLU Activation Function: The GeGLU activation function is added to the project to support models like ModernBert. This addition is part of a fix for the "geglu" component in the graph module, as a continuation of issue #14074.
- Error Messaging Improvements: Improvements to error messaging for RPC server crashes involve replacing the generic
GGML_ASSERT
macro with a new, RPC-server-specific macro. This provides clearer and more informative error messages, indicating a server crash or malformed response.
- Server LRU Check Bug Fix: A bug in the server's Least Recently Used (LRU) check mechanism is addressed, which caused deferred tasks to become stuck in the queue. The fix ensures that tasks are processed correctly when
slot.t_last_used
equalst_last
.
- Build Process Simplification: Simplifying the build process involves moving the generation of
build-info.cpp
to the build directory and cleaning up relative source directory paths. This reduces complexity and dependencies, streamlining the build process.
- Call Stack Restructuring: Restructuring the call stack involves moving the packing routines from inside the MMA kernel to a preceding step. This changes the sequence of operations but results in no significant performance difference.
- Documentation Updates: Updates to the "multimodal.md" document aim to address user concerns about model discovery difficulties. The changes, although seemingly obvious, are intended to provide assistance to users.
- DeepSeek-R1 Conversion Error Fix: A conversion error related to a duplicate key in DeepSeek-R1 is fixed by correcting the
head_dim
value returned byAutoConfig
. The fix involves checking themodel_type
after DeepSeekV3 support was integrated intotransformers
.
- CUDA Backend Improvements: Improvements to the CUDA backend include allowing it to accept host buffers when using integrated GPUs. This addresses issue #14068 and enhances the backend's functionality.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 66 | 23 | 0 | 52 |
ngxson | 50 | 1 | 1 | 26 |
CISC | 38 | 7 | 1 | 21 |
slaren | 12 | 1 | 1 | 23 |
JohannesGaessler | 3 | 1 | 0 | 30 |
qnixsynapse | 15 | 5 | 0 | 10 |
am17an | 14 | 4 | 0 | 6 |
shibizhao | 16 | 1 | 0 | 2 |
Rbiessy | 5 | 1 | 0 | 9 |
noemotiovon | 2 | 0 | 0 | 11 |