Weekly GitHub Report for Llama.cpp: September 29, 2025 - October 06, 2025 (12:07:21)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced performance and user experience. Notable highlights include optimized features and bug fixes that streamline functionality and increase stability.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Feature Request: DeepSeek V3.2-Exp support: This issue requests support for the DeepSeek V3.2-Exp model in the llama.cpp project, motivated by the desire to extend functionality and leverage the model's capabilities. The discussion focuses on the challenges of implementing sparse attention, tensor mapping for model conversion, and performance considerations, with ongoing efforts to convert the model format and validate the approach.
- Commenters shared initial tests and code references, identified sparse attention as a key challenge, and discussed potential implementation strategies including matrix multiplication optimizations. Progress was made on converting the model to GGUF format despite upstream tooling issues, with a branch created for tensor mapping fixes and requests for review. Additional kernel code from DeepSeek was noted as helpful, and caution was advised regarding performance comparisons with VLLM’s implementation.
- Number of comments this week: 13
-
Misc. bug: GLM 4.6 safetensors fails to convert to a GGUF: This issue reports a problem with converting the GLM 4.6 model from safetensors format to GGUF using the provided conversion script, which fails due to missing tensors specifically from the mtp.safetensors file. The user discovered that the script only processes files named with the pattern "model*.safetensors," requiring a manual rename of mtp.safetensors to model-mtp.safetensors as a workaround, and there is discussion about improving the script to handle such cases more robustly.
- The comments reveal that the root cause is the conversion script’s limitation to only recognize safetensor files starting with "model," which excludes mtp.safetensors unless renamed. Contributors agree to keep the issue open for a proper fix, noting the current behavior likely avoids conflicts with consolidated.safetensor files, and one user shares an imatrix file to assist with testing.
- Number of comments this week: 8
-
Misc. bug: server/rerank output result is wrong with most models include qwen3-Rerank: This issue reports that the reranking output from the llama-server is incorrect or produces poor results when using most models, including various sizes of Qwen3-Reranker and others like bge and mxbai, while only the jina-reranker model seems to work reasonably but still differs from the official Jina API results. The user provides detailed test commands, example inputs, and outputs demonstrating the problem and seeks clarification or fixes for the inconsistent and inaccurate reranking scores.
- The comments reveal that the primary cause was likely an improperly converted GGUF model file, with a corrected version provided that yields much better results. It was also clarified that llama-server returns raw scores unlike the normalized scores from the Jina API, and some discussion focused on differences in output when using parallel processing options, highlighting variability in score values and order depending on the number of processing threads.
- Number of comments this week: 4
-
Eval bug: Performance in current vulkan binaries is BAD: This issue reports a significant performance regression in the current Vulkan binaries of the llama-server, where token processing speed dropped from around 20 tokens per second in an older July 2025 version to approximately 4 tokens per second in the latest build. The user observes this slowdown on Ubuntu 25.04 with specific hardware and requests investigation into the cause of the degraded performance.
- The commenters request proper benchmark data comparing the current and older versions to verify the performance drop. The user asks for guidance on benchmarking tools, and a recommendation is given to use the
llama-bench
tool for standardized performance measurements. - Number of comments this week: 3
- The commenters request proper benchmark data comparing the current and older versions to verify the performance drop. The user asks for guidance on benchmarking tools, and a recommendation is given to use the
-
Eval bug: Qwen 2.5 VL-3B subpar OCR performance compared to Transformers implementation: This issue reports that the OCR performance of the Qwen 2.5 VL-3B model using the llama.cpp implementation is significantly worse compared to the same model run via the transformers library, indicating a possible bug or incorrect behavior in the llama.cpp version. The user provides detailed context including hardware, software versions, and example outputs to demonstrate the discrepancy and notes that the llama.cpp output is not correct.
- The commenters suggest that a recent pull request (#15474) might have fixed the problem, and the original poster confirms that applying this PR resolved their issues, expressing hope that it will be merged into the main branch soon.
- Number of comments this week: 2
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 553 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
- Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in improving the Metal backend. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, which would help in collecting and analyzing GPU traces across different frameworks.
- common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-conflicting progress status indicators during parallel downloads.
- kubernetes example: This issue discusses the creation of a Kubernetes Helm chart for deploying the
llama.cpp
server, aiming to facilitate scalable application deployment within the community. The author has begun work on this example but is seeking additional contributions and plans to continue development when time permits. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the Microsoft BitNet model version b1.58-2B-4T-gguf on a Windows system using CUDA with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 19
Summarized Issues:
- Performance regressions in llama-server Vulkan and logprobs usage: Significant performance drops have been reported in llama-server's Vulkan binaries on Ubuntu 25.04, where token processing speed decreased from about 20 to 4 tokens per second despite unchanged hardware and OS. Additionally, enabling log probabilities (logprobs > 0) causes generation speed to slow down by approximately three times on key endpoints, severely impacting performance.
- issues/16317, issues/16329
- Model support and conversion challenges: There are multiple requests and issues related to adding support for new models and architectures, including DeepSeek V3.2-Exp, geospatial AI models, Jet-Nemotron architecture, and jinaai/jina-reranker-v3. Conversion scripts also fail for some models like GLM 4.6 due to filename conventions, and the GGUF conversion tool lacks support for certain architectures, causing errors.
- issues/16331, issues/16360, issues/16361, issues/16390, issues/16425
- Model evaluation and output correctness issues: The Qwen 2.5 VL-3B model shows significantly worse OCR performance and incorrect outputs in llama.cpp compared to transformers, indicating a regression. Similarly, reranking outputs from llama-server are often incorrect or poor across most models except for a partially effective jina-reranker, highlighting problems in evaluation pipelines.
- issues/16334, issues/16407
- GPU offloading and Vulkan inference optimization problems: Offloading models like Gemma-3n-E4B to GPU on Nvidia Jetson devices results in unrelated or empty token generation, despite correct CPU-only runs. There is also a request to prioritize placing the output layer on GPU during Vulkan inference to reduce high CPU usage and improve performance on limited VRAM GPUs.
- issues/16370, issues/16422
- Web UI bugs and usability issues: The new Web UI has multiple problems including premature stopping of response generation when switching chats, failure to preserve large input blocks from legacy UI, and intermittent dropping or omission of HTML output despite correct API responses. These issues degrade user experience and functionality.
- issues/16374, issues/16398, issues/16417
- Context caching and template detection bugs in hybrid models: Despite fixes, the Nemotron-H-47B-Reasoning model still fully reprocesses prompts due to broken context caching, causing inefficiency. Additionally, the Granite 4 Hybrid model crashes because it fails to detect its own chat template due to overly specific detection strings.
- issues/16415, issues/16416
- Syntax error checking and tool call failures: A bug exists in the implementation of syntax error checking for escaped JSON inside LLM responses, causing tool calls to break and not function as intended.
- issues/16420
- ROCm multi-GPU layer splitting issues: When using llama.cpp with HIP support on ROCm-enabled AMD GPUs, running with
--split-mode layer
produces corrupted or garbage output, while--split-mode none
works correctly on smaller models, indicating a problem with ROCm multi-GPU layer splitting. - issues/16424
- Model cache management feature request: There is a request for a tool to list and delete cached models fetched automatically with the
-hf
option, to simplify cache management by providing commands similar tols
andrm
for viewing and removing cached models and their sizes. - issues/16393
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 16
Summarized Issues:
- Vulkan backend crashes and errors: Multiple issues report crashes and assertion failures in the Vulkan backend of llama.cpp related to buffer allocation, memory assertions, and matrix multiplication compatibility. These problems occur during evaluation with large prompts, default batch sizes, and specific model launches, often requiring workarounds like reducing batch size or reshaping tensors.
- issues/16298, issues/16304, issues/16339, issues/16383, issues/16392
- Compilation and build errors with Vulkan and ROCm: Several issues describe compilation failures due to type conversion problems, missing headers, and duplicate symbols when building with Vulkan or ROCm backends on Linux. These errors were resolved by upgrading Vulkan headers, avoiding conflicting HIP headers, or addressing missing files for AMD GPU targets.
- issues/16311, issues/16320, issues/16395
- Web UI message handling bugs: There are multiple bugs in the WebUI involving message disappearance, order breaking after regeneration, and content loss inside
<think>
blocks with double quotes. These issues cause user confusion and errors during conversation rehydration, branch switching, and message regeneration. - issues/16299, issues/16385, issues/16406
- Server and proxy configuration regressions: Issues report regressions in server functionality, including broken Apache reverse proxy support and missing SSL support in llama-server builds. These regressions were introduced by recent commits and cause failures in previously working configurations.
- issues/16338, issues/16343
- Performance and model quantization concerns: One issue questions whether the
ffn_down
pattern matching in quantization causes Mixture of Experts (MoE) models to have unexpectedly large sizes due to increased precision. This raises concerns about the efficiency of current quantization methods and potential need for special handling. - issues/16379
- Web UI responsiveness and usability problems: The new web UI becomes unresponsive and loses functionality such as sidebar icons and chat responses when handling a large number of conversations. This significantly impacts usability but was later addressed with targeted fixes.
- issues/16347
- Knowledge management system functionality issues: An issue reports that the responsibility-related features on the knowledge management HTML page of Banco Industrial S.A. do not meet expectations and require improvements to align with organizational goals.
- issues/16397
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 22
Key Open Pull Requests
1. Add support to ◁think▷...◁/think▷
format and DRY the thinking processing logic: This pull request adds support for the ◁think▷...◁/think▷
thinking format while refactoring and improving the code architecture in thinking.ts
to make the thinking processing logic more maintainable and efficient.
- URL: pull/16364
- Merged: No
2. Enable per-conversation loading states to allow having parallel conversations: This pull request introduces granular loading state management for individual conversations, enabling concurrent message processing to prevent UI lockup and improve user experience when multiple conversations are active.
- URL: pull/16327
- Merged: No
3. refactor: centralize CoT parsing in backend for streaming mode: This pull request refactors the backend to centralize and improve Chain-of-Thought (CoT) parsing for streaming mode by enhancing the incremental parsing of reasoning tags, supporting multiple reasoning segments, preserving whitespace, and ensuring consistent separation of reasoning content and regular content across all reasoning formats, thereby simplifying client implementations and unifying the parsing logic previously split between frontend and backend.
- URL: pull/16394
- Merged: No
Other Open Pull Requests
- Metal backend FA kernel optimization: This pull request improves the metal backend's FA kernel by pre-indexing blocks containing
-INF
values to skip unnecessary processing and enhances loop unrolling, resulting in measurable speedups especially for larger head sizes and longer context lengths. It also extends FA tests by adding-INF
blocks to the KQ mask to better simulate real-world scenarios involving causal masking or padding.
- Support for SentenceTransformer and Dense layers in model conversion: These pull requests add support for Dense linear projection modules used in EmbeddingGemma and enable conversion, verification, and execution of models with SentenceTransformer layers including pooling and normalization. This expands the model-conversion process to handle additional transformation layers beyond the base Transformer backbone.
- Dynamic model selector and multi-model management: This pull request introduces a dynamic model selector with persistence for llama-swap workflows, featuring a dropdown interface, localStorage-backed model store, integration with the /v1/models API, and display of model capabilities through badges. It also includes auto-refresh of server properties and normalization of model display names to enhance multi-model management.
- Automatic memory offloading with host-memory prompt cache: This pull request implements an initial version of automatic memory offloading to host memory by using a host-memory prompt cache that minimizes prompt reprocessing through prefix similarity calculation and hot-swapping cached prompts into the
llama_context
. The feature aims to improve server performance while acknowledging current limitations and ongoing work.
- SYCL backend operator implementations and optimizations: These pull requests implement and optimize SYCL backend operators including the
ARANGE
operator for efficient sequence generation on SYCL devices and theSET
operator forF32
tensors, introducing GPU-accelerated, multi-dimensional element-wise copy operations that significantly improve performance. Both changes maintain compatibility with the existing library and pass all related tests.
- WebGPU backend softmax implementation and bug fix: This pull request adds the missing softmax operation to the ggml WebGPU backend by updating
supports_op
andencode_node
, and fixes a potential bug in therms_norm
function related to an incorrect tensor offset.
- Server ranking system enhancements: This pull request adds functionality to the server's ranking system to enable sorting and management of the top_n results, while maintaining backward compatibility to return all results if top_n is not specified. The changes are demonstrated with example outputs and test scripts.
- CI caching mechanism refactor: This pull request refactors the SDK caching mechanism in the CI workflow to minimize storage usage by ensuring the default branch maintains an up-to-date cache that pull request branches can reuse. It also addresses limitations of composite actions that prevent embedding the
actions/cache
post action.
- Removal of broken SVE code paths: This pull request proposes removing the broken and overly complex SVE (Scalable Vector Extension) code paths due to incompatibility with the existing GGML_SIMD pattern and lack of proper CI coverage, with the intention to potentially reintroduce them later after thorough verification and rework.
- AMX backend unaligned memory access fix: This pull request addresses an unaligned memory access issue in the AMX backend code that causes garbage output with Q4_0 and Q4_1 quantizations, although it does not fully resolve the incorrect results problem.
- CPU detection improvement via compiler flags: This pull request proposes inspecting the compiler flags
-march
and-mcpu
to accurately detect the CPU type in the ggml-cpu project, building upon related previous work.
- CUDA toolkit installation documentation update: This pull request updates the build.md documentation to correct the CUDA toolkit installation instructions, clarifying that official release builds use version 12.4 rather than version 13.
- User preference for Markdown rendering in chat: This pull request introduces a new user preference to render user chat content as Markdown, adding a persistent setting and a checkbox in the chat settings dialog that enables user messages to be formatted with MarkdownContent while maintaining existing card styling.
- Fish shell completions support: This pull request proposes adding support for generating fish shell completions via a new
--completion-fish
option, including initial implementation, testing instructions, and a list of potential improvements to align with existing bash completions and coding guidelines.
- llama-pull tool implementation and documentation: This pull request proposes the implementation of the llama-pull tool along with its complete documentation.
- Jinja template trailing
tag handling: This pull request introduces a generic fallback mechanism to detect trailing<think>
tags in Jinja templates, trims whitespace, and either appends the closing tag or marks the reasoning block as forced-open based on theenable_thinking
setting. It includes a regression test to verify prompt differences and reasoning parsing, ensuring compatibility with models using the default Jinja chat template.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 56
Key Closed Pull Requests
1. Fix Tether CI CMake cmake pkg : This pull request updates the CI workflow to fix the CMake package configuration for the Tether runners by changing the installation path from /lib
to /shared
so that the workflow correctly locates the installed files for the runners.
- URL: pull/16368
- Merged: No
- Associated Commits: 042b5, 4b22b, 45f61, f218f, 3b0e6, e2a5a, 00a36, 1473d, 68cc7, ae1d0, 646fd, c88d6, d961f, f1d7a, 26519, 28571, ff756, dc3dd, cb421, 9009b, 29361, 55bf8, 80d71
2. implement context checkpointing for hybrid and recurrent models: This pull request generalizes the SWA checkpointing logic to implement context checkpointing for hybrid and recurrent models like Mamba and Jamba, renaming related CLI arguments and internal flags to support these models and thereby reducing the need to re-process the entire context in most cases.
- URL: pull/16382
- Merged: Yes
- Associated Commits: 6b3d5, 257d4, cfba3, ba574, fa222, 475e8, 4fee0, a3b4c, d304f, bb92d, 126e0, 9f996, e1b68, 829c7, 85d50, 6fc5b
3. Fix thinking blocks with quotes + add handling [THINK]...[/THINK]
blocks: This pull request fixes issues with reasoning blocks containing quotes to prevent truncation and adds support for handling [THINK]...[/THINK]
blocks, improving the parsing and display of thinking content in the project.
- URL: pull/16326
- Merged: Yes
Other Closed Pull Requests
- Vulkan shader build and backend improvements: Multiple pull requests enhance the Vulkan shader build process by enabling incremental builds, optimizing thread usage to prevent system freezes, and updating buffer creation to support larger allocations and valid usage. These changes collectively improve shader compilation efficiency, compatibility with different Vulkan header versions, and enable advanced features like video generation in stable-diffusion.cpp.
- Chat UI and message handling enhancements: Several pull requests refactor the chat sidebar for better UI efficiency by rendering dialogs as singletons and conditionally showing dropdowns on hover, fix message payloads to include only active branches, and resolve message disappearance issues when navigating regenerated sibling nodes. Additionally, improvements include capturing the model name accurately during streaming and adding a setting to display the "Model used:" information in messages.
- Backend operation optimizations and new features: Pull requests add support for the SOFT_MAX operation, optimize RMS_NORM with a split-row approach, introduce in-place testing functions, and improve CUDA and Metal backend functionality by adding fallback mechanisms and dynamic scaling of SIMD groups. These updates enhance performance and robustness across different hardware backends.
- Continuous integration and build system fixes: Multiple pull requests improve CI reliability by fixing ccache key configurations, properly installing rocwmma for HIP builds including Windows support, updating ROCm Docker builds, and correcting function argument orders with added tests to ensure correctness. These changes stabilize the build and testing pipelines.
- Model loading and compatibility updates: Updates include synchronizing the ggml submodule with version bumps and fixes like improved MKL detection, marking certain GLM tensors as not required to fix compatibility with GLM 4.6, and removing the '-dev' suffix from release versions. These changes prepare the project for development releases and improve model loading success after conversion and quantization.
- User experience and UI theming improvements: Pull requests improve Markdown code block theming by implementing light and dark modes with reactive adjustments, disable progress bars when no TTY is available to avoid issues in non-interactive environments, and enhance chat API options by excluding empty sampling fields from requests. These changes enhance usability and visual consistency.
- Bug fixes and stability improvements: Fixes include correcting the Hermes and Qwen tool-call parsing to prevent XML wrapper leaks, addressing intermittent test failures by overriding error bounds to allow rounding differences, and fixing argument order in vector math functions with added failing tests on SIMD. These ensure more stable and accurate operation.
- Performance and feature enhancements for NVIDIA GPUs: One pull request enables CUDA Graph usage for the Nemotron Nano v2 model by fixing heuristics, rerouting CUDA copy operations, and reorganizing CPU operations to prevent graph splitting, resulting in significantly improved performance on NVIDIA GPUs.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 186 | 25 | 3 | 48 |
taronaeo | 137 | 5 | 1 | 16 |
allozaur | 81 | 11 | 3 | 29 |
CISC | 48 | 14 | 0 | 41 |
danbev | 74 | 9 | 0 | 3 |
ngxson | 37 | 3 | 0 | 39 |
jeffbolznv | 33 | 8 | 0 | 30 |
ServeurpersoCom | 30 | 11 | 3 | 13 |
0cc4m | 17 | 0 | 0 | 33 |
slaren | 15 | 1 | 0 | 24 |