Weekly GitHub Report for Llama.cpp: September 15, 2025 - September 22, 2025 (12:02:13)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined user experience and increased system stability.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: Mobile web ui hides Copy/Edit/Delete buttons: This issue reports a bug in the mobile web user interface where the Copy, Edit, and Delete buttons are hidden because they are placed under a tooltip element that only becomes visible on hover, which is not possible on mobile devices. The problem affects the llama-server module and prevents users from tapping these buttons when accessing the web UI on mobile devices.
- The comments discuss a previous CSS solution that always showed these buttons on mobile in the old web UI, with agreement to implement a similar fix by always displaying the action icons on mobile. A pull request was proposed to address the issue, and users were invited to test the related update to confirm it resolves the problem.
- Number of comments this week: 4
-
Misc. bug: Vulkan backend shows negative scaling at low batch sizes with MOE models: This issue reports a performance anomaly where the Vulkan backend exhibits negative scaling at low batch sizes specifically with MOE (Mixture of Experts) models, using the gpt-oss-120b model as an example. The user observes that while dense models scale well with increasing batch sizes, MOE models show decreased throughput at batch sizes 2 and 3, and suspects the problem may be related to the MXFP4 quantization rather than the MOE architecture or SWA attention.
- In the comments, the user tests other MOE models and quantizations, finding that the negative scaling issue does not appear with some models or with Gemma 3, ruling out SWA attention as the cause. Further tests suggest the MXFP4 quantization is likely responsible for the negative scaling, as models quantized with MXFP4 consistently show this behavior, and a pull request intended to improve speed did not resolve the issue.
- Number of comments this week: 4
-
Eval bug: Vulkan backend hangs forever with NVIDIA GPU on FreeBSD: This issue reports that the Vulkan backend of the
llama-cli
program hangs indefinitely on FreeBSD when running on an NVIDIA GeForce RTX 5090 GPU, both during model warmup and when skipping warmup. The user suspects a problem related to Vulkan shaders or driver issues, supported by kernel tracing output showing repeated timeout errors in shader handling, and further testing reveals certain Vulkan shader operations are unsupported or cause memory allocation failures.- The discussion involved requests for the user's GPU driver version and running a diagnostic tool (
test-backend-ops
), which revealed multiple unsupported shader operations and a failure to allocate Vulkan device memory, leading to an uncaught exception and program abort; the issue was acknowledged as unusual and escalated internally for further investigation. - Number of comments this week: 3
- The discussion involved requests for the user's GPU driver version and running a diagnostic tool (
-
Misc. bug: webui keep conversation actions usable on mobile: This issue addresses a minor design bug in the SvelteKit-based WebUI where conversation actions remain usable on mobile devices, which is unintended behavior. The reporter provides a proposed fix and demonstrates the problem and target behavior on mobile browsers, specifically tested on a Samsung S25 Ultra.
- The maintainers acknowledge the issue and the proposed fix, generally approving it but suggesting an enhancement to have dialogs slide up from the bottom on mobile for better UX. The original reporter agrees to let the maintainer take over the improvement and continue working on the fix.
- Number of comments this week: 3
-
Compile bug: .devops/nix/package.nix still references ./ggml/src/ggml-metal/ggml-metal.m: This issue reports a compile error caused by the
.devops/nix/package.nix
file still referencing a deleted source file./ggml/src/ggml-metal/ggml-metal.m
after a specific commit removed it. This results in a build failure when using the nix flake, as the build process attempts to substitute a non-existent file, causing an error during the patch phase.- Multiple users confirmed experiencing the same build problem, and a pull request was opened to remove the outdated file reference. The fix was tested and verified to work on different machines, including successful builds of related packages.
- Number of comments this week: 3
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 539 days and highlights a discrepancy in behavior between different Vulkan backends used in the project.
- Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, to aid in performance analysis and debugging.
- common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the parallel download progress status by properly utilizing the CURLOPT_NOPROGRESS option from libcurl to ensure accurate and non-conflicting progress reporting.
- Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model on a Windows system using CUDA on an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.
- Feature Request: cli / up arrow key to navigate backward through your request history: This issue requests the addition of functionality in the command-line interface to allow users to navigate backward through their request history using the up arrow key, similar to the behavior found in bash. The motivation behind this feature is to improve user convenience by enabling quick editing of previous queries without needing to rewrite them entirely, with a suggested implementation involving support for both up and down arrow keys.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 22
Summarized Issues:
- Vulkan backend issues on FreeBSD and memory reporting limitations: The Vulkan backend on FreeBSD hangs indefinitely when running on an NVIDIA GeForce RTX 5090 GPU, likely due to shader compilation or synchronization failures, causing the
llama-cli
program to never complete model warmup or produce output. Additionally, the Vulkan backend's memory reporting function only returns dedicated GPU memory and excludes shared memory, limiting accurate memory reporting for discrete GPUs that rely on shared memory to run large models. - [issues/15996, issues/16092]
- WebUI usability and feature requests: Multiple issues affect the WebUI, including hidden Copy, Edit, and Delete buttons on mobile due to tooltip-trigger visibility, lack of attachment editing/removal after sending messages, inability to continue AI responses after editing, and blocking of conversation viewing and message sending when the server is down. There are also requests for storing conversations server-side with browsing APIs and fixing mobile usability bugs related to conversation actions.
- [issues/16061, issues/16085, issues/16097, issues/16120, issues/16131, issues/16094]
- Build and linking errors: Several build and linking problems occur, such as undefined references to C++17 filesystem functions on Oracle Linux 8 due to missing libstdc++fs linkage, failure to locate
libllama.so
at runtime causing shared object errors, and a compile error caused by a stale reference to a removed Metal source file in the nix package. Attempts to fix these issues via CMake flags or disabling tests have been unsuccessful. - [issues/16019, issues/16089, issues/16096]
- Command-line and environment variable improvements: There are requests and reports related to command-line argument handling and environment variables, including adding a
--mmproj-device
argument to specify the vision encoder device instead of relying on an environment variable, and inconsistent boolean argument handling in theLLAMA_ARG_JINJA
environment variable compared to others, causing confusion and configuration difficulties. - [issues/16012, issues/16105]
- Model conversion and compatibility: A reported failure in the
convert_hf_to_gguf.py
script occurs when converting the openai-community/gpt2-medium model due to a buffer length mismatch during lazy tensor conversion. Another issue proposes changing theargsort
function's result type fromI32
toI64
to align withggml_set_rows
and improve compatibility with large index values. - [issues/16013, issues/16001]
- Server and text generation concurrency bugs: The llama-server experiences a bug where parallel text generation processes become stuck or paused whenever one conversation sends an image for decoding or encoding, halting all other ongoing text generations until image processing completes. Additionally, in version 6503, the model can no longer continue generating answers when switching to another message, causing responses to stop abruptly.
- [issues/16046, issues/16133]
- Performance regression with MOE models on Vulkan backend: The Vulkan backend shows unexpected negative performance scaling at low batch sizes when running MOE models quantized with MXFP4, resulting in slower processing times at batch sizes 2 and 3 compared to dense models or other MOE models without this quantization.
- [issues/16134]
- Web UI content rendering issues: The llama.cpp web interface fails to properly decode mathematical expressions, displaying formulas like E=mc² as plain text when using the GPT-OSS-120b model, which affects readability and user experience.
- [issues/16136]
- API and integration enhancements: A feature request proposes exposing common internal utilities of llama.cpp through a stable C API to facilitate easier integration and reuse in external libraries and bindings, avoiding redundant re-implementation.
- [issues/16051]
- Model support feature request: There is a feature request to add support for the newly released Granite Docling model from Hugging Face, motivated by user interest and with implementation already in progress.
- [issues/16110]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 15
Summarized Issues:
- Backend and Device Compatibility Issues: Multiple issues report problems related to backend initialization, device support, and multi-GPU utilization. These include Vulkan backend regression causing out-of-memory errors on multi-GPU setups, missing CPU backend initialization symbols, and compilation failures due to Vulkan SDK header issues, all impacting proper hardware utilization and backend functionality.
- issues/15974, issues/16050, issues/16054
- Compilation and Linking Errors: Several issues describe build failures and linking errors across different platforms and toolchains. Problems include undefined references in shared libraries, unresolved external symbols in Unreal Engine projects due to C++ standard mismatches, and CUDA backend compile errors caused by exceeding shared data size limits.
- issues/16055, issues/16078, issues/16081
- Model Loading and Format Support: Issues highlight failures in loading specific model formats or architectures. These include unsupported model errors for Llama4ForCausalLM conversions, GGUF-format model loading failures on Apple M3 Ultra GPUs due to incorrect file paths, and kv cache misuse causing inefficient prompt processing in llama-server.
- issues/16021, issues/16093, issues/16033
- Metal Backend and Runtime Errors: Problems specific to the Metal backend on Apple hardware are reported, including runtime assertion failures during kernel compilation and model loading errors related to device backend usage. These issues cause crashes or failed executions on Mac Studio M4 Max and Apple M3 Ultra systems.
- issues/15977, issues/16093
- Server Behavior and Protocol Compliance: Several issues concern llama-server's behavior in response handling and protocol adherence. These include always sending usage statistics regardless of client requests and non-compliant SSE error message formatting that breaks compatibility with client libraries expecting RFC8895 compliance.
- issues/16048, issues/16104
- User Interface and Usability Bugs: There is a reported bug where the mobile web UI's Settings Dialog in llama-server is rendered incorrectly, requiring a proper mobile layout to improve usability and user experience on mobile devices.
- issues/16077
- Performance and Benchmarking Concerns: One issue reports unexpectedly long benchmarking times on AMD GPUs using ROCm, which was later attributed to the default behavior of running multiple repetitions per test, affecting perceived throughput and test duration.
- issues/16070
- Development and Testing Process Topics: An issue covers a broad range of development and testing related topics including test execution, usage tracking, VSCode API integration, problem management, code changes, and pull request handling within the project.
- issues/16007
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 44
Key Open Pull Requests
1. --numa mirror
: mirror model weights to every Numa node in the system: This pull request introduces a --numa mirror
option that mirrors model weights to every NUMA node in the system and uses thread-local variables in the OpenMP threadpool to ensure threads access the local mirrored weights at runtime, thereby reducing cross-socket memory traffic and improving inference performance by about 5%.
- URL: pull/16000
- Merged: No
- Associated Commits: dccea, 06a46, 435f0, c665d, d357e, 6d309, 48d8d, 4f756, a665a, 4b016, 166b9, c9513, 34a50, b8bf5, 4da24, b41a8, 98135, fa3a5, 23c97, 6ad67, c19cd, 8c00f, 313bf, e227c, d99fb
2. Add bailingmoe-v2 support: This pull request adds support for the bailingmoe-v2 model, including tokenizer and model initialization, updates to tensor handling and graph processing, fixes for conversion and token issues, and new chat templates, building upon previous work for Ling mini 2.0 moe support.
- URL: pull/16036
- Merged: No
- Associated Commits: b4eb5, 7968c, ba8e7, 1dee4, 63a2f, 69177, 3fe76, 94ec7, a2a22, a6b3c, c72e3, b3595, e078a, 09e3d, 44d88, 4b75f, 2eea4, aa8f5
3. [WIP] Rpc split row: This pull request introduces an early-stage implementation of row-splitting mode for RPC clusters in the Metal backend to enable more effective and faster inference across multiple devices, although it is currently a work in progress with performance issues that worsen as more devices are added.
- URL: pull/16020
- Merged: No
- Associated Commits: b53f0, e4e06, c5cc4, 5c161, 2f259, 77b18, 920b5, 0c171, 24d78, 7ea7f, 85ea1, 997e3, 81ef7
Other Open Pull Requests
- Model support and implementation enhancements: Multiple pull requests introduce support for new models and improve existing model handling, including the Qwen3 Next hybrid model, BailingMoeV2 with expert group selection, Ling version 2 models, and GraniteDocling conversion to GGUF format. These updates focus on expanding model compatibility, refining conversion processes, and addressing CUDA and CPU compatibility issues.
[pull/16095, pull/16063, pull/16028, pull/16075]
- Backend operator support and improvements: Several pull requests add or enhance support for unary and other operators across CPU and SYCL backends, including TRUNC, ROUND, CEIL, SET, and MEAN operators, with implementations of kernels, API exposure, and comprehensive testing to ensure consistency and correctness. These contributions also include optimizations and fixes for Vulkan backend operations to prevent GPU hangs.
[pull/16032, pull/16006, pull/16011, pull/16004, pull/16005, pull/16009, pull/16075]
- Continuous integration and build system updates: A group of pull requests focus on improving the CI system by migrating to self-hosted runners, clarifying code ownership for CI workflows, adding arm64 Docker builds, and automating git tag creation for Docker releases. These changes enhance testing infrastructure, build coverage, and release management.
[pull/16116, pull/16123, pull/16045, pull/16008]
- Web UI and deployment improvements: Updates to the web UI include configurable base path support for subdirectory deployment, UI enhancements for message display and sizing, and fixes to offline download logic to work without CURL. These changes improve usability and deployment flexibility.
[pull/16079, pull/16076, pull/16124]
- Computation graph and fusion enhancements: Pull requests introduce new functions like
ggml_op_is_empty
to filter operations, extend fusion capabilities to non-adjacent nodes in the computation graph, and optimize rope building functions with additional test coverage. These improvements increase code clarity and computational efficiency.
[pull/16122, pull/16126, pull/16112]
- Server and system integration features: One pull request adds a
--systemd
flag to enable systemd socket activation and readiness notification for the llama-server, allowing it to run as a systemd service with on-demand activation and reliable readiness reporting. This feature is conditionally compiled based on build configuration.
[pull/15998]
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 45
Key Closed Pull Requests
1. GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators: This pull request adds support for the ADD, MUL, RMS_NORM, and GET_ROWS operators in the GGML WebGPU backend by enhancing shader generation, handling various quantization types, implementing in-place operation variants, addressing platform-specific threadgroup size limitations, and improving efficiency with vectorized operations.
- URL: pull/16018
- Merged: Yes
- Associated Commits: 30ba1, 04d7b, 01c8c, bfff2, b8012, cddda, 1d572, 6a20e, ea390, 96d10, 4c587, ae8ed, 39aa1, 2c577, 6a613, b2dbf, 248f7, 4ad09, ac522, 1b16a, 7f9ee, c1021, efc0c, 7fbe8, dc7bc, b7635, 42935, ff412, a5da4, 77f8b, 102f2, b0bd4, fc915, 45617, 26742, cfa4f, 94228, b877e, be354
2. metal : refactor + optimize v2: This pull request refactors and optimizes the Metal backend of the project by reorganizing device and context code into separate headers, migrating kernels to inference-time compilation, splitting encoding functions into per-operation sources, improving naming conventions, enhancing error handling, and overall making the implementation safer and more manageable.
- URL: pull/15995
- Merged: Yes
- Associated Commits: a6a65, 61194, ee9d0, af86c, 84f11, 16f41, f9d50, 1f6b8, 1ad3a, 00480, ceaad, 45d6c, fe0ff, bef92, 7a238, 831b2, be6f5, 2324b
3. OpenCL: MoE MXFP4 kernel optimizations: This pull request optimizes the MXFP4 MoE and non-MoE kernels for OpenCL by implementing fixes such as the Q4_0 transpose for Adreno GPUs, adding SOA support and bit-wise conversion functions, improving kernel performance, and cleaning up the codebase to enhance compatibility and efficiency across different GPU architectures.
- URL: pull/16037
- Merged: Yes
- Associated Commits: 9710e, 92800, 29b73, 76d3e, 374c3, 464eb, 36676, 71846, 7a15e, 7aa67, fe12b, b7423, a6959, dbe0c, 7eb7e
Other Closed Pull Requests
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 140 | 18 | 0 | 31 |
taronaeo | 121 | 5 | 1 | 27 |
CISC | 46 | 7 | 0 | 69 |
ngxson | 47 | 7 | 0 | 43 |
pwilkin | 59 | 2 | 3 | 22 |
danbev | 68 | 9 | 0 | 2 |
jeffbolznv | 36 | 8 | 0 | 31 |
JohannesGaessler | 34 | 4 | 0 | 28 |
slaren | 21 | 4 | 0 | 33 |
0cc4m | 20 | 2 | 0 | 33 |