Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: October 20, 2025 - October 27, 2025 (12:01:38)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, reflecting a continued focus on stability and feature improvements. Notable highlights include optimized system processes and refined interface elements to streamline usability.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Eval bug: IBM Granite Docling goes in loop.: This issue reports a problem where the IBM Granite Docling model, when run with Vulkan GPU backend on an AMD Radeon integrated GPU, produces output in an infinite loop during OCR tasks. The user tested running the model with GPU layers disabled and with Vulkan completely disabled, confirming that the looping only occurs when Vulkan is enabled, suggesting a potential Vulkan-related bug.

    • The discussion involved the user providing detailed logs and test results, including running the model with GPU layers off and with Vulkan disabled, which stopped the looping behavior. Developers acknowledged the issue might be a Vulkan bug and are investigating it, with no confirmed fix yet, and a possible relation to another issue was mentioned.
    • Number of comments this week: 11
  2. Eval bug: Slowdown when using Vulkan Multi-GPU: This issue reports a significant performance slowdown when using Vulkan backend with multi-GPU split modes (Row or Layer) compared to single GPU mode, despite Vulkan generally outperforming ROCm on single GPU workloads. The user provides detailed benchmark data showing that while ROCm maintains performance across split modes, Vulkan suffers a notable drop, and commenters discuss that split row mode is not yet supported on Vulkan and that split layer mode should only improve performance if the model does not fit on a single GPU, with speculation that tensor copy overhead between devices might be the cause.

    • The discussion confirms that split row mode is unsupported on Vulkan and that split layer mode performance depends on model size relative to GPU memory; users share benchmark results on different hardware showing expected performance patterns, note that Vulkan’s slowdown is surprising, and suggest that overhead in tensor copying between GPUs may be responsible, with a recommendation to use Vulkan’s performance logger to investigate further.
    • Number of comments this week: 11
  3. HTTP API specification: This issue concerns the frequent and sometimes breaking changes in the HTTP API of llama.cpp's llama-server, which complicate the development of typed Dart clients. The user is seeking a formal, stable specification or a recommended strategy to handle these API changes and minimize client breakage.

    • The comments clarify that all HTTP API changes are documented in a specific issue, and comprehensive documentation and tests exist for the HTTP server, though some endpoints change frequently and documentation may lag. Contributors maintain the docs, and while auto-generated documentation is considered for the future, it is currently not implemented; the user acknowledges this and plans to reduce their client’s API surface to mitigate breakage.
    • Number of comments this week: 6
  4. Compile bug: Linker crash with tagged pointer truncation on Android 15 (Termux): This issue reports a linker crash occurring when building certain complex binaries of llama.cpp on Android 15 using Termux, caused by tagged pointer truncation related to Android 15’s Memory Tagging Extension. The problem selectively affects linking of larger or more complex targets like llama-quantize and some test binaries, while simpler binaries link successfully, and several workarounds involving targeted builds and manual compiler flags have been identified to avoid the crash.

    • The comment discussion centers on testing and confirming the effectiveness of manual optimization flags to improve performance and avoid the linker crash, with users sharing benchmark results on different devices and configurations, noting modest performance differences and discussing available RAM and model usage without resolving the underlying linker issue.
    • Number of comments this week: 5
  5. vulkan. bug: Intel Arc iGPU hangs with granite-4.0-h-tiny-UD-Q8_K_XL.gguf: This issue reports a GPU hang occurring on Intel Arc integrated graphics when using the granite-4.0-h-tiny-UD-Q8_K_XL.gguf model with llama.cpp, specifically triggered by sending a second chat message which causes the GPU and llama.cpp process to abort with fence timeout errors. The user provides detailed device and Vulkan environment information, and commenters discuss Vulkan info output formatting, reproduce the issue on other Intel GPUs, and share a workaround involving a shader modification that prevents the crash but is not a complete fix.

    • The discussion includes a request to hide verbose Vulkan info for readability, confirmation of the issue on another Intel GPU model, and a shared partial workaround patch that disables certain shader data indexing to avoid the GPU hang, indicating ongoing investigation without a definitive resolution yet.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 574 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
  2. Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, to aid in performance analysis and debugging.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress indicators when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status displays during parallel downloads.
  4. kubernetes example: This issue discusses the creation of a Kubernetes example for deploying the llama.cpp server using a Helm chart, aiming to provide the community with a scalable and standardized deployment method. The original poster has made initial progress but is seeking contributions and assistance to further develop this example.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using CUDA on a Windows system with an NVIDIA GeForce RTX 3060 GPU. Specifically, the error occurs because a tensor named 'blk.0.ffn_down.weight' has a number of elements per row that is not a multiple of the expected block size, causing the model loading process to fail.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 19

Summarized Issues:

  • OCR and Vulkan Backend Issues: Several issues report problems related to OCR functionality and Vulkan GPU backend in llama.cpp. These include infinite loops producing repeated OCR text when using Vulkan, and GPU hangs or crashes on specific hardware like Intel Arc GPUs, indicating potential bugs in Vulkan implementation and hardware compatibility.
  • issues/16676, issues/16678, issues/16684
  • GPU Multi-GPU and Driver Compatibility Problems: Multiple issues describe failures or incorrect outputs when using multi-GPU setups or specific GPU drivers. These include gibberish output with NVIDIA 3090 GPUs in row split mode and hangs on Ascend NPU due to driver errors, highlighting challenges in multi-GPU synchronization and driver support.
  • issues/16680, issues/16695
  • Context Management and Memory Allocation Bugs: Issues report problems with context handling flags and memory management during inference. The --context-shift flag no longer works as expected causing errors, and there is a feature request for dynamic KV cache allocation per client to optimize memory usage.
  • issues/16693, issues/16694
  • Model Architecture and Feature Support Requests: Several feature requests seek support for new model architectures and functions. These include adding support for Megrez-moe and Eagle2-VL models, as well as new functions to retrieve vision model image and patch sizes to aid preprocessing.
  • issues/16703, issues/16704, issues/16724
  • Parsing and Runtime Errors in Model Execution: Issues describe crashes and parsing failures caused by malformed inputs or unsupported grammar patterns. These include server crashes due to invalid tool calls and grammar parsing failures with regex shorthand \d in the typescript-sdk.
  • issues/16677, issues/16710, issues/16714
  • API Stability and Specification Challenges: There is a concern about the lack of a stable HTTP API specification for llama-server, causing difficulties for typed clients due to frequent breaking changes and the absence of a single source of truth for API updates.
  • issues/16758
  • Vulkan Performance and Memory Management Issues: Reports highlight irregular Vulkan buffer size behavior and significant performance slowdowns in multi-GPU Vulkan modes. Additionally, memory management bugs such as unreleased mmap fragments and incorrect Metal buffer size reporting on Mac contribute to increased memory usage and potential instability.
  • issues/16759, issues/16761, issues/16762, issues/16767
  • Unchecked Return Values Leading to Potential Errors: One issue points out that the boolean return value of llama_memory_seq_rm() is not checked in multiple source files, which may cause unhandled errors or unexpected behavior during memory operations.
  • issues/16768

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 8

Summarized Issues:

  • Compilation Failures: Multiple issues report compilation failures due to different causes, including incompatibilities between CUDA 13 and GCC on Ubuntu 25.10, and the use of undeclared identifiers in the CPU backend on Termux. These failures prevent successful builds and highlight environment-specific and codebase issues that need addressing.
  • [issues/16685, issues/16719]
  • Model Output and Performance Regressions: There are reports of output quality regressions on AMD hardware with the HIP backend after a specific commit, as well as significant lag during text generation on Android devices when restarting generation. These issues indicate problems affecting both output fidelity and runtime performance under certain conditions.
  • [issues/16709, issues/16721]
  • Memory and Resource Management on AMD ROCm GPUs: Loading large models on AMD ROCm GPUs fails due to out-of-memory errors despite sufficient VRAM, likely caused by excessive compute buffer allocation or backend limitations. This points to inefficiencies or bugs in resource allocation when using flash attention or ROCm support.
  • [issues/16725]
  • Web UI Functionality and Usability Bugs: The llama-server web UI has issues including the lack of support for the q URL parameter to initialize conversations with prompts, and a bug where prematurely stopped response generations do not render properly in the chat interface. These affect user experience and integration capabilities with external tools.
  • [issues/16722, issues/16726]
  • Empty Placeholder Issue: There is a newly created issue containing only the default placeholder text with no additional information or comments, indicating a need for further details or clarification.
  • [issues/16672]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 30

Key Open Pull Requests

1. qwen3-coder tool call parser: This pull request introduces comprehensive support for the Qwen3-Coder model family's XML-based tool-calling format by implementing a new robust XML parser, updating chat template detection logic to reliably identify Qwen3-Coder models, and adding extensive tests to ensure correct function calling and format handling.

  • URL: pull/16755
  • Merged: No
  • Associated Commits: 90dd6, c920d, 5c7c5, 2de36, dda43, 89daf, b5e37, dc6c4, 6e1fb, 9b512, ccad7, e33da, a7f21, 9a2cc, cff13, ca516, f4371, 11f3d, 25200, 1ba03, d1fe9, 0563a, e52c9, 08cc2

2. CUDA: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators: This pull request adds CUDA backend support for the unary operators FLOOR, CEIL, ROUND, and TRUNC in the ggml library by implementing the CUDA kernel logic, registering the new operators in the backend dispatch system, extending the test suite to ensure correctness against CPU results across various tensor shapes and data types, and updating relevant documentation.

  • URL: pull/16683
  • Merged: No
  • Associated Commits: cbefc, c4ce4, a97a0, 7d2a0, 3489b, d8531, afc71, 47aa3, 42a7b, c6291, 022c4, 840ce, 60bdb, 0767a, 0dc6d, fe5ae

3. ggml-cpu: arm64: q4_K repack gemm and gemv implementations: This pull request improves the q4_k_q8_k GEMM and GEMV implementations for ARM64 by utilizing i8mm and vecdot instructions, resulting in significant performance speedups on Apple M4 devices without affecting model perplexity.

  • URL: pull/16739
  • Merged: No
  • Associated Commits: 1f7f4, 4e5be, f9e15, 28e30, 0b1fe, c14e3, 0b456, f678c, ef019, c4f13

Other Open Pull Requests

  • OCR Integration and Model Additions: Multiple pull requests introduce OCR capabilities and related model integrations, including early PaddleOCR work and the addition of the LightOnOCR-1B model with Qwen3 and Mistral3 components. These changes enable OCR functionality with specific input requirements and improve vision-language model support.
    • pull/16701, pull/16764
  • Backend and Hardware Support Enhancements: Several pull requests add or improve support for various hardware backends such as CUDA, Metal, Vulkan, SYCL, and RISC-V. These include CUDA GEMV fusion for performance, Metal backend fixes and new operations, Vulkan unary operator support, SYCL REPEAT_BACK operation, and RISC-V test support with noted limitations.
    • pull/16715, pull/16669, pull/16686, pull/16700, pull/16682
  • Key-Value Cache and Context Management: Pull requests address improvements to the key-value cache system, including unified cache support across parallel server slots, fixing context-to-buffer ordering inconsistencies, and adding token insertion position tracking to ensure correct positional encoding. These changes aim to improve cache efficiency and model accuracy.
    • pull/16736, pull/16727, pull/16745
  • Performance and Stability Fixes: Various pull requests focus on performance optimizations and bug fixes, such as fixing CUDA kernel launch configurations, replacing slow hashing algorithms with faster ones, reducing log verbosity, and disabling pipeline parallelism on buffer allocation failure to prevent crashes. These updates enhance stability and efficiency.
    • pull/16689, pull/16748, pull/16727, pull/16746, pull/16700
  • Template and Format Normalization: One pull request normalizes chat templates by replacing vision and audio markers with a unified placeholder to fix intermittent test failures in vision chat functionality. This ensures consistent template handling across the server.
    • pull/16749
  • Functionality Additions and Fixes in Core Operations: Pull requests add or fix core tensor operations such as interpolation fixes to avoid division by zero, weight clamping in CUDA top-k mixture of experts normalization, and implementation of get_rows and dequantize for 6-bit quantized weights. These improve correctness and performance of tensor computations.
    • pull/16702, pull/16743, pull/16734, pull/16700
  • Model-Specific Enhancements: Pull requests introduce expert group selection for multiple models and add system prompt formatting for the LFM2 model's tool calling, enabling Python-like code generation and optional JSON schema enforcement. These changes enhance model capabilities and usability.
    • pull/16691, pull/16752, pull/16691
  • Build and Documentation Updates: Some pull requests fix build issues caused by compiler flags and update documentation to reflect current GPU target naming conventions. These maintain build reliability and documentation accuracy.
    • pull/16692, pull/16691
  • Web UI Improvements: One pull request adds HTML and JavaScript preview support to the MarkdownContent component via a sandboxed iframe modal, improving the web UI's preview capabilities with updated styling and state management.
    • pull/16757
  • Device and Driver Compatibility Fixes: A pull request addresses duplicate Vulkan device detection on Windows by prioritizing native Vulkan drivers over Microsoft’s Dzn driver to prevent memory allocation failures and reduce overhead.
    • pull/16689

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 29

Key Closed Pull Requests

1. ggml: add ggml_can_fuse_subgraph: This pull request introduces the function ggml_can_fuse_subgraph, a less strict extension of ggml_can_fuse that verifies whether all intermediate tensors within a subgraph can be fused based on its inputs and outputs, along with several refinements and usage updates to improve subgraph fusion capabilities in the ggml project.

  • URL: pull/16662
  • Merged: Yes
  • Associated Commits: b8a36, 578d9, ba472, d8530, 977a3, c1054, 3886b, f2cdb

2. Prevent premature submission on IME input: This pull request prevents premature submission of input during IME word selection by adding checks for KeyboardEvent.isComposing and KeyboardEvent.eventKey === 229 to ensure requests are only sent after IME composition is complete.

  • URL: pull/16673
  • Merged: Yes
  • Associated Commits: 902ad, c71f3, 54fbd, 09716, 6c583, 3787c, c6939, 7e51a

3. mtmd-cli : allow using --jinja: This pull request enables the use of the --jinja option in the mtmd-cli by adding jinja support through the inclusion of chat_history in mtmd_cli_context, with the changes tested and confirmed to work with Gemma 3.

  • URL: pull/16718
  • Merged: Yes
  • Associated Commits: 4b429, 47d89, dfb84, f5689, 283f7

Other Closed Pull Requests

  • Build and Release Fixes: Several pull requests address build and release issues, including fixing the binaries release process for the s390x architecture and resolving build warnings on Linux and Android platforms. These updates ensure smoother release workflows and cleaner builds across supported environments.
    • pull/16664, pull/16688
  • Web UI Enhancements: Multiple pull requests improve the web user interface by adding support for the "q" URL parameter, handling legacy 'context' attachments, and updating the static build. These changes enhance usability and maintain compatibility with existing features and browsers.
    • pull/16728, pull/16687
  • CUDA and GPU Backend Fixes: Several pull requests fix CUDA-related issues such as grid launch limits, topk-moe softmax bugs, and improve CUDA argsort performance using the CUB library. Additionally, Vulkan pipeline selection heuristics were adjusted to prevent hangs on Intel Arc GPUs, improving stability and performance on various hardware.
    • pull/16742, pull/16711, pull/16751, pull/16681
  • Memory Management Improvements: Pull requests address memory leaks and allocation efficiency by fixing a memory leak in ggml-alloc, optimizing metal backend allocation sizes, and adding a memory breakdown printout on server shutdown. These changes improve resource handling and debugging capabilities.
    • pull/16679, pull/16666, pull/16740
  • Model Conversion and Loading Updates: Updates include rebasing support for pre-quantized models, making the mistral-common dependency optional, preventing dequantization errors for GPT-OSS models, and adding trust_remote_code=True for smoother model loading. These changes enhance model compatibility and user experience during conversion and loading.
    • pull/16737, pull/16756, pull/16754, pull/16751
  • Code Cleanup and Refactoring: Some pull requests focus on codebase maintenance by removing unused Vulkan functions, reverting a vectorization optimization, and updating documentation notes. These efforts help maintain code clarity and correctness.
    • pull/16723, pull/16712, pull/16697
  • New Features and API Additions: New functionality includes adding an /api/version endpoint for version checking and introducing functions to get vision image and patch sizes. These additions support integration and expand the project's capabilities.
    • pull/16733, pull/16705
  • Warning and Error Handling Fixes: Fixes include correcting a pooling_type warning that appeared incorrectly and preventing assertion failures in the PLaMo2 model by setting embedding parameters explicitly. These changes improve user feedback and model stability.
    • pull/16674, pull/16766
  • Output Filename Handling: One pull request improves the handling of the default output filename for the mmproj model by ensuring the generated file is placed in the correct directory with an appropriate extension and suffix, aligning with naming conventions.
    • pull/16760

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 109 11 3 45
allozaur 77 7 0 31
ServeurpersoCom 64 10 3 36
hanishkvc 89 1 0 0
am17an 52 10 0 15
jeffbolznv 20 5 0 49
CISC 25 5 0 21
ngxson 20 3 3 17
JohannesGaessler 12 5 0 22
No author found 37 0 0 0

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.