Weekly Project News

Archives
Subscribe

Weekly GitHub Report for Llama.cpp: October 27, 2025 - November 03, 2025 (12:06:16)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced performance and user experience. Notable highlights include optimized features and bug fixes that streamline functionality and increase stability.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Eval bug: ROCm illegal memory access with -sm row: This issue reports a crash caused by an illegal memory access error when running a model with the -sm row option on a mixed GPU setup involving ROCm (AMD) and CUDA (NVIDIA) backends, specifically triggered after a certain commit. The problem does not occur when using -sm layer or running solely on the AMD GPU, and appears related to peer-to-peer memory operations or buffer sharing between different vendor devices, with users confirming the issue across various hardware and ROCm versions.

    • The discussion highlights that this is likely a ROCm or Linux kernel bug affecting peer-to-peer DMA on gfx906 devices rather than a llamacpp bug, though the issue surfaced after a specific commit. Multiple users confirm the crash with mixed vendor setups and different hardware, attempts to disable peer copies do not resolve it, and some note that compiling with certain flags can temporarily avoid the problem; however, the root cause remains unclear and difficult to debug without access to similar mixed hardware environments.
    • Number of comments this week: 15
  2. Misc. bug: llama-server integration with Firefox's AI chatbot feature breaks with overly long queries: This issue describes a bug where the integration of llama-server with Firefox’s AI chatbot feature fails when handling overly long queries, causing the server to hang on the first such request and subsequently return HTTP 414 errors. The problem is linked to a low maximum URI length limit in the underlying HTTP library (cpp-httplib), and discussions include potential fixes by increasing this limit and addressing the delayed 414 error response behavior.

    • The comments include detailed reproduction steps, clarifications on context size and model parameters, and exploration of the root cause in cpp-httplib’s URI length limit. Contributors share debugging insights, propose code patches to fix the hanging on long requests, and discuss related Firefox configuration quirks. There is also tangential discussion about model performance, alternative Firefox extensions for AI summarization, and upstream pull requests to improve error handling in the HTTP library.
    • Number of comments this week: 9
  3. Tickets should not close just because they've been around for two weeks: This issue addresses concerns about the automatic closing of GitHub tickets after a period of inactivity, arguing that tickets should only be closed after proper review or when deemed irrelevant, rather than simply due to a 14-day inactivity timer. The original poster suggests that preserving these tickets could be beneficial, especially as AI tools might soon be able to handle them more effectively, and criticizes the current approach as ignoring potentially valuable issues.

    • The comments clarify that the automatic closing process involves a two-step timeline with a stale period before closure and includes exceptions, with the ability to reopen issues if needed. However, several participants highlight problems such as lack of clear warnings before closing, difficulties in reopening, and the impact on feature requests, while others note that in open-source projects, contributors typically focus on issues that interest them, making automatic closure a practical, if imperfect, solution.
    • Number of comments this week: 8
  4. Eval bug: When offloading to CPU after 3cfa9c3 commit using CUDA, PP performance seems to be reduced by ~75% (Seems related to CPU model buffer): This issue reports a significant performance regression (~75% slower prompt processing) when offloading parts of a large DeepSeek model to the CPU using CUDA after a specific commit in the llama.cpp project. The user observes that the CPU model buffer appears earlier in the loading order and suspects that some matrix multiplications are now being computed on the CPU instead of the main CUDA device, leading to degraded performance, and requests help identifying the cause and potential fixes.

    • The discussion includes detailed comparisons of model loading and evaluation timings before and after the problematic commit, suggestions to test disabling CUDA fusion which did not resolve the issue, observations about GPU utilization shifting to slower PCIe lanes, and attempts to gather debug information via scheduler debug flags; contributors request simpler repro cases and debug logs, and the original reporter provides extensive logs and confirms that the issue persists despite various attempts to isolate it.
    • Number of comments this week: 8
  5. Misc. bug: Ryzen 395 AI Max, vulkan is seeing vram from system memory and not dedicated gpu memory: This issue describes a problem with the Vulkan API on a Ryzen 395 AI Max system where the VRAM reported by llama.cpp does not correctly reflect the dedicated GPU memory when UMA is set to 96GB, instead showing only 32GB. The user reports that with UMA set to 64GB, the full 64GB is recognized, but increasing UMA to 96GB causes the memory to be split and only partially detected, leading to a mismatch in available VRAM for the Radeon 8060S GPU.

    • The discussion involved requests for Vulkan system information outputs at different UMA memory settings, which the user provided via gists. Analysis revealed that when UMA is set to 96GB, the GPU memory is split into multiple heaps, but the current implementation only considers the first heap, causing the underreporting of VRAM. The issue is acknowledged and a solution to properly handle multiple memory heaps is being considered.
    • Number of comments this week: 5

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 581 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
  2. Question: How to generate an MPS gputrace: This issue is a question about how to generate an MPS gputrace for the llama.cpp project during model inference, specifically seeking guidance on producing debugger output similar to that provided by Apple's Metal debugger. The user is working on improving the Metal backend in a related project and is collecting various gputraces from different frameworks, looking for documented or known methods to obtain such traces for llama.cpp.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in libcurl to properly handle and display the progress status of parallel downloads.
  4. kubernetes example: This issue discusses the creation of a Kubernetes Helm chart for deploying the llama.cpp server, aiming to facilitate scalable application deployment within the community. The author has begun work on this example but is seeking contributions and assistance to continue development when time permits.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 18

Summarized Issues:

  • Automated Issue Closure Concerns: The automatic closing of GitHub tickets after a period of inactivity prematurely dismisses potentially relevant issues without proper review or consideration. This practice suggests tickets should only be closed based on logical evaluation rather than an automated timer.
  • issues/16797
  • GPU Backend and Hardware Compatibility Bugs: Multiple issues report crashes, incorrect memory detection, or performance problems related to GPU backends such as CUDA, ROCm, Vulkan, and Hexagon on various hardware including NVIDIA, AMD, Snapdragon, and Adreno GPUs. These include illegal memory access errors, VRAM misreporting, garbled output, and partial GPU utilization, indicating significant compatibility and stability challenges across platforms.
  • issues/16799, issues/16832, issues/16854, issues/16881, issues/16895, issues/16911
  • Compilation and Model Loading Failures: There are compilation failures and model loading errors caused by unrecognized compiler flags and unsupported model architectures, particularly on specific platforms like Rocky Linux 9 and Windows with HIP backend on AMD hardware. These issues prevent successful builds or model initialization, hindering usability on certain systems.
  • issues/16809, issues/16908, issues/16909
  • Server and API Stability Issues: The llama-server experiences stability problems including hanging on long queries due to URI length limits and unexpected runtime errors causing server aborts during streaming with certain models. These issues affect reliability and user experience when interacting with the server API.
  • issues/16830, issues/16888
  • Performance Degradation and Optimization Requests: Performance regressions have been observed, such as a ~75% slowdown in prompt processing when offloading to CPU with CUDA, and proposals exist to optimize memory usage in layer stacking and to add detailed timing statistics in the web UI. These highlight ongoing efforts to improve efficiency and transparency in model execution.
  • issues/16863, issues/16902, issues/16912
  • Feature Requests for UI and Multimodal Enhancements: Users have requested multiple usability improvements for the llama-server WebUI, including response continuation, offline caching, multilingual support, and chat summaries, as well as integration of whisper.cpp for better audio processing. These aim to enhance user experience and expand multimodal capabilities.
  • issues/16839, issues/16885
  • Model Format and Chat Format Support: There are requests and issues related to supporting new chat formats like Minimax M2 and handling tool calls, as well as problems with control vector loading due to embedding dimension mismatches in specific models, indicating challenges in model compatibility and format support.
  • issues/16904, issues/16908

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 13

Summarized Issues:

  • Model Information Display Issues: Starting from version b6821, the llama-server WebUI running offline with local GGUF models incorrectly displays the model name used for each response, showing only the currently loaded model instead. This was fixed by improving the WebUI to refresh model information automatically without requiring manual page reloads.
  • issues/16771
  • Image Resizing and Bounding Box Accuracy: Multiple issues address image resizing and bounding box coordinate problems in models like SmolVLM2, QwenVL, and Qwen 3 VL. These include incorrect scaling caps preventing image upscaling, improper handling of bounding box coordinates relative to fixed sizes rather than actual image dimensions, and challenges maintaining bounding box precision especially for small or non-square images, all impacting model output accuracy.
  • issues/16776, issues/16842, issues/16880
  • GPU and Backend Compatibility Problems: Several issues report failures and performance regressions related to GPU backends including AMD ROCm, CUDA on Jetson AGX Orin, SYCL on Intel Arc Graphics, and Vulkan on Ubuntu. Problems include build configuration changes causing model loading failures, memory access faults with multi-GPU setups, VRAM allocation failures, shader compilation errors, and a 10% CUDA performance regression linked to kernel fusion and hardware specifics.
  • issues/16795, issues/16815, issues/16860, issues/16865, issues/16867
  • Model Support and Feature Requests: There is a request for adding support for the MiniMax M2 model, noted for its competitive performance, which was successfully merged after implementation.
  • issues/16798
  • Security and Logging Concerns: A request was made to prevent the llama-server from logging user message content at the default log level when reusing the kv-cache, to avoid exposing private or sensitive information. The suggestion is to restrict such detailed logging to debug level or remove it entirely.
  • issues/16870
  • Decoding and Schema Conversion Errors: Issues include decoding failures in QwenVL models caused by inconsistent sequence positions violating positional encoding requirements during multi-batch image processing, and a JSON schema to grammar conversion bug where references to items within arrays are unsupported, leading to errors in schema definitions.
  • issues/16790, issues/16876

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 42

Key Open Pull Requests

1. server/public_simplechat - basic builtin data store related tool calls added - use builtin browser/client side tool calling with minimal setup: This pull request adds basic built-in data store tool calls to the server/public_simplechat alternate web client UI, enabling client-side tool calling using the browser's indexedDB within a web worker context for isolated, minimal-setup data storage, and builds upon previous enhancements including reasoning support, web search integration, and improved tool call handling.

  • URL: pull/16852
  • Merged: No
  • Associated Commits: a2a89, 7dc59, aa1c4, 8a68c, 60b2b, 9b486, e8972, 97af8, a3e42, 94982, 8be6d, 7888a, d667e, f8e80, 38fd7, 4e736, 7f3ac, e00f0, ba0f0, bd3e9, f3a1f, fb36d, d34df, dcbe8, 064c4, 99ae4, fc4d7, a9153, ee2ef, e3bc0, 17938, 250bf, 832d6, a4151, 289e9, 7f007, 05b52, da99c, e42e7, 0cb22, 25b7a, 9bd3b, 94417, 20018, 33f35, fb982, 8b184, a128f, 524aa, a4152, cff1d, 8481a, d0621, 2ab01, 21ac0, 303d1, 251ee, 16834, 3c03e, 6ed43, 945de, 0083e, 926b2, 41a5f, 4bc5d, 067f6, efdce, d86bc, a74ab, 0b024, 61dde, 72e7a, 65abb, 4d79c, 58128, fac18, d445e, 4cda0, b0bf7, 9a4b2, 796d3, 39c75, ac231, 036da, 6fee6, 57523, e8340, fbe83, 5f68b, ab27b, 8bb05, ccdef, 3fda8, 308e3, 3135a, 8186d, eafbc, a52fa, 82243, ab224, acce0, 0401f, 54cb9

2. Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next): This pull request introduces and implements a new SparseK Attention mechanism from scratch as a GGML operator with a complete CPU backend, featuring top-K filtering, local windowing, and global stride sparsity techniques, while laying the groundwork for future GPU (SYCL) backend support and performance optimization.

  • URL: pull/16817
  • Merged: No
  • Associated Commits: 66248, 5d6d3, 46325, a5daf, 39a11, 612fd, b0194, d02d9, 5fa78, b19c2, 49c7e, 939bb, 1983a, 202c5, 97129, 77f40

3. CUDA: Conv2d tensor core: This pull request introduces CUDA-based Conv2d operations optimized for Tensor Cores, incorporating modifications from a previous pull request to achieve enhanced FP16 performance on RTX 2070 GPUs, demonstrated by detailed benchmarking results and multiple commits focused on performance optimization, code cleanup, and integration with Vulkan-like tensor code.

  • URL: pull/16828
  • Merged: No
  • Associated Commits: 19596, 96db6, 2cd9f, d633c, ac5e0, 41017, 4ae58, 51f85, 60495, cc3d3, c7259, 18098, e3f94

Other Open Pull Requests

  • CUDA and GPU Performance Optimizations: Multiple pull requests focus on improving GPU-related performance, including a specialized CUDA kernel for transposed tensor copies that boosts bandwidth by 3x-4x, a fused rope and set_rows CUDA operation yielding a 1% speedup, and enabling CUDA Graphs for the Embedding Gemma 300m model to enhance performance. Additional GPU improvements include support for RDNA4 architecture with significant matrix multiplication gains and Vulkan backend optimizations that reduce GPU idle overhead and improve pipeline management.
    • pull/16841, pull/16884, pull/16835, pull/16824, pull/16840
  • Backend and Kernel Optimizations: Several pull requests improve backend implementations and kernel efficiency, such as vectorizing and optimizing the SYCL REPEAT_BACK kernel for Intel UHD GPUs, refining the ggml WebGPU backend's SET_ROWS operation with better parallelization and debugging labels, and refactoring Vulkan buffer handling to reduce code duplication and improve clarity. These changes collectively enhance execution speed and maintainability across different hardware backends.
    • pull/16869, pull/16810, pull/16826
  • Model and Feature Support Enhancements: New model and feature support includes adding Janus-Pro 1B and 7B models with image understanding capabilities, introducing k-quantized matrix-vector multiplication with Vulkan backend support, and adding cross_entropy_loss operator support to the cann backend. There is also a draft implementation of fused QKV weight multiplication for CUDA that improves qwen3 model performance by 4-5%, and a sample FA implementation for the CLIP model that reduces memory usage and boosts performance.
    • pull/16906, pull/16900, pull/16886, pull/16813, pull/16837
  • Video and Image Handling Improvements: Enhancements to media handling include adding video support in the mtmd module and server with features like buffer loading and uniform sampling, and enabling the server to load images from local file paths with new security options to restrict accessible directories and file sizes. These changes improve multimedia integration and safety in the server environment.
    • pull/16910, pull/16874
  • Code Refactoring and Bug Fixes: Refactoring efforts include templateifying rope computation functions to remove duplicate code and adding macros to prevent invalid shader casts, which fixes compilation errors. Other fixes address Docker build failures on s390x architecture by applying compile flags and updating documentation, and attempts to remove embedding mangling hacks in Qwen3VL models to resolve related issues.
    • pull/16805, pull/16922, pull/16925
  • Documentation and Build Improvements: Documentation updates include adding build instructions for TheRock HIP backend and clarifying the conditional functionality of the GGML_CANN_ACL_GRAPH feature. Additional build-related improvements address CUDA 11.7 compilation issues on Manjaro and update Linux/s390x documentation to support new architectures.
    • pull/16915, pull/16844, pull/16824
  • RPC and Communication Optimization: One pull request optimizes RPC communication by combining command and size packets with the main payload in send and receive functions, aiming to improve network performance and speed when used with related changes.
    • pull/16892
  • NUMA and Parallelization Adjustments: Disabling NUMA-specific chunking logic in ggml for high-core-count HPC systems improves parallelization and yields significant speedups at high thread counts by using uniform chunking instead of hardware-specific optimizations.
    • pull/16882
  • OpenCL Feature Addition: A pull request proposes adding support for the imrope feature in the OpenCL implementation, expanding the project's capabilities on this platform.
    • pull/16914

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 60

Key Closed Pull Requests

1. [model] add support for qwen3vl series: This pull request adds support for the Qwen3-VL series, including both dense and MoE variants, fixes several algorithmic implementation details such as deepstack, introduces MRoPE-Interleave support, and performs final code cleanup, building upon contributions from multiple collaborators.

  • URL: pull/16780
  • Merged: Yes
  • Associated Commits: 1e4fd, f84bd, 0443a, 32718, f321b, b0169, 79448, 0b37f, 2be92, 2a669, 52e3d, 96037, 473ee, e9a16, 69e26, 0518b, 6a019, 10ce7, cbca6, 950c7, 19a45, 0bed5, b3380, 7d9c1

2. mtmd: refactor preprocessing + support max/min pixels: This pull request refactors the preprocessing pipeline to implement "smart resize" functionality by supporting maximum and minimum pixel limits, allowing input images to be constrained by token count rather than fixed height or width, thereby improving flexibility and accuracy in image processing.

  • URL: pull/16878
  • Merged: Yes
  • Associated Commits: c1b18, 13cd2, 66d5c, c53c5, 68b15, 7bd1a, 2892e, 42744, 000d1, bfd03, 2c0d9, 4621d, bae84, 29c72, 00ee5, a834e

3. llama: store mrope data in KV cell: This pull request implements storing M-RoPE (x, y, t) positional data inside key-value cells to enable correct construction of the causal mask based on these positions, thereby fixing a related issue without introducing breaking changes and superseding a previous attempt.

  • URL: pull/16825
  • Merged: Yes
  • Associated Commits: bf7f9, 90353, c3e13, ebac8, 9102a, f7063, 18842, 5ec41, bed0f, 45d60

Other Closed Pull Requests

  • Backend operator support and optimizations: Multiple pull requests enhance backend operator support and optimize performance across CUDA, SYCL, Vulkan, and Hexagon backends. These include adding support for RMS_NORM_BACK and SSM_CONV operators in SYCL, implementing the SET operator on CUDA, fusing MoE multiplication and addition in CUDA, optimizing Vulkan mat-vec multiplication fusion, and improving Hexagon dspqueue processing and tensor data handling.
  • [pull/16808, pull/16800, pull/16804, pull/16843, pull/16858, pull/16820, pull/16836, pull/16857]
  • CUDA performance improvements and bug fixes: Several pull requests focus on CUDA backend performance and correctness, including adding Volta tensor core support for matrix multiplication, enabling fast copy for different data types, fixing topk-moe subgraph fusion bugs, optimizing set-rows with fast division, and improving rocWMMA performance on RDNA3+ GPUs. These changes result in speedups, better fusion correctness, and enhanced stability.
  • [pull/16843, pull/16789, pull/16821, pull/16827, pull/16806, pull/16814, pull/16789, pull/16789]
  • Memory and token tracking improvements: Updates include removing redundant n_past counters in server code to simplify token position tracking and simplifying memory module creation by removing unnecessary KV cache padding. These changes streamline internal logic and prepare for future enhancements in positional representation and compute graph reuse.
  • [pull/16818, pull/16812]
  • Web UI enhancements: Improvements to the web UI include automatic refresh of the /props endpoint to resynchronize model metadata, adding support for AsciiDoc files as valid text files, and displaying detailed message generation statistics in the General settings. These changes improve user experience and interface responsiveness.
  • [pull/16784, pull/16850, pull/16901]
  • Bug fixes and test improvements: Bug fixes address Vulkan backend crashes due to incorrect pipeline checks, idefics3 preprocessing and overview image resizing issues, and ggml-hexagon backend tensor data handling. Additionally, test-backend-ops improvements include printing failed tests at the end of runs and fixing test count reporting.
  • [pull/16796, pull/16806, pull/16836, pull/16785]
  • Grammar and schema processing updates: One pull request enhances JSON schema to grammar conversion by adding support for referencing array items and renaming grammar rules with a new naming convention. This improves clarity and uniqueness in grammar processing.
  • [pull/16792]
  • Model and inference improvements: The Minimax M2 model is implemented with cleanup and updates, though chat template integration is postponed. An attempt to improve M-RoPE positional encoding by adding a linearly increasing token dimension aims to enhance output accuracy but remains a work in progress.
  • [pull/16831, pull/16822]

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
hanishkvc 192 4 0 0
ggerganov 82 12 1 65
ngxson 54 8 5 75
am17an 67 12 0 40
ServeurpersoCom 52 8 2 37
CISC 14 10 0 61
No author found 70 0 0 0
jeffbolznv 17 11 0 41
JohannesGaessler 15 7 0 41
0cc4m 14 3 0 27

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.