Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: October 06, 2025 - October 13, 2025 (12:06:29)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates that enhance functionality and performance, reflecting a continued focus on improving user experience and system efficiency. Notable highlights include optimized features and bug fixes that address previous issues.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Misc. bug: Slower performance on newer models Apriel-1.5-15b, granite-4.0-h-tiny: This issue reports unexpectedly slower token generation and prompt processing performance on newer llama.cpp models Apriel-1.5-15b and granite-4.0-h-tiny compared to older models like Qwen3 A3B, despite the newer models having fewer or comparable active parameters. The user is seeking clarification on whether this performance discrepancy is expected behavior or if improvements in implementation are forthcoming, and shares detailed benchmark results and attempts to test a Vulkan backend branch to address the issue.

    • The discussion involved attempts to reproduce and diagnose the performance problem, including building and testing a Vulkan backend branch which improved granite-4.0-h-tiny performance but not Apriel-1.5-15b; contributors shared benchmark results on different hardware showing variable performance gains, and a pull request was opened to improve Vulkan support, with ongoing interest in further investigating Apriel-1.5-15b’s slow performance.
    • Number of comments this week: 9
  2. bug: segfault on q4_0 repacking on windows avx2 cpu only: This issue reports a segmentation fault occurring on Windows with AVX2 CPUs when using q4_0 model repacking in llama-cli, specifically triggered by CPU repacking on a particular model and compiler setup. The user provides detailed debugging output and backtraces showing the crash in low-level AVX2-optimized functions, and commenters discuss the reproducibility, suspecting a compiler bug related to the GCC version used, with a suggestion to try a newer w64devkit to resolve the problem.

    • The discussion confirms the segfault is reproducible on a fresh build with the specified GCC version and hardware, with the crash occurring in vectorized repacking code that does not involve pointers, leading to suspicion of a compiler issue. Suggestions include testing with a newer w64devkit version, and the original reporter agrees to try this approach.
    • Number of comments this week: 7
  3. Misc. bug: SveltKit WebUI blocks prompts that are >~1/3 the max context size: This issue reports a bug in the SveltKit WebUI where the prompt token count is significantly overestimated, causing the interface to block prompts that are roughly one-third or more of the maximum context size, even when the actual prompt length is within limits. The user suggests making the warning dialog less restrictive by allowing the prompt to be sent despite the warning or improving token count accuracy by verifying with a backend tokenize call before blocking.

    • The comments discuss removing frontend token count limitations entirely and relying on backend validation instead, with a contributor sharing a local patch and proposing a PR to simplify client logic and improve resilience. Subsequent discussion confirms alignment with a recent server patch that returns proper HTTP 400 errors for oversized prompts, and collaboration continues with UI improvements and positive feedback on the cooperative development process.
    • Number of comments this week: 6
  4. Misc. bug: llama_context_params.n_seq_max decides the amount of sequences instead of being a max: This issue addresses a bug in the LLAMA API where the parameter llama_context_params.n_seq_max controls the exact number of sequences rather than acting as a maximum limit, causing the total context size to be split evenly among sequences and reducing flexibility. The reporter highlights that this change breaks previous dynamic batching behavior, forcing users to recreate contexts when batch sizes change, and proposes removing or modifying n_seq_max to restore more flexible context handling.

    • The comments discuss the impact of this behavior on downstream projects, with one user clarifying that exceeding the configured sequence count is not allowed and suggesting that new sequences must wait for existing ones to finish. Another user explains the problem in detail, emphasizing the loss of dynamic batching and fixed context partitioning. A solution is proposed to enable a unified key-value cache mode (kv_unified = true), which shares the total context size across sequences, and the original reporter confirms this approach resolves the issue in early testing.
    • Number of comments this week: 5
  5. Feature Request: Support for Microsoft's Phi-4-mini-flash-reasoning and Nvidia's Nemotron-nano-9b-v2: This issue requests the addition of support for Microsoft's Phi-4-mini-flash-reasoning and Nvidia's Nemotron-nano-9b-v2 models in the Llama.cpp project, highlighting the unique hybrid Transformer/Mamba architectures and the potential benefits of enabling comparisons between different hybrid strategies. The user also seeks guidance on where to focus their efforts within the codebase to implement these features, noting that Nemotron-nano-9b-v2 shares similarities with already supported models, while Phi-4-mini-flash-reasoning presents additional challenges due to its use of Mamba1 layers and gated memory units.

    • The comments clarify that Nemotron is already supported, and the user requests help locating the GGUF files for this model. A responder provides direct links to the GGUF files for both the 9B and 12B versions of Nvidia's Nemotron-nano models, addressing the user's immediate need.
    • Number of comments this week: 3

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 560 days and highlights a discrepancy in behavior between two Vulkan backends used in the project.
  2. Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, as part of their efforts to improve the Metal backend in a related project. The user is specifically interested in obtaining debugger output similar to that provided by Apple's Metal debugger to aid in performance analysis and debugging.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-conflicting progress status indicators during parallel downloads.
  4. kubernetes example: This issue discusses the creation of a Kubernetes example for deploying the llama.cpp server using a Helm chart, aiming to facilitate scalable application deployment within the community. The original poster has begun work on this example and is seeking contributions and assistance to continue its development.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA RTX 3060 GPU and CUDA backend. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 17

Summarized Issues:

  • Tokenization and Prompt Handling Issues: Several issues highlight problems with token counting accuracy, prompt caching, and tokenization features. These include client-side overestimation of prompt tokens causing premature blocking, lack of batched tokenization endpoints leading to concatenated token vectors, and forced prompt cache clearing preventing effective caching in certain models.
  • [issues/16437, issues/16458, issues/16491]
  • Model Support and Feature Requests: Multiple requests focus on adding support for new models and features to improve flexibility and compatibility. These include adding support for Microsoft’s Phi-4-mini-flash-reasoning and Nvidia’s Nemotron-nano-9b-v2 models, raw embedding input support for Gemma 3n, modular tokenizer selection, and Huawei SINQ quantization method integration.
  • [issues/16450, issues/16474, issues/16476, issues/16478]
  • Server Stability and Functionality Problems: There are reports of server unresponsiveness and crashes under specific conditions. Issues include the llama-server stopping to process new requests despite logging them, JSON parsing errors causing crashes with Granite 4, and multi-GPU server producing garbled output with row-split tensor parallelism.
  • [issues/16448, issues/16465, issues/16517]
  • Performance and Optimization Concerns: Some issues describe slower token generation and prompt processing in newer models compared to older ones, raising questions about expected behavior or bugs affecting implementation and optimization.
  • [issues/16454]
  • Crash and Compile Errors: Several issues report crashes and compile failures on specific platforms or configurations. These include a segmentation fault on Windows with AVX2 CPU and CPU repacking enabled, a LoRA adapter inference crash on Mac M2 Ultra, and undeclared identifiers causing Vulkan debug build failures on Windows.
  • [issues/16475, issues/16479, issues/16502]
  • Server Code Refactoring and Feature Enhancements: Requests for improving server code structure and adding new API features are present. These include proposals to refactor HTTP code into separate modules for easier compilation and customization, and adding a local API for loading/unloading models to improve server management.
  • [issues/16487, issues/16488]
  • Memory Management and Backend Issues: There are reports of memory-related problems with the SYCL backend, including lower token capacity before out-of-memory crashes and repeated unsupported warnings despite correct environment settings.
  • [issues/16516]

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 5

Summarized Issues:

  • Model integration and output issues: The docling model integration with llama.cpp initially produced nonsensical outputs and failed to load correctly due to configuration errors, but this was resolved in a later version. Similarly, reasoning content extraction failed for GLM-4.5 and GLM-4.6 models because the chat format was misdetected, causing parsing problems that were later fixed.
  • issues/16435, issues/16439
  • Configuration and API compatibility errors: The convenience setting --embd-bge-small-en-default incorrectly sets the pooling type to "none," which is incompatible with OpenAI's API and causes errors at the /v1/embeddings endpoint. This misconfiguration leads to failures when attempting to generate embeddings using this setting.
  • issues/16451
  • Server crashes due to parameter handling: llama-server crashes when specific combinations of the top_p parameter and input_prefix are used during infill requests, resulting in an out-of-range exception and server termination. This bug causes instability in the server under certain input conditions.
  • issues/16498
  • CUDA device memory allocation bug: The -dev CUDA0 device incorrectly allocates 496MiB of memory on device 1 instead of device 0 as expected, while -dev CUDA1 correctly uses device 1. This memory allocation error causes unexpected behavior in device usage.
  • issues/16509

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 22

Key Open Pull Requests

1. webui: remove client-side context pre-check and rely on backend for limits: This pull request removes the client-side context window pre-check in the web UI to rely solely on backend limits, simplifies error handling by displaying a generic server response error when no content is returned, eliminates obsolete state management related to context limits, and improves the SSE client’s robustness to premature termination events in multi-turn agentic proxy chains.

  • URL: pull/16506
  • Merged: No
  • Associated Commits: c6f78, 2f885, deedf, 5ca41, 44747, 06391, bc392, cbdce, ba827, 28bad

2. CUDA: add fp kernel for larger batch size MoE: This pull request introduces a new CUDA kernel in the mmf module designed to optimize larger batch sizes for Mixture of Experts (MoE) models by leveraging mmq_ids_helper and double-buffering techniques, resulting in significant speedups over the cuBLAS fallback for models with token counts up to 512 and particularly benefiting recent Qwen models with ne01 ≤ 1024.

  • URL: pull/16512
  • Merged: No
  • Associated Commits: 8ee40, 996be, 4a238, 3120d, 237ad, 43e04, 56e73, 3183a

3. Add AfmoeForCausalLM support: This pull request adds support for the upcoming AfmoeForCausalLM model by making the AfmoeForCausalLMTokenizer public ahead of the model launch to prevent breaking conversion code, along with updates to parameters, conversion scripts, and minor code cleanups.

  • URL: pull/16477
  • Merged: No
  • Associated Commits: c2893, 05120, a4147, ae836, 83244, 798d0, 592e2

Other Open Pull Requests

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 33

Key Closed Pull Requests

1. ggml webgpu: profiling, CI updates, reworking of command submission: This pull request introduces optional profiling for both CPU and GPU to aid optimization, updates the WebGPU Ubuntu CI testing to version 24.04 to align with Vulkan, and reworks command submission to eliminate global locks by enabling each thread to submit work independently in graph_compute, while also addressing deadlock issues through serialized command submission and preparing the WebGPU backend for initial single-threaded browser execution.

  • URL: pull/16452
  • Merged: Yes
  • Associated Commits: 5ae93, bd3d0, 400a5, ca43f, 98d98, 26c44, eabab, c30b2, 168dd, b926b, d501e, d56e0, bcf8d, 8a848, d3c7d, a5e26

2. metal : various optimizations + refactoring: This pull request introduces various optimizations and refactorings to the Metal backend of the project, including improvements to the 1D get_rows and cpy functions, as well as the cpy_f32_q, ssm_scan, and ssm_conv components to enhance performance and simplify the code.

  • URL: pull/16446
  • Merged: Yes
  • Associated Commits: c11ca, 0600b, 8e535, 99545, fcebb, b9b25

3. Granite Docling stopping: This pull request addresses and fixes several tokenization bugs in the granite chat template implementation, including removing a duplicate fake image token, ensuring a double newline before the global image, and eliminating an incorrect trailing newline in the assistant generation prompt, thereby resolving the issue of the model not terminating correctly.

  • URL: pull/16438
  • Merged: Yes
  • Associated Commits: c112b, 2de4d, 505c0, f7dde

Other Closed Pull Requests

  • README and Documentation Updates: Multiple pull requests update the README and related documentation to reflect new features and fixes, including exposing multiple devices in RPC, adding the n_past_max metric, and introducing a new /v1/health endpoint for server consistency. Some updates also include unmerged proposals for README improvements.
    [pull/16441, pull/16461, pull/16493, pull/16499, pull/16493]
  • Kernel and Performance Optimizations: Several pull requests focus on kernel interface refactoring and performance improvements, including a new CUDA copy kernel optimized for contiguous GGML tensors yielding a 3.7% speedup, and a minor optimization in the set_rows function for more efficient data copying. These changes enhance both CPU and GPU performance aspects of the project.
    [pull/16459, pull/16460, pull/16467, pull/16471]
  • Server Endpoint and Debugging Enhancements: Pull requests add new server endpoints and debugging features, such as the /v1/health endpoint and detailed debug output for the /slots endpoint when enabled via environment variable. Additionally, fixes address task cancellation logic to properly cancel queued tasks, improving server reliability under rapid request scenarios.
    [pull/16461, pull/16462, pull/16482]
  • Model and Tokenizer Fixes: Updates include adding support for the LiquidAI LFM2-8B-A1B mixture-of-experts model, fixing vocabulary construction in the convert_hf_to_gguf.py script for Jamba models, and marking the End Of Text token for Granite models to ensure compatibility with infill samplers. These changes improve model compatibility and token handling.
    [pull/16464, pull/16470, pull/16499]
  • Context and Memory Management Improvements: Improvements to context checkpoint logic and recurrent memory modules ensure better utilization of checkpoints and fix issues with sequential batch processing. Increasing the default number of context checkpoints reduces excessive re-processing in branching conversations, enhancing model efficiency.
    [pull/16436, pull/16440, pull/16442, pull/16444]
  • Bug Fixes and Compatibility Adjustments: Various fixes address issues such as leftover element handling in vector scaling for SVE, removal of unavailable model files to prevent CI errors, fixing pooling parameters for embedding models, and resolving build failures on AIX systems by adjusting CMake configurations. These ensure stability and compatibility across environments.
    [pull/16443, pull/16455, pull/16468, pull/16443]
  • Web UI and Request Handling Updates: The web UI chat service is updated to include the max_tokens parameter only when explicitly provided, correctly interpreting zero or null as infinite tokens. Additionally, server response behavior in streaming mode is fixed to return HTTP 400 when prompts exceed context length, aligning with standard inference engine practices.
    [pull/16483, pull/16486]

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 169 27 3 50
allozaur 84 6 3 37
CISC 42 5 0 37
ServeurpersoCom 47 9 3 22
ngxson 19 2 2 41
jeffbolznv 21 5 0 31
danbev 45 7 0 3
taronaeo 40 1 1 10
reeselevine 44 3 0 1
0cc4m 9 0 0 27

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.