Weekly Project News

Archives

Weekly GitHub Report for Llama.cpp: February 16, 2026 - February 23, 2026 (17:35:47)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined user experience and increased system stability.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. [BUG-UNCONFIRMED] Eval bug: qwen35moe always forces a full prompt reprocess after each message, 'failed to truncate': This issue reports a bug with the Qwen3.5moe model in llama.cpp where every generation request forces a full prompt reprocessing due to a failure to truncate tokens in the cache, causing inefficient and repeated computation even when continuing a conversation. The problem appears related to the model’s hybrid multi-modal nature and the server’s checkpointing logic, with users confirming the issue across different setups and discussing potential fixes including disabling multi-modal features or improving checkpoint handling.

    • Multiple users confirmed the bug with detailed logs showing repeated full prompt reprocessing and cache clearing due to failed token truncation; discussion highlighted the model’s hybrid multi-modal design as a root cause, noted that disabling multi-modal support mitigates the issue, and shared workarounds and patches that improved stability, while also identifying related problems with cache saving/restoring and server logic requiring at least one token evaluation before generation.
    • Number of comments this week: 22
  2. b8070: qwen35moe long-prompt crash (libcuda segfault) with --op-offload on, multi-GPU: This issue reports a crash occurring in the llama.cpp project when processing very long prompts (around 20,000 tokens) using the Qwen35MoE model with multi-GPU and the --op-offload option enabled, resulting in a libcuda segmentation fault. The problem is partially improved compared to a previous version where an assertion failed, but long prompt handling still leads to server crashes or connection drops, with related token truncation failures causing memory clearing and prompt reprocessing observed in some setups.

    • The comments provide detailed logs confirming the assertion failure in the earlier version and the partial fix in the current one, describe reproducibility steps showing successful completion at 10k tokens but crashes at 20k tokens, and note related truncation failures causing cache clearing even without crashes; multiple users confirm the issue on multi- and single-GPU setups with Ubuntu, and attempts to patch the code have not resolved the instability.
    • Number of comments this week: 6
  3. [NEED MORE INFO] Bugs with Prompt Cache, Stopping, Thread Change: This issue reports multiple bugs related to the prompt cache, stop signal handling, and thread switching in the llama.cpp server, where stop signals do not halt model execution, thread changes cause context mixing and prompt cache is not properly cleared, leading to performance degradation and potential crashes. The user describes how these problems manifest in IDEs like Cline and Roo, especially with large context sizes and specific models like MiniMax 2.5, and suggests that the server needs to be stopped on thread changes to avoid these issues.

    • The comments clarify confusion about the report, recommend splitting the issue into separate ones, and confirm that the prompt cache is only cleared when necessary but can cause memory growth; a separate issue was created for the stop signal problem, and the compile command was provided for context.
    • Number of comments this week: 5
  4. CUDA illegal memory access with Qwen3-Next on multi-GPU using -ot (regression): This issue reports a regression causing a CUDA illegal memory access error when using the Qwen3-Next model with multi-GPU layer splitting via the -ot flag, resulting in server crashes during the first token generation for long prompts. The problem was traced to a specific commit that optimized the Qwen3-Next graph but introduced instability with multi-GPU setups, and a subsequent patch replacing an in-place tensor operation with a non-inplace version resolved the crash.

    • The comments detail the bisecting process identifying the first bad commit, confirm that disabling CUDA graphs does not fix the issue, propose a code patch changing ggml_set_inplace to ggml_set, and verify that this fix resolves the illegal memory access error on long prompts with multi-GPU -ot configurations.
    • Number of comments this week: 5
  5. [BUG-UNCONFIRMED] Eval bug: [Bug] Qwen3-Coder-Next (Hybrid) Prompt Cache forces full re-processing due to 'invalidated context checkpoint' despite --swa-full: This issue reports a bug where the Qwen3-Coder-Next (Hybrid) model's prompt cache forces full re-processing due to an "invalidated context checkpoint" error, even when the --swa-full option is used. The user experiences repeated full prompt re-processing during inference, likely caused by the model's handling of SWA or hybrid/recurrent memory, and suspects that changes in the prompt content during tool usage may be triggering the cache invalidation.

    • The comments include a detailed log excerpt illustrating the problem, a user confirming similar issues with related models, a suggestion to check if the prompt changes every time (such as embedding timestamps), and a follow-up acknowledgment that minimal token reuse in continuous conversations might be causing the issue.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

As of our latest update, there are no stale issues for the project this week.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 28

Summarized Issues:

  • Multi-GPU and CUDA Backend Crashes: Several issues report crashes and assertion failures related to multi-GPU setups and CUDA backend usage, including segmentation faults with long prompts, illegal memory access errors during token generation, and CUDA assertion failures triggered by specific flags or hardware. These problems often cause server crashes or token generation failures, impacting stability on NVIDIA hardware and multi-GPU configurations.
  • issues/19676, issues/19705, issues/19816
  • Prompt Cache and Token Truncation Bugs: Multiple issues describe bugs where prompt cache handling fails, causing full prompt reprocessing on every new message due to improper token truncation or invalidated context checkpoints. This leads to repeated cache misses, memory clearing, and degraded performance, especially in hybrid multi-modal and Qwen3.5moe models.
  • issues/19690, issues/19794
  • Memory Management and Out-of-Memory Errors: There are reports of out-of-memory errors caused by insufficient VRAM allocation, ignored memory settings, or backend memory recognition failures, including issues with fit-params not accounting for multimodal model requirements, ROCm backend ignoring GTT memory, and loading models larger than available GPU memory. These memory issues cause crashes or failed model loading despite apparently sufficient resources.
  • issues/19678, issues/19764, issues/19818
  • Server and RPC Stability Problems: Several issues highlight server instability including hangs or crashes during large tensor uploads over RPC, improper handling of stop signals causing continued execution, and port reuse problems due to TIME_WAIT states after SSE streaming sessions. These affect server responsiveness and reliability during heavy or complex operations.
  • issues/19745, issues/19758, issues/19760, issues/19775
  • Model and Feature Support Requests: There are requests for new model additions and feature support, such as adding the Voxtral Realtime model for efficient local deployment and implementing Speculative Decoding for multimodal vision-language models, which currently cause server errors and limit performance improvements.
  • issues/19696, issues/19712
  • Backend-Specific Hardware and Performance Issues: Issues report hardware-specific problems including Vulkan backend image encoding failures on AMD Radeon 8060S, slower CUDA backend token generation on GTX1060 compared to Vulkan, and SYCL backend assertion failures on Intel Arc GPUs. These problems affect performance and functionality on particular hardware and backends.
  • issues/19735, issues/19817, issues/19779
  • Installation and Build Configuration Problems: One issue describes the LLAMA_LIB_INSTALL_DIR setting being ignored during installation, causing libraries not to be installed in the specified directories except for a few files, indicating build configuration problems.
  • issues/19748
  • API and UI Regressions: There are regressions reported in the REST API and Web UI, including inconsistent error handling for oversized context requests and loss of attachment creation when pasting content, which degrade user experience and API reliability.
  • issues/19741, issues/19774
  • Model Execution and Threading Issues: Problems include improper thread switching causing context mixing and prompt cache not being cleared, as well as server crashes due to infinite recursion on Apple Silicon Macs when using the --rerank argument, affecting execution correctness and stability.
  • issues/19756, issues/19760
  • Metrics and Monitoring Improvements: There is a request to improve Prometheus metrics in router mode by adding model name labels and replacing colons with underscores to provide clearer, cumulative metrics and avoid mixing values from different models.
  • issues/19811
  • MoE Model Offloading and System Stability: A request highlights the need for managed SSD offloading for Mixture of Experts models on macOS to prevent kernel panics and system crashes caused by memory overflow, suggesting internal management of the SSD-to-RAM pipeline instead of relying on OS swap.
  • issues/19825
  • Logit Bias Feature Not Working: An issue reports that the logit bias feature specified with the -l flag does not affect token generation in the Kimi K2.5 model on CUDA backend, resulting in banned tokens still being generated despite the bias setting.
  • issues/19699
  • JSON Schema Conversion Bug: There is a bug where JSON schema conversion fails with a 400 error when schema nodes have a description but no type specified, which should be valid and treated as unconstrained according to JSON Schema standards.
  • issues/19716
  • Session Length and Flash Attention Crash: A crash occurs in the CUDA backend when running the GLM-4.7-Flash model with flash attention and a quantized key-value cache, triggered by growing a session to around 65k context length due to a CUDA error in the flash attention MMA kernel linked to a known llama.cpp bug.
  • issues/19724

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 34

Summarized Issues:

  • Model crashes and assertion failures: Multiple issues report crashes and assertion failures in various models and backends, including Qwen Next 80B Coder crashing due to an empty grammar stack, Vulkan backend crashes on Kimi-Linear-48B-Instruct and Adreno 740 GPU, and CUDA backend assertion failures on NVIDIA DGX Spark hardware. These crashes often occur during inference or tool calls and are linked to backend limitations or unexpected input conditions.
  • issues/19304, issues/19471, issues/19672, issues/19746
  • Qwen3-Coder-Next model bugs and crashes: The Qwen3-Coder-Next model exhibits multiple issues including invalid JSON output with duplicate fields, llama-server crashes on multiple tool calls or simultaneous content and tool calls, and failure to load on Windows builds due to unknown architecture errors. These problems cause parsing failures, server instability, and platform-specific incompatibilities.
  • issues/19382, issues/19430, issues/19579
  • Vulkan backend output and stability problems: Several issues describe Vulkan backend problems such as mode collapse and gibberish output after a specific commit, garbled output on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M on Windows, and incorrect output from the GLM-OCR model compared to CPU backend. These indicate instability and rendering errors in Vulkan implementations across different models and platforms.
  • issues/19710, issues/19734, issues/19736
  • Build and linking failures on Windows ARM64: A build failure occurs during linking with the arm64-windows-snapdragon-release preset due to missing C/C++ runtime and startup files, while the non-release preset builds successfully. This indicates configuration issues specific to the release build preset on Windows ARM64 platforms.
  • issues/19444
  • JSON and Jinja template errors in gpt-oss tool calls: The gpt-oss tool call templates have bugs causing malformed JSON due to double-escaping and improper handling of messages containing both content and thinking fields, leading to HTTP 500 errors during multi-turn requests with tool calls. These template errors break prompt generation and server responses.
  • issues/19520, issues/19701, issues/19703
  • GPU and CUDA related inference issues: Problems include GGML FlashAttn module crashes on NVIDIA Geforce 1060 due to conditional checks, qwen35moe model producing degenerate output with CUDA layers enabled by a specific build flag, and SYCL backend disabling Flash Attention on Intel Iris Xe GPUs due to unsupported node types. These issues highlight hardware and driver compatibility challenges affecting inference correctness and performance.
  • issues/19652, issues/19656, issues/19683
  • Model loading and tensor configuration errors: Crashes and loading failures occur due to assertion failures in tensor dimensions and missing tensors, such as in delta-net-base.cpp and PrimeIntellect 3.1 model missing a tensor after MTP layer removal. These errors cause model initialization failures and incorrect outputs until fixed.
  • issues/19728, issues/19733
  • Web UI and server interface problems: Issues include the Web UI failing to upload images to vision models, file attachments not being sent, intermittent blank white pages with JavaScript errors, and scrollable containers hiding assistant code block outputs. These problems degrade user experience and require UI or server thread adjustments.
  • issues/19717, issues/19719, issues/19723, issues/19742
  • Model conversion and tokenizer support requests: Requests to add support for converting cerebras/MiniMax-M2.5-REAP-139B-A10B and Tri-21B models highlight missing tokenizer files and compatibility issues with conversion scripts. Additionally, improvements to error handling in conversion scripts are requested to provide clearer messages and search additional repositories for missing files.
  • issues/19715, issues/19718, issues/19776
  • Documentation and resource access issues: Documentation errors include incorrect flags in server startup instructions causing model loading failures, and broken download links on Ubuntu x64 with ROCm 7.2 prevent access to necessary resources. These hinder proper setup and usage of the project.
  • issues/19786, issues/19789
  • Docker and container registry access problems: Users are unable to pull Docker container images from the GitHub Container Registry due to permission denied errors, blocking container-based deployment and testing workflows.
  • issues/19739
  • Inference and chat response formatting bugs: The Kimi K2.5 and Minimax M2.5 models omit trailing quotation marks in chat completions, and parallel function calling does not work with the unsloth/gpt-oss-20b-GGUF:F16 model due to design constraints requiring reasoning to be passed back for multiple calls. These issues affect output correctness and multi-call handling.
  • issues/19795, issues/19814
  • Cross-stream sequence copy and attention correctness: An initially reported bug with cross-stream sequence copy producing incorrect attention outputs was later found to be non-reproducible after thorough testing, indicating the original garbled output was due to confounded experimental conditions rather than a code defect.
  • issues/19792
  • Research and project documentation: One issue covers the research stage of a project including hypothesis formation and debriefing, emphasizing avoiding redundant work.
  • issues/19695

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 40

Key Open Pull Requests

1. feat: Ultra-Low-Bit Quantization Kernels (Q1_5_K, Q2_K_S): This pull request introduces two new ultra-low-bit quantization kernels, Q1_5_K and Q2_K_S, designed to enable running very large models (e.g., 70B parameters) efficiently on memory-constrained consumer GPUs by providing aggressive compression with minimal accuracy loss, featuring optimized CPU implementations including AVX2 SIMD acceleration, rigorous testing for precision and performance, and integration into the existing quantization framework.

  • URL: pull/19750
  • Associated Commits: 8b75b, 7b0f2, 90412, 4dfb8, 6e97c, 25c8f, 56c9e, ff4ca, 793a5, 47b0c, 95be8, 3db03, 70477, 6900f, 2fb65, 8ea1d, 6e295, 08eae, bae75, 118ce, 0649b, 75f88, fa8a3, 77cbc, a6e87, c7c2f, be1f0, fd24b, 4bbe6, e68a1, c6f12, d9dd2, 0889b, 5ab38, cc957, 620f0, 8902f

2. hexagon refactor all Ops to use local context struct: This pull request refactors all Hexagon operations to use a local context structure that enables precomputing and caching of shared state, removes redundant wrappers and boilerplate, implements DMA for input/output handling, rewrites key loops for improved DMA pipelining, and introduces cross-operation optimizations that collectively enhance performance by over 10 trillion tokens per second and allow for larger batch sizes.

  • URL: pull/19819
  • Associated Commits: 28316, 272e4, 3cd81, 1f72f, aca9a, b047d, 61841, 14752, a732d, 50e83, 2cf88, 3cd2a, f9d5f, b1b74, b9e23, 341a0, ccba1

3. WIP: ggml : add NVFP4 quantization type support: This pull request adds support for NVIDIA's NVFP4 quantization format to the ggml library, including new data types, conversion helpers, backend optimizations across CPU, CUDA, Metal, and Vulkan, integration with the gguf format, and comprehensive testing, aiming to enable efficient handling of NVFP4 models produced by NVIDIA ModelOpt.

  • URL: pull/19769
  • Associated Commits: 52754, d45d3, ab01d, 0a85b, a96f4, 87c74, c0839, 32864, e403c, 307ff, 14c51, 86dd3

Other Open Pull Requests

  • Multimodal and Vision-Language Model Support: Multiple pull requests enhance support for multimodal inputs and vision-language embedding models by fixing server-side context checkpointing to handle processed images in key-value caches and preventing full prompt reprocessing. Additional fixes address server crashes by skipping speculative decoding checks and adding tensor bounds validation for variable embedding tensor dimensions, ensuring stable operation with models like Qwen3.5 and Qwen3-VL-Embedding.
    • pull/19747, pull/19694, pull/19691
  • Quantization Improvements and Extensions: Several pull requests focus on quantization enhancements including porting IQ*_K quantization types to the main CPU backend, implementing block interleaving for Q5_K and Q6_K quantization optimized for x86 SIMD architectures, and adding support for TQ1_0 and TQ2_0 quantization types in the Vulkan backend. These changes improve model loading, performance, and accuracy while introducing new tests and quality-of-life improvements.
    • pull/19726, pull/19707, pull/19706, pull/19743
  • Performance and Backend Optimizations: Pull requests improve performance by enhancing the ggml-hexagon implementation with parallelized dot product functions and vector reduction utilities, adding CDNA3 MFMA tensor core flash attention support for MI300X GPUs, and introducing a new CUDA option to use 32-bit floating point compute to mitigate numerical overflows. These updates increase throughput, simplify code, and maintain correctness across extensive tests.
    • pull/19780, pull/19806, pull/19697, pull/19812
  • Model and Inference Testing Enhancements: Efforts to improve testing include implementing end-to-end tests with toy models to ensure consistent inference results across backends, adding unit tests for max pooling embeddings, and enabling partial GGUF model metadata loading from Huggingface for realistic unit tests. These changes enhance quality control and facilitate faster, more reliable development.
    • pull/19802, pull/19812, pull/19796
  • Server and Networking Improvements: Updates to server functionality include adding the SO_REUSEADDR socket option to the HTTP port for immediate port reuse after shutdown, fixing grammar root symbol checks to prevent incorrect failures, and cleaning up per-thread parameter buffer pools for dynamic resizing and improved job submission logic. These changes improve stability, usability, and resource management.
    • pull/19763, pull/19749, pull/19772
  • New Model and Architecture Support: Support is added for new models and architectures such as the Mistral Voxtral Mini 4B Realtime 2602 model with audio and text dual-stream inference, token classification models like BertForTokenClassification, and the EuroBERT architecture with GGUF conversion and backend support. These additions expand the range of compatible models and improve integration.
    • pull/19698, pull/19698, pull/19826
  • Documentation and Contribution Guidelines: Contributions include adding explicit guidelines for quantization scheme submissions requiring GGUF, PPL, KLD, and CPU performance data, and clarifying the operation of the gguf-split tool to reduce user confusion. These updates improve contributor experience and project transparency.
    • pull/19762, pull/19749
  • Code Quality and Workflow Automation: A new Pylint workflow is introduced for automated Python code analysis to enhance code quality checks, supporting ongoing maintenance and development efficiency.
    • pull/19671
  • Context Size and Memory Management: The default context size for CPU-only builds is updated to 4096 to prevent crashes on low-memory systems, while allowing manual overrides and preserving GPU build behavior. Additionally, hybrid memory snapshotting is improved by enabling partial success of the seq_rm operation to efficiently save recurrent and normal cache states.
    • pull/19711, pull/19670

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 77

Key Closed Pull Requests

1. [WIP] refactor llama-quant.cpp: This pull request is a work-in-progress refactor of the src/llama-quant.cpp file in the llama.cpp project, involving code cleanup, addition of a dry-run option for llama-quantize, improvements in tensor dimension handling, and various fixes and enhancements aimed at improving code clarity and functionality.

  • URL: pull/19616
  • Associated Commits: 844ad, e6b79, 0d222, 56c27, c3f42, b9b32, 150e1, 966b2, 07f88, 2769f, ea8da, 3211a, 55dbe, 22db7, ae786, 1ccd7, 16582, b15bb, 40528, 44f9f, f58de, 75ab2, 0301b, 5d6c9, 67e25, 1f25c, 6734e, d6486, fd378, 053a2, 97aef, bddc6, 7b127, a3bf0, f14fd
  • Associated Commits: 844ad, e6b79, 0d222, 56c27, c3f42, b9b32, 150e1, 966b2, 07f88, 2769f, ea8da, 3211a, 55dbe, 22db7, ae786, 1ccd7, 16582, b15bb, 40528, 44f9f, f58de, 75ab2, 0301b, 5d6c9, 67e25, 1f25c, 6734e, d6486, fd378, 053a2, 97aef, bddc6, 7b127, a3bf0, f14fd

2. Add Kimi Linear to unified delta net: This pull request adds the Kimi Linear component to the unified delta net in the llama.cpp project, along with various code simplifications, optimizations, and synchronization updates to improve the model's implementation and performance.

  • URL: pull/19668
  • Associated Commits: cff8f, b0594, 6c765, a93bc, 7b268, 4a639, c0797, 11776, 6dad4, df269, 1cea2, a6fa6, 6432f, de6a8, 23ccc
  • Associated Commits: cff8f, b0594, 6c765, a93bc, 7b268, 4a639, c0797, 11776, 6dad4, df269, 1cea2, a6fa6, 6432f, de6a8, 23ccc

3. Pre-MCP UI and architecture cleanup: This pull request focuses on pre-MCP user interface and architecture cleanup by refactoring the chat input flow, reworking message editing with shared contexts, cleaning up service and store structures, improving streaming and generation handling, normalizing reasoning and tool-call handling, enhancing model metadata caching, updating settings and type constants, and updating dependencies to prepare the codebase for further modularization.

  • URL: pull/19689
  • Associated Commits: 629cc, 956a0, b02d5, 1305b, 6823a, b78b9, 40f66, 726bd, eddbb, e259b, 28d4d, 3729f, d278b, 560eb
  • Associated Commits: 629cc, 956a0, b02d5, 1305b, 6823a, b78b9, 40f66, 726bd, eddbb, e259b, 28d4d, 3729f, d278b, 560eb

Other Closed Pull Requests

  • WebGPU shader reorganization and unary ops support: These pull requests reorganize the ggml WebGPU shader code by introducing a new shader library structure that centralizes shader preprocessing, JIT compilation, and caching, converting existing shaders to this format for better performance tuning. Additionally, support for unary operations like square, square root, sine, and cosine was added to the WebGPU backend, including a fix for casting source values before sine and cosine computations.
    • pull/19530, pull/19700
  • Model support and tokenizer enhancements: Multiple pull requests add support for new models and tokenizers, including the CohereLabs/tiny-aya family with a custom digit-grouping regex pre-tokenizer, the JAIS-2 Arabic-English bilingual models with specific architecture features and tokenizer fixes, the JoyAI-LLM-Flash tokenizer hash mapping, and the GLM-OCR model integrating text and vision components. These contributions also include fixes to tokenizer hashing and RoPE implementation, as well as enabling flash attention on CUDA backends.
    • pull/19611, pull/19488, pull/19651, pull/19677
  • CUDA graph and performance improvements: These pull requests improve CUDA graph capture by delaying activation until after a warmup period to reduce overhead and enable re-enablement once stable, and enable CUDA graph support for matrix multiplication with batch sizes between 1 and 4, improving parallel sequence generation and simplifying exception handling. They also fix kernel selection logic in CUDA tile FA to prevent host aborts.
    • pull/19754, pull/19645, pull/19686
  • MCP core runtime and UI integration: These pull requests introduce the MCP core runtime, management UI, settings dialogs, and integration of MCP and agentic flow into the chat experience, accompanied by documentation including architecture and generated static output for the web UI. They also clean up the pre-MCP user interface and architecture by extracting unrelated code changes to improve maintainability.
    • pull/19688, pull/19685
  • Vulkan and ROCm backend updates: Updates to the ROCm docker container and CI workflows to version 7.2 include support for the gfx1150 architecture, disabling rocWMMA on gfx908 due to compatibility, and adding a new build target for ROCm 7.2 artifacts. Vulkan improvements address batch dimension limits by splitting mul_mat operations into multiple dispatches with push constants and replace hardcoded strings with Vulkan SDK constants for consistency.
    • pull/19418, pull/19509, pull/19440
  • Tokenizer and quantization tool enhancements: A lightweight audio tokenizer based on the LFM2 architecture was added to enable embedding model use with distinct input/output dimensions and conversion support. The llama-quantize tool gained a --dry-run option to report tensor sizes without quantization, helping users preview quantization impact.
    • pull/19687, pull/19526
  • API and code refactoring improvements: Refactoring efforts include making the ggml_is_view function public to improve backend compatibility, deduplicating delta net graphs in Qwen35 models with a new struct, and refactoring OpenCL EXPM1 and Softplus operators to reduce duplication and improve clarity.
    • pull/19539, pull/19539, pull/19404
  • Chat parser and Responses API fixes: A dangling-reference bug in the chat parser was fixed by changing the constructor to take input by value, preventing input corruption and improving XML tool-call parsing to preserve the full original XML string. The Responses API endpoint was improved by merging contiguous assistant input items into single messages to keep content, reasoning, and tool calls together as expected by chat templates.
    • pull/19801, pull/19773
  • Flash attention toggle and testing: The clip_graph::build_attn function was refactored to allow toggling flash attention via command-line parameters, adding a --flash_off option in tests to validate non-flash code paths and ensure comprehensive testing across models.
    • pull/19729
  • Model configuration and shape fixes: The run-org-model.py script was updated to optionally print model configuration values to avoid errors when missing, and fixes were made to the beta and gate shapes in the qwen3.5 model after refactoring to ensure structural correctness.
    • pull/19681, pull/19730

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
allozaur 241 5 0 7
ggerganov 108 5 0 23
pwilkin 56 2 0 14
CISC 44 4 0 20
ddh0 49 4 1 5
ymcki 50 1 0 5
ServeurpersoCom 54 0 0 0
max-krasnyansky 50 1 0 0
No author found 50 0 0 0
0cc4m 41 2 0 7

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.