Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: November 10, 2025 - November 17, 2025 (12:00:18)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined features aimed at improving user experience and system efficiency.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Compile bug: Vulkan fails to compile after f117be1: This issue reports a compilation failure in the Vulkan backend caused by a recent commit that introduced a GLSLC check, which breaks builds on portable setups like w64devkit despite successful compilation without the check. The problem centers around how the glslc executable path is resolved and passed to the shader generation tool, with discussions focusing on whether CMake or the shader generator should handle executable resolution to maintain portability and avoid build errors.

    • The comments discuss the root cause being the handling of the glslc executable path, with suggestions that CMake should resolve the full path rather than the shader generator doing it internally. Users report that while CMake-based builds work by prioritizing a global Vulkan SDK path, portable setups using Makefiles and custom paths face errors. The conversation highlights the trade-offs between portability and build system complexity, with requests for unmodified build logs and clarifications on environment setup, concluding that the introduced GLSLC check causes issues on portable setups and that a fix might involve changing how executable resolution is managed.
    • Number of comments this week: 15
  2. Feature Request:: This issue addresses a logic problem in the handling of tool-aware models, specifically with Mistral Nemo Instruct 2407, where tools defined inside a .jinja template are not properly recognized or set in the server’s task parameters, causing the tool invocation logic to be bypassed. The user requests design-level guidance on how to best enable the server to detect tool awareness from the model’s GGUF metadata or from the final prompt after the orchestrator runs, so that tool parameters can be correctly initialized even when tools are embedded within the template rather than passed explicitly.

    • The comments clarify that the shipped Mistral Nemo template does not include tools by default and that tools must be passed explicitly to trigger tool handling; the discussion confirms the issue is about supporting tools hardcoded inside .jinja templates, which currently do not populate the server’s tool parameters. Contributors suggest this is a design decision and recommend seeking guidance from the core team on whether the orchestrator should modify task parameters or if the server should parse the final prompt for tools, with consensus that a proper fix requires understanding the intended flow and design philosophy behind tool handling in llama.cpp.
    • Number of comments this week: 13
  3. Misc. bug: Can you disable a backend in llama-server?: This issue concerns a user running llama.cpp on a laptop with both AMD and NVIDIA GPUs, where the software is built with support for both CUDA and Vulkan backends. The user experiences a delay because the CUDA backend attempts to initialize first even when explicitly requesting the Vulkan backend, and they are seeking a way to disable the CUDA backend to avoid this unnecessary initialization time.

    • The comments discuss potential environment variable solutions like GGML_VK_VISIBLE_DEVICES, which do not prevent CUDA initialization; suggestions include compiling only one backend to avoid the issue and investigating why CUDA initialization takes so long despite no CUDA-capable device being present. The user clarifies their use case involves switching between GPUs depending on availability, and the issue remains unresolved but is considered non-critical.
    • Number of comments this week: 5
  4. Qwen3-VL co-ordinate and bounding box errors (grounding errors): This issue reports that the Qwen3-VL model produces incorrect bounding boxes and coordinates, with no coordinates at all in the 4B version and poor localization in the 8B version, even when using FP16 models, indicating the problem is not related to quantization. The user suspects that the issue may stem from the conversion script removing non-vision layers of the vision tower and notes that the problem occurs across different interfaces, including the python API and a CLI tool, while the Hugging Face transformers implementation does not exhibit these errors.

    • The comments clarify that Qwen3-VL coordinates are relative to a 1000x1000 grid and require rescaling, but the original poster confirms proper scaling is already applied and points to poor accuracy or missing coordinates as the core issue. Further discussion identifies a potential root cause in the clip.cpp implementation, referencing related issues and pull requests where fixes have been applied, and requests for specific version details to aid in troubleshooting.
    • Number of comments this week: 4
  5. Misc. bug: Regression: distribution contains duplicate dylibs (macOS): This issue reports a regression in the macOS distribution of the project where the pre-built archives contain multiple copies of the same dynamic libraries (dylibs), including unversioned, versioned, and symlinked versions, causing the archive size to increase unnecessarily. The problem stems from the way zip archives handle symlinks, converting them into full copies rather than preserving the links, which affects the packaging and linking of the executables.

    • The comments clarify that the duplicates are actually triplicates due to symlinks being archived as full copies in zip files, and discuss the limitations of the current upload-artifact action that does not support preserving symlinks. They consider switching to tar.gz archives to better handle symlinks and reduce archive size, while also noting the benefits of the current versioning scheme for dylibs and the potential impact on user workflows.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 595 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
  2. Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference. The user is working on improving the Metal backend in a related project and seeks a documented or known method to produce the type of GPU debugger output used by Apple's Metal debugger.
  3. common: download from URL, improve parallel download progress status: This issue addresses a problem with the parallel downloading of sharded model files, where the progress indicators for each file conflict and do not display correctly. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status updates during simultaneous downloads.
  4. kubernetes example: This issue discusses the creation of a Kubernetes example for deploying the llama.cpp server using a Helm chart, aiming to facilitate scalable application deployment within the community. The original poster has begun work on this example and is seeking contributions and assistance to continue its development.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using CUDA on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 46

Summarized Issues:

  • Model Conversion and Output Accuracy Issues: Several issues report problems with model conversion and output correctness, including incorrect bounding box coordinates in Qwen3-VL models due to clip.cpp implementation errors, failure to generate mmproj files by default in Qwen3-VL conversion, and inconsistent or incorrect outputs in LFM2-VL and Dream 7B diffusion models. These problems cause poor localization, missing visual features, and significant output discrepancies compared to original or Huggingface models.
  • issues/17131, issues/17264, issues/17290, issues/17291
  • Web UI Interaction and Usability Bugs: Multiple issues describe user interface problems in the llama-server WebUI, such as erratic text selection during model output generation, incorrect LaTeX escape rendering, inability to copy text properly, and missing features like "Continue Generation," batch conversation deletion, media attachment editing, token banning, and conversation branching. These bugs and missing features hinder smooth user interaction and content management within the WebUI.
  • issues/17132, issues/17165, issues/17206, issues/17209, issues/17210, issues/17221, issues/17202
  • Backend Initialization and Performance Problems: Several issues highlight backend initialization and performance concerns, including Vulkan backend memory allocation failures, CUDA backend initialization delays when Vulkan is selected, and a consistent performance gap where CUDA outperforms Vulkan on NVIDIA A100 GPUs. These problems affect resource management, startup times, and overall efficiency of model inference.
  • issues/17180, issues/17266, issues/17273
  • Model Loading and Runtime Failures: There are reports of model loading errors and runtime crashes, such as failure to load CLIP models due to unknown projector types, segmentation faults on image uploads with SYCL backend, core dumps with ROCm on AMD GPUs, and Vulkan backend crashes on AMD hardware triggered by CPU offloading options. These issues cause instability and prevent successful model usage on various hardware and configurations.
  • issues/17138, issues/17171, issues/17242, issues/17269
  • Compilation and Build Errors: Multiple issues describe compilation failures across different platforms and configurations, including macOS linker errors due to malformed version numbers, ARM backend build errors with CPU variant flags, riscv64 ISA string mismatches, and Vulkan backend compilation failures on Windows due to GLSLC checks. These build problems hinder cross-platform development and deployment.
  • issues/17164, issues/17187, issues/17258, issues/17282
  • Memory Management and Caching Bugs: Issues report memory management problems such as random purging of active memory slots causing loss of cached content, and taskset and CPU affinity settings being overridden by --numa distribute option, leading to idle physical cores and uncertain behavior. These bugs degrade performance and resource utilization.
  • issues/17124, issues/17196
  • Feature Requests for Model and Tool Support: Several feature requests seek native support for reasoning and tool calling in Kimi-K2 models, addition of new model support like Jina-reranker-v3, ERNIE-4.5-VL-28B-A3B-Thinking, MobileLLMP1ForCausalLM, and guidance on quantizing Qwen3-VL models. These requests aim to expand functionality and improve model compatibility.
  • issues/17155, issues/17188, issues/17189, issues/17252, issues/17280
  • Server and API Behavior Issues: Problems include the llama.cpp server returning HTTP 400 errors instead of truncating chat history, failure of the /slots/reset API endpoint to clear KV cache causing multimodal request failures, and uncaught exceptions causing server termination when systemd limits CPU and tasks. These issues reduce server robustness and API reliability.
  • issues/17168, issues/17200, issues/17284
  • Testing and Automation Requests: There is a request to implement GitHub Actions workflows to automate building and updating the WebUI static output, aiming to reduce manual work and merge conflicts for contributors.
  • issues/17207
  • Discrepancies in Embeddings and Output: An issue reports significant differences in embeddings generated by llama-cpp-python versus llama-server using the same GGUF model and input, despite matching configurations, indicating potential inconsistencies in implementation or pooling methods.
  • issues/17203
  • Miscellaneous Bugs: Other reported bugs include system messages being saved as empty strings in IndexedDB and incorrect JSON escaping of backslashes in structured output causing parsing errors. These minor bugs affect data integrity and output correctness.
  • issues/17195, issues/17157

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 23

Summarized Issues:

  • Performance Degradation on Vision-Language Models: A commit (b6976) caused significant performance degradation on Qwen3 VL benchmarks, reducing accuracy and scores on multiple models and tests like CHARTQA. This issue was resolved by a follow-up fix that restored performance to previous levels.
  • issues/17115
  • Vulkan and AMD GPU Memory Issues: Multiple issues report excessive memory usage, crashes, and performance degradation related to Vulkan backend on AMD GPUs, including Radeon 7900XTX and 780M, with problems such as "Not enough memory for command submission" errors and failure to create compute pipelines. Some issues were resolved by driver updates or fixes, while others remain open.
  • issues/17117, issues/17121, issues/17137, issues/17265
  • Docker Container Library and Startup Failures: Several issues describe Docker container failures due to missing shared libraries like libllama.so.0 and libmtmd.so.0, causing the llama-server or Vulkan-enabled containers to fail to start properly. These problems are linked to build or configuration errors affecting library paths and versioning.
  • issues/17176, issues/17190, issues/17193
  • Compilation and Build Failures: Various build problems occur including compiler crashes with IntelLLVM on Arch Linux, OpenSSL flag issues causing undefined references, ROCm HIP backend linker errors, and naming conflicts with system headers causing multiple errors. These prevent successful compilation or linking of llama.cpp components.
  • issues/17154, issues/17194, issues/17236, issues/17262
  • Model and Server Runtime Crashes: Bugs causing crashes include a divide by zero error in Qwen Omni 3B due to uninitialized patch_size, assertion failures in matrix multiplication tests, kernel panics on AMD Ryzen AI Max+ systems due to GPU faults, and server crashes from invalid input batches or unexpected evaluation states.
  • issues/17125, issues/17129, issues/17218, issues/17253, issues/17260
  • WebUI Usability and Model Recognition Bugs: The WebUI has issues where adding many images prevents scrolling to the prompt area, and it incorrectly fails to recognize Qwen VL models as supporting images, limiting user interaction and causing confusion.
  • issues/17162, issues/17231
  • Multi-GPU and Context Size Performance Issues: Prompt processing with certain large or odd-sized inputs on multi-GPU setups uses only one GPU and runs slower due to context size padding, which was later fixed to improve performance.
  • issues/17163
  • ROCm and GPU Architecture Support Limitations: The llama-server Docker build fails on MI50 GPUs due to missing TensileLibrary files for gfx906 architecture in ROCm 6.4+, requiring older ROCm versions or alternative builds until official support returns in ROCm 8.
  • issues/17166
  • Vulkan Shader and Driver Warnings: A Vulkan out-of-bounds access warning occurs in a shader when robustBufferAccess is disabled, triggered during matrix multiplication tests on RADV drivers. This is expected behavior due to lack of explicit bounds checking to optimize compiler load batching.
  • issues/17281
  • Unexplained or Empty Issue: One issue titled "LIEL" was opened and closed without any description or comments, providing no actionable information.
  • issues/17234

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 30

Key Open Pull Requests

1. llama.android : Rewrite Android binding: This pull request rewrites the Android binding for the llama.cpp project by implementing dynamic native library loading to support advanced CPU acceleration on both Android and ChromeOS, refactoring the C++ layer and JNI bridge to enhance features like automatic message formatting and system prompt injection, adding utilities for GGUF metadata parsing and ARM CPU feature detection, optimizing performance, and replacing the existing app with a basic sample app while providing architectural diagrams and performance comparisons.

  • URL: pull/17152
  • Merged: No
  • Associated Commits: cbe71, 697d7, 3787f, 3f913, 32608, 4dd75, 46bd6, 5ad65, 7e5c8, ca2b7, a7ae8, 648b9, 65c09, a7ee3, 5868e, 4046c, 5596d, 4848b, 75c98, 55681, 64ebd, 3b499, fddf0, e8b84, 6b341, e47e3, 2a41c, 5e497, af0d6, 65741, 564b0, a1f6e, 0afd0, ea11e, 511df, d60bb, aedf4, 6e82b, eebc0, b6cc8, f5e2e, bc93c, 760d6, a3ebd, 5de0b, 290a6, adfbf, 025e3, 2d6b8, 561fe, 1bebd, 0d41e, 6b48f, 2614f, 4913a, 59f5c, 51b12, 32d77, 23d41, 286ed, 9cfa7, 97694, 43493, c2426, d70b8, 9ba74, 06448, 46859, 9f771, 65d4a, 72e97, 63fc5, 225c5, e269d, 69f2b, 41615, 481ba, c08d0, 8203d, ba40d, e1c77, c5a3a, f61c5, cb508, 9f1d2, 0d65c, 1d508, f3133, 8a682, a9466, 67499, 8ae0c, 7ed79, 7540c, 9056f, b81a0, c12ef, 6b74c, ec47f, 57b50, d7afc, 10ca2, 56a72, d1b01, a8dc8, 2b3ba, 77eda, 73330, ef379, 05c62, d97e2, dd036, 9e4ba, ec907, 32f37, 0dec7, d3011, 0bcb1, 3c539, b1831, 379be, 27edf, 43d9d, 81ad4, fe6ea, 4b2f7, 48fa0, cfbd2, 0c7e1, 49df3, bbf04, a4c66, 512fe, eab50, 75d1a, 3884b, ead41, 98c8f, bff98, 1b79d, 53ac8, b59c5, 72822, 4b3f6, 6a5bc, 130cb, 57c3a, c5058, 21e61, 70ec1, d211c, 8c6e4, 1f41a, e6413, 4ff92, 3370b, fe9ba, 5138c, aa224, 5b3b6, 31077, f085d, 85434, 7c2e2, 33d1e, 6f901, 5b761, 4e07a, e58ad, a5a54, 712bc, 7c3e4, 3da54, dd5b2, 79682, 38199, 2c9b1, 46e82, ca1cd, 173c4, 518d0, 28198, 1e1be, 365a7, ba652, 99d77, 659f5, c8480, b7537, 7313b, 8bd96, 027c6, 5794d, 50cea, 2e9de, baa6b, 6863b, 29f26, df16a, a4881, f23b7, cf306, b92c6, 98016, a9b84, c87ff, f1269, b1bcb, 58adb, 1c73f, 54716, d2793, 6fb4a, e067f, 8268d, 2b708, 36c37, eba09, 5f069, a4459, 0c6ce, 6cde2, 6db4c, 687b8, e0ddc, 480d7, d5220, 2223c, 83abf, ad85b, 930e7, 8f90e, 63e5b, 6dfdc, 96817, 56e83, 8897b, 7c2e6, 42e39, f833c, f10d1, 36440, 266fc, cadaf, f94ef, f10a4, 3fa3c, 33987, e7655, 0bbe3

2. common : implement parser combinators for chat parsing [WIP]: This pull request implements a proof-of-concept parser combinator framework for chat parsing using PEG grammars with packrat parsing and semantic actions, aiming to simplify and unify the parsing of model outputs—including complex XML-based tool calls with typed arguments—by enabling on-the-fly generation of specialized parsers and producing GBNF grammars, while also introducing SAX-style parsing and JSON parsing improvements, though it remains a work in progress.

  • URL: pull/17136
  • Merged: No
  • Associated Commits: c822e, e6153, 4ced9, 2a9a1, 22865, 3e666, 76cf0, 9c7b3, f02e2, 66cf0, 2b3ca, adac6, 0be2a, 31b38, ffb7a, 08540, 6bd9a, 62656, 18557, f6aa6, d58da, 20f9a, 35b16, bcb1c, 4bed8, c02aa, 8e821, 9685b, d9a62, 117d9, f97ab, 3114a, cc4d5, c119c, 39d10, eabdb, 692ad, 64780, bbdf4, dd069, 94bd7, 599e2, c40b0, 77452, 74921, 843a2, 9f9fd, 7f92b, 87b92, b1aad, dcace, 10700, 4ebdd, 68f00, 9f09c, bee5e, 4228d, 0f0ec, ea519, 15564, 6dd6c, 9ebdd, c8d94, befca

3. server/public_simplechat vision (basic ok), toolcall (done, with 0 setup clientside builtin tools+), reasoing(done): This pull request introduces the initial basic framework for integrating vision model support into the tools/server/public_simplechat alternate web client UI, including features such as image file input, image display within chat messages, client-side built-in tool calls, reasoning capabilities, and IndexedDB-based chat saving and loading, building upon prior work on tool calling and enhancing user interaction with alerts and UI improvements.

  • URL: pull/17142
  • Merged: No
  • Associated Commits: 98423, f935f, c3cb0, f3c7e, 16277, d3a1d, 4152f, d5742, 20e52, 2a662, b443b, f1b36, 7c0a3, 902ae, f74d4, 49de7, ef415, 9269f, f832f, 55d62, 69ed3, ef817, 4fcc5, 7d2ad, 0f292, e5672, 70344, 7fa22, 692a7, ea2ed, fc62b, 27203, 788c4, 0eac6

Other Open Pull Requests

  • Server refactoring and context management: Multiple pull requests improve server functionality by splitting HTTP handling into a dedicated interface and introducing a server_res_generator class for response modes, alongside enhanced error handling and multi-endpoint support. Additionally, context overflow during decoding is managed by clearing active slots and renaming functions to improve clarity and remove obsolete code.
    • pull/17216, pull/17267
  • Megrez-MoE architecture support and model improvements: The project gains full support for the Megrez-MoE architecture, including GGUF model conversion, inference with 64 routed experts, and extensive refactoring for memory and parameter management. This ensures correct parameter loading and coherent output across 30 MoE layers.
    • pull/17141
  • SYCL backend kernel optimizations: A unified generic unary kernel is introduced for the SYCL backend, consolidating multiple unary operations into a single templated kernel supporting 4-D tensors and multiple data types. This reduces kernel duplication, simplifies maintenance, and ensures compatibility across OpenCL and Level Zero devices.
    • pull/17213, pull/17204
  • GPU kernel enhancements and new architecture support: Several pull requests add and optimize GPU kernels, including WMMA-MMQ kernels for AMD RDNA 4 architecture, a new OpenCL kernel for attention matrix multiplication, and Vulkan backend support for the LOG operation on F32 and F16 types. These changes improve performance and correctness on various GPU platforms.
    • pull/17156, pull/17181, pull/17183
  • WebGPU backend and WebUI build automation: Support for building the ggml WebGPU backend with Emscripten is added, including necessary browser compatibility flags and a GitHub workflow. Additionally, an automated build workflow for the WebUI component is introduced with improvements for graceful fallback and token management.
    • pull/17184, pull/17217
  • CUDA and Vulkan performance improvements: CUDA support for the GGML_OP_CONV_3D operator is implemented with accurate indexing and stride-aware layout, while Vulkan's mul_mmq function gains the ACC_TYPE_VEC2 implementation for q2_K type, resulting in significant performance gains on NVIDIA GPUs.
    • pull/17255, pull/17147
  • Build system fixes and enhancements: Multiple pull requests address build system improvements, including fixing ARM feature verification in CMake, adding BoringSSL build options, proposing OpenSSL usage in CI, and preventing build/install targets for kleidiai dependency to avoid CMake errors.
    • pull/17170, pull/17205, pull/17254
  • Memory management and code maintainability: Refactoring replaces manual ACL object memory management with smart pointers to prevent leaks and clarifies ownership, while a new function centralizes buffer support checking in the ggml-hexagon backend to reduce duplication and improve clarity.
    • pull/17238, pull/17212
  • Attention and model input improvements: Attention temperature tuning support is added for the llama architecture, and offloading of the input layer for select models is proposed to potentially improve performance or resource management.
    • pull/17235, pull/17240
  • Bug fixes and compatibility improvements: Issues such as incorrect OpenCL rms_norm_mul kernel results and zero-size array declarations are fixed by modifying reduction methods and adding sentinel elements, ensuring successful builds across platforms and compilers.
    • pull/17250, pull/17239
  • Miscellaneous updates and rebasing efforts: An attempt to recreate and rebase a previous pull request is made to resolve merge issues, and missing AVX512 feature checks are added to ensure correct instruction requirements.
    • pull/17248, pull/17257

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 61

Key Closed Pull Requests

1. Claude/add activation dump load cli 011 c uv j vgmad UI a6mq n8pbaq: This pull request introduces command-line interface (CLI) commands and interactive commands for dumping and loading intermediate layer activations in GGUF format within the llama.cpp project, enabling users to capture, save, load, and analyze model activations during inference and conversation sessions for improved debugging and model behavior understanding.

  • URL: pull/17130
  • Merged: No
  • Associated Commits: b7a42, 2ac5d, efac1, 45387, de61f, e728b, c7624, 21949, 8a709, 5c3af, aa0f2, bbe67, 21db6, 3faa4, 32607, 66669, 0ce2f, cb933, 7d55e, 4f94a, b3c06, a6a76, e9685, a19a0

2. server: (refactor) implement generator-based API for task results: This pull request refactors the server code by implementing a generator-based API for handling task results, which reduces reliance on callback functions, simplifies the code flow to be more linear and easier to follow, and enables returning correct HTTP error codes during streaming.

  • URL: pull/17174
  • Merged: Yes
  • Associated Commits: dfa24, 88277, 440ce, 99344, cc2e3, 31b8b, efd73, f3bdd, bfa5a

3. feat(mtmd): add Eagle2-VL multimodal support (mmproj + SigLIP pipeline): This pull request introduces initial support for NVIDIA's Eagle2-VL vision-language models in llama.cpp by enabling GGUF conversion of the Eagle2-VL mmproj architecture and integrating a dedicated Eagle2-VL–specific multimodal inference path in the mtmd CLIP pipeline, while ensuring all changes are fully isolated to Eagle2-VL and do not affect other models or core functionalities.

  • URL: pull/17224
  • Merged: No
  • Associated Commits: 466f9, 9a2e7, c5be2, 14c56, 7bc9c, d6cb4, 0fcfb, f83bb

Other Closed Pull Requests

  • SYCL Backend ABS Operation Support: This topic covers the addition of full SYCL backend support for the ABS unary operation, including kernel implementation for f16 and f32 data types with proper tensor stride handling. The changes integrate ABS into the SYCL unary dispatch logic and fix a CI crash related to a missing break statement in the SYCL unit test.
    • [pull/17126, pull/17169]
  • Vulkan Backend Enhancements: Multiple pull requests improve the Vulkan backend by making graph_compute asynchronous to improve CPU/GPU overlap and reduce latency, implementing ABS and NEG operations, updating tensor handling with ggml_vk_tensor_subbuffer in mul_mat_vec computations, and removing shell invocation in the vulkan-shaders-gen tool for better command execution.
    • [pull/17158, pull/17245, pull/17244, pull/17219]
  • Metal Backend Improvements: These pull requests add support for accelerated 2D convolution operations and implement argsort for large arrays by sorting smaller chunks and merging them hierarchically, significantly boosting performance on Apple hardware.
    • [pull/17175, pull/17247]
  • CPU Performance and Functionality Fixes: This group includes improvements such as skipping NOP operations in graph_compute_thread to speed up token generation, fixing 3D tensor handling in repack matrix multiplication to restore performance, and replacing bubble sort with std::sort in ggml_argsort for better efficiency.
    • [pull/17133, pull/17241, pull/17211]
  • Build and Dependency Management: Pull requests here focus on splitting httplib.h into separate source and header files for better build efficiency, moving OpenSSL linking to the vendor directory with related fixes, updating project dependencies to fix vulnerabilities, and updating editorconfig to ignore the benches directory.
    • [pull/17150, pull/17177, pull/17201, pull/17140]
  • WebUI and User Experience Enhancements: These changes improve the WebUI by enhancing multiple attachment handling with better file upload UI, error dialogs, previews, and automatic scrolling, as well as fixing clickability issues around chat processing statistics by properly handling pointer events.
    • [pull/17246, pull/17278]
  • Hexagon Backend and Model Fixes: This topic includes improvements to the Hexagon backend such as introducing a fast division method, fixing test failures for arithmetic operations, handling zero-row tensors explicitly, and resolving inference issues with Qwen3-VL models that generate empty tensors.
    • [pull/17135]
  • Code Refactoring and Bug Fixes: This covers refactoring to move build_inp_out_ids outside loops to avoid duplicate calls, fixing the patch_size initialization in audio models to ensure consistent behavior, and disabling certain operations on older GPUs to prevent crashes.
    • [pull/17151, pull/17128, pull/17134]
  • Native Tool Calling Format Support: This pull request implements support for the Kimi-K2-Thinking native tool calling format by adapting code from DeepSeek V3.1, refining regex patterns, and addressing template compatibility, though it remains unmerged and untested on lower quantized variants.
    • [pull/17251]
  • Continuous Integration Improvements: A dedicated CI job was added to check vendor files to prevent issues referenced in previous pull requests, improving automated validation.
    • [pull/17179]
  • Model Evaluation Results: This pull request adds evaluation results for the gpt-oss-120b model on the 8xAIME25 benchmark, showing high reasoning performance with 93.75% ± 0.24% accuracy on DGX Spark.
    • [pull/17123]

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
hanishkvc 428 6 1 3
hanyin-arm 250 1 0 1
ngxson 119 13 3 79
ggerganov 74 20 2 87
pwilkin 33 2 3 73
am17an 51 7 1 42
aldehir 63 1 1 27
0cc4m 26 8 0 56
jeffbolznv 28 10 0 51
CISC 23 6 0 55

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.