Weekly Project News

Archives
Subscribe
December 1, 2025

Weekly GitHub Report for Llama.cpp: November 24, 2025 - December 01, 2025 (12:06:02)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, reflecting a continued focus on stability and feature improvements. Notable highlights include optimized system processes and refined interface elements to streamline usability.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Feature Request: Support Vibethinker-1.5B: This issue requests adding support for the VibeThinker-1.5B model, a reasoning-focused AI that outperforms larger models on math and code benchmarks by using a unique training framework called the "Spectrum-to-Signal Principle." The user highlights the model’s specialization in introspective and explanatory tasks rather than casual chatting and seeks integration of its template, particularly in the GGUF format.

    • The comments clarify that the model shares architecture with Qwen2.5, which is already supported, so the main request is for chat template support. Participants discuss the model’s distinct training and specialization, provide multiple external references, and address concerns about the model’s uniqueness and relevance, while also calling out inappropriate behavior in the discussion.
    • Number of comments this week: 8
  2. Eval bug: Qwen Models segfault on Vulkan: This issue reports a segmentation fault occurring when loading various Qwen models using the Vulkan backend on an AMD Radeon 680M GPU under Linux. The problem appears to have started with a specific commit related to Vulkan flash attention subgroups, and debugging efforts including a backtrace suggest the crash may be due to a driver bug in the Vulkan Radeon driver.

    • The discussion includes confirmation of the error message "free(): invalid pointer," a request to bisect the code to identify the first bad commit, identification of the problematic commit, and a detailed backtrace from a debugger indicating a likely driver issue; another user reports no problems on a different Radeon GPU and Mesa driver version, suggesting the fault may be hardware or driver specific.
    • Number of comments this week: 7
  3. Misc. bug: error parsing grammar: number of repetitions exceeds sane defaults, please reduce the number of repetitions: This issue reports a parsing error caused by a newly imposed limit on the number of repetitions in the grammar, which breaks functionality in the latest docker version of llama-server when using openweebui with native tool calls. The user highlights that this problem did not occur in previous versions and requests a configurable way to adjust the repetition limit without recompiling, as the hardcoded limit prevents updating the docker image.

    • The discussion reveals that the limit was introduced to prevent DoS attacks, with a default threshold of 2000 repetitions considered sufficient for normal use cases. The user expresses the need for an environment variable or parameter to adjust this limit without recompiling, and the maintainers acknowledge the concern and plan to add a configuration parameter to address it.
    • Number of comments this week: 6
  4. Compile bug: no matching function for call to 'hipblasGemmEx' for ROCM on Linux: This issue reports a compilation error when building the project with ROCm 7.1 on Linux, specifically a "no matching function for call to 'hipblasGemmEx'" error caused by a mismatch in expected argument types due to changes in the hipBLAS API between ROCm versions. The user upgraded from ROCm 6 to 7 and encountered deprecated types and incompatible function signatures, leading to build failures related to hipblasDatatype_t and hipblasComputeType_t usage in the code.

    • The comments discuss potential causes including leftover ROCm 6.x remnants, differences between native and Docker ROCm installations, and changes in hipBLAS API where hipblasDatatype_t was replaced by hipDataType and hipblasComputeType_t. Contributors share detailed investigations of macro expansions and type definitions, suggest using a clean Docker environment to avoid version conflicts, and provide a working Docker build example for ROCm 7.1, while the original poster confirms careful path and version checks and requests further clarification on how certain macros compile successfully in other setups.
    • Number of comments this week: 5
  5. Eval bug: Qwen 2.5 VL 3B Instruct infinite ? output for certain image dimensions: This issue reports a bug in the Qwen 2.5 VL 3B Instruct model where inference produces an infinite sequence of question marks for certain image dimensions, specifically with an image sized 1024x484. The problem appears to be related to NaN values found in the CLIP model's F16 embeddings, which may be caused by precision overflow, and while scaling the image resolves the issue, a proper fix is still needed.

    • The comments confirm the issue occurs on different hardware and is not related to image resolution alone, with debugging revealing NaN values in the embeddings likely due to precision overflow in the CLIP model; a temporary workaround replacing NaNs with zeros allows inference to proceed, but no definitive solution has been found, and the discussion ends with a call for expert assistance.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 600 days, indicating a persistent and unresolved bug affecting this particular implementation.
  2. Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, to aid in performance analysis and debugging.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status indicators during parallel downloads.
  4. kubernetes example: This issue discusses the creation of a Kubernetes example for the llama.cpp project, specifically focusing on developing a Helm chart to facilitate deploying the server in a scalable and industry-standard manner. The author has made initial progress but seeks community contributions to continue and improve the implementation.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 28

Summarized Issues:

  • Configuration and Default Behavior Issues: Several issues report problems with default settings and hardcoded limits that restrict functionality or cause unexpected behavior. These include the kv_unified setting being enabled by default without a way to disable it, a hard cap on model context length preventing longer contexts, and a fixed repetition limit in grammar parsing that cannot be adjusted without recompiling.
  • issues/17450, issues/17459, issues/17473
  • Model Architecture and Conversion Support Requests: Users request support for additional model architectures and conversions, highlighting gaps in current compatibility. These include adding support for LlamaBidirectionalModel, converting the fine-tuned VibeVoice model to GGUF format, and adding new models like HunyuanOCR-1B, VibeThinker-1.5B, INTELLECT-3 chat format, and dots.ocr multilingual OCR.
  • issues/17478, issues/17488, issues/17509, issues/17559, issues/17560, issues/17564
  • Segmentation Faults and Crashes on AMD/ROCm and Vulkan Backends: Multiple issues describe segmentation faults and crashes when running models on AMD GPUs using ROCm or Vulkan backends. Problems include crashes with multiple AMD GPUs, failures on ROCm 7 with hipBLAS incompatibilities, and Vulkan-related invalid pointer errors, indicating instability in GPU acceleration on these platforms.
  • issues/17561, issues/17583, issues/17586
  • Memory and Performance Issues with Large Inputs and Quantization: There are reports of memory allocation failures and performance degradation related to large batch sizes and quantized models. These include memory pool exhaustion when using large --ubatch-size values on AMD ROCm GPUs and unexpected throughput drops with ROCm versions 7.10 and 7.11 compared to 7.9.
  • issues/17578, issues/17596
  • Tokenization and Decoding Errors: Issues arise from incompatible tokenizers and integer overflow in token processing, causing crashes and misleading error messages. Specifically, universal assisted generation fails with incompatible tokenizers causing decoder crashes, and 32-bit integer overflow in template application leads to incorrect error reporting.
  • issues/17463, issues/17480
  • Backend Loading and Environment Variable Handling Problems: Problems with backend library loading and environment variable respect cause failures in model loading and execution. The CPU backend loader ignores LD_LIBRARY_PATH, forcing shared libraries to be in the executable directory, and there is a request for an environment variable to configure log file output in Docker.
  • issues/17491, issues/17601
  • Model-Specific Bugs and Inference Failures: Specific models exhibit unique bugs such as infinite output loops, key-value cache restoration failures, and segmentation faults triggered by certain options. Examples include infinite question marks from Qwen 2.5 VL 3B Instruct due to NaN embeddings, key-value cache issues in GPT-OSS 120B with parallel processing, and crashes when using --cpu-moe option.
  • issues/17527, issues/17534, issues/17546
  • Feature Requests for Improved Usability and Extensions: Users request enhancements such as more flexible file upload type checks, Vulkan backend memory priority support, and chat format templates for advanced models. These aim to improve user experience and extend functionality to better support tool-calling, reasoning, and performance under memory pressure.
  • issues/17556, issues/17605
  • Compilation and Build Failures: Compilation errors occur due to backend API changes and recent code updates, including ROCm HIP backend incompatibilities and Metal kernel compilation failures. These break builds and require updates to maintain compatibility with evolving dependencies.
  • issues/17489, issues/17603
  • User Interface and Schema Validation Issues: The web UI no longer shows context usage in error messages, and schema validation errors cause request failures in integrations like LibreChat. These reduce transparency and cause retry loops, with suggestions to improve error handling and schema correctness.
  • issues/17499, issues/17574
  • GPU Acceleration Startup Delays: A delay in GPU acceleration activation is reported when running models with ramalama, contrasting with immediate acceleration in ollama, indicating possible initialization inefficiencies or configuration issues.
  • issues/17533

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 17

Summarized Issues:

  • User Interface and Usability Issues: Several issues highlight problems affecting user experience, such as disruptive automatic scrolling in the WebUI and improper rendering of raw HTML tags inside markdown tables. These issues cause inconvenience by forcing manual scrolling or displaying formatting errors, reducing overall usability.
  • [issues/17292, issues/17462]
  • Parsing and Data Handling Bugs: There are bugs related to improper escaping of backslash characters in JSON schema literals and incorrect formatting of Prometheus metrics data as quoted strings. These parsing errors lead to program failures and metric parsing problems, impacting reliability.
  • [issues/17306, issues/17384]
  • Build and Compilation Problems: Multiple issues report long compile times, build failures, and compilation errors across different environments including Windows and Docker. Problems include slow compilation of chat.cpp, inability to compile CUDA test programs, and Docker image build failures due to missing C compilers or outdated tools.
  • [issues/17329, issues/17522, issues/17547]
  • GPU and CUDA Compatibility Issues: Several issues describe problems with GPU detection, memory reporting, and CUDA initialization failures. These include incorrect memory recognition on NVIDIA RTX 5090, failure to detect GPU due to compilation flag typos, unsupported driver combinations causing CUDA init errors, and crashes on Intel GPUs with large VRAM usage.
  • [issues/17536, issues/17539, issues/17542, issues/17545]
  • Feature Regression and Script Bugs: Some issues report regressions and bugs in features such as Vulkan backend integer dot product being disabled in Docker, and the convert_lora_to_gguf script ignoring output type parameters. These regressions cause unexpected behavior and output format issues.
  • [issues/17414, issues/17447]
  • Logging and External Application Interference: One issue describes misleading log messages caused by external applications like OBS hooking into the llama-server process, which creates confusion by reporting errors not originating from llama.cpp itself.
  • [issues/17498]
  • Model Support and Inquiry: There is a request for information about support for the qwen2.5vl model and the location of its mmproj file, indicating user interest in expanding model compatibility.
  • [issues/17518]
  • Unspecified or Closed Issues: One issue is closed with no description or comments, providing no additional information about its content or resolution.
  • [issues/17507]
  • Model Offloading and Performance Bugs: A bug causes incoherent output when running a large GLM model fully offloaded on CUDA with flash attention enabled, suggesting issues with layer offloading and flash attention interaction on NVIDIA RTX 40-series GPUs.
  • [issues/17549]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

  1. Feature Request: Support Vibethinker-1.5B
    • Toxicity Score: 0.65 (Rapid escalation, personal attacks, defensive responses)
    • This GitHub conversation begins with users discussing the support and characteristics of a specific AI model, with initial comments being informative and neutral. As the discussion progresses, one user accuses another of being a "hater" and questions their motives, introducing a confrontational tone. This provokes a measured but firm response from another participant, who calls out the personal attacks and emphasizes constructive contributions. The conversation shifts from technical to personal, with tension triggered by perceived hostility and defensive reactions.

III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 49

Key Open Pull Requests

1. server: introduce API for serving / loading / unloading multiple models: This pull request introduces an experimental multi-model management API for the llama-server that enables listing, loading, unloading, and routing requests to multiple models dynamically using a multi-process architecture designed for resilience, with compatibility to the OpenAI-style /v1/models endpoint and on-demand model loading from local directories.

  • URL: pull/17470
  • Merged: No
  • Associated Commits: fc590, 399f5, abc0c, 54b35, 5423d, 0ef3b, 7c6eb, 919d3, 55d33, b9ebd, 66107, d0ea9, 5805c, 8a885, c35de, 21614, 5369a, 6929c, be25b, cd5c6, a2e91, 4bf82, cc88f, 45bf2, 049f4, c26c3, 032b9, 8b1d9, 6b7c0, 69503, 92585, 62ee8, 7cd92, 72415, b0540, 525e2, 457fb, c274f, f2ca5, d32bb, 4af1b, 076ee, db8ed, dc913, a39ef, 036cc, 62825, f25bf, 7ef63, f927e, 74685, f95f9, 2e355, 5ad59, d65be, 1f0cb, b7ba1, 48dbe, 1c214, ef5f9, 49c80, d5a66, f8ff3, 41764, 219fd, e92ce, fb544, 39fb1, 188d3, 16747, e808f, 76557, 13fe8, b2590, 5ef3f, 6ed19, 2c6b5, 539cb, 399b3, e514b, e40f3, e2731, becc6, 1493e, bdaf4, 7be83, c1dfc, a82db

2. server/publc_simplechat tiny (50KB) web client updated with reasoning, vision, builtin clientside tool calls and markdown: This pull request updates the server/public_simplechat tiny (50KB) web client by adding initial support for minimal markdown-to-HTML conversion to provide basic formatted views of AI-generated markdown text, integrates reasoning and vision model features, enhances built-in client-side tool calls with per-chat-session configuration and UI improvements, all implemented internally without external dependencies and designed to fit within a compact pure HTML/CSS/JS package supplemented by a Python proxy for CORS handling.

  • URL: pull/17506
  • Merged: No
  • Associated Commits: 9765c, 1fa94, 3a360, dbe6c, 73fbd, 5c3a5, 4685f, d816d, cfcb5, 3adbb, 32232, d44e0, 489d0, 6af6b, 468b6, 89340, cf085, 5dede, 09ec4, 9ea08, a4b3f, ba5ae, 4cabe, e66e6, b42ae, 1dcbc, 96c5b, 726d7, 8fc91, 46688, 0f01d, 207f7, 23ece, 338bc, c0593, 2e54a, dd260, 18f39, 38449, 25d8f, 5bf00, 9a244, 09f50, 02398, 057ff, 8e193, b7c2d, d1ede, a7a37, 06b03, 6cb6a, f66ad, fcb3c, defb2, f4e78, 14795, 49742, 4f23c, 824c6

3. model : add LLADA 2.0 diffusion support : This pull request introduces support for the LLaDA 2.0 model architecture and its diffusion mechanism into the project, including implementation of conversion scripts, CLI integration, and hybrid diffusion optimizations, while maintaining backward compatibility and addressing known issues such as unexpected tokens and max length constraints.

  • URL: pull/17454
  • Merged: No
  • Associated Commits: ce469, 9716b, a9e81, bfc0b, 85f52, d5a47, 3db37, b763f, 07180, b9a93, d0599, e0714, 985ff, d3839, 0309f, 885ae, 603c8, 2c2a9, 4e5ab, 758c2, 76e16, 66610, fa087, e763d, e81ad, ebe92, eace3, c488c, e84a7, 77d83, dcc5f, a48f4, 876fa, 1c8e5, d2005, 68081, 8e372, 0eaaa, 80cb6, 97dcb, baae3, d7f7d, a8ba6, 679de, 11bd5, 8cf15, 191f1, de641, 6896b

Other Open Pull Requests

  • WMMA and WMMA-MMQ Support for AMD GPUs: These pull requests backport WMMA MMF support and related RDNA4 WMMA-MMQ improvements to RDNA3 GPUs, enabling WMMA-MMQ INT kernels for RDNA 3 architecture and fixing related test failures. The changes result in significant performance speedups on the RX 7900 XT and implement WMMA instructions for most quantizations except q2k.
    • pull/17495, pull/17575
  • Hexagon Backend Improvements: Refactoring efforts in the Hexagon backend introduce generalized, template-based CPU-side operation functions to reduce code duplication and optimize cache usage, alongside explicit buffer type management. Additionally, the rope operation implementation is fixed by correcting scaling calculations and optimizing element copying to pass all tests.
    • pull/17500, pull/17565
  • LFM2-VL Model Fixes: This pull request fixes multiple issues in the LFM2-VL model, including positional embedding resizing to match PyTorch outputs with an antialiasing flag, adjusting image embedding token placement, and implementing smart resize with rounding and stretching. These changes ensure consistent output between llama.cpp and PyTorch implementations.
    • pull/17577
  • CUDA FlashAttention Enhancements: The CUDA FlashAttention implementation is generalized by adding support for non-padded attention masks, Volta tensor core compatibility, and refactoring kernel configuration for flexibility and ROCm compatibility. Additional improvements include out-of-bounds checks and fixing tile shape defects to enhance performance and maintainability.
    • pull/17505
  • Vulkan Backend VRAM Heuristic: A dynamic VRAM-based heuristic is introduced to automatically calculate the optimal number of GPU layers to offload in the Vulkan backend. This enables efficient and stable acceleration on low-VRAM GPUs like the AMD RX 6500 XT by maximizing performance and preventing out-of-memory crashes without manual tuning.
    • pull/17485
  • Tokenization and Chat Template Additions: Tokenization using SentencePiece is introduced along with a chat template for Teuken, complemented by updates to the README and an Inno Setup script to support these features.
    • pull/17529
  • Kimi Linear Model Support: Comprehensive support for the Kimi Linear model is added, including KDA and MLA layers for advanced attention, MoE feed-forward network enhancements with shared experts, TikToken tokenizer integration, CUDA kernel optimization for KDA, and fixes for MoE parameter handling and model conversion.
    • pull/17592
  • RISC-V Vector Extensions Support: Runtime detection of the RISC-V Vector (RVV) extension is added with VLEN-agnostic kernel selection, alongside a RISC-V vector intrinsic implementation for ggml_vec_dot_f16_unroll optimized for Zvfh extension and fixes to avoid RVV vector arrays in the build.
    • pull/17483, pull/17496
  • User Interface and Usability Improvements: Enhancements include applying safe, consistent color schemes to links and table cells for better readability, and improving deletion dialog usability by truncating long conversation names to prevent layout breakage.
    • pull/17555
  • Operator Execution Time Profiling: Operator-level execution time profiling is added to the ggml-cpu backend via a compile option that records operator execution times in milliseconds to a CSV file with thread-safe synchronization, ensuring accurate multi-threaded timing and minimal overhead.
    • pull/17544
  • CUDA CUMSUM and TRI Operations: Support for CUMSUM and TRI operations is added for CUDA by adapting kernels from a previous pull request, including minor optimizations and a correction to the warp_prefix_inclusive_sum function in the float2 variant.
    • pull/17584
  • SVE Vector Length Detection Update: The ggml library is updated to use the svcntb() function for detecting SVE vector length, enabling support for SVE on FreeBSD and other platforms.
    • pull/17474
  • MSVC Build Documentation: Detailed documentation is provided for building the llama.cpp project using MSVC with Visual Studio 17 2022, including setup instructions for libcurl and ccache, extended Windows CMake build options, and Pack and Run functionality.
    • pull/17510
  • CANN Backend Operator Fusion: Operator fusion support is introduced in the CANN backend to optimize performance by combining ADD and RMS_NORM operations into a single kernel call, with an environment variable to enable fusion, helper functions for compatibility checks, and graph evaluation updates.
    • pull/17512
  • CANN Backend Rotary Position Embedding (RoPE) Support: Support is added for two RoPE variants in the CANN backend: partial rotation of tensor dimensions and a Vision mode rotating the entire tensor by pairing dimensions. Implementation includes handling contiguous and non-contiguous tensors, F32 and F16 data types, and improved cache invalidation.
    • pull/17543, pull/17515
  • dots.ocr Model Support Work in Progress: Efforts to add support for the dots.ocr model include model format conversion and CLIP functionality implementation, but the work remains incomplete due to unresolved issues likely related to vision 2D RoPE.
    • pull/17575
  • Sliding Window Attention Pattern Loading: The sliding window attention (SWA) pattern is updated to load directly from GGUF file metadata when available, allowing newer models to specify their SWA pattern without code changes while maintaining backward compatibility and model-specific overrides.
    • pull/17597
  • CUDA Error Handling Improvement: Error checking is added for the cudaMemcpyAsync call in the argsort_f32_i32_cuda_cub function by introducing a CUDA_CHECK wrapper to properly handle potential CUDA errors, following existing CUDA backend patterns.
    • pull/17599
  • JSON Grammar Typo Fix: A small fix corrects the number rule in the JSON grammar file json.gbnf to disallow exponents with leading zeros and allow trailing zeros, aligning the rule with intended behavior.
    • pull/17460
  • CI Workflow Adjustment: The continuous integration workflow is modified to skip the winget update step for repositories outside the ggml-org organization, preventing forks from generating daily failure notifications.
    • pull/17465
  • Floating Point Precision Test Fix: Floating point precision false positives in SUM tests are addressed by modifying the input distribution to avoid catastrophic cancellation, preventing large numerical errors in Vulkan and CUDA backends during repeated test runs.
    • pull/17471

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 74

Key Closed Pull Requests

1. server/public_simplechat alternate web client ui update - uncompressed 300KB - built in client side tool calls with 0 setup - reasoning - vision - ai calling into itself: This pull request proposes an alternate lightweight web client UI for the llama.cpp project, built purely with HTML, CSS, and JavaScript without external dependencies, featuring integrated reasoning, vision, and tool calling capabilities—including support for multiple images per message—while significantly reducing the source size to about 300KB and offering a simpler, more flexible user interface compared to the default heavier Svelte-based web UI.

  • URL: pull/17415
  • Merged: No
  • Associated Commits: 9953d, bb6b3, 8f460, 25b0e, 83786, d0156, 946af, 0a4d6, 2c69c, 1df42, 4276f, dfc31, 374af, a4721, 1e43c, b3187, 73fea, eaff5, 0273e, b97b3, 2894e, e29db, f52b9, 94a81, ee4b4, 7b9fd, d4350, d486e, 004b3, 83cfc, a4366, 48142, a40a8, 87726, af0b3, 1fa3c, ddc2e, 1ce5e, 69c6a, 22cca, 1e580, d865b, bbf6e, f9c62, 95f75, af38f, 64ccf, 5354a, 6a44d, d97f2, e15f5, d9b35, ce71a, c0b42, 00c3d, 3cbc9, 1452c, f77b2, 4db7b, 1c5fd, c0d7d, f20ac, 800d3, 48c82, 4ed20, b4a72, 95787, 4d9b9, b6660, a9a3f, 2e5d7, f8bce, af004, e173a, 7e9df, efb63, fe89c, e7167, 52f55, 921f5, 20a38, 27a9f, edaeb, 5fb94, 0f2e0, 308ee, 6d26b, 32b3c, 66804, 4292e, 5a60b, bb4f0, d2281, ce0e7, f1fdc, a3248, 10c72, 16499, 94f37, 64439, 4a071, cfee2, 361d6, 7c4d4, 40829, dfb15, 93057, 136dd, bd85d, 8650a, 6f52e, 21845, 6c50b, ac716, 6e1d0, 0d06d, 772d2, 50fa2, ed391, 9a8ff, bf890, 0ec8f, fc8f1, 714c1, 5ea26, 1cb0d, 22809, 2097c, 45a8c, c30a2, 14aa4, eee88, 619a9, 1efff, 84a83, ba68d, 44dfe, 37ddd, 389ab, 295df, 9d704, 11109, 24ea7, f64d3, 33370, 76fde, 8ef05, 0f876, b14c5, 65e0d, b780a, 06e88, 4f004, 9ee86, 34347, 067a1, e0274, 17748, 98674, e7888, ef326, d1886, 02fd8, e1880, 99f2b, ffb9f, 7987a, 4d0f0, 323c1, 16f43, 2c95f, 7a8cb, 42ce8, 2acc2, bd6fe, 3ada2, 7765c, 60fae, 09220, 7a640, ef5e7, 919ff, 9305e, 5d9a9, 5208e, 512b8, dc637, c8d33, 14e49, 9936c, c71cc, 6f2f7, 70fc4, 716a7, 818e2, 8f2df, 722e5, 6e5ea, 280f6, 705f7, 4b0f3, 317fc, 82187, cbd87, 1d911, 2175b, 4d62e, e518b, 854ab, 0d564, 62bb6, 9319e, c5c25, cc805, 202d4, 4a36e, 667b8, c0fac, 21520, 082a9, 61095, d11b7, 5830a, 9c652, 04be0, e5275, 7df43, 2c3a6, 3d1ee, 2ff76, bb4a3, b4ba7, 0e0ae, 34064, 3eda5, 1a824, f8c50, ea25b, c711b, 9fbd5, a709b, 48111, 06663, 533c8, a9c7f, b90dc, d6537, a04fe, 81106, 12295

2. mobile: ARM64 optimizations for Snapdragon and Dimensity SoCs: This pull request proposes optional ARM64 optimizations for mobile devices, specifically targeting Snapdragon 8+ Gen1 and MediaTek Dimensity 1000+ SoCs, by integrating the BLIS math library for enhanced matrix operations and adding Vulkan backend support for Adreno and Mali GPUs, all while maintaining full backward compatibility.

  • URL: pull/17593
  • Merged: No
  • Associated Commits: e0650, 20e25, 71007, 99fad, 5ffa3, 626c1, 2e0ab, b108b, 18131, 90578, e1466, d52d5, 59270, bb1e0, 20a99, a2355, 172ed, cb67b, ec75e, c1f7b, d6299, 0f87d, 3dc00, 2d6cc, 18280, 0f651, 96c51, 0c147, 1b090, 143ca, 0d414, 49152, d0e0e, 6d9a2, 9b8e7, 068d6, b7e77, 30483, 752ca, 99154, 8bd13

3. SOLVE_TRI CUDA kernel for small matrices: This pull request introduces a CUDA kernel implementation for the SOLVE_TRI operation on small matrices, achieving approximately twice the performance of the optimized CPU version while providing a foundation for further optimization.

  • URL: pull/17457
  • Merged: 2025-11-28T04:15:33Z
  • Associated Commits: 0e6fd, 48369, 42d6d, 084d6, 002d2, e21a0, b2d87, 376d4, 4e852, baa58, c5cd3, 6b117, f19cd, 3a24c, 6bf23, 18fb1, ea4dc

Other Closed Pull Requests

  • Model metadata and sampling parameters: This pull request introduces support for embedding recommended sampling parameters directly into the GGUF model metadata, allowing model creators to specify default sampler settings that simplify configuration for end users. These defaults are automatically applied unless overridden by user CLI flags, enhancing usability.
    pull/17120
  • ARM64 matrix multiplication optimizations: This pull request improves q4_K and q8_K matrix and vector multiplication implementations for ARM64 by optimizing them to use only the dot product instruction set and introducing a new Q8_Kx4 format. It adds missing guards, generic fallbacks, and demonstrates significant performance speedups on various models and hardware including Raspberry Pi 5 without impacting model perplexity.
    pull/17494
  • Quantized matrix-vector multiplication and Vulkan backend enhancements: This pull request adds support for k-quantized matrix-vector multiplication with multiple quantization levels (q2_k to q6_k) and enables the MUL_MAT_ID integer dot vector path in the Vulkan backend. It includes various optimizations but notes ongoing tuning challenges and incomplete performance evaluation.
    pull/16900
  • Diffusion language model support: This pull request adds support for the RND1 Diffusion Language Model, including conversion to GGUF weights and implementation of non-causal diffusion algorithms with configurable steps and temperature. It also provides detailed documentation and performance benchmarks across various hardware platforms.
    pull/17433
  • AMD RDNA 4 GPU WMMA-MMQ kernel support: This pull request enables WMMA-MMQ kernels for the RDNA 4 architecture on AMD GPUs by adding appropriate WMMA instructions and updating layout mappings. Performance benchmarks are provided using specific build configurations to demonstrate improvements in matrix multiplication operations.
    pull/17156
  • Top-k selection algorithm for Vulkan backend: This pull request implements a top-k selection algorithm in the Vulkan backend, introducing efficient sorting passes that progressively reduce elements to the top K, including a fast path for K=1. It also adds related support in Metal and continuous integration tests.
    pull/17418
  • Anthropic Messages API integration: These pull requests add support for the Anthropic Messages API to llama-server by converting Anthropic's message format into an OpenAI-compatible internal format. They enable endpoints for chat completions with streaming, token counting, tool use, vision support, system prompts, and multi-turn conversations, allowing llama.cpp to function as a local alternative to Anthropic's Claude API and support Anthropic-compatible clients.
    [pull/17570](https://github.com/ggml-org/llama.cpp/pull/17570], pull/17425
  • Server code refactoring and modularization: These pull requests refactor the server code by moving the server-context into dedicated files and splitting the monolithic server.cpp into smaller components like server-common, server-task, and server-queue. These changes improve modularity, maintainability, and prepare for centralized JSON handling while simplifying the public API and fixing build issues.
    [pull/17595](https://github.com/ggml-org/llama.cpp/pull/17595], pull/17362
  • ggml-hexagon backend fixes and improvements: These pull requests fix the swiglu and silu operations by correcting the hvx_fast_sigmoid_f32 implementation to eliminate NaN values and accuracy issues, and introduce a new hex_supported_buffer function to centralize buffer support logic. These changes enhance accuracy, maintainability, and reduce code duplication in the hexagon backend.
    [pull/17449](https://github.com/ggml-org/llama.cpp/pull/17449], pull/17212
  • OpenCL operations and precision improvements: This pull request adds new OpenCL operations such as sqr, sqrt, mean, and ssm_conv required by models like gemma3n and lfm2. It also improves precision and adds support for cl_khr_fp16.
    pull/17476
  • Download interface refactoring and bug fixes: This pull request refactors the common download interface to be directly usable by the tools/run component, eliminating duplicated code. It also fixes issues with ollama downloads by manually handling redirects to address problems with cpp-httplib.
    pull/17535
  • CUDA backend precision support for PH1 MUSA device: This pull request enables support for fp16, fast_fp16, and bf16_mma precision modes on the PH1 MUSA device within the ggml CUDA backend. It ensures successful builds and passing of all backend operation tests.
    pull/17551
  • Hexagon DSP v68 and v69 initial support: This pull request introduces initial support for Hexagon DSP versions v68 and v69, enabling model execution on platforms like Snapdragon 8cx Gen 3 and QCM6490. It notes that performance is currently very slow and includes fixes for build errors and hardware-specific constraints.
    pull/17394
  • RISC-V vector intrinsic implementation: This pull request adds a RISC-V vector intrinsic implementation for the ggml_vec_mad_f16 function using the Zvfh extension. It also deduplicates the scalar implementation and updates the related CPU vector header file.
    pull/17448
  • Web UI display settings reorganization: This pull request reorganizes the web UI settings by creating a dedicated "Display" section consolidating all visualization preferences. It adds a new disableAutoScroll option to prevent automatic scrolling during message streaming, improving user control and maintaining backward compatibility.
    pull/17452
  • Vulkan backend SOLVE_TRI function implementation: This pull request implements the SOLVE_TRI function in the Vulkan backend, including optimizations such as loading the B matrix through shared memory and using a FLOAT_TYPE for improved precision.
    pull/17486
  • MiniCPM-V resampler bug fix: This pull request fixes a critical bug in the MiniCPM-V resampler by correcting the attention scaling factor kq_scale calculation, which was previously reused incorrectly from the Vision Encoder’s hyperparameters. This resolves numerical discrepancies and improves model accuracy to align with the official Python implementation.
    pull/17516
  • JSON schema backslash escaping fix: This pull request fixes an issue in the JSON schema where the backslash character ('\') was not properly escaped in literals, ensuring correct parsing and validation.
    pull/17307
  • Unified cache graph update and test improvements: This pull request updates the worst-case graph for the unified cache, addressing issues referenced in target #17276. It also improves related test configurations by disabling operation offload in some tests.
    pull/17379
  • Metrics endpoint fix: This pull request fixes the /metrics endpoint by preventing it from returning Prometheus-format text that is incorrectly JSON-escaped and wrapped in double quotes. It adds an ok() method overload accepting a std::string to resolve parsing issues.
    pull/17386
  • s390x architecture convert_hf_to_gguf.py fix: This pull request fixes the convert_hf_to_gguf.py script on s390x by correctly handling endianness through strategic byteswapping of model data before and after transformations. It addresses issues with inplace byteswap calls on lazy tensor wrappers and updates GGUFWriter to accept tensors in native endianness to eliminate unnecessary byteswaps.
    pull/17431
  • RDNA4 GPU mul_mat_f operation and performance optimization: This pull request enables the mul_mat_f operation for RDNA4 GPUs and optimizes performance by moving workloads with n >= 3 from the mmvf backend to the mmf backend. It uses an unusual unreached code branch to prompt the ROCm compiler to generate more efficient code, with performance improvements validated on an RX 9070 XT GPU.
    pull/17437

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
hanishkvc 637 6 1 3
ngxson 150 10 3 65
pwilkin 56 5 5 94
ggerganov 74 12 2 71
CISC 21 2 0 116
jeffbolznv 44 13 0 62
0cc4m 29 3 0 56
aldehir 59 1 0 26
chraac 68 6 1 10
am17an 10 3 1 67

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.