Weekly GitHub Report for Llama.cpp: April 06, 2026 - April 13, 2026 (19:28:38)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, with notable improvements in user interface responsiveness and security features. These changes reflect a continued focus on optimizing user experience and safeguarding data integrity.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Eval bug: Gemma 4 generates
tokens in infinite loop on Vulkan backend : This issue reports that Gemma 4 models generate an infinite stream of<unused>tokens when run on the Vulkan backend in llama.cpp, resulting in no valid text output despite applying all known fixes and patches. The problem occurs on both GPU offloading and CPU-only modes, suggesting a Vulkan-specific numerical precision or compute shader issue that does not appear in other forks like Ollama's.- Multiple users confirm the infinite
<unused>token generation on various hardware and OS setups using Vulkan and CUDA backends, with some noting that switching models or backends (e.g., Ollama or CPU-only) can produce correct outputs; attempts to reproduce the issue on certain hardware fail, indicating possible hardware or driver-specific factors, and discussion points to potential bugs in Vulkan fusion or compute shader code paths. - Number of comments this week: 20
- Multiple users confirm the infinite
-
[BUG-UNCONFIRMED] Eval bug: Checkpoints on Gemma 4 consume abnormal amounts of RAM, leading to llama-server going OOM: This issue reports that when using Gemma 4 models on llama-server, the context checkpointing feature causes an abnormal and rapidly increasing RAM consumption, eventually leading to out-of-memory (OOM) crashes even on systems with substantial RAM and VRAM. The problem is specific to Gemma 4 and does not occur with other model architectures; a known workaround is to reduce or disable context checkpoints, which stabilizes memory usage but may impact functionality.
- Commenters confirmed the memory leak occurs on RTX 3090 GPUs with CUDA and Vulkan backends, and that reducing the number of checkpoints or disabling them prevents the RAM from ballooning. Multiple users shared logs and test results showing consistent OOM crashes tied to checkpoint memory usage, and it was noted that this behavior has persisted since the initial Gemma 4 implementation; some suspect it may be an inherent characteristic of Gemma 4 rather than a simple bug.
- Number of comments this week: 16
-
[BUG-UNCONFIRMED] Eval bug: SYCL: Qwen3.5 spitting garbage on the second prompt: This issue reports a bug in the SYCL backend of the llama.cpp project where using the Qwen3.5 model with Q8_0 quantization causes the second prompt to produce garbage output due to a missing reorder-aware dequantization path for Q8_0 tensors after weights are reordered during token generation. The problem was traced to a recent PR that added reorder support for some paths but omitted the GEMM dequantization path, and a fix was developed that adds the necessary reorder-aware dequantizers, restoring correct functionality and preserving performance.
- The comments detail the identification of the root cause related to missing reorder-aware dequantization for Q8_0, the submission and testing of a fix that restores correct output and maintains speedups, user confirmations of the fix working on various hardware, and troubleshooting advice including clean rebuilds and environment consistency to resolve performance discrepancies.
- Number of comments this week: 14
-
[BUG-UNCONFIRMED] Eval bug: Very slow inference of Q1_0 Bonsai model: This issue reports a significant slowdown in inference speed for the Bonsai-8B model using Q1_0 quantization on the latest versions of llama.cpp across CPU, Vulkan, and NVIDIA hardware, with speeds drastically lower than the original fork. The user highlights that no hardware acceleration appears to be utilized yet, and the problem persists despite testing on various hardware configurations.
- The comments clarify that a CUDA backend is not yet implemented but is planned, with some users sharing alternative repositories showing better performance; suggestions for CPU optimizations and pull requests aimed at improving speed are also discussed, alongside acknowledgments of ongoing development efforts.
- Number of comments this week: 10
-
[ENHANCEMENT] Feature Request: WebUI response streaming is fragile: This issue requests improving the robustness of the WebUI response streaming to handle connection interruptions gracefully, allowing users to resume interrupted responses without losing progress. It proposes two main approaches: fixing the assistant prefill API to support reasoning models and implementing server-side SSE resumption with session IDs to maintain and resume streaming state.
- The comments discuss the current limitation where the continue button is disabled for reasoning models due to backend restrictions, with contributors confirming the issue and sharing progress on patches to enable resuming within reasoning blocks; ongoing work focuses on backend fixes and prioritizing related merges before finalizing improvements.
- Number of comments this week: 10
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 89
Summarized Issues:
- Model output and inference quality issues: Several models including llama-server with certain prompts, Gemma 4 IT GGUF, Qwen3.5 Q8_0 quantization, and Bonsai-8B exhibit problems such as aborted or mangled output, incoherent tokens, gibberish second responses, and severe performance degradation. These issues affect both CPU and GPU backends and often relate to quantization, backend-specific bugs, or recent regressions impacting inference correctness and speed.
- issues/21523, issues/21538, issues/21574, issues/21578, issues/21589, issues/21715, issues/21721, issues/21734, issues/21834
- Backend-specific crashes and errors: Multiple crashes and errors occur across CUDA, Vulkan, SYCL, HIP, and Hexagon backends, including device lost errors, kernel failures, assertion errors, and timeouts. These backend issues cause failures in model loading, inference, or server operation, often triggered by specific hardware, driver versions, or recent code changes.
- issues/21524, issues/21547, issues/21585, issues/21603, issues/21604, issues/21608, issues/21648, issues/21724, issues/21747, issues/21762, issues/21773, issues/21811, issues/21837, [issues/21842](https://github.com/issues/21842]
- Memory leaks and excessive resource usage: Issues include VRAM and RAM leaks during model usage, excessive VRAM consumption due to multiple processes per GPU, and out-of-memory crashes caused by context checkpoints or large image processing. These problems degrade system stability and performance, often requiring workarounds or configuration changes to mitigate.
- issues/21550, issues/21552, issues/21690, issues/21692, issues/21742, issues/21759, issues/21780, [issues/21784](https://github.com/issues/21784]
- Model loading and architecture compatibility problems: Several models such as Gemma4 and Phi-4 Mini Reasoning fail to load or initialize correctly due to unknown architectures, assertion errors, or unsupported features. These issues prevent usage of certain models on specific platforms or backends, limiting functionality.
- issues/21688, issues/21730, [issues/21790](https://github.com/issues/21790]
- Quantization and KV cache issues: Problems with quantization include missing reorder-aware dequantization variants, asymmetric quantization research, and incompatibility between tensor split modes and KV cache quantization. These affect model accuracy, performance, and backend compatibility, requiring patches or feature requests to resolve.
- issues/21589, issues/21591, issues/21679, [issues/21788](https://github.com/issues/21788]
- Server and router operational bugs: The llama-server experiences issues such as unloading active models during max load, failing to reuse KV cache causing context loss, remote download failures behind proxies, and router mode launching excessive processes per GPU. These bugs disrupt server stability and user experience during multi-model or multi-user operations.
- issues/21678, issues/21831, issues/21694, [issues/21692](https://github.com/issues/21692]
- Web UI and usability problems: The Web UI on iOS Safari triggers unwanted zooming, lazy loading with transitions causes response disappearance, and inline LaTeX rendering fails with literal dollar signs. Additionally, multi-user support and response streaming robustness are requested to improve user interaction and session management.
- issues/21541, issues/21754, issues/21758, issues/21795, [issues/21649](https://github.com/issues/21649]
- Model evaluation and performance regressions: Significant slowdowns in token processing speed and evaluation performance have been reported for Bonsai-8B, Qwen3 models, and others after specific commits, affecting CPU, Vulkan, and CUDA backends. These regressions reduce throughput and responsiveness, impacting practical usability.
- issues/21574, [issues/21834](https://github.com/issues/21834]
- Build and compilation failures: Compilation errors occur on Windows with CUDA and AMD ROCm SDK, protobuf version conflicts in nix builds, and SYCL backend initialization failures on Intel GPUs. These prevent successful builds or runtime initialization, blocking development and deployment on affected platforms.
- issues/21524, issues/21536, issues/21598, [issues/21747](https://github.com/issues/21747]
- Model-specific bugs and feature limitations: Issues include embedding models incorrectly shown as chat defaults, tool/function calling response format errors, missing speculative decoding support, and bugs in prompt caching or grammar samplers. These affect model capabilities, output correctness, and feature completeness.
- issues/21545, issues/21596, issues/21840, [issues/21571](https://github.com/issues/21571], [issues/21780](https://github.com/issues/21780]
- Tensor parallelism and multi-GPU coordination issues: Failures in tensor parallelism with certain flags or models cause assertion errors, NCCL communication hangs without proper device visibility settings, and multi-GPU CUDA errors lead to timeouts or corrupted outputs. These problems hinder scaling and parallel inference capabilities.
- issues/21686, issues/21719, issues/21765, [issues/21773](https://github.com/issues/21773]
- Feature requests for backend and tooling improvements: Requests include adding SYCL backend support for Q1 quantization, Vulkan backend optimizations, TileLang backend acceleration, NVFP4 tensor mapping for GEMMA4, ggml backend for AMD XDNA NPUs, and model management API endpoints. These aim to enhance performance, hardware support, and usability.
- issues/21590, issues/21641, issues/21712, issues/21725, issues/21777, [issues/21779](https://github.com/issues/21779]
- Tokenizer and parsing bugs: Bugs include Cyrillic text tokenization errors, JSON parser failures with tool parameters, and jinja template issues causing invalid JSON schemas or tokenization failures. These parsing problems lead to corrupted outputs, server errors, and broken chat interactions.
- issues/21675, issues/21771, issues/21600, [issues/21634](https://github.com/issues/21634]
- Model-specific context and attention bugs: Gemma-4 model exhibits sliding window attention checkpoint restoration bugs causing context loss mid-conversation, and bounding box coordinate inversion issues affect visual output correctness. These bugs degrade model response coherence and output accuracy.
- issues/21769, [issues/21631](https://github.com/issues/21631]
- Audio and multimodal processing issues: The Qwen3-ASR model fails on longer audio files, and .wav files uploaded via Web UI are skipped by the server, limiting audio processing capabilities.
- issues/21847, [issues/21850](https://github.com/issues/21850]
- Miscellaneous bugs and slowdowns: Slow command line tool startup due to Metal GPU backend and network requests, Huggingface API rate limit causing server quit, and Docker CUDA image GPU utilization inefficiencies have been reported, impacting user experience and resource usage.
- issues/21677, issues/21708, [issues/21740](https://github.com/issues/21740]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 45
Summarized Issues:
- Grammar and Parsing Failures: Multiple issues report grammar compilation failures, silent parser errors, and server crashes related to exceeding repetition thresholds or improper handling of JSON schemas and tool call arguments. These problems cause unconstrained or malformed outputs, 500 errors, and broken tool-calling grammars in various scenarios including streaming, Windows PowerShell, and chat interfaces.
- issues/20867, issues/20879, issues/21013, issues/21228, issues/21642, issues/21680
- Model Output and Token Generation Bugs: Several issues describe models generating infinite or premature stop tokens, token generation loops, or garbled output due to hardcoded tokens, backend-specific bugs, or recent regressions. These cause truncated HTML, infinite token streams, or incoherent text generation across models like Gemma 4 and Qwen 3.5.
- issues/21321, issues/21471, issues/21622, issues/21602, issues/21726, issues/21799
- Backend and Hardware Compatibility Issues: Reports include crashes, endless loops, and performance regressions on specific hardware and backends such as CUDA, ROCm, Metal, and SYCL. Problems range from kernel inefficiencies and illegal memory accesses to silent crashes on Intel Arc iGPUs and AMD hardware, affecting models like Gemma 4 and Qwen.
- issues/21416, issues/21420, issues/21474, issues/21494, issues/21517, issues/21564, issues/21682
- Model and Feature Support Requests and UI Issues: Users report missing or incorrect UI indications of model capabilities such as audio support, issues with thought expansion in the web UI, and requests for new model support like the 1-bit Bonsai 8B. Router mode UI bugs cause model selection inconsistencies during chat sessions.
- issues/21298, issues/21322, issues/21325, issues/21624, issues/21626
- Command Line and API Behavior Bugs: Problems include ignored command line options such as --grammar-file and --reasoning-budget, improper newline handling in CLI inputs, and prompt-cache corruption causing silent output errors. These bugs affect server startup, input processing, and multi-turn conversation consistency.
- issues/21262, issues/21464, issues/21487, issues/21681
- Model Quantization and Tensor Naming Issues: Bugs in tensor naming during model conversion cause loading failures due to exceeding character limits, and tensor formatting regressions cause excessive console spam during quantization. These issues impact model loading and debugging workflows.
- issues/21115, issues/21776
- Performance Regressions and Efficiency Problems: Significant slowdowns and throughput regressions are reported on Apple M3 and M4 hardware, as well as inefficient kernel paths on Intel GPUs, leading to reduced token generation speeds and increased memory bandwidth usage.
- issues/21494, issues/21655, [issues/21517](https://github.com/issues/21517]
- Crash and Stability Issues: Crashes occur due to CUDA kernel optimizations, audio input processing errors, and GPU tensor parallelism, causing illegal memory accesses, assertion failures, and infinite loops. These affect stability on NVIDIA RTX 5090 GPUs and during audio transcription tasks.
- issues/21564, issues/21816, [issues/21703](https://github.com/issues/21703]
- Audio Input and Transcription Problems: The Gemma4 model exhibits crashes or skips when processing audio files, and transcription accuracy issues persist despite recommended prompts and updated models, impacting usability for audio-based tasks.
- issues/21820, [issues/21825](https://github.com/issues/21825]
- Miscellaneous Bugs and Improvements: Issues include emoji copying causing JSON parsing failures, continuous integration labeler failures, spelling corrections, and enhancement requests for router mode preset reloading without server restarts.
- issues/21660, issues/21666, issues/21810, issues/21823
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 82
Key Open Pull Requests
1. Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks: This pull request fixes a CUDA memory leak in the llama.cpp project by implementing an upper VRAM limit and an LRU cache for cached CUDA graphs in the cuda-backend, thereby preventing unbounded growth of CUDA graphs for Gemma3 models and improving overall memory management.
- URL: pull/21673
- Associated Commits: 7c932, 7261a, 22750, ec5f5, 526c9, ebae1, e69d6, ed51f, 72802, f4a5e, b1722, 837c0, 0f965, 0b467, 9d3c8, ecb33, 5c624, e867a, a7304, e5fc4, 29ef1, 9d5b4, 636bc, aabe6, 9c84f
2. Add support for Reka Edge 2603: This pull request adds comprehensive support for the Reka Edge 2603 model to the project, including integration of its Llama-style language model and ConvNeXt V2-based vision encoder, enhancements to the Jinja templating engine for compatibility with the Reka Edge chat format, scripts for model conversion and quantization, and extensive testing to ensure functionality across multiple benchmarks.
- URL: pull/21616
- Associated Commits: c1290, 3731b, e8b49, 34a24, 4867d, 5c8a4, 557d5, 93dbc, 777d0, 63c2d, be97a, 51226, 8f5ab, 69a53, 5ebbc, 92a52, 30395, f1c24
3. optionally enable ccache for use in Dockerfiles: This pull request optionally enables the use of ccache in various Dockerfiles by introducing a build-time argument CCACHE_ENABLED that, when set to true, installs ccache and adds a cache-mount to the build step to speed up recompilation, while preserving existing behavior in the s390x Dockerfile where ccache is always used.
- URL: pull/21563
Other Open Pull Requests
- CUDA backend and performance optimizations: Multiple pull requests enhance CUDA backend capabilities and performance, including adding a
reusedmember variable to enable graph reuse and bypass costly checks, and implementing the initial CUDA backend for Q1_0 quantization with benchmarks and AMD hardware support. Additionally, tile selection logic for CUDA mul_mat_q is optimized for AMD CDNA2 GPUs, improving throughput and reducing resource waste on large models.
- Quantization fixes and optimizations: Several pull requests address quantization-related issues and improvements, such as fixing overflow in q8_1 dequantized activations to prevent NaN results during perplexity calculations, resolving SYCL Q8_0 reorder optimization bugs with a reorder-aware dequantizer and memory fallback, and implementing an optimized x86 SIMD q1_0 dot product for Bonsai LLM models with significant speedups.
- Backend and hardware support enhancements: Updates include adding asynchronous execution and auto-tuning for Hexagon HMX backend matmul operations, adding Metal GPU acceleration for depthwise 2D convolution in MobileNet/EfficientNet, and fixing OpenCL allocation issues in the OpenVINO backend by bounding tensor dimensions. These changes improve performance and compatibility across diverse hardware platforms.
- Memory management and system resource optimizations: Pull requests improve memory usage and system stability by replacing SYCL memory operations with Level Zero API calls to reduce host RAM usage on Intel multi-GPU setups, adding a mmap-backed persistent KV cache for CPU inference to speed up context resumption, and introducing a Linux
--hugepagesflag to load model weights backed by HugeTLB pages, reducing kernel memory overhead.
- Model input and parsing improvements: Enhancements include adding support for
token_type_idsinllama_batchto enable BERT-based Cross-Encoder and Reranker models, fixing parsing edge cases in the Gemma 4 26B A4B model related to prompt formatting and channel tokens, and addressing a conversion failure in the NVIDIA-Nemotron-3-Nano-4B-BF16 model by overriding config loading to handle recent transformers versions.
- Vulkan and WebGPU improvements: Updates to Vulkan include programmatically adding RoundingModeRTE to shaders, optimizing im2col operation to prevent driver timeouts, and fixing WebGPU matrix multiplication to use f32 accumulation for precision and NaN fixes. These changes also address Chrome-specific bugs and improve iOS performance by increasing shader batching.
- Server, CLI, and API usability enhancements: New features include adding an
--endpointoption tollama-clifor connecting to remote servers, fixing prefix caching in the llama.cpp server for the Anthropic API to prevent cache misses, and improving WebUI thinking mode request handling by decoupling sampling presets and disabling reasoning parsing when thinking is off.
- Verification and auditing tools: A new verifiable inference example demonstrates a commit-and-open workflow using SHA-256 and Merkle trees for lightweight, auditable verification of model inference traces without invasive core changes, including build integration, server support, and documentation.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 134
Key Closed Pull Requests
1. SYCL: Unified cache alignment — zero-alloc inference + planner-driven arena: This pull request presents a comprehensive review and implementation of a unified cache alignment strategy for the SYCL backend in the llama.cpp project, introducing zero-allocation inference and a planner-driven arena system that pre-allocates and manages VRAM and host memory buffers for weights, MoE experts, and compute scratch spaces, while replacing legacy allocation and layout management mechanisms with a streamlined, event-driven, and layout-aware unified cache API to improve memory efficiency, reduce runtime allocations, prevent GPU hangs and device lost errors, and enable scalable multi-GPU and large-context model support.
- URL: pull/21737
- Associated Commits: 774dc, 59ae8, ab34f, 19b7d, bf3fb, 117b4, fef82, 0b20b, 8a53c, 05a26, 2a67d, 08054, 27d7c, dbc92, d9d03, 59e1a, 8db73, 7a736, d36ad, a4005, f267c, f2211, ef672, b3b37, 480a4, 72f63, 49b81, 0c155, 655f2, b37a3, 8b619, 2f860, 548d0, 57f81, 366c3, 18546, 96052, 1c0ea, db261, 9faf2, 40f98, f885e, 0db84, 598f1, 3e9d7, 47799, 61f4f, 3192c, ea812, 0db40, 23024, 449a2, 28197, 84970, c80c8, eb65a, 04dd6, c465a, 0734d, 4cc37, 01bce, f3caf, 46a80, 2b013, c3d06, 356a9, eef1b, 2eca6, b7940, 63e9d, f2533, 10e9f, c53f6, 6451d, 4796c, 1e45f, 020cc, 73471, e0e3b, bd4f7, be5b3, 10819, 1e597, b0c12, 382f9, ff0db, 1004f, 7ae21, 8f50d, b4f09, 2b40c, 5d423, 9685a, 016b2, afad3, 9b0fe, 58bb0, 32973, 10993, badef, 2083c, 68649, f33a1, 8a33b, 5abc7, 7bc6a, 4db7f, ba3cf, 2a993, 0f8dc, c46dc, ccb3e, 077fe, 93458, ab406, 1917a, 74027, 108cf, a22a8, 35daa, 01d19, 77e90, f18d0, 6ff48, bb34a, f2801, 9efef, 923a1, 78207, cbbb2, e83e8, a00ed, 975bc, 41f4b, 909fb, bd38c, 274f4, e995c, 501b5, 7ac67, 10448, e86d1, 37ec0, 04bef, 00ebe, 99b8e, 74c7d, 8b871, 48fcc, 15291, 28d04, 43b0f, 282ca, 8b27c, d5478, f6d49, 1f9d6, d91b9, e6915, 2598b, 3fba5, c9fcf, e28e6, 5820b, 12263, e45bc, 5dd5c, 0a16d, 022a0, 6f018, 9adc8, 1b8d4, 90761, 9273d, 763cc, d1ad9, a167e, 0320e, dbf80, 0d01f, 5292a, 5b14f, 5538a, 7b35b, dc361, 0a370, 65b22, d2381, 2d382, ea6c9, 56ede, 3f271, 90062, 95775, 273f6, 040d9, e00e1, 506d2, 131f5, f02da, 30784, e7f27, b8afd, 6ae8b, 93e89, 2b170, afa4c, f5d80, 06ec3, 90b23, 755e9, e3fa3, 3e01d, be62b, edc87, 68711, 9806e, ccf47, e81dc, 8509e, e17ca, 5eb59, aa2e1, a3f2e, f5d63, ada42, 31665, 71734, 579e7, 15e8c, ea7ee, f9cb6, 5c11b, 96f42, de6d6, 7806f, 29c34, d4f65, d2d95, 2ae3e, 4ddc0, 5c1c3, 4d788, 8b006, 9c5a0, b1aa4, 21b4e, 0ccba, 0b1b7, e5f5a
- Associated Commits: 774dc, 59ae8, ab34f, 19b7d, bf3fb, 117b4, fef82, 0b20b, 8a53c, 05a26, 2a67d, 08054, 27d7c, dbc92, d9d03, 59e1a, 8db73, 7a736, d36ad, a4005, f267c, f2211, ef672, b3b37, 480a4, 72f63, 49b81, 0c155, 655f2, b37a3, 8b619, 2f860, 548d0, 57f81, 366c3, 18546, 96052, 1c0ea, db261, 9faf2, 40f98, f885e, 0db84, 598f1, 3e9d7, 47799, 61f4f, 3192c, ea812, 0db40, 23024, 449a2, 28197, 84970, c80c8, eb65a, 04dd6, c465a, 0734d, 4cc37, 01bce, f3caf, 46a80, 2b013, c3d06, 356a9, eef1b, 2eca6, b7940, 63e9d, f2533, 10e9f, c53f6, 6451d, 4796c, 1e45f, 020cc, 73471, e0e3b, bd4f7, be5b3, 10819, 1e597, b0c12, 382f9, ff0db, 1004f, 7ae21, 8f50d, b4f09, 2b40c, 5d423, 9685a, 016b2, afad3, 9b0fe, 58bb0, 32973, 10993, badef, 2083c, 68649, f33a1, 8a33b, 5abc7, 7bc6a, 4db7f, ba3cf, 2a993, 0f8dc, c46dc, ccb3e, 077fe, 93458, ab406, 1917a, 74027, 108cf, a22a8, 35daa, 01d19, 77e90, f18d0, 6ff48, bb34a, f2801, 9efef, 923a1, 78207, cbbb2, e83e8, a00ed, 975bc, 41f4b, 909fb, bd38c, 274f4, e995c, 501b5, 7ac67, 10448, e86d1, 37ec0, 04bef, 00ebe, 99b8e, 74c7d, 8b871, 48fcc, 15291, 28d04, 43b0f, 282ca, 8b27c, d5478, f6d49, 1f9d6, d91b9, e6915, 2598b, 3fba5, c9fcf, e28e6, 5820b, 12263, e45bc, 5dd5c, 0a16d, 022a0, 6f018, 9adc8, 1b8d4, 90761, 9273d, 763cc, d1ad9, a167e, 0320e, dbf80, 0d01f, 5292a, 5b14f, 5538a, 7b35b, dc361, 0a370, 65b22, d2381, 2d382, ea6c9, 56ede, 3f271, 90062, 95775, 273f6, 040d9, e00e1, 506d2, 131f5, f02da, 30784, e7f27, b8afd, 6ae8b, 93e89, 2b170, afa4c, f5d80, 06ec3, 90b23, 755e9, e3fa3, 3e01d, be62b, edc87, 68711, 9806e, ccf47, e81dc, 8509e, e17ca, 5eb59, aa2e1, a3f2e, f5d63, ada42, 31665, 71734, 579e7, 15e8c, ea7ee, f9cb6, 5c11b, 96f42, de6d6, 7806f, 29c34, d4f65, d2d95, 2ae3e, 4ddc0, 5c1c3, 4d788, 8b006, 9c5a0, b1aa4, 21b4e, 0ccba, 0b1b7, e5f5a
2. feat: Add Turbo/Iso/PlanarQuant with Windows MSVC compatibility: This pull request adds TurboQuant, IsoQuant, and PlanarQuant quantization methods with Windows MSVC compatibility to the project, integrating features from the planarquant-kv-cache branch, fixing Windows-specific issues such as M_PI definition and linker symbols, and optimizing code for stable Windows builds.
- URL: pull/21722
- Associated Commits: 9b041, 9f377, 70a31, dcd15, 793f1, d4ee5, 8997c, 28344, aede1, 1f1f8, b73d6, b8410, 0c9ba, 45657, 29073, 95499, d27a4, 16596, f9841, 18f42, a1230, aed7a, 29786, c3a7a, 93c3e, 753b8, 3d9bc, cf627, d9b97, ded3e, 810b4, b097f, ea35e, 01fc3, c76b7, 316f8, ccd12, 63c8d, 48e46, 99489, 8bf23, 2157f, bc8ae, 7dd0a, abc6e, 05412, a9ef4, 9087f, 99da3, 02268, 929b8, 5811a, b2a5a, eb9a5, 8b36e, 9f233, 4b091, 65ed3, 830d7, d602c, 80430, 14066, 3dfd5, 78fac, edfff, f29b8, 199d6, 39ba0, bc413, ac9e3, f0c4c, 6063c, 88882, d01a0, fd953, e9d06, 61a03, 7673d, 00a54, 7d1bd, 7b885, 4c914, 065ef, a5258, 0a607, 972c7, 9cdb8, f89c4, f284c, 75e27, eddff, fef28, d0d37, c1680, 66179, 53f12, d2ca3, 3ef4d, ca252, da6b0, 00ecb, 6fb85, 4cf71, a5efe, 4c451, 172fc, 3380d, 43f7d, c1d9b, d46ac, 05b7f, b90b5, 965a6, ae702, 2dd60, 70b35, 89d26, 1b716, 58d51, 64dd3, adac2, aca45, 7b750, b8ecc, abfb7, 406bf, e11d7, b345d, c301c, 2c9d2, 3dee4, bcfd8, 26c90, 25f89, 0971e, 1ed04, b69ae, a75b1, 9d4ec, e7bde, 79da6, b719b, 985fd, b83a0, a7306, 6e5a4, 326f7, 20efe, 86d11, 05355, 700bf, fc60e, 01a97
- Associated Commits: 9b041, 9f377, 70a31, dcd15, 793f1, d4ee5, 8997c, 28344, aede1, 1f1f8, b73d6, b8410, 0c9ba, 45657, 29073, 95499, d27a4, 16596, f9841, 18f42, a1230, aed7a, 29786, c3a7a, 93c3e, 753b8, 3d9bc, cf627, d9b97, ded3e, 810b4, b097f, ea35e, 01fc3, c76b7, 316f8, ccd12, 63c8d, 48e46, 99489, 8bf23, 2157f, bc8ae, 7dd0a, abc6e, 05412, a9ef4, 9087f, 99da3, 02268, 929b8, 5811a, b2a5a, eb9a5, 8b36e, 9f233, 4b091, 65ed3, 830d7, d602c, 80430, 14066, 3dfd5, 78fac, edfff, f29b8, 199d6, 39ba0, bc413, ac9e3, f0c4c, 6063c, 88882, d01a0, fd953, e9d06, 61a03, 7673d, 00a54, 7d1bd, 7b885, 4c914, 065ef, a5258, 0a607, 972c7, 9cdb8, f89c4, f284c, 75e27, eddff, fef28, d0d37, c1680, 66179, 53f12, d2ca3, 3ef4d, ca252, da6b0, 00ecb, 6fb85, 4cf71, a5efe, 4c451, 172fc, 3380d, 43f7d, c1d9b, d46ac, 05b7f, b90b5, 965a6, ae702, 2dd60, 70b35, 89d26, 1b716, 58d51, 64dd3, adac2, aca45, 7b750, b8ecc, abfb7, 406bf, e11d7, b345d, c301c, 2c9d2, 3dee4, bcfd8, 26c90, 25f89, 0971e, 1ed04, b69ae, a75b1, 9d4ec, e7bde, 79da6, b719b, 985fd, b83a0, a7306, 6e5a4, 326f7, 20efe, 86d11, 05355, 700bf, fc60e, 01a97
3. hexagon: improved Op queuing, buffer and cache management: This pull request significantly improves the Hexagon backend by introducing batched operation queuing dispatched via a single dspqueue request, optimizing buffer and cache management through memory-mapped buffers and fine-grained cache flushing, enabling support for larger models on a single device, enhancing compatibility with QNN-HTP for parallel workloads, simplifying VTCM reservation to reduce latency, and including various fixes and updates to improve efficiency and maintainability across multiple source files.
- URL: pull/21705
- Associated Commits: 93b7a, bd14a, 698e8, 99ad6, 286a3, 7e8a4, baf7d, ae09b, 0d3dc, 18a5d, bd50e, 9ff82, 33331, 98e8a, e896c, 6df16, cffca, 33e09, 76071, c0b94, 8384c, f820e, 7c0cb, 14d97, 6ab7e, a5beb, 890ac, 11b03, b1436, e22aa, 463be, 9a2c9, 2a7dc, c79d8, b76c9, 47718, 508a6, 3d72c, 23e25, 3a06e, 3c04b, eb1b1, fe936, 87b2f, 23c86, 63246, 3a2f0, 3980c, e5b5d, 954cf, 835a3, 38e3d, 5cffc, 1524d, caa9c, aa0ef, 14da2, 3c666, 334ca, 6ff98
- Associated Commits: 93b7a, bd14a, 698e8, 99ad6, 286a3, 7e8a4, baf7d, ae09b, 0d3dc, 18a5d, bd50e, 9ff82, 33331, 98e8a, e896c, 6df16, cffca, 33e09, 76071, c0b94, 8384c, f820e, 7c0cb, 14d97, 6ab7e, a5beb, 890ac, 11b03, b1436, e22aa, 463be, 9a2c9, 2a7dc, c79d8, b76c9, 47718, 508a6, 3d72c, 23e25, 3a06e, 3c04b, eb1b1, fe936, 87b2f, 23c86, 63246, 3a2f0, 3980c, e5b5d, 954cf, 835a3, 38e3d, 5cffc, 1524d, caa9c, aa0ef, 14da2, 3c666, 334ca, 6ff98
Other Closed Pull Requests
- AMD MI50 (gfx906) GPU support and TurboQuant improvements: Multiple pull requests focus on integrating and optimizing support for the AMD MI50 (gfx906) GPU architecture, including implementing integer dot-product attention kernels and improving GPU loading speed with asynchronous pipelines. These also add TurboQuant backend support, fix critical bugs in TQ3_0 quantization and Walsh-Hadamard Transform implementations, and update documentation and branch tracking for gfx906-specific changes.
- ggml WebGPU backend stability and performance enhancements: Several pull requests improve the ggml WebGPU backend by fixing quantization precision issues such as NaN propagation from packed 4-bit integers, enhancing backend lifecycle management with persistent WebGPU instances, and parameterizing submission sizes and inflight command buffer limits. These changes increase stability and performance, especially for F16 workloads on GPUs like the RTX 3070/3080 and improve stability on iOS devices by throttling operations.
- Q1_0 1-bit quantization format support and Metal backend acceleration: Pull requests add support for the Q1_0 quantization format with group size 128, enabling efficient CPU inference for Bonsai 1-bit models with ARM NEON optimizations and generic fallbacks. Additionally, a Metal backend implementation for Q1_0 is introduced to accelerate Bonsai models on Apple devices, providing performance improvements and validating accuracy through KL divergence tests.
- Metal GPU support for TurboQuant KV cache types on Apple Silicon: A pull request introduces Metal GPU compute kernels for TBQ3_0/TBQ4_0 dequantization and a new Metal host buffer type enabling shared tensor access between GPU and CPU backends. This significantly improves memory compression, maintains output quality, and optimizes operation scheduling on Apple Silicon devices.
- Mixture of Experts (MoE) expert cache and offloading features: One pull request introduces an N-slot Least-Frequently/Recently-Used (LFRU) expert cache with Future Anticipated Token Expert (FATE) prefetching for MoE weight offloading, including cache struct and API additions, allocator fixes, eviction policies, and performance optimizations. However, this feature was ultimately withdrawn and not merged.
- Hexagon backend support for Snapdragon Linux systems: A pull request adds support for running the hexagon backend on Linux systems with Snapdragon processors, updating installation scripts, documentation, onboarding steps, and build configurations to enable vectorization and Debian compatibility on the ex2 platform.
- Web UI usability and text rendering improvements: Multiple pull requests enhance the web UI by adding an opt-in setting to generate conversation titles from only the first non-empty prompt line to prevent overflow, fixing Arabic right-to-left text rendering with
dir="auto"attributes and CSS logical properties, and introducing a user-configurable toggle for Enter key behavior to improve mobile usability.
- Parser and grammar reliability improvements: A pull request simplifies the autoparser tagged parser rules by moving the string argument parser to a dedicated rule, removing upper limits on optional arguments to prevent repetition rule explosion, fixing uninitialized required parameters, and addressing parsing errors related to grammar limitations and trailing whitespace. These changes improve the reliability and maintainability of the project's grammar.
- Tokenizer and regex stability fixes: One pull request adds a custom Qwen2 regex handler to the unicode tokenizer to prevent stack overflow segfaults caused by std::regex's recursive backtracking on long repeated character sequences. This fixes an issue where Qwen2's digit pattern caused fallback to the vulnerable std::regex path.
- Gemma model optimizations and tokenizer tests: Pull requests optimize Gemma models by moving per-layer projection computations to reduce graph splits and improve multi-GPU efficiency, and add comprehensive tests for the Gemma 4 tokenizer including a UTF-8 handling edge case fix.
- CUDA graph properties performance improvement: A pull request improves CUDA graph properties checking by replacing expensive STL container usage with a faster hash computation method, resulting in measurable speedups during token processing across various models.
- Q5_K quantization GPU support in OpenCL backend: One pull request adds basic support for Q5_K quantization on GPU within the OpenCL backend, enabling Q5_K operations to run on GPU instead of CPU fallback, significantly improving performance for models using this quantization format.
- Jinja engine improvements for reka-edge model chat template: A pull request introduces Python-style string repetition, ensure_ascii=true support for the tojson filter, an int() builtin for value_int_t, and fixes handling of invalid UTF-8 bytes when ensure_ascii is enabled, enhancing Jinja engine capabilities for the reka-edge model.
- Metal backend ADD1 operation support and deprecation: Two pull requests add support for the ADD1 operation on the Metal backend with FP32 and FP16 precision and subsequently deprecate the
GGML_OP_ADD1operation in favor of the simplerGGML_OP_ADDsubclass to streamline the codebase.
- Transformers dependency update for Gemma4 model conversion: A pull request updates the transformers dependency to version 5.5.0 to enable support for Gemma4 model conversion and resolve related issues affecting gguf-my-repo users.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| kainlan | 250 | 1 | 0 | 0 |
| ngxson | 85 | 8 | 1 | 37 |
| ggerganov | 90 | 9 | 1 | 12 |
| allozaur | 90 | 9 | 0 | 13 |
| max-krasnyansky | 102 | 1 | 0 | 1 |
| TheTom | 100 | 0 | 0 | 0 |
| aldehir | 63 | 8 | 1 | 16 |
| pwilkin | 60 | 8 | 2 | 16 |
| CISC | 53 | 3 | 0 | 26 |
| reeselevine | 46 | 4 | 0 | 25 |