Weekly GitHub Report for Llama.cpp: March 30, 2026 - April 06, 2026 (18:25:13)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined features aimed at improving user experience and system efficiency.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[BUG] Eval bug: Gemma 4 generates
tokens : This issue reports that the Gemma 4 model (specifically the 26B-A4B variant) generates an unbounded stream of<unused24>tokens partway through responses, effectively producing unusable output. Despite attempts to disable reasoning mode and various backend configurations (CUDA, ROCm), the problem persists and appears to be a low-level inference bug related to CUDA/ROCm fusion of the GLU operation or other internal tensor computations, causing corrupted logits and repeated unused tokens during generation.- The comment discussion extensively investigates the root cause, ruling out tokenizer, sampling, and template issues, and identifies the problem as a CUDA/ROCm backend bug involving fusion of GLU operations. Workarounds like disabling CUDA fusion or partial offloading of layers sometimes help, but the issue remains on many setups. A recent PR (#21506) aims to fix the bug, though some users report mixed results, and the problem is reproducible on multiple GPUs and platforms, with ongoing efforts to confirm a definitive fix.
- Number of comments this week: 88
-
[BUG-UNCONFIRMED] Eval bug: gemma4 infinite output: This issue describes a problem where the gemma-4 model produces infinite repetitive output when used with llama-server for streaming completions, despite working correctly with llama-cli. The root cause is identified as the missing beginning-of-sequence (BOS) token during model conversion, which leads to malfunctioning in non-chat templated endpoints, and a workaround is to use the chat completions endpoint until the conversion script is fixed.
- Commenters confirmed the infinite repetition issue across different gemma-4 models and backends, noted related segfaults with newer builds, and discussed debugging steps; it was clarified that the problem stems from the missing BOS token in non-chat endpoints, and switching to the chat completions endpoint resolves the issue temporarily.
- Number of comments this week: 19
-
[BUG-UNCONFIRMED] Eval bug: Gemma 4 audio support is missing: This issue reports that the Gemma 4 model, which should support audio input, is not recognized as supporting audio in the web UI, causing confusion about missing audio support. The discussion reveals that the problem is complex, involving outdated binaries, build caches, and incomplete or buggy audio conformer implementations, with contributors identifying and fixing multiple bugs to enable proper audio transcription functionality.
- The comments cover troubleshooting version mismatches and build issues, clarifying that the latest builds support the architecture but audio input is still not working due to disabled or incomplete features; contributors share logs, experimental code, and detailed bug fixes addressing optimizer miscompilations, dimension mismatches, and CUDA kernel support, culminating in a working audio transcription setup on specific hardware and software configurations.
- Number of comments this week: 16
-
[BUG-UNCONFIRMED] Eval bug: Gemma4 attn_rot_k and v = 0: This issue reports that the attention rotation parameters (attn_rot_k and attn_rot_v) for the KV cache are zero when using the Gemma4 31B model, indicating that attention rotation is disabled, unlike with other models such as Qwen3.5. The user investigates this behavior through multiple tests, including updating to the latest version, verifying quantization settings, and experimenting with disabling the dimension check that disables rotation, ultimately finding that attention rotation is disabled due to varying embedding dimensions across layers in Gemma4, and that enabling rotation may not break functionality but affects perplexity and KL divergence metrics in quantized caches.
- The comments clarify that attention rotation is disabled because Gemma4’s embedding dimensions differ by layer, confirmed by code inspection and testing; users share benchmark results comparing performance with attention rotation on and off across different quantization levels, noting mixed impacts on perplexity and KL divergence, and suggest that the behavior is expected given the model’s architecture and current implementation.
- Number of comments this week: 14
-
[BUG-UNCONFIRMED] Eval bug: Qwen 3.5 weird behavior...: This issue reports a non-deterministic bug in the Qwen 3.5 model where it sometimes outputs a string of forward slashes instead of expected text, with the problem occurring more frequently when multiple GPUs are used in a Docker environment. The user also notes subtler looping behavior not seen in other implementations and provides detailed reproduction steps, logs, and testing scripts to help diagnose the issue.
- The comments discuss attempts to reproduce the bug using different commands and GPU configurations, confirm the variability and increased likelihood with more GPUs, explore disabling checkpointing as a potential cause, and share similar experiences with other model variants, concluding that the issue is reproducible but its root cause remains unclear.
- Number of comments this week: 12
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 90
Summarized Issues:
- OpenVINO Integration Bugs: Several issues in the OpenVINO backend cause incorrect registration of VIEW, PERMUTE, and TRANSPOSE nodes as output nodes, leading to KV cache corruption, double-permutation errors, and cache overwrites. These bugs result in incorrect outputs, GPU plugin input validation failures, and shape mismatches during inference, particularly affecting models with GQA attention and interleaved sliding window attention.
- issues/21194, issues/21195, issues/21197, issues/21198, issues/21199
- OpenVINO GPU Plugin and Kernel Crashes: The OpenVINO GPU plugin crashes due to divide-by-zero errors in the JIT kernel compiler when processing tensors with zero elements, and the backend fails to detect unsupported operations on non-contiguous strided VIEW slices. These issues cause inference failures and require early-exit guards and improved operation detection to prevent crashes and incorrect outputs.
- issues/21196, issues/21199
- Build and Environment Compatibility Issues: Hardcoded paths and environment mismatches cause build failures and runtime crashes, including TBB path issues breaking OpenVINO builds, CUDA 13.2 causing incorrect outputs for certain quantizations, and OpenBLAS-related compilation errors on specific clusters. These problems hinder compatibility with newer system libraries and hardware configurations.
- issues/21200, issues/21255, issues/21279
- Model Output and Generation Bugs: Multiple models including Qwen3.5, Gemma 4, and RWKV7 exhibit output corruption, infinite loops, premature stopping, or malformed outputs due to backend bugs, tokenizer regressions, or tool call serialization errors. These issues cause unusable outputs, infinite repetitive tokens, or parsing failures across different backends and platforms.
- issues/21190, issues/21239, issues/21253, issues/21321, issues/21365, issues/21384, issues/21471, issues/21423
- llama-server Crashes and Bugs: The llama-server experiences crashes and assertion failures due to stale ngram map entries, device argument ordering, ignored grammar files, incompatible flag combinations, and pi agent role alternation errors. These bugs cause server instability, invalid device errors, and failure to load or apply configurations properly.
- issues/21233, issues/21236, issues/21262, issues/21256, issues/21373
- Memory Leaks and Resource Management Issues: Memory leaks occur on the RPC CUDA backend when splitting model layers, and prompt caching causes crashes due to excessive memory allocation in constrained environments like Docker. These issues lead to increased memory usage, repeated warmup messages, and out-of-memory errors during model execution.
- issues/21265, issues/21436
- Performance and Optimization Problems: Inefficient GPU kernel parameters and quantization methods cause suboptimal performance, including slower q5_0 quantization compared to q4_0 and q8_0, and a 13% regression in token generation throughput on Apple M3 Ultra with Metal backend. Optimizations for specific architectures like gfx1151 improve performance but highlight existing inefficiencies.
- issues/21284, issues/21295, issues/21494
- Model Loading and Compatibility Failures: Loading certain models such as Gemma 4 variants and Microsoft Phi 4 Q4_K causes crashes or out-of-memory errors on CUDA and ROCm backends due to failed assertions, missing kernels, or memory allocation issues. These failures prevent successful initialization or inference on specific hardware configurations.
- issues/21323, issues/21402, issues/21414, issues/21420, issues/21484
- Backend-Specific Crashes and Errors: Crashes occur in various backends including Vulkan, SYCL, OpenCL, and Metal due to missing kernels, invalid pointers, or unimplemented cases, affecting GPUs from Intel Arc, Adreno, AMD, and Apple Silicon. These backend-specific bugs cause segmentation faults, pipeline compilation failures, and device lost errors during inference or initialization.
- issues/21381, issues/21396, issues/21400, issues/21422, [issues/21450](https://github.com/issues/21450], issues/21446, issues/21501
- User Interface and WebUI Issues: The web UI suffers from bugs including disappearing typing boxes on iOS Safari, failure of "Show thought in progress" to expand/collapse thoughts, mouse wheel autoscroll stopping, flickering dropdown menus, and lack of right-to-left text support. These issues degrade user experience and accessibility across platforms and browsers.
- issues/21276, issues/21322, issues/21341, issues/21362, issues/21502
- Feature Requests and Enhancements: Requests include adding ccache support in Dockerfiles to speed builds, speculative speculative decoding for faster inference, per-head adaptive KV cache quantization, Fast Walsh Hadamard Transform for KV cache rotation optimization, and generalizing llama-quantize for any GGUF file format. These aim to improve build efficiency, inference speed, quantization quality, and tool flexibility.
- issues/21225, issues/21300, issues/21385, issues/21352, issues/21447
- Model Architecture and Support Requests: Users request support for new or unrecognized model architectures such as qwen3omnimoe and the 1-bit Bonsai 8B model, highlighting the need for expanding compatibility and recognition within the project.
- issues/21298, issues/21351
- Inference and Reasoning Bugs: Issues include ignored reasoning budget limits causing no stop messages, infinite loops in tool call parsing with Gemma 4, and premature token generation stops due to hardcoded end-of-generation token checks. These bugs affect model control flow and output stability during inference.
- issues/21487, issues/21375, issues/21471
- Quantization and Cache Accuracy Issues: Problems with F16 KV cache accuracy degrading below native context size and attention rotation parameters remaining zero in certain models cause degraded model quality and disabled features. These issues highlight challenges in quantization and cache handling across models and backends.
- issues/21441, issues/21394
- Segmentation Faults and Assertion Failures: Various segmentation faults and assertion failures occur due to invalid pointer usage, parameter parsing errors, and non-causal attention parameter checks, causing crashes during image decoding, batch processing, or model initialization.
- issues/21461, issues/21427, issues/21404
- Miscellaneous Bugs: Additional issues include parse errors with XML/HTML tags in openclaw tool, repeated identical responses in llama-simple-chat, and requests for GUI chat compaction to enable infinite chats. These affect usability, tool integration, and user experience.
- issues/21495, issues/21493, issues/21466
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 47
Summarized Issues:
- Model Loading and Compatibility Issues: Several issues report failures or errors when loading or running specific models like Gemma4, Qwen3.5, and DeepSeek OCR due to unrecognized architectures, tensor shape mismatches, unsupported operators, or backend incompatibilities. These problems cause crashes, incorrect outputs, or inability to use certain features, highlighting challenges in supporting new or complex models across different hardware and software configurations.
- issues/21022, issues/21314, issues/21320, issues/21316, issues/21318, issues/21434, issues/21457
- Performance Regressions and GPU Backend Problems: Multiple issues describe severe performance drops, high GPU utilization, or crashes related to Vulkan, ROCm, HIP, and CUDA backends, especially when using multi-GPU setups or specific GPUs like AMD Radeon Pro VII and RX 9060 XT. These include throughput reductions, GPU running at full frequency when idle, and crashes due to VRAM exhaustion or unsupported operations, indicating instability and inefficiency in GPU acceleration layers.
- issues/21164, issues/21191, issues/21330, issues/21376, issues/21425, issues/21430
- Docker and Platform Compatibility Issues: There are problems with Docker images being built for incorrect platforms (arm64 vs amd64), missing images, or platform mismatches causing execution errors on certain architectures. These issues prevent proper deployment and usage of llama-server containers on expected hardware platforms.
- issues/21123, issues/21202
- API Key and Authentication Failures: Several issues report authentication problems in the WebUI and server, including CORS proxy requests failing due to missing API keys and sudden "Unauthorized: Invalid API Key" errors blocking access to essential resources, resulting in blank pages or failed requests despite user authentication.
- issues/21167, issues/21229
- Cache Migration and Model Storage Problems: The migration to a new Hugging Face cache system and local model cache restructuring caused incomplete file moves, alias conflicts, broken offline mode, and disrupted workflows due to unexpected changes in cache file layouts. These issues lead to model loading failures and confusion about model availability.
- issues/21302, issues/21303, issues/21364
- Gemma4 Model Specific Bugs and Feature Failures: The Gemma4 model exhibits multiple bugs including crashes with parallel parameters, incorrect token outputs, failure of final logit softcapping, segmentation faults on large prompts, and memory leaks during evaluation. These issues affect stability, output correctness, and resource usage when running Gemma4 models.
- issues/21316, issues/21318, issues/21379, issues/21388, issues/21401, issues/21449
- Threading and Performance Configuration Issues: Problems with default threading settings causing slower performance, and bugs triggered by specific flags like
sequentialinllama-perplexityor--parallelin llama-server, indicate that certain configurations negatively impact speed or cause crashes. Manual tuning or flag changes are required to restore expected behavior. - issues/21042, issues/21171, issues/21318
- WebUI and Frontend Loading Errors: The WebUI experiences issues such as visible unparsed tags in reasoning sections, blank pages due to MIME type errors or blocked scripts, and failures in loading essential bundles, all of which degrade user experience and accessibility.
- issues/20356, issues/21281, issues/21229
- Model Evaluation and Inference Bugs: Bugs causing assertion failures, silent output truncation mid-token, and incorrect output tokens (e.g., "
") during evaluation or inference disrupt normal model operation and output quality, leading to crashes or unusable results. - issues/21208, issues/21248, issues/21425
- Build and Compilation Warnings or Errors: Compilation warnings on macOS about missing 'noreturn' attributes and issues with double-zipped DLL archives in Windows releases indicate build process inconsistencies or minor packaging problems that do not necessarily block usage but may affect developer experience.
- issues/21319, issues/21333
- Feature Requests and Enhancements: Requests include adding ARM64 release binaries, Windows OpenVino support, JSONL HTTP trace logging for easier debugging, audio support for Gemma4 multimodal models, and speculative decoding techniques to improve CPU inference throughput and reduce latency.
- issues/21091, issues/21108, issues/21356, issues/21453, issues/21469
- CUDA and Driver Compatibility Issues: The Docker version of llama.cpp no longer supports CUDA 12.8 due to raised minimum CUDA version requirements, causing startup failures on systems unable to upgrade NVIDIA drivers, impacting users with older hardware or software environments.
- issues/21429
- Miscellaneous Bugs and Integration Issues: Other issues include null pointer dereferences in the Jinja parser, HTTPS support failures in Nix flake installations, and integration questions about llama-cpp-python usage with specific scripts, reflecting a range of smaller but impactful problems.
- issues/20911, issues/21353, issues/21409
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 61
Key Open Pull Requests
1. webui: Server tools: This pull request introduces server tools integration into the web user interface, including features such as a /tools endpoint, built-in and JSON schema tools, UI improvements, and reorganized settings sections to enhance server management capabilities.
- URL: pull/21237
- Associated Commits: 8d0eb, 684ed, 155af, c800a, 62c8a, 44193, f4baf, 35076, 3994a, 7fc5b, bbb2b, 79999, 7c520, 94f7d, 5970f, 7eeee, ea5b7, b22ae, 4ddda, 9c922, 8c55e, c3520, 5acfc, cfd5a, 7a13b, ec630, 2d2ef, 8bf19, 156b9, b0749, ad9e9, 5468f, 6ec8a, c12c0, 8e557, 1dafe, c374e, d24e0
2. eagle3: add qwen3.5 4B 9B 35B-A3B support: This pull request extends the EAGLE3 implementation by adding recurrent verification-state support and integrating Qwen3.5 series models, including MoE variants, while maintaining a single-batch target verification flow and updating related APIs, model wiring, and conversion tools to enable efficient hybrid and recurrent model handling.
- URL: pull/21437
- Associated Commits: c0d99, 71ba2, 3da28, 13a9f, 75883, 7b78b, 7d4c2, 5e224, b3537, 9fea2, b8ab2, 07e2c, 5bb2d, 57438
3. CANN: Add suport for Qwen35 ops: This pull request adds comprehensive support for multiple missing operators—including FILL, DIAG, SOLVE_TRI, SOFTPLUS, CUMSUM, TRI, SET, and GATED_DELTA_NET—to the CANN (Ascend NPU) backend to enable full execution of Qwen 3.5 models on this platform, along with improvements such as a memset_tensor interface for correct zero-initialization and a fix to graph cache key comparisons to ensure correctness and performance.
- URL: pull/21204
Other Open Pull Requests
- Parser and Grammar Improvements: Multiple pull requests focus on enhancing the grammar parsing system by simplifying autoparser rules, fixing uninitialized parameters, and addressing parsing errors related to grammar failures and trailing whitespace. These changes improve the reliability and functionality of the parser, preventing rule explosion and ensuring more robust parsing behavior.
- Memory Management Enhancements: A pull request introduces a
--models-memory-marginparameter to the server router, allowing dynamic unloading of smaller embedding models to free memory for larger models. This enables multiple small models to coexist more efficiently in memory, optimizing resource usage.
- Quantization and Kernel Optimizations: Several pull requests add CPU support for Q1_0 1-bit quantization with group size 128, improve Q4_K/Q4_0 SOA kernel implementations for OpenCL 3.0 devices, and fix critical crashes on Adreno GPUs. These changes include ARM NEON and scalar fallbacks, guarded compilation, and optimized buffer handling to enhance performance and stability across hardware.
- Model Integration and Multimodal Support: A pull request adds support for the STEP3-VL-10B multimodal vision-language model, integrating its PE-lang vision encoder and Qwen3-8B decoder with improvements like fused QKV usage and image processing enhancements. This expands the project's capabilities to handle advanced multimodal AI models.
- Tokenizer and Regex Handling Fixes: To prevent stack overflow segfaults caused by recursive regex backtracking, a custom Qwen2 regex handler was added to the unicode tokenizer. This aligns Qwen2's handling with other models and resolves issue #21113 by improving tokenizer stability.
- Benchmarking Tool Enhancements: The llama-bench tool was updated with new
-fitcand-fittcommand-line arguments to specify fitting parameters for performance testing, accompanied by README and script updates. These additions provide users with more control over benchmarking configurations.
- Text Rendering and RTL Support: Fixes were made to the llama.cpp web UI to correctly render Arabic (right-to-left) text by adding
dir="auto"attributes, replacing directional CSS with logical properties, and implementing a comprehensive RTL support plugin. This ensures proper bidirectional text display for mixed Arabic and English content.
- Sparse Attention and New Operations: A new CPU-only GGML_OP_GATHER operation was implemented to enable efficient per-batch sparse attention by extracting tensor elements at top-k indices. This addresses limitations of existing gather/scatter ops and lays foundational infrastructure for sparse attention models like DeepSeek Sparse Attention (DSA).
- API Compatibility and Request Translation: Support was added in llama-server to translate structured generation request parameters from the OpenAI responses API format to the completions API format, including converting
text.formattoresponse_format. This enables proper structured JSON generation using json-schemas.
- Metal Backend Operation Support and Deprecation: The ADD1 operation was added to the Metal backend with FP32 and FP16 precision, including mixed precision support, while a separate pull request proposes deprecating the redundant
GGML_OP_ADD1operation to simplify the codebase.
- CUDA Graph Performance Optimization: The CUDA graph properties matching check was optimized by replacing expensive STL container usage with a faster hash computation method, resulting in measurable speedups across various models.
- QKV Tensor Loading Refactoring: Duplicated Q/K/V tensor loading and graph-building code across 77 model files was refactored into two reusable helper functions,
create_tensor_qkvandbuild_qkv, streamlining handling of fused and separate QKV cases without changing existing logic.
- Context Truncation Bug Fixes: A bug in the ngram-map eviction logic was fixed by correcting the use of size_begin instead of size_last_begin during context truncation, preventing out-of-bounds errors and GGML_ABORT crashes. Comprehensive tests were added to cover multiple truncation scenarios.
- Agentic Loop and Thinking Logic Improvements: Multiple pull requests ensure
reasoning_contentis stored and sent back in assistant messages during agentic loops, refactor thinking parsing logic to support adaptive thinking mode for the Claude Code client, and implement initial integration tests for multi-turn agentic tool use workflows. These changes improve reasoning state management and testing infrastructure.
- Vulkan and Metal Backend Performance Tweaks: Vulkan Xe2 warptile configuration was tweaked to eliminate register spills in native float matmul shaders, yielding substantial speedups for BF16 models. Additionally, Metal backend's FlashAttention implementation was optimized for the Qwen3-VL image encoder, reducing image encoding time by ~11% on large images.
- Quantization and Tokenizer Fixes: The llama-quantize tool's logic for selecting quantization tensor types was fixed to correctly apply quantization when the requested type matches the base type. The Gemma 4 tokenizer's EOG token list was also fixed by removing the
</s>token to resolve conflicts with paddleocr.
- Input Handling and Server Bug Fixes: The llama-cli was fixed to preserve newline characters at the end of multiline input lines, ensuring correct model input processing. A server bug was fixed where the
ignore_eos=trueflag was ignored due to premature parameter capturing, with a regression test added to verify correct logit bias application.
- Syntax Highlighting Restoration: Syntax highlighting was restored for non-common programming languages like Haskell, Elixir, Dart, and Scala after streaming code blocks by ensuring the full language set supported by lowlight is passed to rehype-highlight. This maintains consistent syntax highlighting without adding new dependencies.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 120
Key Closed Pull Requests
1. ci: Add Windows Vulkan backend testing on Intel: This pull request enables partial continuous integration testing for the Windows Vulkan backend specifically on Intel hardware, addressing related issues from previous discussions and implementing various fixes and adjustments to the CI scripts and environment to accommodate Windows-specific challenges such as python virtual environment construction errors.
- URL: pull/21292
- Associated Commits: 7fad5, 2221b, 13c9c, 3c8fa, b6f57, f312d, 445ac, 0f082, f605e, 0bec3, c6771, e5ec1, bc74f, c75f7, d5e46, cc07c, c0f0e, ba0ad, 8647e, 94728, fd583, 70d42, 1cd25, 4ed14, fa3d7, 38417, acdd7, aa83a, d92dc, f847d, e0451, 293c9, 039d7, f9dd5, 65538
- Associated Commits: 7fad5, 2221b, 13c9c, 3c8fa, b6f57, f312d, 445ac, 0f082, f605e, 0bec3, c6771, e5ec1, bc74f, c75f7, d5e46, cc07c, c0f0e, ba0ad, 8647e, 94728, fd583, 70d42, 1cd25, 4ed14, fa3d7, 38417, acdd7, aa83a, d92dc, f847d, e0451, 293c9, 039d7, f9dd5, 65538
2. ggml-webgpu: add vectorized flash attention: This pull request adds a vectorized WebGPU implementation of flash attention (FLASH_ATTN_EXT) to the ggml-webgpu project, featuring a split pipeline with optional mask tile classification, a vectorized attention kernel, and a merge path for multi-split execution to optimize performance.
- URL: pull/20709
- Associated Commits: 976eb, 94abb, 10330, c307a, f8e31, 52709, df6ef, 83830, d61ec, 042a1, b61e6, 36027, 356d6, 1ae04, 33a54, 638c4, 0abac, 83a42, 2595b, 25096, 3d6bf, 68fa2, 5065d, 03d06, 5dd2a, 1e0d8, 88bf3, 5c2fe, 59aa7, cac85
- Associated Commits: 976eb, 94abb, 10330, c307a, f8e31, 52709, df6ef, 83830, d61ec, 042a1, b61e6, 36027, 356d6, 1ae04, 33a54, 638c4, 0abac, 83a42, 2595b, 25096, 3d6bf, 68fa2, 5065d, 03d06, 5dd2a, 1e0d8, 88bf3, 5c2fe, 59aa7, cac85
3. ggml-cuda: Add generic NVFP4 MMQ kernel: This pull request introduces a generic NVFP4 matrix multiplication quantization (MMQ) CUDA kernel that leverages and adapts existing kernels to significantly improve prefill speed for various models on NVIDIA GPUs, while adding FP8 availability checks to ensure CI stability and laying groundwork for future Blackwell-specific optimizations.
- URL: pull/21074
- Associated Commits: 94e58, 2761d, cbd9f, 0d929, 0018c, 592e1, 1489e, 31770, ebe28, aa55c, 8af43, d8c5b, cba86, a2f72, 4be4b, 30d7c, 145d8, bf496, e2bab
- Associated Commits: 94e58, 2761d, cbd9f, 0d929, 0018c, 592e1, 1489e, 31770, ebe28, aa55c, 8af43, d8c5b, cba86, a2f72, 4be4b, 30d7c, 145d8, bf496, e2bab
Other Closed Pull Requests
- Gemma 4 Parser and Template Enhancements: This pull request introduces a specialized parser for Gemma 4 that replaces the previous autoparser implementation and refactors normalization to directly emit JSON from the AST. It also adds utilities for AST walking, reworks tool response handling to align with the Gemma 4 prompt formatting guide, incorporates a new chat template supporting interleaved thinking during function calls, and includes multiple fixes and additional tests to improve parsing accuracy and template functionality.
- CI/CD and Build Workflow Improvements: Multiple pull requests address improvements in the continuous integration pipeline and build workflows by fixing Docker multi-architecture build conflicts, removing macOS, Windows, and iOS build jobs, adjusting permissions and artifact handling, fixing CMake flags, and updating ROCm versions and GPU target support. These changes refine release artifact management, improve build efficiency, and enhance ARM64 support with updated Dockerfiles and accelerated build times.
- Vision-Language Model and Chat Template Additions: Support for the Tencent HunyuanOCR vision-language model is added by implementing converters, fixing token ID issues, introducing a new chat template specific to HunyuanOCR, and integrating a perceiver-based vision projector with dynamic resolution preprocessing. Additionally, a new Granite 4.0 chat template is introduced with correct role mapping and parallel handling alongside the existing Granite 3.x template, ensuring backward compatibility and improved tool calling functionality.
- Web UI and Server Performance Enhancements: The web UI build of the llama-server removes gzip compression to improve Git diff efficiency and reduce disk space usage, adding a flag to optionally exclude the web UI from builds. The server also gains a
--kv-clear-idleflag that saves idle key-value slots to RAM and clears them from VRAM, reducing attention cost and improving performance, especially in CUDA environments.
- GPU Backend and Quantization Improvements: The ggml WebGPU backend is refactored to improve asynchronous scheduling and efficiency by consolidating GPU submissions and replacing the parameter buffer pool with a single buffer managed via offsets. Quantization quality is enhanced by applying Hadamard transform-based rotations to attention activations, reducing outliers without affecting the dot product, and Q4_K GEMM and GEMV kernels are added to the Adreno OpenCL backend to boost performance for quantized models.
- Tracing, Testing, and Parser Fixes: An optional, dependency-free JSONL HTTP tracing feature is added to the
llama-serverfor detailed request and response observability. Comprehensive unit test coverage is introduced for thellama_tensor_get_typefunction using real model metadata, and fixes are made to the Jinja parser to reject empty computed member expressions, improving error handling and compliance with Jinja2 semantics.
- CUDA and Vulkan Support Enhancements: CUDA Flash Attention support is added for a head dimension of 512 to enable efficient GPU execution and prevent CPU fallback. ARM64 CUDA and Vulkan runners are enabled with updated Dockerfiles, upgraded toolchains, and fixes for multi-architecture container image manifest handling, improving build times and platform compatibility.
- Bug Fixes and Stability Improvements: Several fixes address issues such as MSVC optimizer miscompilation in the Gemma 4 audio conformer encoder, segmentation faults in the sampler when loading non-existent models, exclusion of the CCAN component from builds, and stack overflow caused by std::regex in MSVC with the Tekken Tokenizer. These changes improve stability and correctness across various components.
- Quantization Algorithm and Interface Refactoring: TurboQuant Algorithm 1 is implemented by applying random orthogonal rotations to key and value vectors before quantization and its inverse after dequantization, significantly reducing quantization error and VRAM usage. Additionally, the
llama_model_quantize_paramsfunction is refactored to provide a pure C interface, improving accessibility and integration.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| ggerganov | 123 | 11 | 1 | 28 |
| ngxson | 89 | 6 | 1 | 56 |
| CISC | 59 | 2 | 0 | 53 |
| pwilkin | 79 | 9 | 0 | 20 |
| aldehir | 66 | 8 | 0 | 26 |
| allozaur | 80 | 5 | 0 | 8 |
| rodgerhubhay | 84 | 0 | 0 | 0 |
| 0cc4m | 53 | 3 | 1 | 16 |
| angt | 45 | 7 | 0 | 16 |
| taronaeo | 52 | 2 | 0 | 8 |