Weekly GitHub Report for Llama.cpp: August 25, 2025 - September 01, 2025 (12:04:07)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance functionality and improve user experience, reflecting a continued focus on performance optimization and feature expansion. Notable highlights include streamlined workflows and upgraded system stability.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: Performance Downgrade happened from b6188, for llama-bin-win-vulkan-x64 distribution.: This issue reports a significant performance regression in the llama-server.exe Vulkan Windows distribution, where the token inference speed dropped from about 35.35 to 28.08 tokens per second after updating from version b6123 to b6301. The user traced the problem to commit b6188, which introduced larger Vulkan workgroup sizes that appear to negatively impact performance on older Nvidia GPUs like the Quadro P620, and is seeking a fix or a way to disable this change for affected hardware.
- The discussion involved the user performing a git bisect to pinpoint the problematic commit as b6188, with multiple tests confirming the performance drop starting at that commit. Developers suggested the regression is likely due to larger Vulkan workgroup sizes introduced in a recent PR, and provided a code patch to disable these changes by commenting out the workgroup size adjustment, recommending recompilation to verify the fix. Further suggestions included adding GPU capability checks to conditionally apply the optimization only on newer hardware, with the issue awaiting confirmation that the patch resolves the slowdown.
- Number of comments this week: 17
-
Feature Request: Repeated Unecessary Activation Quantization Ops: This issue requests an optimization to reduce repeated and unnecessary activation quantization operations during token generation in the model, specifically by quantizing activations once for groups of projections instead of before each linear operation. The motivation is to improve performance by merging weight tensors and updating the model loading and forward pass logic to handle these merged tensors, potentially implemented via an option in the conversion script to maintain compatibility with existing models.
- The discussion explores merging weight tensors to avoid redundant quantization, with suggestions to modify the conversion script and model code for compatibility. Contributors share debugging experiences, propose code improvements such as using
torch.cat
for tensor concatenation, and report benchmark results showing performance gains. Further refinements include removing unnecessary calls and considering a new quantization operation to support diverse model architectures without requiring full reconversion. - Number of comments this week: 11
- The discussion explores merging weight tensors to avoid redundant quantization, with suggestions to modify the conversion script and model code for compatibility. Contributors share debugging experiences, propose code improvements such as using
-
Misc. bug: convert_hf_to_gguf.py runs out of memory: This issue reports a memory exhaustion problem when running the
convert_hf_to_gguf.py
script on Windows 10 to convert a large Hugging Face model into the GGUF format, where the process consumes nearly all available RAM and ultimately fails with allocation errors. The user and commenters discuss potential causes including dtype conversion overhead, 32-bit vs 64-bit Python builds, Windows pagefile settings, and differences in behavior between Windows and Linux, with suggestions such as using the--use-temp-file
option, matching output dtype to source dtype, increasing pagefile size, and applying a proposed fix that streams dtype casts to reduce peak memory usage.- The comments include troubleshooting steps like enabling
--use-temp-file
, adjusting output dtype flags to avoid large in-memory casts, verifying 64-bit Python usage, and increasing Windows pagefile size; a Linux user confirms the conversion works with much less RAM, indicating a Windows-specific or environment-related issue; a proposed patch to stream dtype conversion in chunks is shared, but testing it shows memory usage still grows, suggesting the problem persists on Windows despite these efforts. - Number of comments this week: 10
- The comments include troubleshooting steps like enabling
-
Eval bug: model infer input "GGGGGGG": This issue reports a bug encountered during model inference on an AMD Radeon 880M using the HIP backend, where the output becomes a string of "GGGGGG" when the prompt length exceeds the ubatch limit. The user has identified the first bad commit causing this behavior through git bisect and requests a fix, while discussion in the comments explores the underlying buffer management in the CUDA/HIP backends and considers the possibility of a ROCm driver issue.
- The comments confirm this issue is a duplicate of a previous one and suggest using git bisect to pinpoint the problematic commit, which the user completes successfully. Further discussion clarifies how buffer types and host memory registration work in the CUDA backend, with speculation that the root cause may lie in the ROCm driver rather than the llama.cpp codebase, and a request for additional environment details before proceeding.
- Number of comments this week: 9
-
Misc. bug: llama-bench json output is too verbose: This issue reports that when using the
llama-bench
tool with JSON output enabled, extraneous Vulkan device information is printed outside the JSON format, making the output too verbose and difficult to parse. Additionally, the user notes a related segmentation fault occurring during shutdown, which appears to be caused by a recent commit and is considered a separate problem.- Commenters suggest that the informational Vulkan device prints should be sent to stderr while the JSON output goes to stdout, allowing users to separate them easily; the original poster confirms this workaround but requests clearer documentation in the help output. The segmentation fault is acknowledged as a distinct issue, with advice to debug it or trace the responsible commit, which the user eventually identifies and plans to report separately.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 518 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
- Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, to aid in performance analysis and debugging.
- common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress indicators when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the download progress status by properly utilizing the CURLOPT_NOPROGRESS option to ensure accurate and non-overlapping progress reporting during parallel downloads.
- Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.
- Compile bug: .devops/cuda.Dockerfile fails if BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04: This issue reports a compilation failure when building the CUDA Docker container using a base image derived from Ubuntu 24.04, due to pip requiring the
--break-system-packages
option to install packages in this environment. The user has identified the cause related to changes in Python package management on Ubuntu 24.04 and plans to submit a fix via a pull request.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 33
Summarized Issues:
- Domain and GUI Support Offer: A community member has offered the domain "llamacpp.app" for free to support official GUI development, improve branding, and provide a user-friendly entry point for the llama.cpp project. They also offered help with domain maintenance or GUI contributions if desired.
- issues/15549
- Runtime and Assertion Failures: Multiple issues report runtime crashes and assertion failures caused by backend-specific problems such as empty probability distributions in std::discrete_distribution, asynchronous kernel execution on Intel GPUs, and unexpected empty grammar stacks during tool calling. These failures often lead to program crashes or core dumps and sometimes require synchronization or code fixes to resolve.
- issues/15551, issues/15580, issues/15608
- Backend and Hardware Compatibility Issues: Several issues highlight problems with specific hardware or backend configurations, including CUDA errors on Windows with outdated GPU drivers, HIP backend bugs on AMD GPUs, Vulkan backend performance regressions and crashes on Windows, and Flash Attention causing slowdowns on AMD GPUs under Linux. These issues affect model loading, inference correctness, and performance.
- issues/15556, issues/15609, issues/15618, issues/15624, issues/15678, issues/15659
- Server and UI Bugs: The llama-server component exhibits multiple bugs including JavaScript errors clearing answers and history, crashes caused by tool calling with strict validation, improper handling of thinking tokens in prompts, and fatal crashes on macOS when using the reranking flag. These issues disrupt user interaction and server stability.
- issues/15571, issues/15640, issues/15673, issues/15685
- Model Behavior and Conversion Concerns: Users report unexpected multi-turn chat behavior causing KV cache growth and redundant prompt content, accuracy loss when converting finetuned models between formats, and transcription failures with specific models in llama-server. These issues raise questions about intended behavior and model fidelity after conversion.
- issues/15573, issues/15600, issues/15651
- Feature Requests for Model and CLI Support: Several requests seek to add support for new models such as AllenAI's FlexOLMo, Cohere's Command A Reasoning, and the Ovis2.5 vision model, as well as new CLI features like lookahead algorithm flags and timeout options for automated benchmarking. These enhancements aim to expand functionality and improve usability.
- issues/15581, issues/15585, issues/15603, issues/15612, issues/15654
- Performance Regressions and Optimizations: Reports include a performance regression in llama-server due to a sampling code change, a request to optimize activation quantization to avoid repeated operations, and a significant slowdown caused by enabling Flash Attention on certain GPUs. These issues impact throughput and efficiency.
- issues/15602, issues/15672, issues/15624
- Compilation and Build Failures: There are build failures reported for ARM64 Windows builds with KLEIDIAI backend due to undefined symbols, missing CMake scripts causing llama-server build failures on Windows with CUDA, and general issues related to backend-specific compilation steps. These prevent successful builds and deployment.
- issues/15653, issues/15675
- False Virus Detection and Segmentation Faults: Windows Defender falsely detects Trojan malware in certain binary files, and a segmentation fault occurs due to a modified tokenizer regex when processing very long repeated character strings. These issues affect user trust and stability.
- issues/15596, issues/15594
- Output Formatting and Streaming Issues: Problems include JSON output being polluted with extraneous Vulkan device info, the
--jinja
flag ignoring specified response formats, and inconsistent streaming behavior in the Granite chat parser where some fields stream incrementally and others do not. These issues complicate output parsing and user experience. - issues/15554, issues/15664, issues/15681
- Prompt Processing Inefficiencies: The Nemotron v2 Nano model reprocesses the entire prompt context instead of using cached data, causing inefficiencies despite attempts to parse reasoning content correctly. Suggestions include adding a SWA snapshot mechanism to improve prompt handling.
- issues/15677
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 27
Summarized Issues:
- Abort behavior inconsistency with JavaScript AbortController: Aborting a long-running prompt processing request correctly stops both prompt processing and token generation when using Postman or cURL, but fails to stop prompt processing when using the JavaScript OpenAI library with an AbortController signal. This inconsistency causes the prompt processing phase to continue despite the abort signal, only halting token generation.
- issues/15232
- Embeddings endpoint failure after version 5630: The llama-server's /v1/embeddings endpoint stops working and returns a "not_supported_error" after version 5630, likely due to changes in how embeddings must be enabled via environment variables in Kubernetes deployments. This change breaks existing setups that do not configure the environment variables correctly.
- issues/15406
- Feature request for NVidia Nemotron Nano v2 model support: There is a request to add support for the NVidia Nemotron Nano v2 model, which requires implementing a new model class to handle its hybrid architecture. Challenges include model conversion, token generation, and cache initialization specific to this model.
- issues/15409
- HIP backend token output bug on AMD hardware: When compiling and running a language model using the HIP backend on AMD hardware, output tokens exceeding the batch size parameter cause repeated "GGGGGGGG" characters to be erroneously output. This indicates a bug in handling token output beyond batch size limits.
- issues/15465
- ARGMAX tie-breaking inconsistency in backend tests: Backend tests intermittently fail due to different backends returning different indices for tied maximum values in the ARGMAX function. This inconsistency could be resolved by implementing explicit tie-breaking logic in the reduction process.
- issues/15484
- Segmentation fault on multi-GPU CUDA setup: Running llama-server with a multi-GPU setup on NVIDIA RTX 5060 Ti GPUs causes a segmentation fault during the first inference, crashing the server while processing the initial prompt. This issue is specific to the CUDA backend in multi-GPU environments.
- issues/15519
- BF16 precision assertion failure in CUDA backend: Using Mistral's mmproj operation in BF16 precision with the CUDA backend triggers an assertion failure because the im2col implementation does not support BF16. This causes a crash during image processing and suggests the need for BF16 support or fallback mechanisms.
- issues/15536
- Vulkan backend crashes with Qwen3-235B and Seed-OSS models: The llama-cli chat feature crashes on Vulkan backend under Linux with the Qwen3-235B model due to a Jinja template runtime error, while the Seed-OSS model crashes immediately after the first user prompt with repeated assertion failures in ggml CPU code. Both issues cause core dumps and aborted processes.
- issues/15540, issues/15547
- Vulkan backend crash loading Qwen3 30B A3B model: Loading the Qwen3 30B A3B model on Vulkan backend causes a crash, later identified as a duplicate issue likely caused by a corrupt build and resolved in later releases.
- issues/15548
- Lack of clear informational messages on program stop: The project lacks clear informational messages explaining why the program stops running, instead silently stopping without explanation when the verbose flag is not used. This feature request aims to improve user feedback during unexpected stops.
- issues/15553
- Compilation failure due to undefined ggml_graph_compute_with_ctx: Building a simple MNIST model example fails on macOS with the Metal backend due to an undefined identifier error for
ggml_graph_compute_with_ctx
inggml.h
. This indicates outdated code references causing build errors. - issues/15570
- llama-server crash on Mac Metal backend during image processing: The llama-server crashes when processing an image with the Mistrall-Small model on Mac Metal backend due to a failed assertion caused by a non-contiguous memory buffer in the GGML Metal backend during image encoding.
- issues/15574
- iOS build failure due to incompatible CMake install step: Building the mtmd tool for iOS fails because the CMake install step is incompatible with iOS packaging, causing the build to crash. A conditional fix is suggested to exclude runtime installation on iOS systems.
- issues/15578
- CUDA backend core dump on startup due to missing architecture support: CUDA-enabled binaries crash on startup on Linux systems with NVIDIA RTX A5000 GPUs because the GGML CUDA backend was not compiled with support for the detected CUDA architecture (compute capability 8.6). This results in a runtime abort triggered by a missing compatible CUDA architecture flag.
- issues/15584
- CUDA architecture version build failures affecting model loading: The ggml library was not built with any CUDA architecture version ≤ 7.5, causing failures when loading gguf models on Linux and Windows with NVIDIA A100 and 3090 GPUs. This compile-time bug prevents proper runtime operation on affected systems.
- issues/15589, issues/15593
- Tensor initialization error with MXFP4 quantization in llama-cli: The gpt-oss-120b model using MXFP4 quantization runs on Ollama but fails to load in llama-cli due to a tensor initialization error related to deprecated tensor types and block size mismatches. This compatibility bug prevents successful model loading.
- issues/15597
- Linking errors due to missing ggml backend CPU functions: Building the
test-opt
target on Linux fails with undefined references toggml_backend_is_cpu
andggml_backend_cpu_set_n_threads
when specific GGML backend and CPU variant flags are enabled. This causes compile-time linking errors. - issues/15598
- Segmentation faults with HIP backend on AMD after ROCm 6.4.3 upgrade: Running models with the HIP backend on AMD Radeon RX 7900XTX after upgrading ROCm to 6.4.3 causes segmentation faults, suggesting compatibility problems introduced by the ROCm upgrade during model inference.
- issues/15605
- UTF-8 encoding exception in llama-server on Windows: The string_strip function raises exceptions due to encoding problems when processing accented characters in reasoning text, causing incorrect UTF-8/unicode handling and character misinterpretation on Windows.
- issues/15607
- Segmentation fault on Arch Linux with ROCm 6.4.3-1 during completions: The llama-server crashes with a segmentation fault on Arch Linux using ROCm 6.4.3-1, likely due to a known ROCm HIP runtime issue affecting resource cleanup and kernel unloading, which does not occur with earlier ROCm versions.
- issues/15613
- Linking errors due to missing cublasLt_static symbols with GGML_STATIC=ON: Building
llama-cli
withGGML_STATIC=ON
fails due to missing references tocublasLt_static
symbols after removal of linking withCUDA::cublasLt_static
, causing undefined reference errors related to CUDA's cuBLAS static library. - issues/15620
- Feature request for multi-tool call support in model API: A request to add support for multiple tool calls within a single interaction aims to improve compatibility with models defining multiple tools and enable native OpenAI REST API usage for parallel tool invocations, similar to qwen-agent's multi-tool handling.
- issues/15644
- Inquiry about Ascend 910B NPU support redirected to Q&A: A question about support for the Ascend 910B NPU was redirected to the project's Q&A discussions for further assistance, indicating no direct issue or feature implementation in the repository.
- issues/15656
- ARM64 Windows MSVC build preset removal and documentation update: Building Llama.cpp for ARM64 Windows using MSVC fails because the preset "build-arm64-windows-MSVC" no longer exists and MSVC is no longer supported for ARM64 builds, with a recommendation to update documentation to use LLVM instead.
- issues/15674
- CUDA backend compilation error due to ambiguous half type conversions: Building the CUDA backend on Linux fails because of ambiguous conversion functions from CUDA's "half" type to built-in types, preventing successful compilation of the conv2d CUDA kernel.
- issues/15680
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 32
Key Open Pull Requests
1. quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error: This pull request introduces a new --target-bpw
option that implements an optimized algorithm to automatically select per-tensor quantization types in order to meet a specified bits-per-weight target while minimizing mean squared error, using an imatrix-based error estimation that accounts for activations and applies a Pareto frontier filtering to balance quantization precision and size efficiently.
- URL: pull/15550
- Merged: No
- Associated Commits: ba733, 4d949, cfec4, 5e85f, e6d55, 77b81, 0edbf, e8774, 1b3d5, a22a9, c96b8, 9adae, 01794, 92f49, 1187f, 5aceb, ee05d, f22b3, 93629, b33ab, 5cd69, 69586, 29b2d, 43caa, 52da4, 3f011, b0b33, 35ad0, 5ef49, 95b2a, e01da, 88749, 9e11f, 5b6f1, e6eef, ec0af, 35c15, bb0d9, 2f13f, 47cdb, 01c92, 897de, f05c8, fea99, 6d178, 9a4b1, f7526, 73124, 68ae5, decaf, 3856d, 61c0e, d4ac2, ccaab, 42866, 04946, 8df1d, 66aff, 556f6, eab87, 09198
2. ggml: add ops for WAN video model (cuda && cpu): This pull request adds and optimizes several operations, including conv3d, im2col_3d, and padding support for both CUDA and CPU backends, to enhance the WAN video model functionality while also fixing related issues and improving build compatibility.
- URL: pull/15669
- Merged: No
- Associated Commits: c92f9, 93c7e, f7a12, 85c8e, ae47c, dd745, d8377, d30e0, df059, 9d035, d11a7, f6a87, f6278, c9b9f, 131ae, 0d5eb, aafa7, 3f901
3. granite embedding small support (ModernBert arch) : This pull request adds initial support for running the granite embedding small model based on the ModernBert architecture, including tensor mappings, model conversion from Hugging Face to GGUF format, and graph building, while addressing issues with pre-tokenizer implementation and ubatch size assertions in llama-graph.cpp.
- URL: pull/15641
- Merged: No
- Associated Commits: 61515, 6643c, ac67f, cc403, 41b68, cc3d7, 4ceb8, 18c0c, bffe3, 8f328, 98056, 40249, 853f3, 2a1c7, c73eb
Other Open Pull Requests
- Nemotron thinking and toolcalling with streaming: This topic covers the introduction of Nemotron thinking and toolcalling with streaming capabilities, enhancing reasoning functionality as a follow-up to a previous update. It also includes a fix for a crash in the Hermes 2 tool calling feature related to newline placement before the
<tool_call>
tag.- [pull/15676, pull/15639]
- Vulkan backend improvements and optimizations: Multiple pull requests address Vulkan backend enhancements, including adding a memory budget extension for accurate memory tracking, optimizing the mul_mat_id function for NVIDIA GPUs, implementing a fallback to system memory when video memory is full, clamping matrix multiplication results for fp16 precision, and updating validation extension checks. These changes improve performance, stability, and memory management in Vulkan operations.
- [pull/15545, pull/15546, pull/15630, pull/15649, pull/15652, pull/15662]
- Sampling and matrix multiplication optimizations: This includes optimizing the sampling process by reusing bucket sort to reduce memory allocations and updating the API, as well as introducing a function to enforce 32-bit floating point accumulators in matrix multiplications and optimizing the MUL_MAT_ID operation for the CANN backend. These pull requests focus on improving computational efficiency and precision.
- [pull/15665, pull/15614, pull/15655]
- Server and endpoint enhancements: Pull requests in this area enable the
/slots
endpoint by default with improved security, update the server README, extend the/props
endpoint to list enabled endpoints, and refactor the server codebase into modular files to improve maintainability and readability without changing functionality.- [pull/15630, pull/15632]
- Memory usage and conversion improvements: These pull requests improve memory usage during model conversion by streaming NumPy dtype casts to reduce peak RAM usage and directly parsing safetensors files to avoid eager memory mapping, significantly lowering memory consumption on multiple platforms.
- [pull/15642, pull/15666]
- Bug fixes and assertion handling: This topic includes fixing an assertion failure related to the KleidiAI backend in whisper.cpp and correcting the carry calculation in the ggml_compute_forward_dup_f16 function by using destination dimension indices instead of source indices for fp16 precision.
- [pull/15611, pull/15619, pull/15626]
- Documentation and benchmarking updates: Partial documentation updates for llama.cpp and ggml codebases include asciidoc formatting, minor fixes, and removal of irrelevant references. Additionally, a comprehensive benchmark script was merged from upstream to enhance project testing capabilities.
- [pull/15601, pull/15639]
- New features and parameter additions: This includes adding a new
pad_equal
method for batching sequences to enable multi-stream attention parallelization and introducing a new parametert_max_predict_ms
to set a time limit for the prediction phase via CLI, addressing user feature requests.- [pull/15636, pull/15648]
- Synchronization and race condition prevention: A synchronization call was added before exiting the argsort kernel in the ggml-sycl module to ensure all kernels complete execution and prevent potential race conditions.
- [pull/15582]
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 69
Key Closed Pull Requests
1. Feature/txe sqr: This pull request introduces a new kernel called "tsi-sqr" to the ggml-tsi-kernel branch, enabling the TSavorite backend to perform square operations with profiling and testing integrated for the TXE device.
- URL: pull/15567
- Merged: No
- Associated Commits: bb1f9, 69953, 68410, 1a151, ca06b, d9dd8, 9a144, 441fd, 9d65b, f9197, 614da, 9459c, c1858, 47cef, 2ea93, cea50, a7b7e, d4484, bbecb, d7685, 17d09, 96889, c369a, 77a3e, a4b77, 21ba6, 597f9, ce310, 8a5ff, 52ae0, f1dcd, ffe04, 3211f, a411f, ca783, 41d98, 1b474, 2aeae, 52e4a, 66c37, 61915, cd734, f53f2, 1a9ba, 6047d, d7330, f5713, 15e73, ef9f7, d6bba, 08dc5, 70223, 9599a, 744c3, 53889, 1f16b, e0fee, 0130d, c43a9, a9018, a6918, c7e93, 86037, d1bb3, 7dc80, 36ea7, 8d3e8, d5ca4, 44daa, 83de2, fa46e, 87361, ad4c6, 2e9d8, 42459, b1e08, c27a4, 042cf, c40ca, 4400b, 71527, cab6b, 8f8a5, a5441, 2e5b0, fa224, 7d0eb, 97955, dabd7, f620a, 61fb7, 97b51, 1ca88
2. Addresses #15409 - Support for NVIDIA Nemotron-H hybrid architecture models DRAFT: This pull request introduces initial support for the NVIDIA Nemotron-H hybrid architecture, which combines Mamba2 state-space model layers with selective transformer attention layers for efficient inference, addressing tensor dimension issues and optimizing KV cache allocation, though it remains in draft status until full inference functionality is achieved.
- URL: pull/15572
- Merged: No
- Associated Commits: cce8c, 423d8, f1acd, 175d6, 3a99e, 1f55a, 62acc, cc9b9, 36dc3, 65790, 3df06, 15445, ca4c9, a5569, e2b0d, 0d972, 3efbb, 74368, bfc23, 2ebaa, 497d7, 7c668
3. [SERVER] Added documentation for parallel_tool_calls
param: This pull request proposes adding documentation for the existing parallel_tool_calls
parameter in the server component, clarifying its usage and functionality as tested on the Qwen4B model, although it was not merged.
- URL: pull/15646
- Merged: No
- Associated Commits: 08002, e6ad7, c05e7, b745e, 599ea, 6e919, 14b31, 6a30f, a72cd, 05391, 6eb75, c77cd, 2e6ed, 27a3c, d4145
Other Closed Pull Requests
- CUDA and GPU Performance Optimizations: Multiple pull requests focus on enhancing GPU performance through CUDA and Vulkan optimizations. These include fusing add operations and RMS normalization for NVIDIA GPUs, optimizing Metal backend kernels, moving MoE MMQ CUDA kernel code to device, applying MUL_MAT_ID subgroup optimizations to Vulkan GPUs, and accelerating MXFP4 vector dot products on NVIDIA hardware, resulting in significant speedups and reduced memory usage across various models and configurations.
- pull/15631, pull/15541, pull/15525, pull/15524, pull/15451, pull/15412, pull/15563, pull/15525
- Flash Attention Improvements: Several pull requests improve Flash Attention by optimizing vector kernels for large sequences and small batch sizes, refactoring mask computation for better performance and memory layout, and setting FlashAttention as the default configuration. These changes lead to significant performance boosts, especially for long sequences and mixture-of-experts models, and improve usability for first-time users.
- pull/15566, pull/15561, pull/15434, pull/15249
- Model Support Additions: New model architectures and variants have been added, including support for the Seed OSS model with reasoning and tool-calling, the nemotronh architecture for Nvidia Nemotron Nano V2, the Kimi VL model with dynamic resolution, and the InternLM Intern-S1-mini model. These additions expand the range of supported models and improve tensor loading diagnostics and conversion processes.
- pull/15552, pull/15507, pull/15458, pull/15412
- Kernel Fusion and Operation Fusion Enhancements: Pull requests introduce fused OpenCL kernels for group normalization, multiplication, and addition, fix bugs in CUDA rms_norm fusion, and improve fusion detection logic for sequences involving group normalization followed by reshape, multiply, and add. These changes address limitations in fusion detection and improve stability and performance of fused operations.
- pull/15314, pull/15660, pull/15660
- Memory and Cache Handling Fixes: Fixes include correcting unified key-value cache handling by accounting for split KV cache, removing unnecessary contiguous memory assertions in IM2COL, and adjusting performance profiling timing to exclude KV cache copying. These fixes improve correctness and profiling accuracy in memory and cache management.
- pull/15562, pull/15577, pull/15524
- Vectorization and CPU Support Enhancements: Basic support for RVV vector float32 operations was added by refactoring vectorization logic to handle RVV's flexible vector length and sizeless intrinsic types. This required rewriting code to overcome limitations with traditional loop unrolling techniques, enhancing CPU module capabilities.
- pull/15057
- New Features and Templates: A dedicated model card template for embedding models was introduced to better accommodate their unique server commands and provide relevant additional information not applicable to causal models. This improves documentation and usability for embedding-specific models.
- pull/15557
- Workflow and Release Proposals: A proposal was made to add support for pre-built CUDA-compatible binaries on Ubuntu in the release workflow, similar to existing Windows support, though it has not been merged. This aims to improve release usability across platforms.
- pull/15249
- Testing and Stability Improvements: Tests were added to generate unique input values for the count_equal function to prevent backend-dependent behavior in argmax, avoiding intermittent test failures. This enhances test reliability and stability.
- pull/15436
- ROPE Backend Optimizations: Performance improvements to the ROPE implementation in the CANN backend were made by caching
sin_repeat
andcos_repeat
values, demonstrated by extensive testing across configurations. This results in more efficient ROPE computations. - pull/15501
- New Operation Implementations: A CUDA implementation of the conv2d operation was added, including initial addition and subsequent formatting and const correctness improvements, addressing a long-standing issue. This expands the operation support in the project.
- pull/15604
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 79 | 16 | 0 | 57 |
CISC | 64 | 6 | 0 | 50 |
JohannesGaessler | 33 | 7 | 0 | 60 |
jeffbolznv | 38 | 11 | 1 | 40 |
slaren | 22 | 2 | 0 | 50 |
pwilkin | 35 | 2 | 3 | 24 |
wine99 | 61 | 1 | 0 | 0 |
EAddario | 61 | 1 | 0 | 0 |
gabe-l-hart | 24 | 4 | 1 | 29 |
ngxson | 30 | 1 | 0 | 21 |