Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: April 28, 2025 - May 05, 2025 (12:02:38)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the given data. Notable highlights or trends cannot be identified without additional information.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI: This issue involves a bug where the Qwen3 30B A3B Q4_K_M model loads successfully on a server but crashes when an inference request is made through the Llama.cpp web UI. The problem appears to be related to the Vulkan backend, as the model works without issues on other backends like ROCm and CPU-only builds, and a temporary workaround involves reducing the batch size to prevent the crash.

    • Multiple users confirm experiencing the same issue, with discussions focusing on reducing the batch size as a temporary fix. Some users report that the problem is specific to the Vulkan backend, while others note that the model works fine with other backends. There are suggestions to increase certain limits in the code, but concerns are raised about shared memory constraints.
    • Number of comments this week: 21
  2. bug: ValueError: Architecture qwen3 not supported: This issue is about a user encountering a ValueError when attempting to load a model with the architecture 'qwen3', which is not supported by the current version of the software they are using. The user is trying to use a wrapper on llama.cpp for quantization but is facing errors despite the presence of GGUF files for the model on the Hugging Face hub.

    • The comments suggest that support for the 'qwen3' architecture was added recently, and users are advised to update to a more recent version of the software. Despite attempts to update, some users continue to face the same error, possibly due to running outdated builds or issues with their environment setup. There is a discussion about ensuring the latest version is used, and some users report similar issues on different hardware setups, indicating a broader problem with the model loading process.
    • Number of comments this week: 11
  3. Eval bug: grammar breaks sampling in Qwen3 MoE: This issue involves a bug in the Qwen3 MoE model where the grammar used in the prompt sometimes causes the model to output incorrect tokens, despite having a fixed seed and low temperature, which should ensure consistent outputs. The problem is particularly evident when using CUDA, as the model occasionally fails to generate the expected token sequence, leading to broken outputs.

    • The comments discuss the limitations of using grammar to enforce specific tokens, noting that grammar can only enforce character sequences, not specific tokens. There is a debate about whether the randomness in output is expected given the fixed seed and zero temperature, with some attributing the issue to CUDA's numerical instability. Further tests are conducted to analyze the log probabilities of the tokens, revealing unexpected variability, and suggestions are made to check for hardware issues that might affect determinism.
    • Number of comments this week: 8
  4. Misc. bug: Qwen 3.0 "enable_thinking" parameter not working: This issue reports a bug in the Qwen 3.0 model where the enable_thinking parameter does not function as expected when used with the llama-server, despite being documented in the official examples. The user highlights that while other platforms like SGLang and VLLM support this feature through "chat_template_kwargs," the parameter seems to have no effect in the current setup.

    • The comments discuss alternative methods to control the "thinking" mode, such as using "/think" or "/no_think" in prompts, which affect the output format. A workaround is suggested by adding specific tags in the assistant message, and there is a request for the enable_thinking parameter to be implemented in llama.cpp.
    • Number of comments this week: 6
  5. Compile bug: nvcc fatal : Unsupported gpu architecture 'compute_120': This issue involves a compilation error encountered when attempting to build a project using CUDA on a Windows system after replacing an NVIDIA 3090 GPU with a 5070 GPU. The error message "nvcc fatal: Unsupported gpu architecture 'compute_120'" suggests that the CUDA version being used does not support the architecture of the new GPU, and the user proposes a potential fix by specifying a different CUDA architecture during compilation.

    • The comments discuss troubleshooting steps, including ensuring the build directory is recreated and verifying the CUDA version's compatibility with the new GPU. It is clarified that recompilation is necessary when changing GPUs due to specific compute capabilities. The issue was resolved by updating to a compatible CUDA version, as the initial version was outdated.
    • Number of comments this week: 6

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue pertains to a problem with the Kompute-based Vulkan backend in a GitHub project, where it is generating a GGML_OP_GET_ROWS error. The error does not occur with the other Vulkan backend, indicating a specific compatibility or implementation issue with the Kompute-based version.
  2. Feature Request: Task Cancellation on Client Disconnection: This issue is about a feature request to enhance the current embedding server setup by implementing task cancellation when a client disconnects, as the existing system continues processing queued tasks even after a client cancels a request, leading to inefficiencies and potential server overload. The issue highlights the need for a mechanism to terminate task processing upon request cancellation to prevent server paralysis and ensure that new requests are processed promptly without delay.
  3. Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
  4. common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the CURLOPT_NOPROGRESS option to ensure accurate and non-conflicting progress reporting during these parallel download operations.
  5. kubernetes example: This issue is about the need for a Helm chart for the llama.cpp server to facilitate its deployment on Kubernetes, which is a widely used platform for deploying applications at scale. The author has initiated the work on this project but is seeking assistance from the community to continue its development.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 36

Summarized Issues:

  • Flash Attention and Performance Issues on CDNA3 ROCm 6.4 MI300 platform: The Flash Attention feature in the llama-server module is not functioning correctly on the CDNA3 ROCm 6.4 MI300 platform, leading to performance degradation and excessive CPU usage. This issue is particularly evident when processing large context windows, whereas smaller prompts operate at a higher token rate.
    • issues/13145
  • Shared Libraries and Function Inclusion Problems: Shared libraries like libllama.dylib or .so do not properly include essential functions from the /common directory, even when built with BUILD_SHARED_LIBS=1. This issue suggests modifying the build configuration to ensure these functions are included, potentially by switching from static to shared library creation or introducing a new build flag.
    • issues/13156
  • Model Loading and Architecture Support Issues: Users encounter a ValueError due to unsupported architecture 'qwen3' when loading models using a wrapper on llama.cpp for quantization. Updating to a more recent version of the software might resolve the problem, as suggested by other users who have successfully run the model with newer builds.
    • issues/13157
  • Qwen 3.0 Model and Parameter Functionality Bugs: The "enable_thinking" parameter in the Qwen 3.0 model's llama-server module does not function as intended, despite being documented in examples. Users are discussing alternative methods to control the thinking mode in prompts.
    • issues/13160
  • SIGILL Error on Arm64 Architecture with GGUF Models: Running the llama-server application with specific GGUF models on an Arm64 architecture results in a SIGILL (Illegal Instruction) error. This problem does not occur when the application is built from source.
    • issues/13161
  • Inference and SYCL Issues on Windows: Attempting inference with the Qwen3 Q4_0 model using SYCL on Windows causes the screen to briefly go black and the application to malfunction. The problem does not occur with the Q4_K_M model or when using a CUDA build, and a workaround involves disabling SYCL reorder optimizations.
    • issues/13163
  • Vulkan-Related Crashes with Qwen3 30B A3B Q4_K_M Model: The Qwen3 30B A3B Q4_K_M model crashes upon requesting inference when loaded on a server using the Llama.cpp web UI, potentially due to a Vulkan-related problem. A temporary workaround is reducing the batch size to prevent assertion errors.
    • issues/13164
  • Unreadable Output with Qwen2-VL Model: The output generated by the Qwen2-VL model using the llama-qwen2vl-cli command is unreadable, despite the model being loaded and executed on a system with Intel UHD Graphics 770 and Vulkan backend.
    • issues/13165
  • Assertion Failure with Qwen3 30B A3B Q4_0 Model on Linux: The Qwen3 30B A3B Q4_0 model fails to run on a Linux system with an Intel Core i9-14900K CPU due to an assertion failure in the GGML library during the model's initialization process.
    • issues/13168
  • Segmentation Fault in llama-parallel Module: A segmentation fault occurs in the llama-parallel module when executed on a Linux system with an NVIDIA GeForce RTX 3090 GPU, specifically when attempting to load a GGUF file, resulting in a crash due to an error in the libllama.so library.
    • issues/13172
  • Unsupported List Slicing Syntax in Qwen3 Model: The Qwen3 model in llama.cpp fails to parse a chat template due to an unsupported list slicing syntax in Jinja, specifically the [::-1] reverse slicing, leading to errors in processing chat messages.
    • issues/13178
  • RPC Server Crashes and Security Warnings: The rpc-server crashes when started as a background process without specifying a cache, and includes warnings about the security risks of exposing the RPC server to an open network.
    • issues/13185
  • Persistent Tags in Qwen3-32B Model Output: The Qwen3-32B model continues to generate tags in its output, despite settings that should prevent such tags. The user seeks guidance on how to disable these tags server-side.
    • issues/13189
  • Performance Issues with llama-server and CPU Utilization: The llama-server is unable to utilize all 16 threads or 8 CPU cores for prompt processing, resulting in significantly slower performance compared to llama-cli, making llama-server nearly unusable for large models.
    • issues/13197
  • Performance Discrepancies Between CUDA and Vulkan Backends: The Qwen3 30B A3B model runs significantly slower using the CUDA backend compared to the Vulkan backend on a multi-GPU setup. The user experiences only 40-50 tokens per second (tps) on CUDA versus 80-90 tps on Vulkan.
    • issues/13211
  • Vulkan Backend Performance on AMD 780m GPU: The Vulkan backend on an AMD 780m GPU is approximately 10% slower than the AVX2 backend when running the Qwen3-30B-A3B-Q4_K_M model. Removing the -fa flag is suggested to significantly improve Vulkan's performance.
    • issues/13217
  • Feature Request for XiaomiMiMo/MiMo-7B-RL Model Support: A feature request to add support for the XiaomiMiMo/MiMo-7B-RL model in the ggml-org/llama.cpp project, as the MiMoForCausalLM model is currently unsupported.
    • issues/13218
  • GGML_ASSERT Failure with Llama 4 Scout Model: Using the -sm row option with the Llama 4 Scout model causes a GGML_ASSERT failure, rendering the llama-cli unresponsive. It is noted that -sm row is not supported for MoE models.
    • issues/13240
  • Feature Request to Disable offload_op Function: A feature request to allow users to manually disable the offload_op function in the llama.cpp project without reducing the batch size, as the current implementation negatively impacts prompt processing performance.
    • issues/13241
  • Feature Request for s390x Architecture CI: A feature request to implement continuous integration (CI) for the s390x architecture in the ggml-org/llama.cpp project, seeking guidance on updating the ci/run.sh script to include s390x CI.
    • issues/13243
  • Support for Multimodal LLMs in llama.cpp: A request to add support for multimodal large language models (LLMs) like Qwen2.5-VL in the llama.cpp project to enhance image and text embeddings, highlighting the need for compatibility with models such as nomic-embed-multimodal-3b.
    • issues/13247
  • Vulkan Device Memory Allocation Issues on Laptops: Running the llama.cpp application on laptops using Vulkan results in a "vk::DeviceLostError" due to device memory allocation issues, despite the "llama-bench" command working correctly.
    • issues/13248
  • Vulkan Buffer Allocation Error on AMD Ryzen APU: An error is encountered when using the qwen2.5-vl model on an AMD Ryzen APU under Windows, where the system fails to allocate a Vulkan buffer of size 4342230552.
    • issues/13250
  • CUDA and ggml_cuda_compute_forward Function Bug: A bug in the ggml_cuda_compute_forward function where a MUL_MAT operation fails when using the -fa flag with the DeepSeek V3 0324 model on a mixed CPU and GPU setup.
    • issues/13252
  • Sentencepiece Tokenizer Bug in llama.cpp: The llama.cpp implementation of the sentencepiece tokenizer generates incorrect token sequences that differ significantly from those produced by HuggingFace tokenizers and official SentencePiece implementations.
    • issues/13256
  • GGUF Format Issue in llama-quantize Module: The llama-quantize module generates an output file not in the expected GGUF format, leading to errors when attempting to load the model.
    • issues/13258
  • HTTP Request Handling Bug in llama-server: The llama-server continues to generate responses for HTTP requests even after the client has disconnected, leading to unnecessary processing and potential server hang-ups.
    • issues/13262
  • CUDA Compiler Error with Unsupported GPU Architecture: A compilation error occurs when using the NVIDIA CUDA compiler (nvcc) due to an unsupported GPU architecture 'compute_120', resolved by updating the CUDA version and specifying the correct compute capability.
    • issues/13271
  • Feature Request for Per-Request "Reasoning" Options: A feature request for the llama-server to incorporate per-request "reasoning" options, allowing users to specify a reasoning token budget and format.
    • issues/13272
  • Feature Request for bf16:1 Reporting in llama-bench: A feature request to enhance the llama-bench tool by adding a reporting feature for "bf16:1" to indicate support for the VK_KHR_bfloat16 extension.
    • issues/13274
  • Integration of IBM's Granite 4 Model Architecture: Tracking the development and integration of support for IBM's Granite 4 model architecture into the llama.cpp project, involving multiple work streams and ensuring compatibility with various backends.
    • issues/13275
  • Web UI Command Line Parameter Override Issue: The llama-server web user interface (webui) overrides command line parameters with its own saved settings, complicating the process of swapping models.
    • issues/13277
  • Non-Deterministic Outputs in Qwen3 MoE Model: The Qwen3 MoE model's grammar sometimes results in incorrect token generation due to potential numerical instability with CUDA, leading to non-deterministic outputs.
    • issues/13280
  • Indefinite Hang Due to CUDA Error in Completions Process: The completions process hangs indefinitely after encountering a CUDA error related to illegal memory access, with the health endpoint incorrectly reporting the system status as "ok."
    • issues/13281
  • Illegal Memory Access Crash in Llama 4 Scout Model: A bug introduced in commit b5237 causes a crash due to illegal memory access when running a perplexity test on the Llama 4 Scout model using CUDA.
    • issues/13287
  • Compile Bug with Vulkan Backend on Unix Systems: A compile bug on Unix systems using the Vulkan backend occurs when paths containing spaces cause the compilation of Llama.cpp to fail.
    • issues/13288

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 21

Summarized Issues:

  • CUDA and GPU-related bugs: Several issues have been reported regarding CUDA and GPU-related bugs in the llama.cpp project. These include a CUDA error on an NVIDIA GeForce GTX 1080 causing a failure in the ggml_cuda_compute_forward function, and a bug in the 'clip_model_loader' function affecting model loading on a Linux system using CUDA with A100 hardware.
    • issues/12973, issues/13147
  • CPU and performance issues: Performance issues have been identified in the llama.cpp project, such as the LM Studio software utilizing only one compute core on an AMD Ryzen 7 7800 X3D CPU, and a performance bottleneck in the llama-server module when the top_k parameter is set to 0 or -1.
    • issues/12978, issues/13171
  • Server and RPC bugs: The llama.cpp project has encountered server and RPC-related bugs, including a crash in the RPC server when processing a SET_TENSOR command with an invalid ggml_type, and a failure in prompt caching in the llama-server, leading to increased processing time.
    • issues/13067, issues/13126
  • Compilation and build failures: Compilation and build failures have been reported, such as a build failure when compiling a Docker image for the llama-server-cuda project using CUDA, and a compilation error on the ppc64le architecture due to errors in the simd-mappings.h file.
    • issues/13166, issues/13170
  • Model and architecture issues: Various model and architecture issues have been identified, including the malfunctioning of the Plamo architecture resulting in nonsensical output, and a bug in the Qwen3 model where using a JSON schema results in an empty response.
    • issues/13130, issues/13212
  • Docker and workflow issues: The llama.cpp project has faced issues with Docker and workflows, such as the "Publish Docker image" workflow failing since April 2025, causing Docker images to be stuck at an older build, and a compile error related to the integration of curl.
    • issues/13203, issues/13213
  • Model conversion and quantization issues: Problems have been reported with model conversion and quantization, including a failed attempt to quantize the Qwen3-30B-A3B model due to an unrecognized model architecture, and an inability to convert the "nomic-embed-code" model to the GGUF format.
    • issues/13200, issues/13242
  • Backend and hardware compatibility issues: Backend and hardware compatibility issues have been noted, such as the llama-bench tool crashing or failing to produce meaningful output across various configurations, and the Qwen3-30B-A3B model throwing an assertion error when using the Vulkan backend.
    • issues/13169, issues/13233
  • Feature requests and architectural proposals: Feature requests and proposals for architectural changes have been made, including enabling the -hf option in the llama-cli tool to function offline, and a proposal to redesign the architecture to facilitate the integration of new models and algorithms.
    • issues/13128, issues/13227

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 23

Key Open Pull Requests

1. Support start strings, the opposite of stop tokens.: This pull request introduces the ability to set one or multiple start strings that function oppositely to stop tokens, allowing output to be discarded until a start string is detected, after which it is sent to the client, even in streaming mode, to support clients that do not handle reasoning models well, such as GitHub Copilot.

  • URL: pull/13214
  • Merged: No
  • Associated Commits: e57d8, 5c0c0, 513c4, a4b52, e4f48, a7349, b8436, 79260, 124a9, 5f99f, 52186, 8a12c, 89d0c, 0c65d, 3bead, a9432

2. mtmd : (WIP) add vision support for llama 4: This pull request aims to add vision support for Llama 4 by implementing a conversion process that allows the model to run inference with image inputs, although it currently faces limitations such as restricted image resolution and incorrect image perception, and includes critiques of the official implementation's complexity and redundancy.

  • URL: pull/13282
  • Merged: No
  • Associated Commits: c912c, a67a1, 10db5, 55ad3, c50e6, 8775b, 7341e, 893ad, 15605, 32a62, 97a5c, c6c2d, 224cb, 9d1a4, 532c3

3. mtmd : add C public API: This pull request introduces a C public API for the mtmd library by creating a C-only wrapper around C++ types, converting structs containing C++ types into opaque pointers, and adding C++ convenient wrappers to manage memory and prevent manual free() calls, thereby addressing issue #13124 on the GitHub project.

  • URL: pull/13184
  • Merged: No
  • Associated Commits: 4a4f3, f6b65, e0806, 82f42, f8c27, 33579, 92d24, 111d5, 08d0f, a2308, 863db, a0fb7, 6bc7a, 4d842

Other Open Pull Requests

  • Top-nσ Sampler Integration: This pull request integrates the Top-nσ sampler into the main sampling chain of the llama.cpp project, allowing it to be combined with other sampling methods such as min_p. It makes the sampler available in the llama-server, thereby removing the previous special case handling and enhancing the flexibility of sampling configurations.
    • pull/13264
  • Jinja Template Parameters: This pull request introduces the capability to handle additional Jinja template parameters, specifically enabling the "enable_thinking" feature for Qwen3 models. It allows these parameters to be set both from the command line and client, addressing issues #13160 and #13189 in the llama.cpp project.
    • pull/13196
  • GraniteMoEShared Architecture: This pull request introduces the GraniteMoEShared architecture to the project, enhancing the existing GraniteMoE model by incorporating a shared expert into each MoE layer. It aligns with the implementation found in the Hugging Face Transformers library and serves as a foundational component for the newly released Granite 4 architecture.
    • pull/13269
  • Naming Conventions in clip.cpp: This pull request addresses the issue of incorrect naming conventions in the clip.cpp file by fixing the reversed ffn_up and ffn_down labels. It aligns them with the new model standards and includes renaming variables to match the llama.cpp style, ensuring compatibility with both new and existing GGUF models.
    • pull/13290
  • Windows Build Process Enhancement: This pull request focuses on enhancing the build process for Windows releases of the llama project by utilizing dynamic loading backends and switching to the clang compiler for improved performance. It consolidates the Windows x64 CPU build into a single release, removes outdated builds and toolchain files, and optimizes caching mechanisms.
    • pull/13220
  • Xiaomi Mimo Support: This pull request introduces a work-in-progress feature to the llama project, adding support for Xiaomi Mimo with a multi-token prediction (MTP) layer. It allows the model to generate either the next N+1 or N+2 tokens from N input tokens, utilizing 36 normal layers and an additional MTP layer.
    • pull/13236
  • CMake Shader Feature Detection: This pull request introduces a CMake function to streamline the logic for detecting shader features. It ensures that necessary preprocessor definitions are automatically shared with the vulkan-shaders-gen project via an auto-generated CMake file, eliminating the need for manual transmission.
    • pull/13263
  • SYCL Reorder Optimization: This pull request addresses a workaround for issue #13163 by disabling the reorder optimization by default in SYCL. It ensures that tensor extras are not set when this optimization is disabled, as detailed in the commits linked to the pull request.
    • pull/13254
  • muBLAS and MMA Support on MUSA Devices: This pull request enables support for muBLAS and MMA on MUSA (QY2) devices within the project, ensuring compatibility and improved performance. It includes a note that the author will rebase the changes once another pull request is merged into the master branch.
    • pull/13149
  • CANN Model Support Update: This pull request aims to update the support status of the CANN model within the project, as indicated by the title and the commit message. It is currently not merged.
    • pull/13162
  • Graph Operand Validation: This pull request introduces a static helper function validate_graph_operands in ggml-rpc.cpp to ensure that each node in a deserialized compute graph has all required non-null input operands before computation. It prevents crashes or undefined behavior in the backend by rejecting malformed graphs early in the RPC server process.
    • pull/13167
  • Unified KV Cache Redesign: This pull request aims to redesign the unified KV cache to support sliding-window attention (SWA) for reducing memory usage in models like Gemma3. It raises concerns about the inability to perform context caching due to the loss of old KV data when the window slides, which could impact features like prefix caching and context reuse.
    • pull/13194
  • TMAC Matrix Multiplication Method: This pull request introduces a new Lookup-Table (LUT)-based matrix multiplication method called TMAC, which refactors previous code to eliminate third-party dependencies. It integrates LUT codes under the ggml directory and introduces a series of TMAC_* data types to support Bitnet-like and GPTQ models, aiming to improve matrix multiplication efficiency and flexibility in the llama.cpp project.
    • pull/13206
  • OpenCL Code Assertion Removal: This pull request proposes the removal of an unnecessary assertion in the OpenCL code for the add function, as it does not require both inputs to be contiguous. It can be reviewed at the following URL: https://github.com/ggml-org/llama.cpp/pull/13257.
    • pull/13257
  • Vulkan Backend Type Support: This pull request introduces additional type support in the Vulkan backend for unary and binary operations, as well as copy operations. It enables functionality such as f16 to f32 copy, f16 to f16 and f32 to f32 unary operations, and supports all combinations of f16/f32 for source and destination in add, subtract, multiply, and divide operations.
    • pull/13266
  • llama_kv_cache_hybrid Implementation: This pull request introduces an initial implementation of the llama_kv_cache_hybrid feature, which integrates the llama_memory_i and llama_kv_cache interfaces. It aims to support the newly released Granite 4 model architecture, although it is acknowledged that the current implementation may not be entirely correct and is open for discussion.
    • pull/13276
  • Infill Example Removal: This pull request proposes the removal of an example labeled "infill" from the project repository, as it is deemed no longer useful. It includes a single commit with the message "examples : remove infill" linked to the provided URL.
    • pull/13283
  • llama_context Logic Simplification: This pull request aims to simplify the logic in llama_context by removing the deprecated logits_all flag. It suggests that users set llama_batch::logits to achieve the same functionality.
    • pull/13284
  • imatrix Code Out-of-Bounds Fix: This pull request addresses an issue in the imatrix code where out-of-bounds writes occur if src1 is not contiguous. It ensures the host buffer matches the exact size of the data being copied and adjusts for non-contiguous src1 when calculating byte addresses of matrix rows.
    • pull/13286
  • Vulkan glslc Invocation Fix: This pull request addresses the issue of improperly handling Vulkan glslc invocation command lines by eliminating the use of a Windows-style string representation on Unix systems. It passes arguments directly as an array to avoid the need for complex escaping and refines the parameter handling to prevent confusion with glslc, thereby fixing issue #13288.
    • pull/13289

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 74

Key Closed Pull Requests

1. Merge mainline: This pull request involves multiple updates and merges to the llama-quant.cpp, imatrix.cpp, ggml-quants.c, and llama-model.cpp files, as well as several merges from the main branch, but it was ultimately not merged into the mainline.

  • URL: pull/13279
  • Merged: No
  • Associated Commits: 015a9, 5d08f, 661c2, a9f06, da493, af6a5, 2e1d9, 0cd61, 47394, 6fb3d, 85fb1, 733dd, c4a2d, 4d871, 32c2d, 63fbb, ae8b7, 11a52, 54eb4, 2f652, 4818a, 901c2, 6c774, 1987e, da5a7, ee86d, 619d1, 55d0d, 30dfb, 7bb36, 67968, 29c4f, 2c181, 78f1c, 55caa, 63273, 6a788, 08062, f5132, 0a759, 7bc47, 78c9d, cd965, fd819, 9099f, 96e26, 6d4b0, f68de, 9ac63, aab99, 8a325, 135f8, f0249, 4ac3d, 09353, bce9c, e5edf, fcc80, a1368, 1a6fd, e45b9, eb995, 90c58, 2b8c9, ed1d9, 8fd78, 91a76, 64933, da8e1, 7399c, b3f6d, 58353, ab57a, c70a5, 4855d, c3f07, 523c0, 103e6, 18492, 55ab1, dc31b, f9d83, 827e4, df1f2, 44fc2, d3333, 3cd3d, 6b7d1, 0dfd8, 72376, b5447, 862a1, b2cc1, fe4ac, db43f, ac831, 42015, 1076c

2. convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf: This pull request involves converting pre-quantized models for Qwen2/2.5VL from the Hugging Face format to the GGUF format using the llama-mtmd-cli tool, with successful tests for most models except the Qwen2.5-VL-32B model, which remains unusable due to output issues.

  • URL: pull/13209
  • Merged: 2025-05-02T15:17:15Z
  • Associated Commits: 79238, f7260, f48f5, b5e72, 4fac7, 47493, 65175, d96ef, 13e4c, 6e31d, ef0bc, 62b7b, c0309

3. mtmd : add qwen2vl and qwen2.5vl: This pull request introduces the addition of the qwen2vl and qwen2.5vl models to the llama-mtmd-cli, addresses issues with the llama-qwen2vl-cli by deprecating it due to incorrect n_past tracking, and confirms the correct functionality of the new models through various test results.

  • URL: pull/13141
  • Merged: 2025-04-29T09:47:04Z
  • Associated Commits: 6f7a5, 8742f, 8646e, 513e9, b3035, 14a6a, d23fd, e9bff, 496f1, db85d

Other Closed Pull Requests

  • Linux Cross-Compile CI Setup Synchronization: This topic addresses synchronization issues with dependencies during the Linux cross-compile CI setup, which were causing unnecessary build failures and delays. The pull requests implement changes such as continuing on error during setup and caching packages in the toolchain, including adjustments to force CI builds and merging updates from the master branch.
    • pull/12804, pull/13204
  • Support for New Models and Features: Several pull requests introduce support for new models and features, such as the Deci model Llama-3_1-Nemotron-Ultra-253B-v1 and Qwen2 embedding models. These changes involve modifying scripts and files to accommodate new layers and embedding features, ensuring compatibility and functionality across different models.
    • pull/12843, pull/13245
  • Error Handling and Input Validation: Pull requests in this category focus on improving error handling and input validation to prevent issues such as "out_of_range" exceptions and Denial of Service attacks. These changes include adding checks to prevent errors and replacing abort() calls with graceful error handling.
    • pull/13244, pull/13069
  • Project Reorganization and Refactoring: This topic covers the reorganization of project components, such as moving examples to a separate tools directory and refactoring code to enhance reliability. These changes ensure that only necessary programs are included in future distributions and improve code structure.
    • pull/13249, pull/13136
  • New Arguments and Features in Tools: Pull requests introduce new arguments and features in tools like llama-bench, allowing users to run tests with a prefilled KV cache context. These updates are documented in the README and demonstrated with sample output showing performance metrics.
    • pull/13096, pull/13174
  • Template and Token Handling Improvements: Several pull requests address issues with template and token handling, such as incorrect legacy templates and redundant token counts. These changes ensure the correct application of templates and improve the handling of tokens in various models.
    • pull/13140, pull/13139, pull/13099
  • Vision Support and Image Processing Enhancements: Pull requests add vision support for models like Mistral Small 3.1 and improve image processing techniques. These changes include implementing a "patch merger" technique and refactoring input handling to enhance reliability.
    • pull/13231, pull/13136
  • Kernel and Performance Improvements: This topic includes pull requests that add missing kernels and enhance performance, such as introducing unary kernels to the SYCL implementation and improving matrix multiplication kernels for specific architectures. These changes result in significant speed improvements and better kernel handling.
    • pull/13074, pull/13146
  • Configuration and Architecture Handling: Pull requests improve the handling of model architectures in configuration files, ensuring correct mapping and providing fallbacks. These changes enhance the flexibility and accuracy of model configuration.
    • pull/13122, pull/13159
  • Compiler Warnings and Synchronization: This topic addresses the suppression of compiler warnings and synchronization of project components. Pull requests disable specific warnings and add new features like Vulkan kernels for improved performance.
    • pull/13234, pull/13268
  • Offline Usage and URL Handling: Pull requests improve offline usage by allowing operations without internet access and address URL handling issues. These changes include introducing a manifest file for cached models and modifying logic to prevent download errors due to URL case mismatches.
    • pull/13202, pull/13219

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ngxson 263 23 6 77
ggerganov 81 15 1 34
danielhanchen 82 1 0 0
BradHutchings 73 0 0 1
matteoserva 33 7 2 19
slaren 18 3 0 27
CISC 13 4 0 28
No author found 39 0 0 0
JohannesGaessler 12 7 0 19
jukofyork 28 0 1 1

Don't miss what's next. Subscribe to Weekly Project News:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.