Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: June 23, 2025 - June 30, 2025 (23:08:28)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

This version, created on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Eval bug: Program crashes during long input inference when batch size is set to 16384: This issue involves a program crash during long input inference when the batch size is set to 16384, due to the CUDA copy operations requiring input expansion from int32 to int64 under certain conditions. The user attempted a temporary solution by bypassing the GGML_ASSERT and adding checks for the element count, but a fundamental resolution would require both the element count and tensor size to be represented as int64.

    • The comments discuss the need to reduce batch or context size to avoid exceeding CUDA's limits, with suggestions to use smaller batch sizes as the current setting is not intended behavior. The user explains the larger batch size was chosen for faster processing, and they hope for future support of larger configurations. There are requests for benchmarks to support performance claims, and discussions about using the benchmark tool for accurate comparisons. Some users confirm performance improvements with increased batch sizes, while others note the need for proper configuration to optimize GPU utilization.
    • Number of comments this week: 11
  2. Misc. bug:: This issue involves an antivirus message from Windows 11 that is triggered by the llama-server module of the llama.cpp project. The user is experiencing this problem after downloading the latest pre-built version for Windows and CUDA 12.x from the official repository.

    • The comments discuss whether the binaries were downloaded from the official repository or compiled by the user, suggesting that the issue might be a false positive. The user confirms downloading from the official repository, and another comment notes that changes in the latest version should not have caused any antivirus flags, indicating a possible misunderstanding or error in the antivirus detection.
    • Number of comments this week: 5
  3. Feature Request: Exclude thinking tokens from server cache for reasoning models: This issue is a feature request to modify the server cache behavior for reasoning models by excluding thinking tokens, which are currently cached even when the web interface excludes them, leading to unnecessary reprocessing and wasted context size. The proposed implementation involves identifying and removing thinking tokens from the cached tokens and key-value cache on the server side.

    • The comments discuss the current support for excluding thinking tokens using a specific command, with clarification sought on whether this affects non-thinking tokens as well. Further explanation is provided, including a visual aid, and a question is raised about a specific parameter in the example.
    • Number of comments this week: 4
  4. Request for Official Support of AMD Ryzen AI Platform NPU: This issue is a feature request for the official support of the AMD Ryzen AI platform NPU in the llama.cpp project, highlighting the potential benefits of improved cross-platform compatibility and performance. The requester notes that while AMD has provided an initial implementation, integrating this functionality into the official repository could enhance usability and address varying quantization precision support across different platforms.

    • The comments discuss related issues and existing preliminary implementations by AMD, emphasizing the potential for accelerated development if these resources are integrated. There is mention of deprecated implementations and ongoing work in other repositories, with suggestions for collaboration and contributions to enhance support for Ryzen AI NPUs.
    • Number of comments this week: 3
  5. main: failed to quantize model from 'gemma-3n-E2B-it.f16.gguf': This issue involves a failure to quantize a model file named 'gemma-3n-E2B-it.f16.gguf' to a quantized format 'gemma-3n-E2B-it.q5_k.gguf' due to a problem with the tensor data being out of file bounds, indicating that the model might be corrupted or incomplete. The user is attempting to use a specific build of the software to perform this quantization, but encounters an error suggesting that the model file is not suitable for the operation.

    • The comments discuss the need to update to a newer commit for successful quantization, as the current version lacks support for the 'gemma3n' model. It is suggested to either wait for a new binary release or compile the source code manually. A user shares their successful experience with a different model version, indicating that the issue might be specific to the 'E2B' version.
    • Number of comments this week: 3

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is producing a GGML_OP_GET_ROWS error that does not occur with other Vulkan backends. The issue has been open for a significant period, indicating a potentially complex problem that has not yet been resolved.
  2. Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a project at Hugging Face. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
  3. common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the CURLOPT_NOPROGRESS option to ensure accurate and non-conflicting progress updates during the download process.
  4. kubernetes example: This issue is about the need for a Helm chart for the llama.cpp server to facilitate its deployment on Kubernetes, which is a popular platform for deploying applications at scale in the industry. The issue has been open for over 448 days, and while initial work has begun, the contributor is seeking additional help from the community to advance the project.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the CUDA backend on a system with an NVIDIA GeForce RTX 3060 GPU, where the process fails due to a tensor type mismatch and an incorrect block size. The error occurs during the initialization of the model, specifically when trying to read tensor information from the file, resulting in the model failing to load properly.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 20

Summarized Issues:

  • Bug in llama-server prefill functionality: The llama-server's prefill functionality only works correctly when the "content" field of a message is a string. It fails to continue the assistant message when the "content" is provided as a list of objects.
    • issues/14353
  • Gemma3 transformer library discrepancy: There is a discrepancy in the Gemma3 transformer library implementation within the llama.cpp project. The rope-type for local_emb_rope is hardcoded, causing confusion due to differing assumptions about frequency scaling and base.
    • issues/14367
  • Compile error on Intel x86_64 MacOS: A compile error occurs in the ggml_gemv_q4_K_8x8_q8_K function on Intel x86_64 MacOS systems using AVX2. The error can be resolved by disabling a specific branch in the code, though the exact SIMD instruction causing the incompatibility is unclear.
    • issues/14372
  • AMD Ryzen AI platform NPU support request: There is a request for official support for the AMD Ryzen AI platform NPU in the llama.cpp project. This highlights the need for improved performance and usability over AMD's initial implementation.
    • issues/14377
  • Server cache behavior modification request: A feature request suggests modifying the server cache behavior in reasoning models by excluding "thinking tokens" from being cached. The current setup causes unnecessary reprocessing and context size wastage.
    • issues/14379
  • Idle priority feature request: A feature request proposes enhancing the llama.cpp project by allowing it to run with an idle (lowest) priority using a niceness value of 19. This aims to improve system responsiveness during AI model inferencing on CPUs.
    • issues/14382
  • Generic CPU architecture feature request: A feature request suggests adding a generic CPU architecture in the ggml-cpu/arch directory of the llama.cpp project. This would facilitate building cross-CPU and cross-platform binaries using Cosmopolitan.
    • issues/14402
  • Sparse vectors generation feature request: A feature request asks if the embeddings endpoint of the llama.cpp server can be enhanced to generate sparse vectors using models like bge-me. This would utilize BM25 lexical scores for both dense and sparse embeddings.
    • issues/14404
  • Model quantization failure: There is a failure to quantize the model 'gemma-3n-E2B-it.f16.gguf' to 'gemma-3n-E2B-it.q5_k.gguf'. The issue is due to a problem with the tensor 'per_layer_token_embd.weight' data being out of file bounds.
    • issues/14405
  • DeepSeek model bug: A bug in the llama.cpp project causes tools to crash or fail when using the DeepSeek R1/V3 models with unsloth dynamic quantization. This results in runtime errors and core dumps, particularly when attempting to call tools.
    • issues/14406
  • Empty grammar stack error: A bug in the llama.cpp project results in an unexpected empty grammar stack error after accepting a piece labeled "". This causes a runtime error during model execution on systems with AMD Radeon GPUs using the HIP backend.
    • issues/14413
  • Hunyuan-A13B model support request: A feature request for the llama.cpp project seeks support for the Hunyuan-A13B model. This mixture of experts (MoE) model with 13 billion active parameters is suitable for use on home computers with one or more GPUs.
    • issues/14415
  • Flask application bug in Docker: A bug in the llama.cpp project causes a Flask application running in a Docker container on a Linux VM to encounter a GGML_ASSERT(backend_embd != nullptr) failed error. This occurs when more than 3-4 simultaneous API requests are made.
    • issues/14418
  • rocBLAS error with gemma-3n model: A crash occurs with certain models, such as gemma-3n, due to a rocBLAS error where the system cannot locate the required TensileLibrary.dat file for the GPU architecture gfx1036. A temporary workaround involves using a single GPU with full offload.
    • issues/14421
  • Antivirus message on Windows 11: A bug in the llama.cpp project triggers an antivirus message on Windows 11 when using the llama-server module. This is potentially due to a false positive, as the user downloaded the latest pre-built version from the official repository.
    • issues/14422
  • Finetuning example crash: A crash occurs in the example/finetune.cpp of a GitHub project when attempting to run a finetuning example on a Linux system. The problem seems related to the inability of pre-allocated tensors in a CUDA buffer to execute the required operations.
    • issues/14424
  • Assertion error with Qwen 3-30B model: A bug is encountered when running the llama-server with a specific configuration (ubatch=2048) on the Qwen 3-30B model. An assertion error (GGML_ASSERT(nei0 * nei1 <= 4096)) fails, indicating a problem with the model's compatibility or configuration limits.
    • issues/14426
  • Gemma3n multimodal capabilities request: A feature request for the llama.cpp project seeks the addition of support for Gemma3n multimodal capabilities, such as ASR and vision. This would extend beyond the currently supported text modality.
    • issues/14429
  • PLE offloading inquiry: A user inquires whether the PLE (Per-Layer Embedding) of the Gemma3N model can be offloaded to a GPU using CUDA. Their attempt to do so resulted in a silent crash of the llama.cpp application.
    • issues/14430
  • Segmentation fault in llama-server: A bug in the llama.cpp project causes a segmentation fault when attempting to save and restore an empty slot using the llama-server module. Further complications arise when a restore request is made with a nonexistent file.
    • issues/14434

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 4

Summarized Issues:

  • Bug in ggml_cuda_compute_forward function: This issue involves a "MUL failed" error due to an invalid device function on ROCm, which is caused by a misconfiguration in compiling for the wrong AMD GPU architecture. The error specifically arises when the architecture is set to gfx1030 instead of the correct gfx1036.
    • issues/14370
  • Regression bug in llama-cpp with Mistral: A regression bug in recent versions of llama-cpp causes function calls using Mistral to incorrectly reference a character from Tekken 7. This issue is due to a string mistral-v7-tekken in the prompts and affects specific hardware and software configurations, impacting versions 5749 and 5756 but not 5311.
    • issues/14383
  • CI process failure on Windows: The continuous integration process for Windows in the llama.cpp project is failing for all new pull requests. This failure is due to a compiler version mismatch error, specifically requiring Clang 19.0.0 or newer.
    • issues/14412
  • Feature request for Hunyuan-A13B model support: There is a feature request to add support for the Hunyuan-A13B model, which has 80 billion parameters with 13 billion active parameters, in both llama.cpp and gguf-my-repo. The user encountered an error when attempting to convert the model to GGUF due to unsupported model architecture.
    • issues/14433

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 23

Key Open Pull Requests

1. llama : add high-throughput mode: This pull request introduces a high-throughput mode to the llama project, aiming to enhance multi-sequence decoding performance by minimizing cross-sequence attention computation in the unified KV cache, with initial promising results and functionality gated by the LLAMA_HT environment variable, while also requiring LLAMA_SET_ROWS from a related issue, and includes various updates and tests to support this new feature.

  • URL: pull/14363
  • Merged: No
  • Associated Commits: c1a58, f2cd9, 695b6, 313a4, df71c, 630c8, e8970, e7369, 828e5, c0cfc, eba97, 1f647, 79dac, db2bb, f875d, 39d0b, 332f0, 36f8e, 52b90, 13214, 401c1, 7c648, 8c682, 1b74b, 165d8, 0bb1d, 66631, 5eb1a, 61795

2. vulkan: Add fusion support for RMS_NORM+MUL: This pull request introduces fusion support for the RMS_NORM and MUL operations in the Vulkan backend of the ggml library, by adding a use_count to ggml_tensor to track multiple uses of an output, modifying the rms_norm shader to allow multiplication by another tensor, implementing detection and basic fusion logic, and enhancing testing capabilities to compute entire graphs while focusing on individual node results, particularly targeting the common generation of rms_norm+mul in the llm_graph_context::build_norm process.

  • URL: pull/14366
  • Merged: No
  • Associated Commits: 8643f, 18f2d, b84cb, 5e13d, e6f3c, da2fc, 8c50a, 9ddd4, 16304, 2dbac, 68949, a75e9

3. model : add hunyuan moe: This pull request aims to add the "hunyuan moe" model to the project, addressing issue #14415, with tasks including making it convertible to GGUF, ensuring the correct tokenizer and pretok are used, and implementing the computation graph, as evidenced by multiple commits such as 'model : add hunyuan moe', 'tokenizer ok', and 'cgraph init'.

  • URL: pull/14425
  • Merged: No
  • Associated Commits: f5d8a, 38acf, 35591, cb1f9, 51886, cff16, 3920f

Other Open Pull Requests

  • Conv2d Implementations and Enhancements: Multiple pull requests focus on improving Conv2d operations across different platforms. One introduces a new Conv2d implementation for CPU using a tiled Im2col + GEMM approach, while another adds a direct conv2d kernel for OpenCL, designed for Adreno GPUs, both showing significant performance improvements.
    • pull/14388, pull/14403
  • Performance Optimization for Specific Architectures: Several pull requests target performance enhancements for specific architectures. These include a build variant for the GGML CPU library targeting Neoverse-V2 architecture and performance improvements for the Ascend 310P by optimizing matrix multiplication weights.
    • pull/14380, pull/14407
  • Backend Performance Testing and Comparison: Enhancements in backend performance testing are addressed in multiple pull requests. A new script tests backend operations focusing on GFLOPS and GB/s metrics, and another enhances the compare-commits.sh script to support performance comparisons for llama-bench and test-backend-ops.
    • pull/14354, pull/14392
  • Quantization and SIMD Instructions: A pull request introduces a block interleaving approach for Q2_K quantization on x86/x64 architectures using AVX2 and AVX512 SIMD instructions, resulting in significant performance improvements while maintaining similar perplexity levels.
    • pull/14373
  • API and Functionality Enhancements: Several pull requests focus on enhancing API functionality and usability. These include a new C API function for determining backend device types and the introduction of the ggml_scale_bias function for the Metal backend.
    • pull/14358, pull/14417
  • Bug Fixes and Improvements: Various pull requests address bug fixes and improvements. These include fixing assistant prefilling logic, addressing a broken URL in the README, and enhancing the onChunk callback to handle streaming errors more gracefully.
    • pull/14360, pull/14371, pull/14374
  • Documentation and Configuration Updates: Updates to documentation and configuration files are covered in several pull requests. These include refining the .gitignore file and adding explanations to command-line options to improve user understanding.
    • pull/14355, pull/14399
  • Hybrid Cache System and Broadcasting Support: Enhancements to the hybrid cache system and support for broadcasting in specific functions are introduced. These changes ensure proper cache function calls and allow for generalized broadcasting across dimensions.
    • pull/14428, pull/14435

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 29

Key Closed Pull Requests

1. Changes sync: This pull request involves a series of updates and improvements to the llama.cpp project, including adding proper checks for Clang to avoid errors and warnings, introducing CMake presets and toolchain files for Windows ARM64, enabling and fixing issues with matmul-int8 for MSVC and Clang, optimizing Windows ARM64 builds with MSVC and LLVM, and refactoring CI workflow inputs for better readability and dispatch flexibility, although it was ultimately not merged.

  • URL: pull/14376
  • Merged: No
  • Associated Commits: ff48f, 7d469, 18d20, e838a, 39780, ece01, f8532, b58b0, 8b5f3, f0338, d1c9f, aa1f4, a43c4

2. Fix compile warnings: This pull request addresses various compile warnings in the llama.cpp project by implementing fixes such as removing unused variables, correcting type mismatches, and adjusting code to prevent the discarding of 'const' qualifiers, although it was ultimately not merged into the main branch.

  • URL: pull/14393
  • Merged: No
  • Associated Commits: e1ffb, 4f9bb, 455fd, 6de95, e55e9, 4ed3a, c499d, 0e0a9, 39ae0, d61d9

3. CUDA: add bf16 and f32 support to cublas_mul_mat_batched: This pull request adds support for bf16 and f32 data types to the cublas_mul_mat_batched function in CUDA, resulting in a significant speedup for certain configurations when running the llama-bench, as evidenced by performance improvements in specific test scenarios.

  • URL: pull/14361
  • Merged: Yes
  • Associated Commits: 4887f, 526fc, c02cd, 2c4e4

Other Closed Pull Requests

  • Vulkan Performance Improvements: Several pull requests focus on enhancing Vulkan performance by increasing workgroup sizes and addressing concurrency issues. These changes result in improved performance metrics and stability, as demonstrated by benchmark tests and commit evidence.
    • pull/14345, pull/14333, pull/14378, pull/14427
  • Batch Processing Enhancements: Enhancements in batch processing are addressed by introducing methods for batching row copies and ensuring consistent memory batch splitting. These improvements help optimize processing efficiency and maintain consistency across implementations.
    • pull/14384, pull/14414
  • Documentation and Build Updates: Updates to documentation and build processes include clarifying build instructions, correcting README errors, and improving version management. These changes aim to enhance user understanding and flexibility in project management.
    • pull/14389, pull/14352, pull/14362
  • Model and Template Additions: New models and templates are introduced, such as the text-only gemma 3n model and a Jinja template for the Mistral model. These additions expand the project's capabilities and support for various functionalities.
    • pull/14400, pull/14349, pull/14408
  • Bug Fixes and Code Improvements: Various bug fixes and code improvements are implemented, including addressing empty sequence checks, fixing unprintable character issues, and making destructors virtual. These changes enhance the project's robustness and maintainability.
    • pull/14364, pull/14381, pull/14410, pull/14416
  • Feature Enhancements: Enhancements to features include improving the feature scoring mechanism and extending verbose prompt capabilities. These updates provide more precise optimizations and a better user experience during interactive sessions.
    • pull/14332, pull/14350
  • Build and Release Process Improvements: Improvements to the build and release process address issues with Windows builds and stabilize Release builds by pinning runner images and fixing ARM64 builds. These changes ensure a more reliable build process.
    • pull/14419, pull/14431
  • CMake and Shader Dependency Management: Updates to CMake and shader dependency management ensure proper inclusion of C++ sources and correct passing of debug settings. These changes improve build accuracy and shader generation processes.
    • pull/14398, pull/14427
  • Memory Management and Data Handling: Improvements in memory management and data handling include adding user data pointers to tensors and handling non-contiguous data. These updates facilitate better data management and support for node bindings.
    • pull/14365, pull/14378
  • Template and Option Explanations: Clarifications on template behavior and option usage, such as the --no-mmap option, provide users with better understanding and control over data loading and template selection processes.
    • pull/14390, pull/14396

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 142 11 0 52
taronaeo 112 2 0 2
CISC 59 11 0 44
gabe-l-hart 60 1 1 1
jeffbolznv 28 8 0 20
am17an 37 6 1 7
ngxson 16 3 0 27
JohannesGaessler 6 2 0 35
kpouget 30 0 0 0
Zijie-Tian 26 0 0 0

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.