Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: September 08, 2025 - September 15, 2025 (12:00:59)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces significant updates enhancing overall performance and user experience, with notable improvements in system stability and feature optimization. This release reflects a continued focus on refining core functionalities and addressing user feedback.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Feature Request: Qwen3-Next support: This issue requests support for the new Qwen3-Next models in the llama.cpp project, highlighting that these state-of-the-art models feature complex architectures such as gated DeltaNet linear attention and a high-sparsity Mixture of Experts (MoE) layer that are not currently supported. The discussion emphasizes the significant engineering challenges involved, including the need for new GPU kernels, extended conversion scripts, and integration of multi-token prediction and long-context handling, making this a substantial development effort likely requiring months of specialized work.

    • Commenters noted the absence of a pull request despite the model’s release and discussed the technical difficulties posed by the novel attention mechanisms and model architecture. Several users shared attempts at converting the model to compatible formats and running it, encountering stalls and quantization issues. It was clarified that simple conversion is insufficient without implementing new GPU kernels and architectural support within llama.cpp. The conversation also covered ongoing efforts to extend support for multi-token prediction and long-context features, with some optimism about future integration but consensus that full support will require extensive development.
    • Number of comments this week: 17
  2. Eval bug: Crash with Mistral Small 2506: This issue reports a crash occurring when running the llama-server with the Mistral Small 2506 model due to an unsupported template syntax used in the chat template, specifically related to the yesterday parsing in the Jinja-like template engine. The user highlights that while the template causes a crash, ideally the program should handle such unsupported templates gracefully by providing clear error messages or fallback behavior instead of terminating unexpectedly.

    • The discussion clarifies that the crash is caused by the template engine minja not supporting certain template constructs, and the recommended workaround is to use a supported template file via the --chat-template-file option. A fix was developed and tested that prevents the crash by properly handling unsupported templates, and the conversation emphasizes the importance of avoiding crashes by producing informative error messages and exiting cleanly rather than failing silently or crashing.
    • Number of comments this week: 15
  3. Misc. bug: llama.cpp always produces malformed on Loongarch output starting from b6353: This issue reports that starting from commit b6353, the llama.cpp project produces malformed output on Loongarch CPU-only Linux systems, whereas earlier commits like b6324 work correctly. The problem appears to be linked to the use of Flash Attention, which when disabled (-fa off), resolves the malformed output, suggesting a bug or incompatibility in the Flash Attention implementation on Loongarch.

    • The comments confirm that disabling Flash Attention fixes the issue, with one user noting similar problems on IBM NNPA and suspecting Flash Attention as the root cause. It was also clarified that the problem is not related to endianness since Loongarch uses little-endian. Further investigation showed tensor data corruption (NaN and -inf values) when Flash Attention is enabled, and maintainers are encouraged to look into this for Loongarch-specific code paths.
    • Number of comments this week: 2
  4. Windows Defender flags llama-b6409-bin-win-cpu-x64.zip as Script/Wacapew.A!ml — archive blocked and inaccessible: This issue reports that Windows Defender flags the downloaded Windows build archive of llama.cpp version b6409 as a threat named Script/Wacapew.A!ml, blocking access to the file and preventing extraction unless real-time protection is disabled, which is not an acceptable solution. The user highlights that VirusTotal scans show no active threat, indicating a likely false positive, but stresses the importance of addressing this to maintain user trust and ensure safe, accessible releases.

    • The comments confirm that the issue was resolved in a later version (b6412), with Defender no longer flagging the archive, and acknowledge that false positives can occur occasionally; users are advised to use unaffected versions while the development team works on preventing such antivirus triggers.
    • Number of comments this week: 2
  5. Feature Request: Curious behavior in Q6_K quantization — emotionally resonant responses in mental health context: This issue reports an intriguing observation that the Q6_K quantization variant of the Qwen2-1.5B model produces emotionally resonant and reflective responses in Vietnamese, which appear more aligned with the tone of a mental health care assistant compared to other quantization types and even FP16. The user shares benchmark results from 700 samples demonstrating this unique behavior and expresses interest in exploring further improvements based on these findings.

    • The comments express skepticism about the subjective nature of evaluating "emotionally resonant" responses and suggest conducting blind tests to verify the distinction between quantization types, while also cautioning against using the model for mental health applications.
    • Number of comments this week: 2

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 532 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
  2. Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in improving the Metal backend. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, which would help in collecting and analyzing GPU traces across different frameworks.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status for each parallel download.
  4. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using CUDA on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information. Since there were fewer than 5 open issues, all of the open issues have been listed above.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 17

Summarized Issues:

  • Model output and loading errors: Several issues report problems with model outputs and loading failures, including models producing repeated or nonsensical characters after conversion or offloading, and failures to open or load GGUF model files possibly due to corruption or compatibility. These problems affect different models and hardware setups, indicating challenges in model conversion, GPU offloading, and file integrity.
  • issues/15870, issues/15920, issues/15923
  • Windows compatibility and security false positives: Multiple issues highlight Windows-specific problems such as Windows Defender falsely flagging the llama.cpp Windows build as a threat and the llama-server failing to start on Windows 7 due to unsupported API usage. These issues cause usability and compatibility problems, prompting requests for fixes to ensure broader Windows support and safe distribution.
  • issues/15874, issues/15876
  • CUDA backend and GPU support issues: Several reports describe CUDA backend problems including crashes on specific NVIDIA hardware, poor token decoding performance compared to Vulkan, and compilation without GPU support causing GPU options to be ignored. These issues reveal instability and performance limitations in GPU acceleration within the project.
  • issues/15946, issues/15955, issues/15969
  • Template engine and configuration errors: Issues related to template rendering include crashes caused by unsupported or invalid Jinja and minja template filters, resulting in runtime errors and program crashes. Users request clearer error messages and better handling of template errors to avoid crashes during model serving.
  • issues/15930, issues/15971
  • Model support and feature requests: There are requests to add support for new and complex model architectures such as Qwen3-Next, BailingMoeV2ForCausalLM, and Gemma3nForConditionalGeneration, highlighting engineering challenges like custom attention mechanisms and multi-token prediction. These requests emphasize the need to extend llama.cpp's compatibility with state-of-the-art and specialized models.
  • issues/15940, issues/15968, issues/15970
  • Quantization and model behavior observations: One issue reports a unique behavior of the Q6_K quantization variant producing emotionally resonant and poetic responses in Vietnamese, differing from other quantization types and FP16 during benchmarking. This highlights interesting qualitative differences in model output depending on quantization methods.
  • issues/15907
  • Platform-specific loading failures: An issue describes the load_tensors operation hanging on the Strix Halo platform with an AMD RYZEN AI MAX+ 395 chipset using the HIP backend, indicating platform-specific compatibility or backend issues affecting model loading.
  • issues/15889
  • Data processing and schema bugs: Bugs include invalid UTF-8 byte errors during concurrent processing on Linux with CUDA and incorrect grammar output generated by the json_schema_to_grammar.py script due to nested allOf constructs in JSON Schema. These issues affect data integrity and tooling correctness.
  • issues/15931, [issues/15945](https://github.com/issues/15945]

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 15

Summarized Issues:

  • Vulkan and AMD GPU Output Corruption: Multiple issues report corrupted or garbled output when using Vulkan backend on macOS with AMD GPUs, including RDNA1 and specific models like Qwen3-Coder-30B-A3B. These problems are linked to broken subgroup arithmetic support and Vulkan GPU offloading, causing unreadable text or gibberish output during model evaluation.
  • issues/15846, issues/15875
  • Parallel Processing and Context Handling Bugs: There is a bug where running llama-server with parallel processing and embedding flags causes matrix multiplication assertion failures and incorrect context size handling, which was fixed by adjusting context size parameters. Additionally, the gpt-oss model inefficiently reprocesses the entire prompt on subsequent queries, degrading performance, mitigated by lowering the prompt similarity threshold.
  • issues/15849, issues/15894
  • User Interface Regression in Web Interface: A regression introduced in a recent commit caused the llama-server web interface to stop displaying real-time token generation speed and progress during streaming, only showing speed after generation completes, negatively impacting user experience during slow or long generations.
  • issues/15865
  • Model Evaluation Assertion Failures: Several issues report runtime assertion failures during model evaluation, including non-contiguous tensor views triggering GGML_ASSERT checks on Vulkan backend with AMD hardware, and quantization errors causing assertion failures in nearest_int(float) due to corrupted GGUF model weights.
  • issues/15895, issues/15911
  • Compilation and Backend Support Issues: Compilation errors occur in the Vulkan backend due to ambiguous operator overloads on Debian with GCC 12, resolved by changing null checks, and in the hipBLAS backend for gfx803 GPU architecture due to unsupported assembly instructions, which required logic adjustments.
  • issues/15914, issues/15936
  • Model Loading and Conversion Problems: The server fails to load previously working GGUF model files on Ubuntu with CUDA backend, causing startup errors, while attempts to convert the Qwen3-Next-80B-A3B model fail due to unsupported architecture. The convert_lora_to_gguf.py script also ignores the --outtype parameter for LoRA weights, defaulting to F32 format intentionally.
  • issues/15890, issues/15919, issues/15950
  • Memory Leak on Apple M3 with Metal Backend: A potential memory leak is reported when repeatedly initializing and freeing llama contexts on Apple M3 hardware using the Metal backend, with memory usage increasing over many iterations and confirmed by leak detection tools.
  • issues/15954
  • Crash in iOS SwiftUI Sample App: The SwiftUI sample app crashes with an assertion failure during sampling when topK and topP parameters are added to the llama_sampler_chain on iOS devices.
  • issues/15961
  • Feature Requests for New Models: There are feature requests for implementing the Gwen3-next model promising a tenfold speed increase and enhanced intelligence, as well as support for the Qwen3-Next-80B-A3B model.
  • issues/15949, issues/15950

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

  1. Feature Request: Qwen3-Next-80B-A3B support
    • Toxicity Score: 0.65 (Rapid escalation, confrontational language, accusations)
    • This GitHub conversation begins with the original poster submitting a feature request and providing detailed context. A respondent quickly challenges the poster's claim of having searched for existing issues, labeling them a "liar" and linking to a related issue, which introduces a confrontational tone. The original poster or another participant replies defensively, pointing out the search did not yield results and suggesting the accuser should be more careful. The exchange is brief but marked by direct accusations and defensive responses, indicating tension triggered by perceived dishonesty and misunderstandings about search thoroughness.

III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 28

Key Open Pull Requests

1. devops: add s390x containers: This pull request introduces dedicated s390x architecture Dockerfiles for Llama.cpp, separating them from the original CPU Dockerfile due to additional dependencies, while postponing the inclusion of the zDNN backend Dockerfile and delaying GitHub Actions workflow builds until the required Ubuntu 24.04 s390x image is available.

  • URL: pull/15915
  • Merged: No
  • Associated Commits: bdcbc, 75846, 955c4, ce7bd, e53e1, e172b, 23d34, a0701, 2ff66, 28b41, 3a09c, 74767, 7027c, c3ab7, 451ac, b23e7, 944ef, ab79c, 10714, f6baa, a0b22, 489e0, 17a99, 244d6, 0a766, bff18, 73679, 0084c, 03e64, a1912, 43869, ffcc7, 234ee, bd87c, b2053, ec0f5, 7dd0d, 4da28, 9d01f, 67cf8, 4d380, a7432, 2e78a, 712d7, 4d79a, ff41f, a6d85, fd6ca, 2258d, 3be29, b3d39, 98477, 55732, 4f65e

2. [DRAFT] devops: add s390x CI: This pull request is a draft aimed at adding continuous integration (CI) support for the s390x architecture using IBM Actions on POWER and Z s390x Runner images, including efforts to address build warnings and tokenizer differences related to endianness, while also serving as a testing ground for GitHub Actions due to limited access to the required runner images.

  • URL: pull/15925
  • Merged: No
  • Associated Commits: 8a832, 7a740, 542d5, 92c6a, 17ef8, 8fdbf, 4b4a3, 1e1bc, f6135, b8ae9, 8cb66, 9a9b1, a6ec1, d9699, 3f8db, 00f20, e0634, 55680, 91790, 51881, b67a6, ec199

3. SYCL: add COUNT_EQUAL operator support: This pull request adds full SYCL backend support for the COUNT_EQUAL operator by implementing the operator in the codebase, updating dispatch and registration files, regenerating documentation to reflect support for multiple data types, and verifying functionality through comprehensive testing.

  • URL: pull/15937
  • Merged: No
  • Associated Commits: f773e, 99013, bf95c, 994d7, 6dd96, ea576, b9d45, 53d9b, d35ba, ef875, 8526e, 5010a

Other Open Pull Requests

  • Timestep Embedding Kernel Fixes: This topic covers fixes for zero padding issues in timestep embedding kernels across multiple backends by correcting indexing to properly initialize all padding elements. These changes ensure consistency with previous CPU backend fixes and improve kernel correctness.
    • [pull/15932]
  • Metal Backend Memory Management: Pull requests in this area remove the use of memory pools in the Metal backend by expanding destination tensors to allocate sufficient scratch space, addressing synchronization and memory range logic concerns. This approach replaces MTLHeap buffers to simplify memory handling.
    • [pull/15966]
  • GLM-4.5 Tool Calling Enhancements: These pull requests introduce grammar-constrained tool-call outputs and streaming parsing for GLM-4.5, improving real-time inference responsiveness. The work builds on prior contributions and plans further improvements in template compatibility and unit testing.
    • [pull/15904]
  • Norm Operation Optimization: This topic includes replacing loop-based summation with vectorized functions and using the Accelerate framework for variance computation, along with new intrinsic-based variance functions. Performance tests demonstrate significant speedups on AVX2 Intel CPUs.
    • [pull/15953]
  • ROCM Release Updates: Pull requests here add support for new gfx architectures, update the macOS target version, and include additional hipblaslt.dll assets in the ROCM releases. These changes enhance hardware compatibility and release completeness.
    • [pull/15972]
  • ggml-zdnn Backend Improvements: These changes remove user mapped buffers to allow weight tensors to be processed through .set_tensor, enabling zDNN to create zTensor objects during setup. A missed buffer free for extra buffers like bias is also fixed.
    • [pull/15965]
  • CUDA Kernel Micro-Optimizations: This topic covers micro-optimizations in the CUDA kernel mmf.cuh for the mul_mat_id function, improving memory coalescing and warp utilization. The changes also address register pressure and occupancy issues observed during kernel optimization.
    • [pull/15926]
  • Bug Fixes for Variable Initialization: This pull request fixes an uninitialized variable is_on_grid in the quantize function to prevent potential errors by properly initializing it to true.
    • [pull/15928]
  • CANN Backend Device Switch Optimization: This fix adds a check to skip redundant device switches in the CANN backend, preventing unnecessary context switches and maintaining thread/device consistency.
    • [pull/15935]
  • Memory Usage Reporting Feature: A new feature prints a detailed memory usage breakdown on program exit, showing allocations by device and memory category. This aids efficient model distribution and future automated optimization of model placement.
    • [pull/15860]
  • Code Formatting for ggml-cann: This pull request applies clang-format to the ggml-cann source code to improve code style consistency.
    • [pull/15863]
  • LoongArch Compilation Fixes: These changes fix compilation errors on LoongArch by adding explicit type conversions and reorganizing LSX-implemented functions into correct macros.
    • [pull/15864]
  • OpenCL get_rows Parameter Support: This update adds support for the ne3 parameter in the OpenCL get_rows function, enhancing its functionality as previously tested.
    • [pull/15866]
  • Model Conversion and Testing Enhancements: These changes add support for passing a prompt file to model conversion targets and scripts, update embedding info printing, and facilitate sliding window testing with variable-sized input files.
    • [pull/15871]
  • Mirostat Sampler Documentation Update: This pull request corrects the descriptions of Mirostat sampler function parameters in the documentation to better reflect understanding.
    • [pull/15885]
  • OpenCL pad_ext Support: This proposal extends the OpenCL implementation to support the pad_ext function, enhancing padding capabilities.
    • [pull/15888]
  • iOS Build Process Improvement: This fix prevents tool installation during iOS builds by addressing a CMake configuration issue related to missing BUNDLE artifact kind, ensuring compatibility when common and tools builds are enabled.
    • [pull/15903]
  • Sequence Length Limit Increase: This change increases the maximum sequence length limit from 64 to 256 without significant performance impact, suggesting future refactoring for dynamic configuration.
    • [pull/15916]
  • CI macOS Job Label Update: Continuous integration jobs named macos-latest* are updated to use the macos-latest label instead of explicit macOS versions to resolve job failures caused by version inconsistencies.
    • [pull/15938]
  • Windows ARM64 Concat Crash Fix: This fix resolves a crash on Windows ARM64 with Adreno GPUs by ensuring data type consistency between host code and kernels using cl_long, preventing CL_INVALID_ARG_SIZE errors.
    • [pull/15944]
  • llama-bench Feature and Refactor: This update adds --n-cpu-moe support to llama-bench, fixes markdown table output, and refactors duplicated regular expressions into helper functions.
    • [pull/15952]
  • CUDA im2col_3d Kernel Fix: This pull request fixes the CUDA im2col_3d kernel to handle non-contiguous input views by using true strides from the input tensor, resolving test failures and maintaining CPU backend correctness.
    • [pull/15956]
  • CUDA PAD_REFLECT_1D Kernel Optimization: This optimization eliminates loops in the CUDA PAD_REFLECT_1D kernel by assigning each thread a single value, improving memory bandwidth by 1% to 11% and adding test cases for validation.
    • [pull/15957]
  • Resumable Downloads for llama-server: This feature adds resumable downloads by detecting partial files, verifying HTTP Range support, and handling byte-range requests with append mode file operations.
    • [pull/15963]
  • Server Crash Fix for Thinking Feature: This fix prevents crashes by enabling the "thinking" feature only when the --jinja option is used, avoiding failures from minja parsing issues while maintaining expected behavior.
    • [pull/15967]

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 48

Key Closed Pull Requests

1. Vulkan perfetto: This pull request introduces support for Perfetto profiling in both CPU and Vulkan backends of the project, enabling in-process tracing that generates Perfetto trace binaries during inference to help visualize and analyze performance, with detailed integration including start/stop tracing calls, event annotations, token rate counters, GPU timeline emission, and periodic GPU statistics printing and trace flushing.

  • URL: pull/15859
  • Merged: No
  • Associated Commits: d968e, 7efc6, b0a9c, 14ff9, 0bcbe, a5d75, 22506, 878fe, d68c6, 42e98, 26f08, e3556, 6164f, f580b, e1602, c369c, 44da5, c86af, 7c540, 47bd9, 69886, 696f1, fad54, a31a3, 883ad, 7c21e, 02ae1, 31ee6, f6b06, a7c19, bf7de, c18fc, b7206, 51bf2, 14a48, 8aa1f, 6300a, 723f7, c2e8b, 436e6, 0cf27, 6779e, 6e598, 15010, 83a41, 8f189, 2cb9b, 9afd4, d0f62, d4617, 5569d, 375e6, 6545c, a62ab, 31850, 30238, e614a, ff764, 34e73, 491d7, ed6f5, 6efef, 8c1c4, f02d0, abbac, 88190, 56e20, 1bf21, ebbb8, 94920, f192a, b883c, a99f4, 64bd6, a6444, 3e927, add79, 077fd, b6e1f, f18fe, 2ed21, d7302, 80840, 958f1, 6b4a4, 53c9c, 43d92

2. Extend the support of T5 models with different encoder-decoder layers: This pull request extends the llama.cpp implementation to support T5 models with differing numbers of encoder and decoder layers, enabling compatibility with T5 variants that previously failed to run, similar to the support provided by Hugging Face transformers.

  • URL: pull/15909
  • Merged: Yes
  • Associated Commits: 33163, 219ea, 2161c, 284ce, 77f0f, 7efe5, 12a90, 634e5, ebef5, 0acda, 19281, 51530, 60821, de463, 804a9, 92150, 11672, 678aa, d145e, ce90f, 01002, 3ee21, 6cb51, 42f1f, 69406, f16d8, 84e5d

3. metal : make the backend async v2: This pull request implements an asynchronous version of the Metal backend for the project, introducing support for both shared and private GPU buffer types with a global command queue per device to reduce CPU-GPU synchronization overhead, thereby improving performance on discrete GPUs while maintaining host memory allocation for Apple Silicon devices to minimize text generation slowdowns.

  • URL: pull/15906
  • Merged: Yes
  • Associated Commits: 97b96, c5637, bdff7, d91ba, 85aaf, 7fc2b, f2882, 0926c, 7b59f, 04084, afd95, e65c5, 52375, 9248a, 5fdce, c9a5b, e796f

Other Closed Pull Requests

  • Metal backend optimizations and concurrency improvements: Multiple pull requests enhance the Metal backend by introducing concurrent execution of graph operations through memory interval tracking and graph reordering, dynamic inference-time compilation of Metal kernels for better specialization, and refactoring kernel binary loading with memory leak detection improvements. These changes collectively improve performance and stability across various models and tests.
    • pull/15929, pull/15857, pull/15964
  • CUDA kernel fixes and performance enhancements: Several pull requests address CUDA kernel issues by fixing unsupported conditions in supports_op, integrating fastdiv and fastmodulo optimizations for better performance on Ada Lovelace and Blackwell GPUs, extending FlashAttention kernel support for AMD GPUs with FP32 to FP16 conversion, and fixing segmentation faults related to PAD implementation. These updates improve reliability and speed on CUDA and AMD platforms.
    • pull/15868, pull/15872, pull/15927, pull/15869, pull/15912, pull/15861
  • Vulkan backend fixes and improvements: Pull requests fix failing dequantization shaders, address out-of-bounds memory accesses in soft_max_back, and overhaul the Vulkan backend to detect integrated and discrete GPUs by adding PCI ID support and fixing compiler warnings. These changes enhance Vulkan backend stability and device management.
    • pull/15862, pull/15858, pull/15947
  • Graph execution reordering and concurrency in Vulkan and Metal: Two pull requests introduce graph node reordering to group independent nodes for parallel execution, reducing synchronization overhead and improving performance in both Vulkan and Metal backends. This approach enables more efficient utilization of hardware resources.
    • pull/15951, pull/15929
  • EmbeddingGemma model support and configuration fixes: Pull requests add support for the GEMMA_EMBEDDING architecture to run Google's embedding models and modify parameter setting to ensure the sliding_window parameter is read directly from the original model config, maintaining consistency during model conversion. These changes enable proper embedding model usage and configuration accuracy.
    • pull/15918, pull/15867
  • Code modernization and cleanup: A pull request replaces deprecated NULL with nullptr to improve code clarity and safety, while another cleans up s390x SIMD Vector Intrinsics syntax and improves horizontal summation performance without adding new features. These updates modernize and optimize the codebase.
    • pull/15851, pull/15855
  • Model conversion and debugging enhancements: One pull request adds BF16 support and enables dumping intermediate steps during model conversion for better debugging, while another improves test coverage reporting by filtering no-op operations and refining sub-operation handling. These improvements aid development and testing accuracy.
    • pull/15877, pull/15900
  • Bug fixes in tensor handling and context module: Pull requests fix a non-contiguous tensor issue by changing reshape_4d to view_4d to prevent assertion errors across multiple models and address an issue with the n_outputs parameter during the reserve operation in the context module. These fixes improve stability and correctness.
    • pull/15908, pull/15858
  • ROPE implementation optimization: A pull request introduces a sin/cos caching mechanism in the ROPE implementation to avoid redundant computations across layers by caching on the first layer per device and reusing it when parameters match, accelerating multi-layer performance while ensuring correctness.
    • pull/15912
  • CI workflow and contributing guidelines improvements: One pull request enhances the continuous integration workflow by adding caching and symbolic link management to prevent ROCm installation hangs and errors, while another extends contributing guidelines with notes on merging others' pull requests to improve collaboration.
    • pull/15887, pull/15881
  • SYCL extension revert due to memory leak: A pull request reverts the addition of the SYCL enqueue_functions extension because of a significant memory leak observed during inference and workload tests on oneAPI versions 2025.0 to 2025.2, with plans to restore it once fixed in a future SYCL release.
    • pull/15910
  • New media assets added: A pull request adds new SVG and PNG icons based on the llama1-icon.svg to the project repository, enhancing the project's visual resources.
    • pull/15878

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 122 15 0 33
taronaeo 133 6 2 26
CISC 44 6 0 86
danbev 71 13 1 8
pwilkin 58 2 4 29
jeffbolznv 46 8 1 34
JohannesGaessler 36 6 0 39
EAddario 61 1 0 0
ngxson 39 3 0 13
slaren 20 3 0 26

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.