Weekly GitHub Report for Llama.cpp: August 11, 2025 - August 18, 2025 (12:03:33)

            Weekly GitHub Report for Llama.cpp: August 11, 2025 - August 18, 2025 (12:03:33)

                    Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined features aimed at improving user experience and system efficiency.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Eval bug: Jinja fails on gpt-oss-120b when using Vulkan: This issue reports a deterministic failure when running the gpt-oss-120b model with the Vulkan backend and the --jinja flag enabled, resulting in the model outputting repeated characters like "GGGGGG...". The problem appears to be hardware and backend specific, primarily affecting AMD Strix Halo (gfx1151) GPUs, and is linked to numerical instability or precision issues in Vulkan’s Flash Attention and matrix multiplication implementations, with some users also observing repeated "dissolution" tokens and NaNs during evaluation.

The discussion explores various reproduction attempts, hardware configurations, and backend settings, revealing that disabling certain Vulkan CoopMat extensions or forcing f32 accumulators can mitigate but not fully resolve the issue; users share debugging insights including NaN detection, differences between GPU models, and the impact of offloading computations to CPU. Multiple participants confirm the problem is not model-specific but tied to Vulkan backend interactions on AMD hardware, and ongoing patches and tests aim to address precision and stability problems in the attention and matrix multiplication code paths.
Number of comments this week: 59

Eval bug: Possible CUDA syncronization bug between devices.: This issue reports a suspected CUDA synchronization bug occurring when running large-context language models on multiple NVIDIA RTX 3090 GPUs using the CUDA backend. The user experiences two main problems: repeated gibberish output (notably long sequences of "GGGG...") that eventually leads to CUDA illegal memory access errors, and intermittent freezes during prompt processing where one GPU is fully utilized while the other remains idle, suggesting a synchronization issue between devices.

In the comments, the user and others discuss various troubleshooting steps including disabling flash attention, changing batch sizes, and running with CUDA debug tools like compute-sanitizer memcheck. Logs reveal invalid memory accesses and CUDA launch failures primarily linked to flash attention kernels. Attempts to reproduce the issue without flash attention reduce crashes but do not eliminate gibberish output. The problem appears random and difficult to reproduce consistently, with some indication it may be related to CUDA driver, compiler, or hardware environment rather than the code itself. Testing with an earlier software version showed fewer freezes but the gibberish issue persisted. Overall, the discussion centers on isolating the root cause through detailed logs, debug builds, and configuration changes.
Number of comments this week: 32

Feature Request: Support for Activated LoRA: This issue requests the addition of support for Activated LoRA (aLoRA) adapters in llama.cpp, which enable more efficient multi-turn inference by allowing the model to reuse the base model's KV cache up until the point the adapter is activated, significantly reducing the time to first token (TTFT). The feature aims to complement ongoing work in vLLM and relies on recognizing activation prompts within the input to selectively apply adapter weights only after the invocation, improving performance in agentic and retrieval-augmented generation (RAG) scenarios.

The discussion clarified that while llama.cpp already supports hot-swapping LoRA adapters and retains KV caches, aLoRA requires more nuanced handling to activate adapter weights only after a specific prompt, which is not trivial in a stateless server context. Participants explored potential implementation strategies, including modifying server logic to track activation tokens and possibly performing dummy token generations to prefill caches, but agreed that full support depends on upstream changes in the Huggingface PEFT repository and deeper integration for positional masking; an early PR has been opened to begin addressing these challenges.
Number of comments this week: 12

Feature Request:  --n-cpu-moe option for multi GPU?: This issue requests an enhancement to the --n-cpu-moe option to better support multi-GPU setups by allowing offloading of MOE layers from each GPU individually or specifying multiple --n-cpu-moe parameters for different layer ranges. The motivation is that the current implementation only offloads the first layers to CPU, which is not optimal for multi-GPU configurations where offloading from each GPU separately would improve performance and flexibility.

The comments provide detailed explanations and examples on how to currently use --tensor_split combined with --n-cpu-moe to manually distribute layers and MOE experts across multiple GPUs, highlighting the complexity and trial-and-error involved. Users discuss the challenges of predicting layer distribution, share strategies for optimizing GPU memory usage, and express interest in a more intuitive option like --n-gpu-moe to specify MOE layers per GPU, while also touching on related issues such as CUDA backend bugs and context device assignment.
Number of comments this week: 11

Misc. bug: Massive generation slowdown when disabling top-k sampling (--top-k 0): This issue reports a significant performance degradation in token generation speed when disabling top-k sampling (setting --top-k to 0) in llama.cpp, observed both via the command line interface and Python bindings. The user demonstrates that sampling time increases by roughly 100 times because without top-k, the sampler processes the entire vocabulary, causing a massive slowdown during generation.

Commenters explain that disabling top-k sampling causes the next sampler to run on the full vocabulary, which is inherently slow, and suggest that top-k is generally necessary to maintain performance. The original poster clarifies their use case involving audio codec vocabularies where top-p or min-p sampling without top-k is preferred for diversity, and a workaround is found by applying min-p before top-p and disabling top-p to avoid sorting overhead, significantly improving speed. Additional discussion notes that skipping top-k is not recommended as it bypasses efficient sampling steps, and some users reference external recommendations that conflict with this behavior.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 500 days, indicating a persistent and unresolved bug affecting this particular implementation.
Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in debugging and performance analysis. The user is working on improving the Metal backend in a related project and seeks a documented or known method to produce the type of GPU trace output similar to what is provided by Apple's Metal debugger.
common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress indicators when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status displays during parallel downloads.
kubernetes example: This issue discusses the creation of a Kubernetes example for deploying the llama.cpp server using a Helm chart, aiming to facilitate scalable application deployment within the community. The original poster has begun work on this example and invites contributions from others to help advance the project when they have more time.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 38
Summarized Issues:

Model conversion errors: Several issues report failures during model conversion processes, including errors mapping tensor names, unsupported model architectures, and tensor dimension mismatches that cause crashes or prevent successful conversion. These problems affect GPTQ quantized models, GLM-4.5V vision models, and GPT-OSS to GGUF format conversions, indicating challenges in handling diverse model formats and architectures.  
issues/15224, issues/15240, issues/15284

CUDA and GPU runtime errors: Multiple issues describe CUDA-related crashes, illegal memory accesses, and driver shutdowns occurring during inference, cleanup, or multi-GPU usage, often involving specific hardware like RTX 3090 or RTX A6000 GPUs. These errors include synchronization problems, kernel crashes, and failures in CUDA context destruction, highlighting instability in GPU resource management and kernel execution.  
issues/15294, issues/15316, issues/15359, issues/15330

Model output and reasoning bugs: Several issues report problems with model output formatting, reasoning budget parameters not working, and repeated or garbled output characters, especially when using flags like --jinja or running on specific backends like Vulkan. These bugs affect the correctness and usability of model responses and reasoning capabilities.  
issues/15266, issues/15272, issues/15274, issues/15276

Performance and optimization requests: There are requests and reports related to improving performance, including Metal3 support for Mac GPUs, multi-GPU MoE layer offloading enhancements, AVX512 vectorized loading optimization, and addressing slowdowns caused by disabling top-k sampling. These highlight ongoing efforts to optimize hardware utilization and inference speed.  
issues/15223, issues/15228, issues/15263, issues/15351

Code quality and build issues: Several issues describe compilation failures, warnings, and code style inconsistencies, including a sudden GCC compilation error on Ubuntu 24.04, unreachable code warnings, inconsistent macro indentation, and linker errors with Intel oneAPI builds. These indicate challenges in maintaining cross-platform build stability and codebase cleanliness.  
issues/15254, issues/15257, issues/15264, issues/15269

Feature requests for usability and functionality: Users request new features such as specifying download locations to manage disk space, adding keyboard shortcuts in the web interface, supporting new model architectures like EAGLE3 and GLM-4.5V, adding local speech-to-text endpoints, and integrated model file authenticity verification. These requests aim to enhance user experience and expand functionality.  
issues/15250, issues/15255, issues/15271, issues/15291, issues/15305, issues/15310

Server and API usage bugs: Issues report bugs in the llama-server module related to improper message formatting with channel tags and failures when combining tool choices with reasoning, causing errors or reasoning failures. These problems affect the correct operation of chat completions and message handling in the server environment.  
issues/15242, issues/15247

Security and false positive alerts: One issue reports a false positive detection by Microsoft Defender flagging the CUDA build of the llama binary as a Trojan, which may cause user concern despite being a benign detection.  
issues/15235

Shader and backend compatibility problems: There is a reported shader compilation failure with shaderc version 2025.2 for Vulkan backend due to bfloat16 support issues, requiring workarounds like disabling optimization or downgrading shaderc. This indicates compatibility challenges with newer shader compiler versions.  
issues/15344

Prompt cache and abort handling bugs: An assertion failure crash occurs when loading cached prompts due to improper handling of sequence ID values, and aborting prompt processing behaves inconsistently depending on the client used, with JavaScript AbortController failing to stop prompt processing properly. These bugs affect prompt cache reliability and request cancellation behavior.  
issues/15215, issues/15232

Model loading and evaluation performance discrepancies: The OpenAI gpt-oss 120b model loads significantly slower on RTX 3080 Ti compared to similar models, and evaluation outputs sometimes produce repeated or garbled characters across multiple models and backends, indicating performance and stability issues during model loading and evaluation.  
issues/15253, issues/15272

CI and Docker build failures: The continuous integration process fails due to an invalid pip option in the CUDA Docker image build, causing the build to exit with an error and blocking automated testing and deployment.  
issues/15356

Model evaluation and fine-tuning crashes: Crashes occur during fine-tuning with assertion failures in graph construction and runtime errors during evaluation with unexpected content exceptions, indicating bugs in model training and evaluation pipelines.  
issues/15279, issues/15292

Model compatibility inquiries: There is a question about whether llama.cpp supports running the CosyVoice2 model, reflecting user interest in expanding supported models.  
issues/15348

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 12
Summarized Issues:

GPU Compatibility and ROCm Issues: Multiple issues highlight problems with GPU compatibility and ROCm support, including vector addition failures on AMD GPUs due to invalid device functions, build system failures to detect ROCm installations causing CPU-only binaries, and containerized applications failing to detect GPUs due to missing CUDA libraries. These problems emphasize challenges in configuring ROCm versions, GPU architecture overrides, and proper environment setups for GPU acceleration.  
issues/15202, issues/15245, issues/15299

Performance Limitations and Optimization: Several reports focus on performance issues such as slow prompt processing on GPUs with FP16 computations, suboptimal sampling settings causing low GPU utilization and slower throughput in server evaluations, and Vulkan backend memory allocation failures impacting performance. These issues reveal the need for better backend support, tuning of parameters like top-k, and improved documentation to optimize GPU usage.  
issues/15233, issues/15256, issues/15297

Segmentation Faults and Crashes: There are multiple segmentation fault issues including crashes during long GPT-OSS:20B model evaluations caused by problematic regex handling, and crashes triggered by calling chat template functions immediately after model loading with certain models. These faults point to bugs in code handling and timing of function calls that require careful fixes to prevent runtime failures.  
issues/15283, issues/15345

Build and Environment Configuration Failures: Issues include failures in the Xcode build step due to missing iOS and visionOS SDKs after macOS image updates, and module import errors during model conversion caused by missing Python dependencies. These highlight the importance of maintaining up-to-date build environments and dependency management to ensure successful builds and conversions.  
issues/15322, issues/15268

API Usage and Input Validation Errors: One issue describes server errors caused by invalid image URL values in multimodal API requests, resulting from incorrect overriding of URL fields. This underscores the need for proper input validation and error handling in API implementations to avoid server-side failures.  
issues/15349

User Input and Response Quality Issues: A reported problem involves AI responses containing progressively repeated words due to a user mistake, illustrating how input errors can degrade output quality and cause confusion before resolution.  
issues/15311

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 29
Key Open Pull Requests
1. Add OpenVINO backend: This pull request introduces the OpenVINO backend to the Llama.cpp project, adding support for OpenVINO operators, integrating OpenVINO frontend conversion and inference capabilities, and enhancing the GGML backend with OpenVINO-based implementations to enable model compilation and execution on OpenVINO-supported devices.

URL: pull/15307

Merged: No

Associated Commits: 355da, 4453f, e5035, 1f060, 22aac, 75e55, f1821, 29024, 46a56, 6bf6b, 3c12e, cf295, 870cd, 69422, 4fa11, 77c74, 11edf, 98c93, e6729, d48cb, d8c05, 8715f, e8cf0, 9711f, d88e9, 30235, a59a1, d9e83, 5e336, 9667b, 3e53c, 568cb, 7e1f1, 57c7e, 6c17b, d6f72, e0e19, 4c377, ed366, 980e1, 6c897, 9b21a, f660c, 6deef, bf7a0, 55ba7, 64790, 7fef0, 04d54, 11da0, 8b1b7, 4ca42, dfec8, 500c0, 34831, 8fc4d, 243a5, 6cef8, 832e0, c2178, e81fc, 93279, f844e, 7141a, ef9bb, 30c95, 91503, aa276, e0399, e87de, f6611, 56bce, 61fda, 1dd47, 89b89, db205, ea202, 63871, ae8ea, 23102, 23c35, d0aa8, 83c0e, 84825, fb8a3, d3be3, 45ac2, 4e3d5, 033db, 3ed5d, 731d9, 46962, b8333, 49afa, 04030, 7e1ad, 8e16c, 30f51, 6d05d, 1aad0, ed2ef, 263d4, 1f843, b0277, 3d8d0, f15d5, 3eba1, c7d41, aa9e0, 93b5c, fef37, ae847, 4292a, dcf7c

2. Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2): This pull request introduces a block interleaving approach for Q6_K quantization optimized for x86 SIMD architectures (AVX512/AVX2), including implementations of GEMM and GEMV functions, weight rearrangement to a Q6_Kx8 format, and demonstrates significant performance improvements on the llama2 7B model with detailed benchmarking on supported CPU features.

URL: pull/15275

Merged: No

Associated Commits: 58d53, a6ac7, f1983, 8938a, 3afe0, 33f64, a75e0, 161d7, 1f784, 9b755, e3baf, d2c9e, d0806, e7f30, ff335, ecf01, 3b3d5, 6daa6, 6e541, 0afe1, 50d33

3. Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example.: This pull request integrates Apple Neural Engine (ANE) acceleration into the llama.cpp project using MiniCPM-V 4.0 as an example, adding a build option to enable CoreML support, introducing a new command-line interface for ANE model usage, and providing benchmark results demonstrating improved performance on Apple M2 and M4 devices.

URL: pull/15262

Merged: No

Associated Commits: f37f8, 220ad, 88988, 999a8, 4d32f, 8775b, 2e7bc, d4f0c, 13efc, 93f40, ea691, e4688, 2fd1e, 864d0, 54258, 629b6, 701af, 0042f, 9eee5, fd64e

Other Open Pull Requests

Backend performance optimizations: Multiple pull requests enhance backend performance across various hardware and frameworks. These include Vulkan backend improvements with loop unrolling, larger workgroups, and branchless code for quantization, ARM64 SVE instruction optimizations for GEMM kernels, CANN backend rope operator throughput improvements, and disabling MMA on Turing GPUs to avoid inefficient tensor core usage.

pull/15281, pull/15355, pull/15360, pull/15335, pull/15357, pull/15363

CUDA and GPU support enhancements: Several pull requests add or improve CUDA and GPU-related features. These include adding a compile-time flag for 64-bit tensor support in CUDA, introducing bfloat16 CUDA support in Flash Attention, adding pre-built CUDA-compatible binaries for Ubuntu, and optimizing prompt offloading to GPU by reducing VRAM data transfer.

pull/15298, pull/15261, pull/15249, pull/15346, pull/15331, pull/15217

Vulkan backend improvements: Multiple pull requests focus on Vulkan backend updates to improve performance and maintainability. These include updating the Vulkan SDK installation method, optimizing rms_norm with loop unrolling and fusion, improving mul_mat_vec with larger workgroups and subgroup instructions, disabling spirv-opt for bfloat16 shaders, and optimizing argsort with reduced branching and shared memory usage.

pull/15282, pull/15281, pull/15258, pull/15346, pull/15352

Normalization and fused kernel support: This topic covers adding fused kernels for group normalization and related operations, as well as fixing allocation size issues in the CANN backend's rms_norm function. These changes address limitations in fused operation detection and ensure compliance with backend documentation.

pull/15314, pull/15312

LoRA and adapter support: One pull request introduces Activated LoRA (aLoRA) support in the llama-server and GGUF format for LoRA adapters, enabling efficient hot-swapping and multi-adapter model execution with dynamic add-on features.

pull/15327

Speculative decoding and multi-token prediction: A draft pull request explores a GLM-style Multi-Token Prediction (MTP) approach for the server, aiming to improve speculative decoding by reusing and adapting existing functionality, with testing on the GLM-4.5 model and addressing KV cache and throughput challenges.

pull/15225

Code quality and build fixes: Several pull requests address build warnings and errors to ensure smooth CI execution and clean builds. These include fixing a -Werror=return-type warning, resolving unused parameter and cast-qual warnings in the MUSA backend, and updating pip install commands to handle the --break-system-packages flag.

pull/15221, pull/15258, pull/15357

Model support and reasoning improvements: One pull request updates chat.cpp to fix an issue enabling the qwen3 model to use reasoning with tool_choice set to required, allowing the model to think before tool invocation as per revised grammar.

pull/15248

New hardware backend support: A pull request adds a new spacemit backend optimized for the SpacemiT X60 CPU, leveraging extended instructions and RVV optimizations to accelerate matrix calculations for specific quantized models.

pull/15288

Download and file integrity enhancements: One pull request introduces partial download resumption using range headers, sha256 checksum verification, file size validation via head requests, exponential backoff for retries, and adds text coverage for these features.

pull/15217

Attention output accumulator improvements: A pull request proposes adding F32 accumulators for attention output multiplication in the llm_graph_context to improve consistency and prepare for merging code paths between GLM and other models.

pull/15312

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 47
Key Closed Pull Requests
1. Replace Jenkins Cloud-V pipeline with GitHub Actions RISC-V native build: This pull request proposes replacing the legacy Jenkins Cloud-V pipeline with a new GitHub Actions workflow specifically designed for native RISC-V builds by removing the .devops/cloud-v-pipeline Jenkins configuration and adding the .github/workflows/build-riscv-native.yml file.

URL: pull/15287

Merged: No

Associated Commits: c24dc, c465e, b5997, 9342a, bb198, b4024, e608c, 58572, 48fc7, d8c92, fd500, 95f4d, fc453, 96d1a, c1d61

2. server : add SWA checkpoints: This pull request adds functionality to the server to create and manage checkpoints of the SWA (Stochastic Weight Averaging) memory cache, allowing for reduced context reprocessing by storing a compact representation of the SWA state after processing prompts, configurable via command-line arguments, and includes updates to the libllama API to support extended state saving and restoration.

URL: pull/15293

Merged: Yes

Associated Commits: 96db9, 487b9, 5b0d2, 025af, e7d2e, c2b5c, 52b77, 3d08a

3. Fix HIP warp synchronization function conflicts for ROCm 7.0+: This pull request aims to resolve conflicts in HIP warp synchronization functions caused by the inclusion of rocWMMA headers in ROCm 7.0+ environments, which introduce native 64-bit mask-based functions that clash with existing 32-bit mask compatibility shims, leading to compilation failures.

URL: pull/15241

Merged: No

Associated Commits: ba721, e09f3, 493f9, f2647, 4d9df, 74c7a, 1299c

Other Closed Pull Requests

LiquidAI LFM2-VL vision model support: This pull request adds support for the LiquidAI LFM2-VL family of vision models, implementing dynamic image resolution handling, positional embedding interpolation, and smart image preprocessing. It includes testing across multiple backends, image resolutions, and quantization settings to ensure robustness.

pull/15347

HIP backend updates and optimizations: These pull requests clean up the hipification header by switching to hip_bf16, simplify RDNA3 defines, reduce the swap over of the new hipblas API to ROCm 6.5, and update the HIP backend requirement to ROCm 6.1 for compatibility with Debian stable. The changes ensure better support and compatibility without affecting hardware support.

pull/15285, pull/15296

Perplexity error handling improvements: This pull request improves error messages related to the -np parameter by clarifying constraints and correcting misleading messages, especially for TruthfulQA which requires a higher -np due to num_answers. It also updates comments and error messages in perplexity.cpp to replace "eval" with "decode" for consistency.

pull/15303, pull/15227

GPT-OSS Jinja template fixes and reasoning format support: These pull requests fix exceptions caused by channel tags in GPT-OSS outputs by parsing reasoning content and optionally wrapping it in think tags or separate fields. Additionally, the server is enhanced to accept a reasoning_format parameter in HTTP requests, allowing independent web UI control without CLI arguments.

pull/15230, pull/15238, pull/15243

Vulkan backend optimizations and fixes: These pull requests implement a Vulkan optimization that fuses multiple add operations common in MoE models, improve the Vulkan performance logger by better accounting for batch dimensions and operand types, and add support for mul_mat_id operations with f32 accumulators including missing bounds checks. They prepare the codebase for future explicit f32 precision requests in shaders.

pull/15252, pull/15246, pull/15337

OpenCL backend initial mxfp4 support: This pull request introduces initial support for the mxfp4 format in the OpenCL backend by adding reference implementations for matrix-vector multiplication based on Metal kernels. It serves as a baseline for future performance improvements in this backend.

pull/15270

CUDA graph feature fix: This pull request fixes the issue where the CUDA graph feature was disabled by adjusting capture status checks and completely disabling the capturing check to ensure proper functionality when CUDA graphs are turned off.

pull/15300

Synchronization with ggml repository: This pull request synchronizes updates from the ggml repository into llama.cpp, including a bug fix for the ggml_conv_1d_dw function and removal of unused includes in tests.

pull/15308

iOS Xcode build fixes and CI improvements: These pull requests fix the iOS Xcode build by switching the GitHub Actions runner to macos-15 to use Xcode 16.4 and ensure the latest stable Xcode version is selected. They also propose CI workflow changes to download SDKs during the iOS build process without altering the Xcode version or relying on external actions.

pull/15324, pull/15329

KV cache and mul_mat_id operation fixes: These pull requests fix the removal of tokens from all KV cache buffers when seq_id is -1 and correct the mul_mat_id operation implementation on MUSA devices to pass all tests without errors.

pull/15226, pull/15236

Continuous integration enhancements: This pull request adds additional Python requirements such as flake8 and pyright to the copilot-setup-steps.yml file to support improved linting and testing in the CI environment.

pull/15289

Chat example format update: This pull request updates the chat example format to include parameters specified with --chat-template-kwargs in the startup output, allowing users to see the effect of their custom template arguments.

pull/15309

Optimizer backend support check fix: This pull request fixes the backend support check in tests/test-opt by allocating simple test tensors for optimizer operations to conditionally skip tests when AdamW/SGD optimizer steps are not supported, ensuring correct test execution without complicating ggml_opt_fit.

pull/15317

NaN detection improvement in evaluation callback: This pull request updates the llama-eval-callback to immediately stop processing upon encountering the first tensor containing a NaN value, enhancing debugging by preventing further computation after such an error is detected.

pull/15320

Miscellaneous fixes and updates: These pull requests address merge conflicts with a quick fixup, update the Paddler link and description in the README.md, and include a work-in-progress submission titled "Error can be ignored" with a single commit message "WIP."

pull/15229, pull/15222, pull/15234

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
65
11
1
80

wine99
115
1
0
0

CISC
43
6
0
61

taronaeo
90
1
1
1

jeffbolznv
32
10
0
49

slaren
22
3
0
54

zhanmyz
68
0
0
0

JohannesGaessler
32
2
0
25

0cc4m
9
1
0
42

dbsanfte
46
1
0
0

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	65	11	1	80
wine99	115	1	0	0
CISC	43	6	0	61
taronaeo	90	1	1	1
jeffbolznv	32	10	0	49
slaren	22	3	0	54
zhanmyz	68	0	0	0
JohannesGaessler	32	2	0	25
0cc4m	9	1	0	42
dbsanfte	46	1	0	0