Weekly GitHub Report for Llama.cpp: November 03, 2025 - November 10, 2025 (12:08:19)

            Weekly GitHub Report for Llama.cpp: November 03, 2025 - November 10, 2025 (12:08:19)

                    Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, with notable improvements in user interface responsiveness and security features. These changes reflect a continued focus on optimizing user experience and safeguarding data.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Misc. bug: performance regression in llama-server (ggml-vulkan): This issue reports a performance regression in the llama-server component when using the ggml-vulkan backend, which was introduced by a recent pull request that enabled unified KV cache slots. The regression manifests as a significant increase in time-to-first-token (TTFT) and increased inference latency (ITL) on Mac systems, while the ggml-metal backend and llama-bench tests remain unaffected.

The discussion clarifies how TTFT is measured and confirms the regression is linked to the unified KV cache feature in Vulkan, which lacks an optimization present in Metal for skipping fully masked (-INF) blocks. Workarounds using command flags restore performance, and contributors share detailed reproduction steps, benchmark results on different GPUs, and ongoing efforts to implement and test shader optimizations to mitigate the slowdown.
Number of comments this week: 19

Eval bug: Vulkan not working on Intel GPU: This issue reports a bug where Vulkan backend inference on Intel GPUs using the Mesa driver crashes or produces garbage output when running llama.cpp models. The problem appears after a specific commit and is linked to integer dot product operations in Vulkan, with a workaround involving disabling certain Vulkan fusion or integer dot product features via environment variables.

The comments confirm the issue affects multiple Intel GPUs on Linux with Mesa Vulkan drivers, showing consistent crashes or incorrect outputs; benchmarking with llama-bench is unaffected. Users share driver versions and test outputs, identify the problem as related to Vulkan integer dot product fusion, and confirm that disabling this feature via environment variables resolves the issue. A related pull request is suggested as a potential fix, and others report the same problem with newer Mesa versions.
Number of comments this week: 14

Eval bug: Qwen3-VL-8B freezes on image processing tasks: This issue reports that the Qwen3-VL-8B model freezes during image processing tasks when using recent builds of llama.cpp, specifically after version b6910, while text processing remains unaffected. The user observes significant slowdowns and system lockups on Windows with Vulkan backend, and experiments indicate that changes introduced around build b6915 and related pull requests have degraded performance and caused instability during image handling.

The comments identify recent pull requests as the cause of the slowdown and freezing, with suggestions to reduce image token limits to mitigate the issue; however, even with adjustments, performance remains much worse than earlier versions. Multiple users confirm the problem across different platforms and backends, noting that the issue is not widely reported yet but clearly linked to recent code changes.
Number of comments this week: 9

Eval bug: [ROCm/HIP] Severe Performance and Stability Regression for MoE Models since 4146d6a1a: This issue reports a severe performance and stability regression on the ROCm/HIP backend for Mixture-of-Experts (MoE) models introduced by a specific commit that added a CUDA expert reduce kernel. The regression causes a 15-20% drop in prompt processing speed for a 30B MoE model and leads to complete system freezes requiring hard reboots when running larger 120B MoE models on AMD Radeon hardware.

The discussion confirms the regression is isolated to MoE models and the new expert reduce kernel, with dense models unaffected. Attempts to reproduce the issue on other hardware and models showed no performance loss, suggesting a hardware-specific or kernel interaction problem. A proposed workaround is to disable the kernel fusion for the HIP backend, and the original reporter is willing to test patches. The system crash is suspected to be a bug in the AMD GPU driver, and further investigation is ongoing.
Number of comments this week: 9

Misc. bug: Vulkan output is gibberish: This issue reports that when using the Vulkan backend on an Intel Iris Xe GPU, the output generated by the llama-cli model is gibberish, whereas using the CPU backend produces correct output. The problem appears to be specific to Vulkan on Intel hardware, and disabling 16-bit float support in Vulkan (GGML_VK_DISABLE_F16) seems to mitigate the issue, though the root cause remains unclear and may be related to a driver or hardware-specific bug.

The comments confirm the issue is distinct and reproducible on Intel GPUs, with tests passing outside the model context, suggesting a subtle Vulkan or driver problem. Multiple users report the same behavior on different models, and disabling Vulkan FP16 support is a common workaround, but no definitive fix or minimal test case has been identified yet.
Number of comments this week: 8

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has remained open for over 588 days, indicating a persistent and unresolved discrepancy between the two Vulkan backends in the project.
Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger in Xcode.
common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status for each parallel download.
kubernetes example: This issue discusses the creation of a Kubernetes Helm chart for deploying the llama.cpp server, aiming to facilitate scalable application deployment within the community. The original poster has begun work on this example but is seeking contributions and assistance to continue development when time permits.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using CUDA on a Windows system with an NVIDIA GeForce RTX 3060 GPU. Specifically, the error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 35
Summarized Issues:

Vulkan Backend Output and Stability Issues: Multiple issues report that the Vulkan backend produces incorrect, garbled, or repetitive output and causes crashes on various platforms including Linux, Windows, and Intel GPUs. These problems include Vulkan-specific initialization bugs, shared memory overruns, assertion failures, and compatibility issues with certain quantized models, severely affecting model inference reliability.  
[issues/16961, issues/17012, issues/17039, issues/17055, issues/17056, issues/17106]

CUDA Backend Crashes and Memory Errors: Several issues describe crashes and illegal memory access errors in the CUDA backend during evaluation or specific operations like K-Shift, often linked to particular commits or GPU hardware such as NVIDIA RTX 3090. These errors cause fatal failures and require detailed debugging to resolve memory management and kernel execution problems.  
[issues/16996, issues/17109]

Performance Regressions and Slowdowns: There are reports of significant performance degradation and system crashes on ROCm/HIP and Vulkan backends, including slow prompt processing and system freezes with Mixture-of-Experts models and unified KV cache changes. These regressions impact large models and specific hardware configurations, necessitating optimizations and fixes.  
[issues/17014, issues/17026, issues/17033, issues/17092]

Model Output and Decoding Bugs: Issues include outputting strings of question marks, gibberish output with certain device options, and decoding problems on Jetson Orin NX with CUDA, indicating numeric stability or decoding errors affecting text generation quality. These bugs reduce the usability of models under specific configurations.  
[issues/17023, issues/17067]

llama-server Stability and Crash Issues: The llama-server experiences crashes and unexpected exits on Windows and Docker environments, including assertion failures related to prompt tokens and exit code 139 under load. These stability problems affect server reliability across different platforms and models.  
[issues/17060, issues/17071]

Web UI Usability and Feature Requests: Several issues request improvements to the llama-server web UI, such as adding scrollable views for attachments, a parsing progress indicator, persistent KV cache storage for faster conversation reloads, and fixing copy functionality bugs in the Svelte WebUI. These enhancements aim to improve user experience and interface responsiveness.  
[issues/17003, issues/17010, issues/17018, issues/17079, issues/17107]

Quantization and Model Support Requests: Users request support for new model series like LLaDA2.0, multimodal projection with model drafts, and quantization of Stable Audio Open models to GGUF format, highlighting interest in expanding model compatibility and inference efficiency.  
[issues/16973, issues/17050, issues/17066]

Exception Handling and Code Refactoring: One issue highlights unnecessary exception handling in chat template detection, proposing to replace exception-based control flow with sentinel values to reduce output pollution and improve code clarity.  
[issues/16964]

Multi-GPU Testing and Compatibility: There is a request to add multi-GPU tests to the CI system to detect regressions early, focusing on specific multi-GPU execution paths to improve robustness.  
[issues/16959]

Build and Compilation Problems: Compilation failures occur on Kubuntu 25.10 with CUDA 12.8 and GCC 14 due to incompatibilities between CUDA math headers and system headers, as well as unsupported GCC versions by nvcc, causing multiple compiler errors.  
[issues/17041]

Backend Device Support Issues: The OpenCL backend incorrectly drops Rusticl devices on Mesa due to misidentification as unsupported, preventing their use and limiting hardware compatibility.  
[issues/17112]

GPU Memory and Permission Limitations: A bug causes GPU backend crashes on iOS when the app is minimized due to insufficient background GPU permissions, leading to fatal Metal command buffer errors.  
[issues/16998]

Cross-GPU Compatibility Crashes: Running models split across GPUs with ECC and non-ECC memory on gfx906 devices causes crashes in the ROCm backend due to memory incompatibilities.  
[issues/17086]

UI Feature Enhancements for AI Tooling: The tools/server/public_simplechat web client UI was enhanced to support zero-setup builtin tool calls, real-time AI reasoning display, and flexible future extensions, enabling seamless AI task execution in-browser.  
[issues/17040]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 26
Summarized Issues:

CUDA and GPU Backend Crashes and Memory Issues: Multiple issues report crashes and memory errors when using CUDA or GPU backends, including illegal memory access errors, assertion failures, and out-of-memory crashes during evaluation or server runs. These problems often relate to specific commits or driver incompatibilities and have been resolved by patches, driver updates, or workarounds such as disabling certain features or flags.  
[issues/16945, issues/16953, issues/16976, issues/16980, issues/16967, issues/16975, issues/17076]

Model Output Corruption and Garbage Generation: Several models produce corrupted, repetitive, or nonsensical outputs under certain conditions such as specific thread counts, long contexts, or CUDA fusion bugs. These issues have been linked to model-specific problems, backend bugs, or driver issues and were addressed by fixes, kernel reverts, or model investigations.  
[issues/16942, issues/16950, issues/16951, issues/16960, issues/17016]

Tokenization and Conversion Script Failures: Problems with tokenizer recognition and conversion scripts occur due to missing tokenizer files, inaccessible repositories, or unsupported tokenizer formats, causing errors during model conversion to GGUF format. These issues required updates to conversion scripts or repository access fixes to resolve.  
[issues/16970, issues/16972]

Web UI and Server Interaction Bugs: The web UI and server exhibit bugs such as incorrect PDF previewing, chat history model name display errors, server errors preventing new conversations, and crashes triggered by specific user actions like clicking "Regenerate." These issues affect user experience and require UI or server-side fixes.  
[issues/16947, issues/16983, issues/16986, issues/17043]

Performance Regressions and Slowdowns: Performance regressions have been reported due to specific commits causing slower prompt processing, token generation, and increased memory usage, especially on AMD ROCm GPUs and NVIDIA hardware. These regressions were identified and mitigated through patches, context padding, or disabling problematic features.  
[issues/17037, issues/17058, issues/17065]

Model Loading Failures and Format Support Issues: Some models fail to load due to unsupported GGUF formats or VRAM limitations, with error messages suggesting workarounds like reducing GPU layers. These loading issues prevent usage of certain models until format support or resource management is improved.  
[issues/16955, issues/17105]

Backend Detection and Compatibility Problems: Running llama.cpp in certain environments like WSL2 or with specific hardware configurations leads to backend detection failures or incompatibilities, such as the CPU backend not being found or Vulkan driver bugs causing endless output. These require environment-specific fixes or driver updates.  
[issues/17102, issues/17013]

Multimodal and Feature Support Limitations: Features like flash attention or multimodal projector files are disabled or missing due to backend limitations or conversion script issues, impacting model performance and feature availability. Requests have been made to address these limitations in future updates.  
[issues/16950, issues/17015]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 47
Key Open Pull Requests
1. server/public_simplechat alternate web client ui with 0 setup builtin tool calling++, reasoning - refactored, SysDateTime, rename pdftext: This pull request introduces an alternate zero-setup web client UI for the llama-server at tools/server/public_simplechat, refactors core tooling classes into modular files for better developer and static tool awareness, adds system date-time timestamping to prevent AI hallucinated timestamps, renames the pdftext tool call for clearer semantics, and enhances tool calling capabilities including proxy-based web and PDF fetching with improved UI controls for user verification and asynchronous handling of tool call responses.

URL: pull/17038

Merged: No

Associated Commits: 733d4, bac82, ddfbd, c8952, 6041f, bcdf4, 76748, ff342, 68341, 966b8, 547e0, beeb8, 07418, a766f, 6dbbc, f669b, 6fa44, c3670, a5bf4, 7bd3e, 75f6e, 05a24, 7b7c5, 672e1, 06a54, 2899c, 6e7c9, 7119b, 0d8ce, 4ab44, d59fa, 5c4ee, 8b63a, 68421, 6cc11, e707d, 67431, df5cf, f8749, 9f234, 42b3f, 29a6d, 8a28b, e4334, d0c39, 84fd5, ada6f, ae62f, 1f685, 32bb2, 80758, 27bb6, dc58a, d829e, 5a369, 4770e, 3e057, d297c, a4d10, bf116, 702c1, 874ae, 00cb9, adff8, cf324, 926d8, 54e03, a2c1d, 724b3, e1e4f, 8f9f0, 750bf, 35f68, b78db, 84708, 7cd77, 8abbf, cbc0e, 218dc, abcf5, 4f165, ac192, 0853c, 2001b, 231fd, 82804, 12248, 913de, 2f388, 1ed31, d24b5, abd82, 8294b, b902c, ea7b7, c61b7, 6328a, 7d3d5, f78f4, d517e, d09e2, f994c, 4f943, f9ef8, 412b6, 9a3ba, ee3c8, 9b5ee, 013e4, 2e1ff, 3d8a2, ce6a7, 5e265, ee56e, 7ee75, b87e3, 797b5, e4c74, 436d1, a20a7, 33537, 86f7f, 7a156, aa69c, 00678, a5d7f, 6bff6, 0407c, 41add, ba304, 5352c, 76246, 7e6a1, f7591, 2bfc0, 8f788, 2e202, d13ba, 66d8b, 3d3e3, 2aa9d, 6fc28, 99239, ca715, cec31, 1e3c0, 1a8b2, 32774, 3328c, 643d4, c77d8, 82928, c13a2, e05fd, 84d30, 1d547, 25635, a95ab, 3665d, e8595, dfc06, a4433, 6a489, 54422, 219de, e584c, e36e7, 8d65c, 4c980, 07153, 7bcdb, 2441f, e70b8, 88bb9, 16f2e, f8503, 7463b, 4c339, e8f0e, c7a22, 57874, d9b30, 90999, 3be01, cff87, 6d742, 827d9, bd407, 0dc5c, 0e95e, d1a6c, 78ff9, da8c1, ed042, a798d, 46016, a7db3, f1800, ad229, 366c1, 5b9e8, e9741, ab5e9, ebe89, d2544, 58d63, 45eac, 10829, 64a0f, 8e2c3, fe244, 34fe9, babfb, 58a59, 9b9b9, c39af, 7105f, 87a76, 72193, 3ff68, 2618d, 20610, 0a28f, a6593, 3894f, cc12d, 32429, 2c995, f7897, 802fc, b7ec6, 04c8c, 4938d, cb5ca, 42a6b, 323d0, b7cb9, 624ec, 91d71, 4d394, d9b17, abc1e, 05031, ae1ef, 1b3ba, cdbb9, dcb29, 39618, 8d139, 6f526

2. Mamba2 SSD: This pull request is a draft implementation of the Structured Statespace Duality (SSD) approach from the Mamba2 paper, introducing new primitive operations and an alternate multi-token update path in the ggml library to reframe the inefficient recurrent SSM_SCAN operation as a chunked pseudo-attention mechanism that enables parallelization over sequence chunks, while addressing outstanding issues related to performance, chunking strategy, and handling of repeat operations in certain models.

URL: pull/16982

Merged: No

Associated Commits: 245f3, 638e2, be23a, 71e22, f6d60, 778e8, 42280, 97bd1, ffd88, 7409d, ba740, 29b30, fb689, cd73f, 52be1, f57da, 8a870, 79bce, 1ceb1, ef120, 3da5c, 3336f, d1e15, ee13a, 3b405, 3963a, 188ae, 0441c, aba30, 62ac8, 36244, 5ff37, 8b6f3, 82bba, 7ad0f, 6256f, 204cd, 86788, de43d, 426a9, 6733b, 44356

3. CUDA: add implicit conv3d : This pull request adds an implicit 3D convolution operation to the CUDA backend as a more efficient and memory-saving alternative to the existing IM2COL_3D+GEMM kernel used for video models, leveraging tensor cores to achieve better performance while following the design of the conv2d_implicit implementation.

URL: pull/16948

Merged: No

Associated Commits: ab15f, 52455, 0a64e, e8020, a5b68, 3f5c5, 3308c, 23579, 5aa4a, 91650, f0ced, f9212, 36c0d

Other Open Pull Requests

GPU Sampling Support: This pull request adds support for performing sampling operations directly on the GPU within the computation graph, enabling partial or full GPU-based token sampling, filtering of logits, and probability distribution computations to improve efficiency by minimizing data transfer between device and host memory. It includes initial integrations for llama-cli and llama-server, configurable GPU sampler chains per sequence or slot, and accompanying tests and usage examples.  
pull/17004

Backend Test Fixes and Enhancements: Multiple pull requests improve backend testing by fixing failures in the test-backend-ops for the ggml-hexagon backend through corrected index calculations, and by adding verbosity flags to the test-backend-ops testing binary and the llama-eval-callback tool to enable higher precision printing and detailed tensor outputs. These changes enhance test reliability and output clarity without being directly related to the SSD algorithm implementation.  
pull/17042, pull/17029, pull/17022

New Operations for Hybrid Models: This pull request adds necessary operations such as SOFTPLUS, EXPM1, TRI, SOLVE_TRI, and CUMSUM to support new hybrid models like Qwen3 Next and Kimi Linear. These additions serve as prerequisites for merging related pull requests that extend model capabilities.  
pull/17063

Flash Attention for SYCL Backend: A basic implementation of Flash Attention is introduced for the SYCL backend, enabling more efficient attention computation on Intel GPUs. This includes a Flash Attention kernel, a forward pass with block-wise computation, and integration with existing SYCL infrastructure supporting F32 data types.  
pull/16969

Compressed Tensors Quantization Enhancements: Support is added for multiple formats in the compressed-tensors quantization method, including pack-quantized (symmetric and asymmetric), int-quantized, float-quantized with channel and block strategies, and naive-quantized models. Additionally, a bug related to metadata handling in lazy tensors is fixed to ensure correct broadcast shift shapes during unpacking.  
pull/17069

Compressed Tensor Repacking for Kimi-K2: This pull request demonstrates repacking the compressed_tensor format specifically for the kimi-k2 model by mapping int4 data to GGML's Q4_0 format with conversion of the original BF16 scale to F16. It requires deletion or renaming of the "quantization_config" section in config.json and may break models other than kimi-k2.  
pull/17083

Circular Tiling in Conv2D: Circular tiling support is introduced to convolution and padding operations in conv2d for Vulkan, CUDA, and CPU backends. New functions treat input data as if wrapped on a torus to enable seamless texture generation without breaking existing code.  
pull/16985

Hybrid-Recurrent Model Context Shift Fix: This pull request modifies the llama_memory_recurrent::seq_rm function to reject only partial erasures that include the final token, allowing safe no-op partial erasures that do not affect the final token. This enables successful continuation of generation when the context limit is reached.  
pull/17009

Megrez-MoE Architecture Support: Comprehensive support for the Megrez-MoE (Mixture of Experts) architecture is added, including architecture registration, MoE-specific feed-forward network implementation with sigmoid gating and top-k expert selection, model loading enhancements, and memory management fixes. These enable stable warmup and inference on Megrez2-3x7B and similar models.  
pull/17052

BoringSSL Integration via CMake: A CMake option is introduced to fetch, build, and statically link the BoringSSL library within the project when the LLAMA_BUILD_BORINGSSL flag is enabled. The version can be specified via the LLAMA_BORINGSSL_VERSION cache variable.  
pull/17062

Fast Division and Modulo in OpenCL Backend: A fast division and modulo implementation, ported from the CUDA backend, is added to the OpenCL backend for use in the set_rows function. This aims to maintain current performance while enabling future optimizations across more operations.  
pull/17090

Embedding CLI End-to-End Tests: A new GitHub Actions CI workflow adds end-to-end tests for the embedding CLI using small cached GGUF models to verify output dimensions and deterministic behavior. This establishes a fast and reproducible baseline for embedding validation.  
pull/16940

Web UI Assistant Message Enhancements: The web UI gains a "Continue" action for assistant messages, allowing users to extend responses seamlessly by clicking an arrow button. A "Save" feature for user message edits is also added to preserve conversations without regenerating responses.  
pull/16971

Bicubic Interpolation Backend Support: Backend support for bicubic interpolation in the GGML_OP_UPSCALE operation is added for CUDA and Vulkan, while the OpenCL backend disables bicubic support to prevent test failures.  
pull/17022

3D Tensor Batch Support in CPU Repack: Basic support for handling 3D tensors with multiple batches is added in the ggml CPU repack matrix multiplication function. This addresses incorrect results in models like LFM2 with layers having two batches by implementing a naive chunking approach.  
pull/17028

Profile Guided Speculative Decoding Draft: A draft implementation of "Profile Guided Speculative Decoding" uses empirically measured batch cost profiles to guide speculative drafting of token sequences based on expected value. It includes support for look-ahead parameters to improve prediction accuracy and decoding efficiency.  
pull/17030

Vulkan MMQ Out-of-Bounds Fixes: Out-of-bounds read errors in the Vulkan MMQ implementation are fixed by correcting indexing and buffer size issues in matrix multiplication functions. Legacy code cleanup and attempts to resolve a regression in the mul_mat_id quantization call are also included.  
pull/17108

Improved iGPU Memory Reporting: The issue of inaccurate iGPU memory reporting is addressed by utilizing all device-local heaps for memory availability and accommodating devices with split heaps, such as Windows iGPUs with more than 64GB VRAM dedicated by BIOS, when creating buffers.  
pull/17110

Dangling Pointer Fix in Sampler Grammar: A dangling pointer issue related to non-empty trigger words in lazy grammar construction is fixed by initially adding the static keyword and later replacing it with a method to keep the variable in scope longer, improving memory safety.  
pull/17048

MMF CUDA Parameter Extension: The MMF_ROWS_PER_BLOCK parameter in the mmf CUDA implementation is extended to values greater than the warp size, while retaining the original value due to lack of performance tuning and shared memory limitations observed on an RTX 3080 when set to 128.  
pull/17051

API Remoting Backend Proposal: A new backend for the GGML API called API Remoting is proposed, enabling escaping VM isolation through virt-gpu paravirtualization and the virglrenderer library. This allows llama.cpp to run in VM containers with near-native performance by forwarding GGML API calls between a remoting frontend and backend.  
pull/17072

RDNA4 Tensor Core Support in MMF: Support for RDNA4 tensor cores is added in the MMF implementation, though performance improvements are lower than expected. Adjustments to padding alignment and observations about compatibility issues with recent changes affecting native MMF execution on RDNA4 hardware are included.  
pull/17077

Host-Memory Cache Prompt Restoration Fix: Failures when restoring an old prompt from the host-memory cache in the server are addressed by gracefully handling cases with insufficient context memory, ensuring the prompt is reprocessed from scratch instead.  
pull/17078

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 57
Key Closed Pull Requests
1. Temperature branch: This pull request introduces temperature management features including a temperature throttler, thermal checks, GPU frequency adjustments, and a PID controller to monitor and control device temperature and throughput, although it was not merged.

URL: pull/17097

Merged: No

Associated Commits: b867c, 5b5b8, 64d60, e2158, a79a3, 93517, ae845, 41764, b450a, ae0fb, e01da, 28466, 5cd50, 761a9

2. CUDA: update ops.md: This pull request updates the CUDA operations documentation to include the new operations added by the contributor, ensuring the ops.md file accurately reflects recent changes.

URL: pull/17005

Merged: Yes

Associated Commits: 48c4a, de984, 4ed92, 37466, 0c4ee, 820b6, 919dd

3. Model: Minimax M2 - chat support: This pull request proposes adding chat support to the Minimax M2 model, incorporating tool calling and simple non-interleaved reasoning using a fixed Unsloth template, along with including an upstream fix from the minja project.

URL: pull/16946

Merged: No

Associated Commits: e21f8, 4e583, de672, 1a351, 94812, 23d4b

Other Closed Pull Requests

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

hanishkvc
481
6
1
3

ggerganov
79
19
2
82

ngxson
73
12
4
58

am17an
72
12
1
51

CISC
18
6
0
80

gabe-l-hart
79
6
0
18

pwilkin
23
2
3
64

ServeurpersoCom
45
7
2
33

No author found
73
0
0
0

jeffbolznv
22
9
0
40

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
hanishkvc	481	6	1	3
ggerganov	79	19	2	82
ngxson	73	12	4	58
am17an	72	12	1	51
CISC	18	6	0	80
gabe-l-hart	79	6	0	18
pwilkin	23	2	3	64
ServeurpersoCom	45	7	2	33
No author found	73	0	0	0
jeffbolznv	22	9	0	40