Weekly GitHub Report for Llama.cpp: October 13, 2025 - October 20, 2025 (12:06:24)

            Weekly GitHub Report for Llama.cpp: October 13, 2025 - October 20, 2025 (12:06:24)

                    Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced performance and user experience. Notable highlights include optimized features and bug fixes that streamline functionality and increase stability.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Feature Request: Add a debug option to display OpenAI-Compatible toolcall chunks in the WebUI: This issue requests the addition of a debug option in the WebUI to display raw OpenAI-Compatible toolcall chunks, allowing users to inspect model behavior through a read-only visualization of these chunks alongside an optional input for custom Harmony-formatted tool documentation. The feature aims to enhance transparency, debugging, and educational value without affecting runtime execution or backend performance, inspired by prior exploratory work on tool calling in the WebUI.

The comments express strong support for the idea, sharing examples of existing server-side debugging practices and offering assistance with implementation. A working proposal with JSON examples and demo videos was provided, demonstrating successful integration and practical benefits for debugging and development workflows.
Number of comments this week: 6

Compile bug: OpenBLAS / cmake include path detection: This issue describes a compilation problem on Android 15 using Termux where the CMake configuration for OpenBLAS incorrectly detects the include path, resulting in a malformed path that causes the compiler to fail to find the cblas.h header file. The user provides a workaround by manually fixing the paths in the generated build files and discusses attempts to resolve the issue by specifying different CMake flags and verifying the OpenBLAS installation details.

The comments include suggestions to use specific compile flags to set the OpenBLAS root directory, which did not resolve the path misdetection. Further discussion reveals that the OpenBLAS package was installed via Termux’s package manager, and the user confirms the presence of cblas.h in the expected include directory, indicating the problem likely lies in how CMake constructs the include path rather than missing files.
Number of comments this week: 4

Feature Request: support PaddleOCR-VL: This issue is a feature request to support PaddleOCR-VL, a state-of-the-art and resource-efficient vision-language model designed for document parsing that integrates a dynamic resolution visual encoder with a powerful language model to recognize complex elements across 109 languages. The requester highlights the model’s superior performance, fast inference speed, and practical deployment potential, seeking its integration into the project to enhance OCR capabilities.

The comments include multiple users expressing agreement with the feature request, emphasizing the model’s impressive capabilities and efficiency, thereby showing community interest and support for adding PaddleOCR-VL support.
Number of comments this week: 4

Eval bug: LoRA inference crashes GGML_ASSERT((int)sched->hash_set.size >= graph->n_nodes + graph->n_leafs) failed: This issue reports a crash occurring during LoRA inference with the Meta-Llama-3.2-3B-Instruct model when using llama.cpp, triggered by a failed assertion related to the GGML scheduler's hash set size. The user notes that running the base model without the LoRA adapter works fine, but including the LoRA model consistently causes the error, and they seek insight into the cause or a resolution.

The comments request and provide links to the base model and LoRA adapter files to help reproduce and diagnose the issue, referencing a similar previously reported problem and sharing download instructions and Google Drive links for the relevant files.
Number of comments this week: 3

Eval bug: Vulkan llama.cpp > 64GB Graphics card load bug.: This issue describes a problem with loading large language models (LLMs) using Vulkan on Windows, where the user encounters a 64GB VRAM limit despite having a 96GB VRAM setup. The user suspects a bug in the memory loading process, specifically in how RAM is transferred to VRAM in 16GB blocks, causing failures when attempting to load models or contexts exceeding 64GB.

The comments suggest troubleshooting steps including providing detailed Vulkan system information and enabling the Flash Attention feature, which has resolved similar issues for others; additionally, using ROCm was mentioned as a potential fix despite its instability with GPU drivers.
Number of comments this week: 2

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 567 days and was last updated in early June 2024, indicating ongoing concerns or investigation regarding this backend discrepancy.
Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in improving the Metal backend. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, which would help in collecting and analyzing GPU traces across different frameworks.
common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-conflicting progress status indicators during parallel downloads.
kubernetes example: This issue discusses the creation of a Kubernetes example for deploying the llama.cpp server using a Helm chart, aiming to facilitate scalable application deployment within the community. The original poster has begun work on this example and is seeking contributions and assistance to continue its development.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the Microsoft BitNet model version b1.58-2B-4T-gguf on a Windows system using CUDA backend with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 19
Summarized Issues:

Model performance and optimization challenges: Several issues highlight difficulties in improving embedding performance, reducing latency, and enabling efficient model sharding and backend coordination in llama.cpp and related systems. These challenges include optimizing memory and threading usage, handling distributed backends, and improving inference stability across different hardware and platforms.  
[issues/16550]

Crashes and assertion failures during inference and evaluation: Multiple reports describe crashes and assertion failures occurring during model inference or evaluation, often related to KV cache handling, scheduler hash set sizes, or kernel compilation issues on various platforms including Windows, Linux, and iOS. These failures disrupt normal operation and are linked to specific model configurations or backend initializations.  
[issues/16553, issues/16555, issues/16564]

Feature requests for model and platform support: There are requests to add support for new models like Ring-1T, Ling-1T, and PaddleOCR-VL, as well as enhancements such as CUDA 12.9 binary packages and HIP kernel support. These requests aim to expand the range of compatible models and improve hardware acceleration capabilities across different architectures.  
[issues/16567, issues/16570, issues/16583, issues/16627, issues/16631]

Memory and hardware limitations: Issues report hardware-related constraints such as a 64GB VRAM limit in Vulkan on Windows and runtime errors on specific devices like Atlas 300I DUO due to null context pointers. These limitations affect the ability to fully utilize available GPU resources and cause program aborts during execution.  
[issues/16575, issues/16628]

Build and compilation problems: Several issues describe build failures caused by incorrect detection of vector instruction support on RISC-V, CMake path duplication on Android 15, and OpenCL context reference counting bugs. These problems prevent successful compilation or cause runtime errors, requiring manual fixes or improved detection logic.  
[issues/16593, issues/16612, issues/16615]

User interface and debugging enhancements: A request was made to add a debug option in the WebUI for displaying raw OpenAI-compatible toolcall chunks and injecting custom tool documentation, facilitating transparent inspection of model behavior without execution risk.  
[issues/16597]

Output consistency and formatting bugs: Bugs were reported where generated token IDs do not match re-tokenized text, causing KV cache invalidation and inefficiencies, and where the granite docling model produces malformed outputs through certain endpoints due to missing formatting steps. These issues degrade output quality and performance, especially on mobile devices.  
[issues/16601, issues/16632]

Functionality concerns in example applications: There is concern that the example Android app supports only large model completion mode without chat functionality, which is considered important but apparently missing.  
[issues/16556]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 16
Summarized Issues:

Model compatibility and loading errors: Several issues report failures to load or convert specific model architectures such as Qwen3-VL-8B-Instruct and Qwen3VLForConditionalGeneration due to lack of support or outdated Transformers library versions. These problems prevent proper model usage and require updates or fixes to support new model types.  
[issues/16596, issues/16605]

Server crashes and runtime errors: The llama-server experiences crashes and runtime errors in scenarios including restoring context checkpoints from image inputs and GPU backend initialization failures inside Docker containers. These errors disrupt normal operation and indicate underlying bugs in checkpoint handling and backend compatibility.  
[issues/16590, issues/16617]

GPU and backend initialization issues: Multiple issues describe GPU-related failures such as ROCm invalid device function errors on AMD GPUs, Vulkan backend logging limitations, and memory errors on Mac Studio M4 Max after memory pool removal. These problems highlight challenges in GPU target configuration, driver identification, and memory management.  
[issues/16524, issues/16637, issues/16646]

API and UI behavior inconsistencies: Problems with API endpoint handling (trailing slash causing 404 errors) and UI features like LaTeX rendering and conversation persistence were reported. These issues affect user experience by causing unexpected errors or loss of functionality, prompting requests for improved handling and feature additions.  
[issues/16525, issues/16598, issues/16604]

Model embedding and conversion bugs: Incorrect embeddings generated by llama-server due to missing dense modules and conversion script failures mapping specific tensor names indicate issues in model compatibility and tooling. These bugs lead to inaccurate results and failed model conversions.  
[issues/16538, issues/16566]

Multi-GPU and offloading problems: The --n-cpu-moe option causes improper tensor offloading by sending all tensors to a single GPU instead of distributing them, requiring complex manual configurations to work correctly. This limits efficient multi-GPU usage and complicates deployment.  
[issues/16579]

False antivirus detections: Microsoft antivirus falsely flags several executable files from a Windows CUDA installation as malware, causing installation and scanning issues. This false positive can hinder user trust and software deployment on affected systems.  
[issues/16527]

Feature requests for multi-modal support: There is a request to add multi-modal capabilities for the Qwen3-VL-30B-A3B-Thinking model, indicating user interest in expanding model functionality to handle diverse input types.  
[issues/16582]

Parameter truncation bug in serving: Integer parameters defined with zod4 are truncated incorrectly when serving models, causing tool calls to receive wrong inputs, while older zod versions or other serving methods do not exhibit this issue. This bug affects parameter integrity during model serving.  
[issues/16622]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 37
Key Open Pull Requests
1. extend server/public_simplechat with simple minded interactive browser-client side based toolcalling - base logic: This pull request extends the server/public_simplechat web UI for llama.cpp by adding a basic interactive tool-calling feature that allows the browser client to execute simple tools like a calculator and JavaScript functions within a web worker context, enabling users to verify and trigger AI-requested tool calls with editable arguments and responses, thereby enhancing the chat interface with exploratory and practical tool integration capabilities.

URL: pull/16563

Merged: No

Associated Commits: fa23e, 75ce9, 68fc2, 85845, bbaae, f0915, 9d8be, 2e469, 27161, 788d5, 4cbe1, 174b0, 10b10, e4e29, d7f61, 2a276, 92b82, d8b1b, 7a2bc, f10ab, a1f17, 4ac6f, 37963, 0ed83, aa81f, 5ed2b, 619d6, 226aa, 2aabc, 90b24, a8ead, 70bc1, f8ebe, cc606, 7ea9b, 46647, 5933b, 50be1, 44cfe, 6f137, dbf05, 39c1c, e9a78, 340ae, 0629f, bb25a, ae00c, e1e1d, 3b73b, aa80b, 75550, 7fb55, 69cbc, f379e, 4efa2, a644c, 61c23, 75a63, 99b04, 7dc99, 5d764, 788a9, c155b, 1e1bb, 9b1ed, 1b361, 9ec3d, 620da, bdd05, 5ac60, ac448, 2ca2b, 04a5f, 29589, 46945, 696d5, 2e6a3, 0a3b4, 61b93, 378bd, e7277, 9db2f, c5a60, 6efbd, 8f070, 514ce, e1243, 91099, 4ffed

2. ggml: CUMSUM and TRI (CPU, Metal, CUDA): This pull request extends the CPU implementations of the CUMSUM and TRI operations to support Metal and CUDA backends, adds type support for F16 and BF16, and aims to facilitate improvements in the DELTA_NET op for Qwen3-Next as well as optimize the State Space Duality form of SSM_SCAN for faster prefill, while inviting feedback on kernel performance and potential optimizations.

URL: pull/16623

Merged: No

Associated Commits: 245f3, 638e2, 1f02d, 00f11, 2744d, ab3f3, 8c23c, 2a2e7, 092f7, 6949c, f8fba, 05816, 86ce3, 3a895, cbaed, e5964, 3011a, 112d3, 78e13, e5587, 0468b, c71e3, 5f0d2, 42658, ba3b8, dfae9, d1f86, 5071f

3. Add experimental ggml-hexagon backend for the Hexagon NPU: This pull request introduces an experimental ggml-hexagon backend for the Hexagon NPU, adding support for multiple Hexagon versions and Snapdragon-based Android devices, implementing core LLM operations with various data types, providing minimal build dependencies, and including initial optimizations and tooling for early testing and feedback within the llama.cpp/ggml community.

URL: pull/16547

Merged: No

Associated Commits: c75da, dc767, 4894e, 8c5e5, 036c9, 28e08, eccfd, eeac0, f5bd5, f5d88, 017d9, 0642f, 55ebd, b690a, 73097, beb50

Other Open Pull Requests

Vision model integration: This pull request adds support for the GLM-4.5V vision model to llama.cpp by implementing the multimodal architecture Glm4vMoeForConditionalGeneration, which combines a GLM-4.5-Air-based language model with a ViT-based vision adapter. It enables processing of image and text inputs with specialized rotary positional embeddings and dynamic 2D positional embedding adaptation, excluding video input support in this initial integration.  
pull/16600

CUDA performance improvements and fusion: These pull requests introduce CUDA support for fusing mmvq and mmvf operations with optional gate and SWIGLU activation, resulting in 4-9% performance gains on quantized models and smaller gains on floating-point models. Additionally, CUDA graph plan APIs are implemented to optimize graph execution, improving hybrid inference performance by up to 30% through refined graph update and reuse logic.  
pull/16630, pull/16548

Windows non-ASCII path handling fixes: Multiple pull requests fix issues related to handling non-ASCII file paths on Windows by improving conversion of command-line arguments and file paths to UTF-8. These changes enable correct loading of .mmproj, image files, and models with non-ASCII paths, and also address missing console initialization in mtmd-cli to resolve related path parsing issues.  
pull/16609, pull/16611

User interface and experience enhancements: These pull requests improve the user experience by relocating import/export conversation buttons to the Settings Dialog with a dedicated UI for selecting conversations, and by adding an OpenAI-compatible model selector in the WebUI sidebar that persists selections and integrates with chat request payloads. They also add a purely visual and diagnostic enhancement to the web UI for streaming and displaying OpenAI-compatible Harmony tool-call deltas with new Svelte components and configurable toggles.  
pull/16619, pull/16562, pull/16618

SYCL backend unary operator support: This pull request adds support for the unary operators FLOOR, CEIL, ROUND, and TRUNC in the SYCL backend, including implementation, header updates, registration, documentation, and test coverage, following their prior addition to the CPU backend.  
pull/16613

Metal API and kernel stability improvements: These pull requests introduce initial support for the Metal4 tensor API by reworking matrix-matrix multiplication to use the Tensor API when available, and fix a crash caused by missing Metal function constant values by adding a resilient wrapper and hardening pipeline compilation on iOS.  
pull/16634, pull/16639

MMQ shader refactoring and Vulkan integer dot support: This pull request refactors the MMQ shader caching structure for modularity and efficiency by copying quant structs through shared memory into registers before reshaping them for integer dot operations. It also adds support for Vulkan MMQ integer dot with Q2_K quantization while addressing performance challenges in mapping quant structures.  
pull/16536

Embedding output format addition: This pull request introduces a new --embd-output-format raw option that outputs embeddings as plain space-separated floats without JSON formatting or prefixes, facilitating easier integration with downstream vector processing tools while keeping existing formats unchanged.  
pull/16541

CUDA kernel optimization for gpt-oss model: This pull request adds an optional parameter to the CUDA topk-moe kernel for the gpt-oss model, enabling softmax after top-k selection, which results in measurable performance improvements on an NVIDIA 4090 GPU.  
pull/16649

GitHub action removal proposal: This pull request proposes removing the close-issue.yml GitHub action, criticizing it as rude automation and refusing to contribute to projects that include such actions except to eliminate them.  
pull/16535

Granite chat initializer and parser compatibility: This pull request adds a defensive mechanism to dynamically detect and support both <tool_call> and <|tool_call|> sentinel formats in IBM Granite Jinja templates, ensuring compatibility without affecting existing Hermes 2 Pro or Qwen templates and maintaining stable cross-model behavior.  
pull/16537

Quantization test threshold adjustment: This pull request increases the NMSE error threshold for q5_1 quantization MUL_MAT tests from 5e-4 to 7e-4 to reduce intermittent CI failures caused by slightly higher numerical errors in CUDA Release mode, while preserving stricter thresholds for other quantization types.  
pull/16544

Embedding output log suppression fix: This pull request fixes the issue where the --log-disable flag was incorrectly suppressing embedding outputs, enabling llama-embedding to produce clean outputs without extraneous messages for easier script integration.  
pull/16561

JinaCLIP v2 and GGUF format support: This pull request introduces the JinaCLIP v2 vision projector and adds GGUF format support for the jina-bert-v3 model, enabling conversion of merged-LoRA and adapter-based checkpoints and providing a cross-platform CLI tool for validating text and image embeddings with performance and correctness checks.  
pull/16574

Windows file path character conversion fix: This pull request addresses an unsafe character conversion issue in ggml_fopen on Windows by replacing a simple cast with a proper UTF-8 to wide character conversion using a helper function, improving robustness in handling file paths.  
pull/16589

AIX linking failure fix: This pull request fixes linking failures of llama-imatrix and llama-run on AIX systems by manually adding the -lbsd flag to resolve the missing flock() symbol, as AIX provides flock() via libbsd.a.  
pull/16610

Compiler warning fix for std::array initialization: This pull request fixes build failures caused by strict compiler flags by correcting the initialization of a std::array in llama-batch.h from single to double braces, enabling clean compilation under stricter warning settings.  
pull/16614

Granite Hybrid model addition: This pull request adds Granite Hybrid models by mapping their embedding dimensions to the number of parameters, based on information from the IBM Granite 4.0 model on Hugging Face.  
pull/16635

Vulkan GPU matrix multiplication optimization: This pull request proposes increasing the block size parameter BK to 32 and modifying the non-CM mul_mm.comp kernel to use BK/4, adding additional vec2 loads for cache_a and cache_b, aiming to improve matrix multiplication performance on Vulkan-supported GPUs with benchmark results on NVIDIA and AMD hardware.  
pull/16636

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 42
Key Closed Pull Requests
1. Add CONV_TRANSPOSE_2D for Metal: This pull request adds a Metal-based implementation of the CONV_TRANSPOSE_2D operation to the ggml library in the llama.cpp project, including updates to headers, device code, type fixes, additional tests, optimizations for threading, and dynamic memory allocation.

URL: pull/16542

Merged: Yes

Associated Commits: 1dc94, 09661, 2f77e, a190a, aa4b2, 86f6d, 73124, 93bea, 2f1ed, 7b6f6, 9f3e1

2. Metal: add opt_step_adamw and op_sum: This pull request adds support for the OPT_STEP_ADAMW operator to the Metal backend by implementing a new kernel and argument struct, and also introduces an initial, unoptimized Metal implementation of the GGML_OP_SUM operator to address test failures.

URL: pull/16529

Merged: Yes

Associated Commits: 101b8, 61cb2, 45a2d, 8c8cd, bf8be, 9b957, e850a, 32a43, 01e8f

3. CUDA: Changing the CUDA scheduling strategy to spin: This pull request changes the CUDA scheduling strategy to cudaDeviceScheduleSpin for devices with compute capability 12.1 to eliminate latency caused by multiple cudaStreamSynchronize calls, thereby restoring performance impacted by disabling host buffers.

URL: pull/16585

Merged: Yes

Associated Commits: 02fc2, a33e3, 4a707, f7ada, 14652

Other Closed Pull Requests

Metal backend optimizations and fixes: Multiple pull requests improve the Metal backend by optimizing operations such as GGML_OP_SUM with multi-threaded SIMD partial sums, adding support for OPT_STEP_SGD operator with new kernels, replacing unreliable gpuAddress usage with atomic counters, and enabling building on osx_arm64 platform. These changes collectively enhance performance, compatibility, and platform support for Metal.  
pull/16559, pull/16539, pull/16576, pull/16565

Feature Averaging (FA) and cacheless embeddings support: Several pull requests add and improve support for cacheless embeddings using Feature Averaging and iterative Stochastic Weight Averaging across different backends including graph module, Metal, and Vulkan. These enable FA with F32 key/value tensors and improve functionality while maintaining necessary casts for cacheless contexts.  
pull/16528, pull/16531, pull/16543

CUDA backend enhancements and bug fixes: Pull requests introduce fast integer division and ggml_cuda_mad for matrix multiplication vector fusion, fix numerical issues in CUDA tile FA kernel using FP32 arithmetic, enable CUDA FlashAttention support for FP32 key-value cache, and fix bugs related to fusion in CUDA implementations. These changes improve speed, correctness, and feature support on NVIDIA GPUs.  
pull/16557, pull/16540, pull/16546, pull/16545, pull/16577

Performance improvements via vectorization and build optimizations: One pull request proposes vectorizing row conversion functions using F16C, AVX2, and AVX512F intrinsics to boost x86 performance, while another improves MSVC build times by enabling parallel custom build steps and concurrent compilation. These changes target faster execution and development workflows.  
pull/16587, pull/16545

Security vulnerability fixes in benchmarking scripts: Two pull requests address high-severity SQL injection vulnerabilities in scripts/compare-llama-bench.py by replacing unsafe formatted SQL queries and string concatenations with parameterized queries and prepared statements using SQLAlchemy's TextualSQL. These ensure safer database interactions.  
pull/16571, pull/16572

Web UI improvements and LaTeX rendering fixes: One pull request reorganizes the web UI settings layout and cleans up unused variables to improve code quality, while another adds normalization converting MathJax delimiters to KaTeX-compatible dollar signs for proper inline and block LaTeX rendering in the WebUI. These enhance user experience and interface clarity.  
pull/16607, pull/16599

Memory management and reporting enhancements: Pull requests improve memory usage by fixing a CPU-side memory leak in the CANN backend, adding dynamic token limits for prompt cache based on memory constraints, and enhancing RPC functionality to report actual free memory on devices for detailed memory breakdowns. These changes optimize resource handling and diagnostics.  
pull/16549, pull/16560, pull/16616

Kernel and operator bug fixes: Several pull requests fix bugs including a scalar path norm computation error introduced earlier, build failures due to strict compiler flags by replacing putenv with setenv, and a bug in CUDA and OpenCL implementations related to accessing rms_norm->src during fusion. These fixes improve stability and correctness.  
pull/16558, pull/16573, pull/16554

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

Update close-issue.yml
Toxicity Score: 0.75 (Rapid escalation, explicit language, confrontational tone)
This GitHub conversation involves a user expressing strong negative sentiment towards a GitHub action, using explicit language to describe it as rude and demanding its removal, which likely triggered defensive or critical responses from other contributors. The tone is confrontational and dismissive, with the original poster indicating a refusal to contribute further unless the issue is addressed. This sets a tense atmosphere that may provoke further conflict or defensive replies.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
137
22
3
56

allozaur
77
6
1
30

ServeurpersoCom
59
11
4
36

hanishkvc
89
1
0
0

CISC
34
4
0
35

jeffbolznv
25
10
0
32

taronaeo
39
2
1
9

gabe-l-hart
41
2
0
7

danbev
40
4
0
5

JohannesGaessler
12
6
0
26

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	137	22	3	56
allozaur	77	6	1	30
ServeurpersoCom	59	11	4	36
hanishkvc	89	1	0	0
CISC	34	4	0	35
jeffbolznv	25	10	0	32
taronaeo	39	2	1	9
gabe-l-hart	41	2	0	7
danbev	40	4	0	5
JohannesGaessler	12	6	0	26