Weekly GitHub Report for Llama.cpp: August 04, 2025 - August 11, 2025 (22:42:38)

            Weekly GitHub Report for Llama.cpp: August 04, 2025 - August 11, 2025 (22:42:38)

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced performance and user experience. Notable highlights include optimized features and bug fixes that streamline functionality and increase stability.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Misc. bug: gpt-oss-20b perplexity broken: This issue reports that the perplexity metric for the gpt-oss-20b model is abnormally high when evaluated using llama.cpp, indicating broken or nonsensical probability outputs that do not align with expected model behavior. The user suspects a fundamental problem with the model weights or their conversion to the gguf format, as this issue persists across different quantization versions and is linked to generation instability and flawed token probability distributions.

Commenters discuss that high perplexity can be expected in instruction-tuned models but emphasize that the extreme values here are unusual and likely indicate a real problem. Some users confirm similar instability and broken perplexity results on llama.cpp, while others note that generation quality can vary with parameters and that perplexity comparisons across platforms are not straightforward. There is a consensus that the issue may stem from model weight conversion or internal normalization problems, and requests are made for testing on other runtimes like transformers or vLLM to isolate the cause.
Number of comments this week: 17

Eval bug: GPT-OSS-120B: Vulkan backend fails to allocate KV cache with OOM error, despite enough free memory: This issue reports a Vulkan backend memory allocation failure when loading the GPT-OSS-120B model with llama.cpp, where the Vulkan driver returns an out-of-memory error despite ample available memory. The problem appears specific to this model and is related to Vulkan's per-allocation size limits, with users finding that reducing context size or enabling flash attention can mitigate the issue, though running without flash attention still triggers allocation failures.

Commenters confirmed the issue occurs across multiple AMD Vulkan drivers and shared detailed logs showing large memory allocation attempts failing due to Vulkan's 2-4GB allocation limits. Various troubleshooting steps were discussed, including reducing context size and enabling flash attention, which helped bypass the problem; however, enabling flash attention caused crashes for some users. The discussion concluded that the Vulkan API's allocation limits are a fundamental constraint, and while flash attention is a practical workaround, the underlying large allocation request might still represent a bug or inefficiency in the model loading process.
Number of comments this week: 15

Eval bug: Tools In Prompt Crashing On gpt-oss 20b: This issue reports a crash occurring when using the gpt-oss-20b model with tool calls enabled in the prompt, specifically when running llama-server with the --jinja and --reasoning-format none flags. The problem is traced to incomplete parsing of tool calls in the GPT-OSS chat format, causing a runtime error when the server processes requests involving tools, and several users have proposed a patch to properly consume the remaining input during parsing to prevent the crash.

The discussion reveals that removing the --jinja flag avoids the crash but disables tool functionality, while tool calls without proper parsing cause the server to terminate unexpectedly. Multiple commenters confirm the issue across platforms, identify the root cause in the parsing function, and share a patch that adds a call to consume the rest of the input, which stops the crashing and enables tool calls to work; a pull request with this fix and additional Harmony parsing improvements is also referenced.
Number of comments this week: 14

Eval bug: gpt-oss reasoning_effort does nothing: This issue reports that the reasoning_effort parameter in the gpt-oss model running on llama-server does not have any effect on the system prompt or output, despite being set to values like "high" or "low" via command line or API calls. Users discuss various ways to pass this parameter, revealing that it only works when included inside chat_template_kwargs in the API request body or as a CLI argument, while setting reasoning_effort directly in the API call does not change the reasoning level.

Multiple users confirm the problem with reasoning_effort not affecting output when set directly in the API call, but some report success when passing it inside chat_template_kwargs or via CLI arguments. The discussion includes sharing startup commands, model versions, and build details, with troubleshooting narrowing down the correct usage pattern to enable the parameter’s effect.
Number of comments this week: 9

Eval bug: Crash over tool calls in Qwen3 Coder: This issue reports a crash occurring in the llama-server when the Qwen3 Coder 30B-A3 model attempts to perform tool calls, specifically causing a runtime error related to invalid diffs during parsing. The problem appears linked to the model’s non-JSON-based tool call syntax and parsing limitations, with additional reports of similar errors and regressions affecting other models and tool call functionalities, prompting investigation into template handling and parsing robustness.

The comments include reproduction attempts confirming the crash and related errors with different versions and models, discussion about the parsing challenges due to Qwen3 Coder’s unique syntax, and observations of similar issues in other models like Gemma3 and Llama Scout; contributors share logs, test results, and note that some errors stem from empty or unexpected system message formats, while others suggest the need for improved template support and possibly separate defect tracking for related models.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 498 days and highlights a discrepancy in behavior between different Vulkan backends within the project.
Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, to aid in performance analysis and debugging.
common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple files in parallel for sharded models, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in libcurl to properly handle and display the progress status of parallel downloads.
kubernetes example: This issue is about creating a Kubernetes example for the llama.cpp project, specifically proposing the development of a Helm chart to facilitate deploying the server in Kubernetes environments. The original poster has made initial progress on this task and is seeking community contributions to help continue the work.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU and CUDA backend. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loading process to fail with a "failed to read tensor info" message.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 38
Summarized Issues:

Crashes and Segmentation Faults: Multiple issues report crashes caused by invalid parameters, assertion failures, or segmentation faults in various contexts such as enabling KleidiAI with large thread batch sizes, running finetune.cpp on Linux with CUDA, and using tool calls with GPT OSS 20B models. These crashes often stem from improper handling of inputs, kernel parameters, or sequence IDs, leading to runtime exceptions and termination.  
issues/15079, issues/15090, issues/15102, issues/15215

GPU Backend and Memory Allocation Issues: Several reports highlight memory allocation failures and out-of-memory errors on AMD GPUs using Vulkan or ROCm backends, including failures to allocate KV caches or large buffers, and problems with device function support or tensor loading across multiple CUDA devices. These issues cause model loading failures, crashes, or inefficient memory usage, often related to driver or backend limitations.  
issues/15105, issues/15106, issues/15107, issues/15120, issues/15125, issues/15139, issues/15128, issues/15196

Model Loading and Format Compatibility: Users encounter errors loading certain model architectures or multi-part GGUF files, as well as failures converting models from safetensors to GGUF due to unmappable tensor names or unrecognized architectures. These problems prevent successful model initialization and usage in llama.cpp.  
issues/15143, issues/15164, issues/15173

Quantization and Model Size Anomalies: There are inconsistencies in file sizes resulting from different quantization methods, where smaller quantizations do not always reduce model size and sometimes increase it unexpectedly. This behavior causes confusion about whether it is a bug or related to tensor format handling.  
issues/15117

Performance and Speed Issues on GPUs: Running large models on certain GPUs, such as Intel Arrow Lake iGPUs, results in extremely slow prefill and token decoding speeds due to inefficient matrix multiplication on hardware lacking matrix core support. This leads to poor overall performance despite functioning execution.  
issues/15163

Prompt Caching and Tool Description Inefficiencies: The --cache-reuse option fails to cache prompt prefixes properly, causing repeated prompt processing, and the tool description caching is ineffective, leading to long computation times and delayed time-to-first-token on limited CPU resources. These inefficiencies degrade user experience and system responsiveness.  
issues/15082, issues/15166

Feature Requests for Tooling and Backend Support: Users request new features such as a --log-file parameter for llama-bench, support for RWKV-style kernels in the ggml-cann backend, Activated LoRA adapter inference support, and multi-device RPC server capabilities to improve usability and performance. These requests aim to extend functionality and optimize workflows.  
issues/15084, issues/15085, issues/15210, issues/15212

Model Behavior and Output Quality Issues: Problems include abnormally high perplexity values indicating flawed model weights or conversions, discrepancies between local and online model outputs, and irrelevant or unsolicited model responses that frustrate users. These issues affect the reliability and quality of generated content.  
issues/15155, issues/15190, issues/15199

JSON and Parsing Bugs: The JSON schema enforcement incorrectly requires keys in exact order, causing invalid outputs when models produce keys in natural but different orders. Additionally, parsing errors with tool call outputs cause server crashes due to incomplete handling of chat formats.  
issues/15216, issues/15102

Command Line and Configuration Limitations: The llama-bench module lacks support for the -dev parameter, limiting GPU configuration benchmarking, and there is confusion about CUDA-enabled binary package usage and multi-model selection in llama-server. These limitations hinder flexible usage and deployment.  
issues/15089, issues/15204, issues/15168

Compatibility and Integration Issues: Recent updates to external tools like VSCode Copilot Chat break compatibility with llama.cpp, requiring manual fixes, and there are questions about GPU support and backend compatibility for AMD hardware. These issues complicate integration and usage in diverse environments.  
issues/15167, issues/15106

Miscellaneous Bugs and Questions: Other issues include inability of seq_rm to handle negative sequence IDs despite documentation claims, runtime errors with tool_call features on specific hardware, and user questions about Vulkan usage to resolve memory allocation failures on mobile devices.  
issues/15104, issues/15205, issues/15139

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 20
Summarized Issues:

Model Loading and Compatibility Issues: Several issues report failures or errors when loading specific models due to incompatible formats, invalid tensor types, or unknown architectures. These problems affect various models including GPT-OSS 20B on Mac M4 hardware, Ollama's model format on Windows Vulkan backend, and hunyuan-7b-Instruct, causing crashes or load failures until updates or workarounds are applied.  
[issues/15099, issues/15122, issues/15138]

Vulkan Backend Bugs and Crashes: Multiple issues highlight bugs related to running models on the Vulkan backend, such as infinite output repetition, segmentation faults with flash attention enabled, and device memory allocation errors. These Vulkan-specific problems do not occur on CPU or CUDA backends and often require disabling features or using specific options to avoid crashes.  
[issues/14974, issues/15100, issues/15144]

Mixture of Experts (MoE) Model Support and GPU Memory Distribution: There are requests and bug reports concerning MoE model support, including adding GLM 4.5 MoE architecture and issues with uneven distribution of MoE layers across GPUs. The --n-cpu-moe flag does not balance VRAM usage properly, leading to inefficient memory utilization on multi-GPU systems.  
[issues/14921, issues/15136]

GPT-OSS Model Support and Related Bugs: Several issues focus on GPT-OSS models, including feature requests for support, runtime crashes with tools enabled, and Vulkan backend crashes with flash attention. These problems affect server stability and model usability, requiring fixes or disabling certain features to maintain operation.  
[issues/15096, issues/15100, issues/15170]

Performance and GPU Offloading Concerns: Reports include perceived performance degradation after CUDA attention changes, later attributed to thermal throttling, and failures in GPU offloading on NVIDIA RTX Pro 6000 with CUDA 13.0 due to deprecated command line options. These issues impact efficient hardware utilization and require updated configurations.  
[issues/15174, issues/15175]

ATLAS System Integration and Memory Management: There are comprehensive feature requests to integrate the ATLAS system into llama-server, including API development for configuration, memory persistence, real-time stats, and lifecycle management. These enhancements aim to provide robust memory handling, concurrent client support, and detailed monitoring for improved server functionality.  
[issues/15183, issues/15184]

Quantization and Model Conversion Issues: Problems are reported with models quantized using certain ternary formats like TQ1_0 failing to run properly, and conversion scripts for OpenAI fine-tuned BF16 models failing due to missing expected tensor types. These issues highlight challenges in supporting diverse quantization schemes and model formats.  
[issues/15146, issues/15193]

Output Formatting and Reasoning Token Display: Issues include incomplete or missing "think" tags in model output when certain flags are not used, and missing thinking tokens in recent Windows CUDA releases, which were resolved by adjusting command-line flags. These affect the clarity and completeness of generated reasoning content.  
[issues/14679, issues/15159]

Context Reset and Session Management Feature Request: A request was made to add an explicit context reset feature such as a /reset command or HTTP endpoint for llama-cli and llama-server, to allow users to start fresh sessions without reloading the model, improving efficiency for chat applications and long-running servers.  
[issues/14652]

ROCm and AMD GPU Compatibility Issues: A specific issue reports vector addition failures on AMD Radeon RX 6600 XT GPUs using ROCm due to invalid device function errors, while the same operations succeed with other backends or software, indicating compatibility and configuration challenges with ROCm and GPU architecture overrides.  
[issues/15202]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

super dumb inclusive model, keep bitching about metal health.. without asking..  lose of time..
Toxicity Score: 0.75 (Rapid escalation, aggressive language, dismissive tone)
This GitHub conversation involves a user expressing frustration and dissatisfaction with a response they received, using dismissive and confrontational language. The tone is negative and somewhat aggressive, with the user showing impatience and irritation. There are no other participants or replies to moderate the sentiment or de-escalate the tension.

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 31
Key Open Pull Requests
1. webui: prettify styling: This pull request aims to enhance the llama.cpp WebUI by reorganizing and styling various interface elements—including reworking the ChatInput layout, updating the favicon to a llama emoji, synchronizing header button styles, relocating the model name to the header, positioning server information at the bottom, improving the greeting message, adjusting the settings dialog width, fixing sidebar and dropdown visibility issues on specific devices, and adding theme previews in the theme selection dropdown—to create a more cohesive and user-friendly chat UI experience.

URL: pull/15201

Merged: No

Associated Commits: f0752, 2141d, 16313, 1f618, 9e293, 6522f, ca97d, 55cbf, 2f4ff, e066e, 181ef, 859e1, 52577, eb76a, 25dc1, 2c624, c0528, d3a95

2. CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n: This pull request optimizes the CUDA reduce_rows_f32 kernel by implementing manual loop unrolling and dynamically adjusting thread counts based on workload size to significantly reduce memory latency, resulting in up to 25x kernel-level performance improvement and an overall 10% speedup for the Gemma3n model on NVIDIA GPUs.

URL: pull/15132

Merged: No

Associated Commits: 3deb3, c270f, ece60, 9070a, 80de6, 8e042, 8fc2c, 9296d, a6fe4, 4a1c5

3. tool-call: Qwen3 Coder chat format support: This pull request introduces support for the Qwen3 Coder chat format by implementing a constrained tool call syntax that enforces JSON-stringified arguments within XML-like parameter tags to address the unique and non-JSON parameter formatting used by Qwen3 Coder.

URL: pull/15162

Merged: No

Associated Commits: 37bca, 1ed29, 1156f, 8894a, f8360, 42f60, 98e8c, 9e6e6

Other Open Pull Requests

Harmony and GPT-OSS Parsing Enhancements: Multiple pull requests focus on improving parsing capabilities for GPT-OSS and related chat formats, including support for reasoning format options, tool parsing with grammar, and handling commentary preambles. These changes add extensive tests, fix crashes, and improve structured tool call extraction, while also addressing partial message parsing and multi-part reasoning tag handling in streaming mode.  
pull/15181, pull/15149, pull/15158, pull/15145

Reasoning and Tool Parsing in Llama Models: Several pull requests introduce reasoning and tool parsing capabilities for Llama 3.x Nemotron and GLM-4.5 models by adding new chat formats and parsers for XML-like tool call tags. These updates fix issues with generic parsing, token streaming during reasoning, and ensure proper detection and handling of tool calls, enabling enhanced model interaction features.  
pull/15083, pull/15208, pull/15182

ARM Architecture and SVE Kernel Support: Multiple pull requests add and optimize support for Scalable Vector Extensions (SVE) kernels on ARM architecture, including implementations for f16 data type vector functions and exponential functions. These changes improve performance in image encoding, activation, and softmax computations, achieving significant speedups without accuracy loss.  
pull/15115, pull/15140

Server Endpoint and API Enhancements: Several pull requests enhance server functionality by adding support for multimodal data prompts, echo log probabilities, performance tuning parameters, and a minimal /api/version endpoint for compatibility with external tools. These improvements include updates to input tokenizers, tests, documentation, and parameter inheritance behavior.  
pull/15108, pull/15189, pull/15165, pull/15176

Backend and Kernel Fixes and Improvements: Various pull requests address backend issues such as CUDA/HIP kernel failures, SYCL backend configuration fixes, and CPU backend support for conv3d operations. These fixes improve kernel efficiency, resolve compilation warnings, and add baseline implementations verified against PyTorch for tasks like text-to-video inference.  
pull/15101, pull/15150, pull/15151, pull/15177

Mixture of Experts (MoE) Configuration and CLI Parameters: Pull requests introduce new command-line arguments and parameters to control the number and selection of Mixture of Experts components, as well as flexible tensor buffer type overrides and CPU offloading configurations for draft models. These changes enable experimental control and tuning of expert usage and speculative decoding workflows.  
pull/15191, pull/15158

Memory and Cache Debugging Improvements: Updates include fixes for prompt cache regressions causing coredumps, corrections to sequence ID usage in memory functions, and enhanced debug logging for kv-cache streams. These changes improve stability and observability in memory and caching subsystems.  
pull/15160, pull/15172, pull/15188

Large Tensor Transfer and RPC Enhancements: One pull request implements chunked send and receive operations in the ggml-rpc module to split large tensor transfers into manageable pieces, preventing errors on macOS and enabling successful offloading of large models without aborts.  
pull/15186

Build and Platform Support Updates: A pull request introduces instructions and configuration changes to enable building the project with Vulkan support on Raspbian OS targeting ARM architecture, including detailed CMake commands and environment detection for Vulkan and CPU backend compatibility.  
pull/15206

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 44
Key Closed Pull Requests
1. llama : add gpt-oss: This pull request adds support for the gpt-oss model in the native MXFP4 format to llama.cpp, including a compute graph implementation, attention sinks support across CUDA, Vulkan, Metal, and CPU backends, a new MXFP4 data type and the ggml_add_id operator in ggml, enabling usage of the gpt-oss model collection with improved performance and compatibility.

URL: pull/15091

Merged: 2025-08-05T19:10:36Z

Associated Commits: e2c1b, 4431c, fe9b8, 539c2, 039a6, aa240, 32a65, 13f35, e59b2, 832dc, c6806, 423b1, 4dd47, ebc7d, a543d, 6b303, 44bdb, 65b53, 61979, 3c472, 04cfb, 4cf69, ec95c, 3ef6c, cd514, 98c4b, fcb23, 98f34, df841, 60ab0, 256fe, cd8ed, a3b29, 1ea37, 07d78, b236c, d9d89, 81991, 917f9, a4ab8, 3801c, 13f39, b3594, bd571, 089a7, 4d01b, f271c, 106b1

2. model: Add support for GLM 4.5 family of models (#14921): This pull request adds comprehensive support for the newly released GLM 4.5 family of models to llama.cpp, including architecture registration with MoE components, multi-variant model loading, expert weight handling with merged 3D tensor formats, HuggingFace conversion support, and implementation of a new graph class for expert routing, thereby enabling efficient loading and inference of both the 47-layer Air and 93-layer full GLM-4.5 models.

URL: pull/14939

Merged: 2025-08-04T18:29:25Z

Associated Commits: c7550, 0edf7, 6b478, 96528, 07bb0, fae4d, 03fad, b61fc, 999c0, 5baa6, 62447, 58898, ab318, 6f3d9, b25f4, c90f6, bdfe0, 3d15c, dbfad

3. ggml: WebGPU disable SET_ROWS for now: This pull request temporarily disables the use of SET_ROWS in the WebGPU backend of the ggml project to ensure continuous integration tests pass, as the WebGPU backend does not yet support SET_ROWS, with plans to add support in a future update.

URL: pull/15078

Merged: 2025-08-05T23:26:38Z

Associated Commits: 4c587, ae8ed, 75eb9, bfc69, 69965, c773e, 5aeab, d4af0, 320f6, f4229, 0feec, 0512d, 9335a, fc9e9, 7d980, f7745, 4dc40, 3b81c

Other Closed Pull Requests

GLM-4.5 Mixture-of-Experts (MoE) support: Multiple pull requests enhance support for GLM-4.5 MoE models by implementing model loading, tensor mapping, and integration of MoE-specific components, as well as adding options to control MoE weights placement on CPU for optimization. These changes enable local usage of GLM-4.5 models with their unique architecture and routing mechanisms while simplifying optimization strategies.  
pull/15026, pull/15077

WebGPU backend improvements: Pull requests improve the WebGPU backend by introducing a pool of parameter buffers for concurrent operations, batching command submissions, and adding basic support for the SET_ROWS operation with error handling and debug buffer support. These updates enhance maintainability, performance, and shader development capabilities within the WebGPU context.  
pull/14978, pull/15137

Benchmarking and profiling fixes: Enhancements include extending the server-bench.py script to benchmark external OAI-compatible servers with SQLite result saving, and fixing profiling crashes and synchronization issues in OpenCL implementations to improve stability on specific devices like Adreno 830. These fixes ensure reliable benchmarking and profiling workflows.  
pull/15179, pull/15042, pull/15071, pull/15072

Model and format support updates: Several pull requests add support for new models such as Granite chat template and internlm/Intern-S1, fix model conversion for non-mxfp4 Hugging Face models, and introduce Python-based dequantization/quantization for MXFP4 format with tests. These updates expand model compatibility and improve quantization tooling.  
pull/14864, pull/14875, pull/15111, pull/15153, pull/15161

OpenCL and CUDA backend enhancements: Pull requests add support for new operations like swiglu_oai and add_id in OpenCL, fix field naming in ggml_backend, and introduce CUDA GEMM kernels optimized for various data types and flash attention improvements for older GPUs. These changes improve backend performance and compatibility across hardware.  
pull/15121, pull/15131, pull/15178, [pull/15195](https://github.com/pull/15195]

Chat template and token handling fixes: Updates fix server crashes caused by unconsumed end tokens in GPT-OSS chat parser and modify chat template handling to correctly manage BOS/EOS tokens based on flags, resolving issues with the GLM-4.5 model's special BOS token. These fixes improve chat parsing robustness and template correctness.  
pull/15195, pull/15086

Build and packaging flexibility: A new CMake option GGML_BACKEND_DIR is introduced to allow specifying custom backend directories for dynamic loading, providing flexibility to conform to different packaging conventions. This facilitates easier integration and distribution of backends.  
pull/15074

User experience and UI fixes: Fixes include correcting markdown table display in the web UI for proper rendering across themes and adding warnings when saving imatrix files with unexpected suffixes to prevent confusion. These changes enhance usability and clarity for users.  
pull/15080, pull/15076

Huawei Ascend ACL graph mode support: Optional support is added for executing ggml computational graphs using Huawei's ACL graph mode on Ascend devices, enabled via a compile-time flag with fallback mechanisms and logging to improve performance in repetitive graph executions.  
pull/15065

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
63
5
1
44

CISC
36
6
0
58

taronaeo
92
1
1
1

JohannesGaessler
33
5
0
36

wine99
55
1
0
0

slaren
18
2
0
35

0cc4m
9
1
0
37

dbsanfte
46
1
0
0

jeffbolznv
22
6
0
17

compilade
20
5
0
11

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	63	5	1	44
CISC	36	6	0	58
taronaeo	92	1	1	1
JohannesGaessler	33	5	0	36
wine99	55	1	0	0
slaren	18	2	0	35
0cc4m	9	1	0	37
dbsanfte	46	1	0	0
jeffbolznv	22	6	0	17
compilade	20	5	0	11