Weekly GitHub Report for Llama.cpp: August 18, 2025 - August 25, 2025 (12:04:30)

            Weekly GitHub Report for Llama.cpp: August 18, 2025 - August 25, 2025 (12:04:30)

                    Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance functionality and improve user experience, reflecting a continued focus on performance optimization and feature expansion. Notable highlights include streamlined workflows and upgraded system stability.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Feature Request: Support for NVidia Nemotron Nano v2: This issue requests adding support for the NVidia Nemotron Nano v2 model to the project, highlighting its novel hybrid Mamba-transformer architecture that could offer state-of-the-art performance for its size. The discussion focuses on the technical challenges of implementing this support, including differences in layer configurations compared to existing models, handling custom model code, and debugging issues with token generation and model conversion.

Commenters shared initial attempts to convert and run the model, noting that the current architecture requires a new model class due to unique layer types and no RoPE usage. Progress includes partial implementation with token generation issues, detailed debugging efforts, and plans to refine cache handling and merge contributions, with multiple participants offering support and collaboration despite some initial hurdles.
Number of comments this week: 16

Compile bug: Parameter type truncation warning under Clang at (uint32_t)abs(o.val_i64): This issue reports a compile-time warning under Clang when casting the absolute value of a 64-bit integer to a 32-bit unsigned integer using the C-style abs function, which may cause truncation. The user suspects that the warning arises because the global abs function does not have an overload for 64-bit integers, and commenters discuss the differences between the C and C++ standards regarding abs overloads, the namespaces they reside in, and the impact of compiler and standard library versions on this behavior.

The comments focus on clarifying that the 64-bit overload of abs is standardized only in the std:: namespace since C++11, not necessarily in the global namespace, and that older or certain Clang versions might lack this overload, causing the warning. Participants share compiler and OS version details, note that recent Clang versions do not emit this warning, and suggest that the issue likely stems from the standard library version rather than the compiler itself, with references to related issues and pull requests.
Number of comments this week: 11

Feature Request: The script convert_hf_to_gguf.py supports conversion of DeepSeek-R1-0528-FP4.: This issue requests adding support to the script convert_hf_to_gguf.py for converting DeepSeek-R1-0528-FP4 safetensor models into the GGUF format, enabling local inference of this advanced FP4-quantized model within the llama.cpp ecosystem. The user highlights the importance of this feature for efficient hardware utilization and broader community adoption, noting that current conversion attempts fail due to architectural and quantization format differences not yet handled by the script.

The comments discuss the technical challenges and distinctions between FP4 quantization formats, emphasizing that GGUF is not just a container but a quantization spec, and that native NVFP4 support is lacking. Several contributors debate the value of converting existing low-precision quant models versus using native GGUF quantizations, while others share progress on patches to enable conversion and suggest collaboration and related issues that could simplify implementation.
Number of comments this week: 9

Misc. bug: prompt processing progress fraction is incorrect with cached prompts: This issue reports a bug where the prompt processing progress fraction displayed by the llama-server is incorrect when cached prompts are involved, as the progress calculation includes tokens already present in the KV cache rather than only the new tokens being processed. The user suggests that the progress indicator should reflect the processing of new tokens exclusively, since cached tokens do not contribute to actual progress, and commenters discuss whether to rename the metric or adjust the display logic to differentiate between overall progress and evaluation progress.

The comments clarify that the current progress metric misleadingly includes cached tokens, which do not represent real processing progress. Suggestions include renaming the metric to reflect new tokens only, fixing the progress calculation to exclude cached tokens, and potentially displaying two separate progress indicators: one including cached tokens and one excluding them, with some discussion on whether both should be shown during streaming responses.
Number of comments this week: 6

Eval bug: HIP: After compiling the code using HIP, when the length of the output token exceeds u_batch, GGGGGGGG will be output.: This issue reports a bug encountered when compiling code using the HIP backend, where the output produces a string of repeated "G" characters if the length of the output token exceeds the batch size parameter u_batch. The user provides detailed environment information including hardware, software versions, and compilation commands, and seeks assistance in resolving this unexpected output behavior with the PHI4 mini model on an AMD Radeon 880M GPU.

The commenters requested more detailed reproduction steps and configuration details, which the user provided, including the exact model, compilation commands, and environment setup. Suggestions were made to try specific compile-time flags and code modifications related to CUDA compatibility, but some were deemed irrelevant due to the model not being quantized. The discussion ended with a recommendation to modify certain CUDA-related macros to potentially address the issue, acknowledging that this might impact performance.
Number of comments this week: 6

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 511 days and highlights a discrepancy in behavior between the two Vulkan backends used in the project.
Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in improving the Metal backend. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, which would help in collecting and analyzing GPU traces across different frameworks.
common: download from URL, improve parallel download progress status: This issue addresses a problem with the parallel downloading of sharded model files, where the progress indicators for each file conflict and do not display correctly. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status updates during simultaneous downloads.
kubernetes example: This issue is about creating a Kubernetes example for the llama.cpp project, specifically proposing the development of a Helm chart to facilitate deploying the server in Kubernetes environments. The original poster has begun work on this example and is seeking community contributions to help continue its development.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the Microsoft Bitnet B1.58-2B-4T GGUF model on a Windows system using CUDA with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 34
Summarized Issues:

Finetuning and Training Features: The finetune feature currently lacks options for specifying learning rate or number of epochs and does not support saving progress for later use, limiting flexibility in training workflows. There is also a request for robust checkpointing and training resumption features to enable step-level saving/loading of model weights and optimizer state, addressing challenges in optimizer tensor loading and seeking community input on API design.  
[issues/15376, issues/15442]

Command-Line Interface Enhancements: Users request new CLI flags to force MMQ on/off at runtime for easier benchmarking without rebuilding, and a feature to navigate backward through request history using the up arrow key, improving usability and convenience.  
[issues/15378, issues/15502]

Server and CLI Crashes on Input: The llama-server crashes immediately upon receiving user input, and llama-cli segfaults with user input on Linux ROCm AMD GPUs. Similarly, llama-cli segfaults on Windows after loading Vulkan or RPC backends regardless of model or GPU, indicating critical stability issues across platforms.  
[issues/15382, issues/15384]

Performance and GPU Utilization Issues: Significant slowdowns occur in image processing steps on Jetson Orin NX with CUDA backend, and enabling tool calls reduces inference performance and GPU utilization on Windows 11 with Qwen3 30B Coder model. Additionally, dual-node RPC setups cap GPU utilization at ~50%, unlike single-node runs achieving full utilization.  
[issues/15387, issues/15389, issues/15463]

Memory Access and Backend Crashes: Illegal memory access errors cause crashes during perplexity evaluation on GTX 1660 Ti CUDA backend and during ROCm backend execution on AMD hardware with HIP backend. A segmentation fault also occurs on first inference with multi-GPU CUDA setup on Linux.  
[issues/15390, issues/15470, issues/15519]

Model and Backend Bugs: Bugs include a thinking-enabled model failing to use /apply-template endpoint due to inverted logic, incorrect inference results with zDNN backend when LLAMA_SET_ROWS is enabled, and repeated "GGGGGGGG" output tokens when exceeding batch size on HIP backend with AMD Radeon 880M GPU.  
[issues/15401, issues/15414, issues/15465]

Embedding and Conversion Support Regressions: The /v1/embeddings endpoint stops working after version 5630 despite embeddings enabled, indicating regression. There is also a request to add support for converting DeepSeek-R1-0528-FP4 safetensor models using NVFP4 quantization format not currently supported by GGUF.  
[issues/15406, issues/15415]

Model Support Requests: Requests include adding support for NVIDIA Nemotron Nano v2 model with hybrid architecture, DeepSeek V3.1 model with thinking mode, ERNIE-4.5-VL multimodal model with superior OCR, and Grok-2 large language model comparable to Qwen3-235B-A22B.  
[issues/15409, issues/15496, issues/15512, issues/15534]

UI and Progress Reporting Issues: The web UI download feature only downloads prompt text instead of full results, and the prompt processing progress fraction does not correctly account for tokens reused from the KV cache, causing inaccurate progress display.  
[issues/15430, issues/15432]

Shader and Vulkan Backend Crashes: Shader compilation crashes occur on macOS with Vulkan backend and MoltenVK 1.4.0 due to reinterpret_cast errors in multi_add.comp shader, causing pipeline creation failures and segmentation faults with large models like Qwen3-30B-A3B-Instruct-2507-GGUF.  
[issues/15498]

Backend Test Inconsistencies: The COUNT_EQUAL backend tests intermittently fail due to inconsistent tie-breaking in ARGMAX function across CPU and Vulkan backends, suggesting the need for explicit tie-breaking logic to ensure consistent results.  
[issues/15484]

Documentation and Example Requests: There is a request for a clear example demonstrating usage of the mtmd C-API, particularly for images, to help users better understand its application.  
[issues/15492]

Long Prompt and Model Stability Issues: Deepseek models stall on very long prompts (~24k tokens) despite success with shorter prompts, and GPT-OSS 120B model produces gibberish or gets stuck indefinitely on very long prompts (~10k-15k tokens) using HIP backend on AMD Radeon PRO W7900 GPUs.  
[issues/15514, issues/15516, issues/15517]

Feature Requests for Accuracy and Filtering: A feature called "DeepConf" is requested to use model-internal confidence signals to dynamically filter low-quality reasoning traces during or after generation, aiming to improve accuracy and reduce token usage without extra training or tuning.  
[issues/15518]

Vision Model Output Issues: The InternVL3 vision model run with --mmproj produces distorted or incorrect outputs on images containing text, failing to properly extract or interpret text, unlike other models and platforms.  
[issues/15528]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 24
Summarized Issues:

Concurrency and Task Queueing Issues: The llama-server module experiences a bug where concurrent embedding requests cause some tasks to become indefinitely stuck in a deferred queue due to the way deferred tasks are re-queued, resulting in requests hanging until new requests stop. A proposed patch changes the task re-queuing order to fix this behavior.  
issues/15008

Model Output Discrepancies and Formatting Bugs: Several issues report discrepancies in model outputs and formatting, including the local gpt-oss-20b-F16 model producing inaccurate responses compared to the website, the --jinja parameter causing the server to ignore the requested response_format, and the llama-cli chat template diff logic causing incomplete tokenization due to token inconsistencies. These problems affect the correctness and usability of generated outputs.  
issues/15190, issues/15276, issues/15417

Compilation and Build Failures: Multiple build-related issues occur, including Vulkan backend compilation failures caused by shaderc's mishandling of bfloat16 support, build failures due to invalid pip options in the CUDA Dockerfile, and Vulkan shader compilation errors on Linux caused by spaces in file paths. These prevent successful builds and require workarounds or fixes.  
issues/15344, issues/15356, issues/15482

CUDA Backend and Runtime Errors: Several CUDA-related problems are reported, such as matrix multiplication runtime errors on RTX 3090 due to incorrect CUDA architecture flags, crashes during long-prompt decoding on hybrid CPU+GPU setups linked to recent CUDA changes, and a feature request to restore a CUDA backend-only optimization to improve performance on less powerful hardware. These issues impact stability and performance on CUDA devices.  
issues/15407, issues/15452, issues/15481

Model and Feature Support Limitations: Some models and features are either unsupported or requested, including the Gemma 3 270M model failing for multimodal inputs due to being text-only, a feature request to add DeepSeek-3.1-Base model support, and a request to add Bytedance Seed-OSS 36B Instruct model support based on its promising benchmarks. These highlight gaps in model compatibility and expansion desires.  
issues/15377, issues/15438, issues/15483

Performance and Speed Issues: Performance problems include slow image processing on Apple M2 Macs using the Metal backend due to missing kernel implementations, and the gpt-oss 20b model serving API being over 50% slower than its GUI counterpart. These issues degrade user experience and efficiency.  
issues/15426, issues/15478

API and Streaming Compatibility Problems: The llama-server streaming API includes usage statistics in the final chunk with non-empty choices, conflicting with the OpenAI Streaming API specification that requires usage data in a separate final chunk with empty choices. This causes compatibility issues with downstream tools expecting the standard format.  
issues/15443

Memory and Cache Management Bugs: Issues include decoding failures on CPU due to insufficient KV cache size resolved by enabling unified KV cache, and a memory allocation failure causing a crash when fine-tuning the SmolLM2 360M model on a CPU-only system. These bugs affect resource management and model training stability.  
issues/15445, issues/15532

Vulkan Backend Stability and Compatibility Issues: The Vulkan backend faces stability problems such as test failures on AMD GPUs due to floating point precision differences, and compute pipeline creation failures causing segmentation faults on macOS. These issues limit Vulkan backend usability across platforms.  
issues/15491, issues/15497

User Interface and Help Command Deficiencies: The --help command in llama-server does not display options for controlling the top_n_sigma sampler, making it unclear how to configure this feature via CLI. Additionally, the llama-server WebUI crashes after streaming a response due to an undefined "delta" property in the response object, causing browser errors.  
issues/15423, issues/15461

Miscellaneous and Deleted Issues: One issue related to the function llama_decode was deleted by the author after acknowledging a mistake, indicating no actionable bug or feature request remains.  
issues/15459

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 25
Key Open Pull Requests
1. Deepseek V3.1 thinking mode is the default: This pull request enables DeepSeek V3.1 thinking mode as the default behavior in the project, introducing support for its unique reasoning content parsing format and allowing it to be disabled with the --reasoning-budget 0 option.

URL: pull/15533

Merged: No

Associated Commits: 0d969, 3912f, bac6c, fe862, 3d00d, c50d8, 3f319, 79f7c, 0d959, 6223c, 0d372, f0da1, 56f7e, f4f0d

2. Thinking model disabled assistant prefill: This pull request improves the handling of the enable_thinking flag by auto-detecting when it should be enabled based on template changes, ensuring explicit overrides from chat_template_kwargs are respected, and fixing the logic to correctly reject assistant prefill when enable_thinking is active, thereby preventing incompatible states and improving error reporting for invalid inputs.

URL: pull/15404

Merged: No

Associated Commits: f6648, 8e4c1, c31c9, 7cf59, 4c06d, 6adae

3. mtmd : support Kimi VL model: This pull request adds support for the Kimi VL model and its newer "Thinking" variants with dynamic resolution, incorporating code largely adapted from the LFM2 implementation to enhance compatibility and functionality within the project.

URL: pull/15458

Merged: No

Associated Commits: 6af5d, 131cd, a7dca, 56cf4, afd43, 9bba5

Other Open Pull Requests

Architecture Support Additions: Multiple pull requests add support for new model architectures including the nemotronh hybrid model used in Nemotron Nano V2, the InternLM Intern-S1-mini model, and the GroveMoE model combining adjugate and ordinary experts. These additions involve updates to enums, layer mappings, conversion processes, and preliminary implementations requiring further refinement.  
pull/15507, pull/15412, pull/15510

Performance Optimizations in GPU Backends: Several pull requests focus on improving GPU performance by optimizing CUDA and Vulkan kernels, including accelerating MXFP4 vector dot products with __byte_perm, applying MUL_MAT_ID subgroup optimizations to Vulkan GPUs, and moving MoE MMQ kernel helper code from host to device to reduce synchronization. These changes result in significant performance improvements across various GPU models and batch sizes.  
pull/15451, pull/15524, pull/15525

KV Cache and Layer Reuse Refactoring: The KV cache implementation is improved by refactoring layer reuse logic to be more generic and maintainable, introducing a callback interface and hyperparameter flag, and removing model-specific special cases. Additionally, LLAMA_SET_ROWS checks are removed following universal adoption of ggml_set_rows(), simplifying the codebase.  
pull/15504, pull/15505

Bug Fixes and Compatibility Improvements: Fixes include resolving Vulkan multi_add shader compile failures on MoltenVK by removing problematic qualifiers, addressing a Metal initialization regression causing segmentation faults, and adding conditional compilation for OpenCL 2.0 compatibility. These ensure stable builds and backward compatibility across platforms.  
pull/15506, pull/15531, pull/15383

ROPE and Attention Mask Enhancements: Improvements to the ROPE implementation include caching sin_repeat and cos_repeat values for better performance and fixing incorrect causal attention masks caused by M-Rope by appending positions to traditional LLM positions. These changes remove previous workarounds and pass all relevant tests.  
pull/15501, pull/15474

Compute Graph Storage and RPC Enhancements: Server-side storage and reuse of compute graphs are introduced using a fixed-size ring buffer to avoid repeated serialization. New RPC commands enable storing and recomputing graphs by ID, with a workaround associating IDs to ggml_cgraph via the tensor->extra field.  
pull/15405

Default Configuration and Usability Improvements: The default configuration is updated to enable FlashAttention and set the maximum number of GPU layers, improving performance and usability for most models and hardware, especially benefiting first-time users working with small models.  
pull/15434

Code Simplification and Readability: Code simplifications include removing the userdata parameter from the WebGPU request adapter callback and capturing context directly in the lambda, as well as enabling Conv2D on Apple devices by removing a previous workaround after a MoltenVK bug fix. These changes improve maintainability and leverage upstream fixes.  
pull/15526, pull/15527, pull/15515

Versioning and Testing Improvements: Semantic versioning is introduced to replace the build number system in GGML to better track changes and releases, while test reliability is improved by generating unique input values to prevent backend-dependent failures in argmax tests.  
pull/15499, pull/15487

Conversion Script Fixes: Updates to the convert_hf_to_gguf.py script fix tensor name handling for multi-modal and vision projector components, resolving errors encountered when converting Gemma 3 12B+27B models from safetensor to gguf format.  
pull/15515

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 82
Key Closed Pull Requests
1. ggml WebGPU: add support for quantization types: This pull request adds support for basic matrix multiplication using various quantization types in the ggml WebGPU backend, includes shader generation preprocessing via a Python script to accommodate WGSL compiler limitations, and implements helper functions for tensor buffer alignment and initialization improvements, while noting some quantization types and backend combinations remain unsupported.

URL: pull/15440

Merged: 2025-08-22T18:28:03Z

Associated Commits: 63551, 831ea, 688b5, 1aa40, c3611, de4da, d76e5, e2380, 2a3b9, 57c26, 7a2ae, 51252, 98550, 6552e, 65beb, 16df2, 10bab, 7a323, d1b0f, d6903, 1fcc4

2. Model: Add support for Seed-OSS: This pull request adds initial support for the Seed-OSS model to the project, including various fixes, chat template additions, and code improvements, aiming to address issue #15483 and enable successful model conversion and generation.

URL: pull/15490

Merged: 2025-08-23T13:21:52Z

Associated Commits: fa8da, 7233e, 1bb4e, 7c2b3, cef9b, 300e5, 81f95, 2a82b, c4d67, 56add, 40a9d, 70e69, c858c, 8f643, f6cc0, a81e4, 4c1d4, 78f5c, 92081, f4988

3. Added RVV CI with build.yaml tests: This pull request proposes adding RISC-V Vector (RVV) continuous integration tests using a build.yaml workflow, including updates to the CI configuration, toolchain setup with gcc-14, system specifications, and build environment improvements, but it was not merged.

URL: pull/15388

Merged: No

Associated Commits: c24dc, c465e, b5997, 9342a, bb198, b4024, e608c, 58572, 48fc7, d8c92, fd500, 95f4d, fc453, d70b2, 4d21e, 1cc99, c9bb3, 767c0, 482c5

Other Closed Pull Requests

Benchmarking and Performance Automation: This topic includes pull requests that add comprehensive benchmarking scripts and automate performance testing with specific hardware support, such as GFX906. These PRs streamline the process of updating submodules, building the project, and running benchmarks with user-defined parameters.

  [pull/15411]

Documentation and Developer Guidance: This topic covers the addition of detailed documentation aimed at improving AI coding agents' understanding of the project architecture, build system, testing, formatting, and CI/CD workflows. The documentation enhances developer onboarding and code quality through clear instructions.

  [pull/15286]

SIMD and Hardware-Specific Performance Enhancements: Multiple pull requests add SIMD instruction set support and optimize backend operations for various platforms including s390x, PowerPC, Vulkan GPUs, and CANN backend. These improvements result in significant performance gains in token generation, matrix operations, and normalization across different hardware architectures.

  [pull/15486, pull/15385, pull/15335, pull/15360, pull/15355, pull/15379, pull/15408, pull/15281, pull/15393, pull/15380, pull/15419, pull/15335]

Continuous Integration and Build System Improvements: This topic includes PRs that enhance CI workflows by adding triggers for pull requests, fixing build errors, updating pip commands for compatibility, and optimizing build performance with ccache. These changes ensure smoother and more reliable build and test processes.

  [pull/15386, pull/15450, pull/15221, pull/15357]

Vulkan Backend Enhancements: Several pull requests improve Vulkan backend functionality by adding new operations like GGML_OP_MEAN, optimizing kernel performance with loop unrolling and fusion, and improving workgroup sizes and subgroup instructions. These changes boost GPU utilization and maintain compatibility across hardware.

  [pull/15393, pull/15355, pull/15408, pull/15281]

Model Support and Conversion Tools: This topic covers PRs that add or attempt to add support for new models such as DeepSeek-V3.1 and LFM2 family improvements, as well as tools for converting models to the GGUF format. These efforts improve model compatibility and conversion reliability.

  [pull/15495, pull/15455]

Codebase Refactoring and Cleanup: This includes merging similar functions to reduce redundancy, fixing include directories in CMake configurations, and removing unnecessary code elements. These changes streamline the codebase and fix configuration issues.

  [pull/15380, pull/15450]

Bug Fixes and Stability Improvements: This topic includes fixes for division by zero in quantization code, return-type warnings causing CI failures, and disabling confusing default features like context shift in the server. These PRs improve stability and developer experience.

  [pull/15357, pull/15221, pull/15416, pull/15221]

Multimodal and API Enhancements: This topic covers adding support for multimodal data prompts in the server API and clarifying enum usage in the chat API to support multiple reasoning formats. These changes expand functionality and improve API clarity.

  [pull/15108, pull/15408]

Optimization of Specific Operators: PRs in this topic optimize operators such as RMS_NORM and rope by caching tensors and reusing computations, resulting in improved throughput and reduced overhead in backend implementations.

  [pull/15419, pull/15335]

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
70
18
1
66

CISC
46
6
0
80

wine99
115
1
0
0

jeffbolznv
35
11
1
52

JohannesGaessler
36
7
0
36

taronaeo
68
2
1
2

slaren
22
2
0
46

zhanmyz
68
0
0
0

ngxson
31
5
0
23

reeselevine
52
1
0
0

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	70	18	1	66
CISC	46	6	0	80
wine99	115	1	0	0
jeffbolznv	35	11	1	52
JohannesGaessler	36	7	0	36
taronaeo	68	2	1	2
slaren	22	2	0	46
zhanmyz	68	0	0	0
ngxson	31	5	0	23
reeselevine	52	1	0	0