Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp: September 01, 2025 - September 08, 2025 (12:06:49)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, with notable improvements in system stability and feature optimization. This release reflects a continued focus on refining core functionalities and addressing user feedback.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Feature Request: Apertus models support: This issue requests support for the Apertus models, a new fully open-source multilingual model with training data and code, aiming to enhance the project by integrating these models. The discussion highlights the complexity of the chat template used by Apertus, its unique features like deliberation and tool calling, and ongoing development efforts to improve model capabilities such as thinking and reasoning.

    • The comments cover technical details about the Apertus model’s modifications and chat template complexity, share testing experiences and opinions on the model’s usefulness, clarify compatibility with existing tools and templates, and provide insights into ongoing training and future improvements for thinking and tool-calling features.
    • Number of comments this week: 10
  2. Eval bug: gpt-oss-20b returns only analisys chunk w/o content and tools: This issue reports a problem where the gpt-oss-20b model, when used with llama.cpp version 6360, only returns an analysis chunk indicating it should use a tool (like get_weather) but fails to produce the actual content or execute the tool call, resulting in empty responses. The user notes that earlier versions (e.g., 6115) worked correctly, and the problem persists even when requesting simple answers without tools; additionally, there is confusion about how to properly use the tools field in the API and the role of the inference server in executing built-in functions.

    • The comments clarify that tool calls are only parsed and returned when the tools field is properly used in the API request, and that the --reasoning-format none flag disables tool parsing, causing the issue. It is explained that llama.cpp itself does not execute built-in tools, which are expected to be handled by an inference server or MCP implementation, and users share experiences with various clients and reference implementations, emphasizing the need to enable native function calling and correctly format requests to get tool outputs.
    • Number of comments this week: 6
  3. Feature Request: Please consider splitting out libggml.so into a separate project: This issue requests that the libggml.so library be split out into a separate project, as it is currently bundled with llama-cpp but is also used by other projects like whisper.cpp. The user suggests that having libggml.so as an independent project would allow multiple projects to depend on it directly, rather than requiring the installation of llama-cpp.

    • The comments include a link to the ggml repository and discuss build errors encountered when using the system ggml option, noting that the ggml repo is slightly behind llama.cpp and synced periodically. Suggestions are made to manage ggml as a git submodule to avoid such issues, but it is also explained that most ggml development happens within the context of llama.cpp, which influences the current project structure.
    • Number of comments this week: 5
  4. Eval bug: Granite 4.0 Invalid diff: '<|tool_call|>["1025202362"]' not found at start of '<|tool_call|>["1350490027"]': This issue reports a runtime crash occurring when running a tool call with the Granite 4.0 model in required or AUTO mode, caused by an invalid string diff error related to tool call identifiers during partial JSON parsing. The problem appears to stem from how partial JSON tool calls are streamed and parsed, leading to a mismatch in expected string segments and ultimately throwing a runtime exception.

    • The comments include a reproducible node script to trigger the crash consistently, followed by a discussion on the challenges of streaming partial JSON parses for tool calls, with suggestions to either stream the entire JSON tool call at once or wait until the JSON is fully parseable before streaming. A pull request implementing a fix based on a different JSON parser approach was shared and confirmed by others to resolve the issue.
    • Number of comments this week: 4
  5. Misc. bug: Performance degradation with -ctk / -ctv q8_0 using GPT OSS 20B: This issue reports a significant performance degradation when using quantized KV cache options (-ctk q8_0 / -ctv q8_0) with the GPT OSS 20B model on llama-server, despite VRAM usage remaining similar. The user provides detailed benchmark logs showing slower token processing times with quantized KV cache enabled and seeks insight into this unexpected slowdown.

    • Multiple commenters confirm experiencing the same slowdown on different hardware, including ROCm with Radeon GPUs and NVIDIA RTX 3080. It is noted that enabling quantized KV cache for a head size of 64 requires a special compilation flag, but recompiling with this flag did not resolve the performance issue, indicating a broader problem affecting processing speed rather than generation speed.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 525 days and highlights a discrepancy in behavior between different Vulkan backends within the project.
  2. Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in debugging and performance analysis. The user is working on improving the Metal backend in a related project and seeks a documented or known method to produce the type of GPU trace output similar to what is provided by Apple's Metal debugger.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress indicators when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status for each parallel download.
  4. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.
  5. Eval bug: Load Kimi-K2-Instruct-BF16-384x14B-Q2_K_S-00001-of-00013.gguf fail: This issue reports a failure to load the Kimi-K2-Instruct-BF16-384x14B-Q2_K_S-00001-of-00013.gguf model on a Mac M2 Ultra system using the Metal GGML backend, where the loading process errors out due to an inability to open one of the model's split files. The user provides detailed logs showing that the model loader cannot find or access the second split file in the sequence, resulting in the overall model initialization failing.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 26

Summarized Issues:

  • Model crashes and runtime errors with specific hardware and backends: Several issues report crashes or runtime failures when running models on particular GPUs or backends. These include crashes during warmup on Intel Arc A770 with Vulkan, assertion failures on NVIDIA RTX 2060 with Vulkan, and runtime errors on Ascend 910B NPU and NVIDIA RTX 3070 Ti with CUDA, indicating hardware and backend compatibility problems.
  • issues/15701, issues/15810, issues/15759, issues/15841
  • Performance issues and discrepancies with drivers and quantization: There are significant performance degradations reported due to driver differences and quantization options. For example, Conv2D operations run much slower on AMDVLK compared to RADV, and quantized KV cache options drastically slow down GPT OSS 20B model evaluation despite similar VRAM usage.
  • issues/15725, issues/15766
  • Model conversion and architecture support limitations: Multiple issues highlight failures or missing support in converting models to the gguf format due to unrecognized architectures or tokenizers. This includes deepseek-ai models with unsupported BPE pre-tokenizers, missing support for deepseek_vl_v2 and ApertusForCausalLM architectures, and requests to add support for Apertus models.
  • issues/15734, issues/15741, issues/15751, issues/15844
  • API and server message handling bugs: There are bugs related to server message content and API features, such as the server sending assistant messages with null content causing runtime errors, and a feature request to retrieve chat completions in raw token form to preserve schema markers.
  • issues/15755, issues/15731
  • Tool call argument and parsing errors: Issues report failures in tool call handling, including invalid string diff errors causing runtime crashes and the "edit_file" tool call not correctly transmitting the required "mode" argument, leading to tool call failures.
  • issues/15713, issues/15823
  • GPU resource and assignment issues: Problems with GPU resource management include the separate projector file loading onto GPU 0 regardless of the specified GPU flag, causing unexpected GPU assignment despite available VRAM.
  • issues/15804
  • Assertion failures and DLL loading issues on Windows: An intermittent assertion failure occurs when dynamically loading and unloading ggml.dll on Windows, raising questions about the necessity of the exception given Windows' inability to load the same DLL twice.
  • issues/15752
  • Model output and tool invocation failures: The gpt-oss-20b model fails to produce actual content or execute tool calls in recent llama.cpp versions, only outputting analysis messages referencing tool usage, whereas earlier versions worked correctly.
  • issues/15789
  • Server stability and signal handling problems: The llama-server running on FreeBSD with Vulkan backend loads models but never returns output to the browser and cannot be gracefully stopped with SIGINT, requiring a force kill.
  • issues/15831
  • Library and project structure improvement requests: There is a request to split the libggml.so library into a separate project to allow independent dependencies for projects like llama-cpp and whisper.cpp, improving modularity.
  • issues/15778
  • Training and fine-tuning inconsistencies: Fine-tuning the SmolLM2-135M Base model on CPU yields worse results than on CUDA, suggesting a potential bug in CPU finetuning. Additionally, converting a fine-tuned LLaMA 3.1-8B Instruct model with added domain-specific tokens to GGUF format causes failure to recognize these tokens during inference.
  • issues/15779, issues/15842
  • Inference pipeline and token batch processing bugs: The llama-server crashes with an assertion failure when processing audio input prompts due to token batch handling issues, despite the same model working via CLI, indicating server evaluation pipeline problems.
  • issues/15812
  • Feature enablement and enhancement requests: A request was made to enable the LLGuidance feature by default in official Docker containers, suggesting adding Rust to the Docker build process to improve user accessibility.
  • issues/15833
  • Memory calculation and buffer overflow bugs: An integer overflow bug causes buffer overflow due to incorrect assumptions about tensor overhead sizes during memory size calculation, posing a stability risk.
  • issues/15711
  • Inference performance optimization proposals: A proposal suggests consolidating multiple tensor copies from CPU to GPU into a single copy operation to reduce CUDA API overhead and improve inference speed.
  • issues/15749

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 18

Summarized Issues:

  • Performance regressions in inference and sampling: Several issues report performance drops in different components of llama.cpp, including a Vulkan Windows distribution regression reducing token throughput on NVIDIA GPUs and a sampling speed penalty introduced by a specific commit affecting llama-server but not llama-cli. These regressions highlight the impact of recent code changes on efficiency and responsiveness in various environments.
  • issues/15618, issues/15672
  • Crashes and runtime errors in server and tool modules: Multiple crashes occur in llama-server and related modules, including a kernel crash in ggml_cann_rms_norm on Ascend NPU due to tensor shape issues, a runtime error from an empty grammar stack during tool calling, and Vulkan-related crashes on Windows with NVIDIA GPUs after a certain build. These issues cause instability and termination of processes under specific hardware or input conditions.
  • issues/15330, issues/15608, issues/15678
  • Build and compilation problems across platforms and backends: Several issues describe build failures and linker errors, including CUDA runtime errors caused by unclean build states, Vulkan compile errors on Linux due to outdated Vulkan versions, and missing zDNN references on IBM LinuxONE systems requiring manual build and configuration. These problems complicate the build process and require careful environment management.
  • issues/15718, issues/15737, issues/15772
  • Model compatibility and conversion issues: Problems arise when converting or quantizing certain models, such as assertion failures during tencent/Hunyuan-MT-7B conversion due to dynamic RoPE scaling changes and quantization failures for the ByteDance Seed-OSS-36B model caused by unrecognized architecture labels. These issues necessitate updates to conversion tools or manual adjustments to support new or complex models.
  • issues/15726, issues/15757
  • Hardware-specific runtime faults and assertion failures: There are hardware-related runtime errors including GPU crashes on AMD Instinct MI210 devices due to memory access faults with long prompts in HIP backend, and assertion failures on Apple A16 GPUs caused by incompatible tensor shapes during matrix multiplication in the Metal backend. These highlight challenges in ensuring stable operation across diverse hardware.
  • issues/15829, issues/15806
  • Logical bugs and feature incompatibilities in API and model behavior: Bugs include an inverted logical condition causing errors when disabling the "thinking" feature, and incoherent story generation on Apple M3 Pro with Metal backend after a specific commit, which was later fixed. These issues affect user-facing features and model output quality.
  • issues/15401, issues/15808
  • Request for new model support and codebase maintainability improvements: There is a feature request to add support for the Aperture multilingual models requiring specific activations and normalization, alongside concerns about extremely large code files exceeding 10,000 lines that hinder readability and maintainability, calling for refactoring.
  • issues/15753, issues/15723
  • Precision conversion failures on specialized hardware: A bug introduced after a certain commit causes FP32 to FP16 and vice versa conversions to fail on IBM NNPA co-processor with big-endian CPU backend, resulting in incorrect inference outputs and assertion failures during normalization. This affects correctness on specialized hardware platforms.
  • issues/15721
  • Efficiency improvements via adapter activation: A proposal suggests adding support for Activated LoRA (aLoRA) adapters in llama.cpp to enable more efficient multi-turn inference by activating adapter weights only after a specific prompt, allowing reuse of the base model’s KV cache and reducing token fill time in retrieval-augmented generation scenarios.
  • issues/15212

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 34

Key Open Pull Requests

1. ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorrect zTensor free: This pull request addresses issue #15414 by activating FP16 and BF16 acceleration in ggml-zdnn, fixing incorrect zTensor memory freeing to prevent exhaustion during multi-model inference, correcting inference errors caused by LLAMA_SET_ROWS=1, and improving performance by initializing bias zTensors in .init_tensor, with detailed performance benchmarks demonstrating significant speedups on IBM z17 hardware.

  • URL: pull/15839
  • Merged: No
  • Associated Commits: 47509, 6e780, bf285, f4ec7, e0bae, 7de71, 81e20, 5c31f, 1a6d6, 8279a, 0a08b, 9ed59, 99311, d5b32, 53b2a, 4f6be

2. convert : use reflinks for faster conversion: This pull request introduces a --reflink option to the convert_hf_to_gguf.py script that leverages Copy-On-Write filesystem features on supported filesystems like BTRFS, XFS, and ZFS to significantly speed up model conversion and reduce disk space usage by sharing file extents, while handling alignment constraints and providing fallbacks for unsupported cases, though it remains experimental and may produce incompatible or broken models in some scenarios.

  • URL: pull/15727
  • Merged: No
  • Associated Commits: 6b327, bbc35, 1f000, dd6a4, 5c83b, eecb8, 4b4d6, 1f18f, add6b, 81003

3. ggml: allow casting between f32 and i32: This pull request introduces functionality to allow casting between 32-bit floating point (f32) and 32-bit integer (i32) types across multiple backends (CPU, Metal, CUDA, Vulkan) in the ggml library, enabling further calculations from outputs like ggml_argmax or ggml_top_k that previously lacked this conversion operation.

  • URL: pull/15783
  • Merged: No
  • Associated Commits: f3b48, d4f78, 4f57d, fee65, 01af2, 60e8f, 9f5d5, 0a2b2, 65b19

Other Open Pull Requests

  • CUDA-based conv2d improvements: Multiple pull requests introduce CUDA-optimized conv2d implementations leveraging implicit GEMM and Tensor Cores, significantly boosting performance with up to 10× throughput and 3–10× GFLOPS gains on large kernels, especially in FP16 precision. These changes also include detailed benchmarks and address build stability and memory trade-offs on smaller convolutions.
    • pull/15805, pull/15813
  • OpenCL and embedded GPU optimizations: Pull requests add initial support for q8_0 matrix-vector multiplication using OpenCL and introduce a new matrix multiplication shader variant optimized for embedded GPUs like Mali-G715, improving performance and laying groundwork for future operation enhancements.
    • pull/15732, pull/15800
  • CUDA mmf kernel and batch size optimizations: A pull request adds support for the mul_mat_id operation in the CUDA mmf kernel optimized for batch sizes less than 16, improving efficiency by calculating activations per token per expert and filtering unused experts, resulting in better performance on NVIDIA 4090 GPUs without affecting normal execution.
    • pull/15767
  • Build system and ROCm support enhancements: One pull request fixes build failures in the Nix flake for llama-cpp-rocm by adding missing executables, enabling ROCWMMA_FATTN support, updating architecture targets, and adding UMA support, ensuring successful builds with or without ROCm and MPI. Another addresses AMDGPU_TARGETS deprecation warnings in ROCm 6.4 by updating Docker build configurations.
    • pull/15747, pull/15786
  • Device management and prioritization: A pull request introduces a new device type GGML_BACKEND_DEVICE_TYPE_IGPU, adds device IDs to device properties, modifies device prioritization to prefer iGPU only when no GPU is available, and prevents multiple devices from different backends sharing the same device ID.
    • pull/15797
  • Vulkan backend performance and feature support: Several pull requests improve Vulkan backend performance by using larger vector/matrix loads, add support for pad_ext and im2col_3d operations, and enhance memory allocation to support large graphs exceeding 4 GB by splitting allocations across multiple backend buffers.
    • pull/15729, pull/15794, pull/15795, pull/15815
  • CANN backend improvements: Pull requests propose switching to stream synchronization for better efficiency and implement an LRU cache for ACL graphs to enable on-demand graph loading and eviction, reducing reconstruction overhead.
    • pull/15809, pull/15814
  • Quantization and architecture-specific optimizations: A pull request introduces block repacking support for the Q4_K quantization format on AArch64 with NEON-optimized quantize paths and GEMV/GEMM kernels, resulting in significant performance improvements without degrading perplexity.
    • pull/15719
  • Web and UI enhancements: Pull requests add the currently loaded model name with detailed shard and build info to the llama-server WEBUI header, fix the Seed-OSS thinking block issue in the web UI, and improve user-facing warnings for chat template usage and context length, enhancing clarity without changing functionality.
    • pull/15787, pull/15820, pull/15822
  • Tooling and example additions: Pull requests introduce an example demonstrating next-token prediction with LLaMA models, rewrite the llama-run tool to leverage llama-server's advanced functionality, and add support for the docker:// protocol in llama-server to pull and run models directly from Docker Hub.
    • pull/15774, pull/15790, pull/15818
  • WebGPU backend build fixes: A pull request fixes the build process of the webgpu backend on the emscripten platform by applying necessary code changes and debugging enhancements to ensure successful compilation and testing in a single-threaded wasm environment.
    • pull/15826
  • Vulkan shader debugging and performance tools: A pull request initializes vulkan-hpp to enable extension function pointers for querying register counts on NVIDIA GPUs, primarily for performance debugging and potential future heuristic improvements.
    • pull/15705

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 76

Key Closed Pull Requests

1. ggml: add ops for WAN video model (cuda && cpu): This pull request adds essential operations such as conv3d and im2col_3d support for both CUDA and CPU backends to enable the WAN video model, while also fixing related issues and improving performance and compatibility across platforms.

  • URL: pull/15669
  • Merged: 2025-09-04T08:38:49Z
  • Associated Commits: c92f9, 93c7e, f7a12, 85c8e, ae47c, dd745, d8377, d30e0, df059, 9d035, d11a7, f6a87, f6278, c9b9f, 131ae, 0d5eb, aafa7, 3f901

2. ggml-cpu: fixes instability in NNPA Vector Intrinsics: This pull request addresses and fixes the instability issues in the GGML NNPA vector intrinsics that caused gibberish output when compiling with -DGGML_NNPA using more than four threads, and it also implements automatic disabling of Flash Attention in this build configuration, with extensive verification tests confirming correct inference across multiple thread counts and performance benchmarks on an IBM z16 mainframe.

  • URL: pull/15739
  • Merged: No
  • Associated Commits: 14c87, 0cc20, 1edd6, b8e17, a59f3, fde52, ed91e, 4200b, 0b3be, b9ce3, d73c4, 0510f, fac0d, dc84c, f00ec

3. feat: add Jinja tester PySide6 simple app: This pull request introduces a simple Python application using PySide6 for testing Jinja templates that improves upon existing Jinja tester websites by providing detailed debugging information, including the specific line number of template errors.

  • URL: pull/15756
  • Merged: 2025-09-04T23:05:12Z
  • Associated Commits: 2d189, 0b766, 67c29, dcbcb, 8789a, 1851f, d1ae1, 3225d, a6fbf, de414, ca6e8, 4ee48

Other Closed Pull Requests

  • LoRA and Adapter Support: This topic covers the addition of Activated LoRA (aLoRA) support in the llama-server and GGUF format, enabling efficient hot-swapping of LoRA adapters without clearing the cache. These changes allow multi-adapter models to dynamically apply add-on features during model execution.
    • pull/15327
  • Kernel and Backend Optimizations: Multiple pull requests optimize CUDA kernels, RISC-V Vector kernels, and backend implementations to improve performance and hardware support. These include fast integer division for CUDA rms_norm_f32, elimination of pipeline stalls in RVV kernels, and enhancements in the ggml-cpu backend.
    • pull/15715, pull/15720, pull/15690
  • Documentation and Code Refactoring: Improvements include rebasing and enhancing GGML API documentation with new 3D convolution function details, splitting a large server source file into modular components, and updating bash scripts for portability. These changes improve maintainability, readability, and usability of the codebase.
    • pull/15777, pull/15632, pull/15765, pull/15632, pull/15791
  • File and Memory Handling Enhancements: Refactoring efforts include writing gguf data directly to disk to handle large models efficiently and managing per-device workspaces in the Ascend backend to prevent memory conflicts. These changes improve efficiency and fix precision issues in multi-device scenarios.
    • pull/15691, pull/15763
  • WebGPU and Vulkan Backend Updates: Support for TRANSPOSE and RESHAPE operations was added to the ggml WebGPU backend, while Vulkan integration was enhanced by using the memory budget extension to monitor device memory consumption accurately. Additionally, missing Vulkan operators like hardsigmoid and hardswish were implemented.
    • pull/15695, pull/15545, pull/15762
  • Server Features and API Improvements: Enhancements include enabling the /slots endpoint by default with improved security, adding prompt processing progress reporting in stream mode, and implementing logic to handle the enable_thinking flag in assistant prefill templates to prevent incompatible states.
    • pull/15630, [pull/15827](https://github.com/pull/15827], pull/15404
  • Sampling and Model Execution Optimizations: Sampling was optimized by reusing bucket sort to reduce memory allocations and allow optional candidate sorting. Support for running operators in eager execution mode while ACL graph compilation is enabled was also added to improve debugging and reduce overhead.
    • pull/15665, pull/15712
  • Bug Fixes and Stability Improvements: Fixes include resolving build errors in CUDA half-precision conv2d kernels by unifying type conversions, correcting the llama_context constructor to prevent assertion failures during embedding mean pooling, and addressing RoPE cache issues in the CANN backend to avoid memory and precision errors.
    • pull/15690, pull/15791, pull/15629
  • Build and Compilation Enhancements: Incremental build times were improved by refactoring shader loading to avoid std::string, and checks for supported FA kernels in the metal backend were updated while removing unused vector kernels. These changes speed up compilation and improve code maintainability.
    • pull/15700, pull/15700
  • Miscellaneous Features: Added support for Nemotron thinking and toolcalling with streaming capabilities, enhancing reasoning functionality as a follow-up to previous updates.
    • pull/15676

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 76 10 0 60
CISC 57 5 0 64
JohannesGaessler 33 6 0 72
jeffbolznv 40 12 1 46
pwilkin 47 2 4 27
taronaeo 51 4 2 19
ngxson 39 5 0 18
EAddario 61 1 0 0
wine99 60 1 0 0
danbev 42 14 1 2

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.