Weekly Project News

Archives
Subscribe

Weekly GitHub Report for Llama.cpp: January 25, 2026 - February 01, 2026 (21:34:40)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced performance and user experience. Notable highlights include optimized features and bug fixes that streamline functionality and increase stability.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. [ENHANCEMENT] Feature Request: Support Kimi K2.5: This issue requests support for the Kimi K2.5 model, which is a new iteration of the Kimi series featuring multimedia input capabilities such as image and video, and includes native INT4 quantization for experts. The user highlights challenges in converting the model due to custom code requirements and shares ongoing efforts and solutions related to quantization, template adjustments for reasoning content, and adding image input support.

    • The comments detail attempts to convert the model with warnings about custom code execution, discussions on quantization strategies, fixes for reasoning content parsing by modifying the chat template, and progress on adding image input support with shared code and model files, reflecting collaborative troubleshooting and development.
    • Number of comments this week: 10
  2. [BUG-UNCONFIRMED] Eval bug: comp freeze when running model: This issue reports a problem where running the Qwen3-1.7b-GGUF model on a Ryzen 3 5400u CPU causes the system to freeze and the SSH connection to drop after partial text generation, with no apparent memory spike observed. The user has tried various troubleshooting steps including building from source, limiting CPU threads, disabling GPU usage, and testing on another machine, but the issue persists specifically on the AMD hardware.

    • Commenters suggested verifying a clean CPU build, using reliable GGUF models, disabling GPU explicitly, and checking for VM usage or CPU core settings; the user confirmed no GPU and running on host with virtual CPUs, and shared verbose logs before the crash; recommendations included testing on different releases, purging caches, and adjusting command-line parameters to isolate the problem, but no definitive solution was reached.
    • Number of comments this week: 9
  3. [BUG-UNCONFIRMED] Eval bug: [Vulkan] [Intel] unsloth/GLM-4.7-Flash-Q4_K_M.gguf and A770 main: error: failed to load model (MMAP off): This issue reports a problem where a specific GLM 4.7 model fails to load using the Vulkan backend on Intel Arc A770 GPUs when memory mapping (mmap) is disabled, resulting in an "ErrorOutOfHostMemory" during model loading. The user observes that enabling mmap allows the model to load successfully, but mmap causes other memory allocation errors, and disabling direct I/O or using multiple GPUs appears to mitigate the problem, suggesting it may be related to Vulkan memory management or driver limitations.

    • The comments discuss the memory requirements and errors encountered with and without mmap and direct I/O, share detailed logs with memory debugging enabled, confirm that the issue is likely related to Vulkan driver or system memory handling, and note that disabling direct I/O by default in a recent change should resolve the problem.
    • Number of comments this week: 9
  4. MoE model decode hangs on Jetson Orin AGX (SM87) since b7309: This issue reports that MoE model decoding hangs indefinitely on the NVIDIA Jetson Orin AGX (SM87) starting from commit b7309, with the CUDA kernel stalling during inference despite successful model loading. The root cause was traced to a commit that scales CUDA launch queues, which causes deadlocks on Jetson's unified memory architecture, and a workaround involving setting an environment variable was found, followed by a fix that conditionally disables this scaling on Tegra devices.

    • The comments detail a refined bisect identifying the problematic commit, discuss environment variable workarounds, share reproduction details and build configurations, and provide deep analysis including strace logs explaining the GPU command buffer deadlock specific to Jetson’s unified memory, culminating in a targeted fix that avoids the issue without reverting all changes.
    • Number of comments this week: 8
  5. [BUG-UNCONFIRMED] [VULKAN] Eval bug: GLM-4.7-Flash gibberish output with FA = on: This issue reports that when running the GLM-4.7-Flash model with flash attention (FA) enabled, the output becomes gibberish after a few prompts once the context length reaches several thousand tokens, whereas disabling FA produces coherent output. The user seeks clarification on various related parameters such as cache reuse, sliding window attention, MLA toggling, and context size reporting, aiming to optimize usage and understand the model's behavior better.

    • Comments reveal that the bug is reproducible by others and is linked to the --cache-reuse parameter, which when removed, resolves the gibberish output; further discussion clarifies that some parameters like ctx-checkpoints have no effect on this model, MLA is not toggleable, and context size reporting discrepancies are normal, with additional guidance provided on parameter usage and configuration.
    • Number of comments this week: 6

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

As of our latest update, there are no stale issues for the project this week.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 34

Summarized Issues:

  • Model Loading and Output Corruption Issues: Several issues report problems with model loading and output corruption across different models and hardware. These include freezing and crashing during generation on AMD CPUs, garbage output on Snapdragon processors, gibberish output with flash attention enabled, and Vulkan backend memory errors on Intel GPUs, all indicating instability in model execution and loading under various conditions.
  • [issues/19099, issues/19118, issues/19128, issues/19143]
  • Backend and Hardware Compatibility Crashes: Multiple crashes occur related to backend implementations and hardware specifics, such as assertion failures in Vulkan kv-cache, Metal backend device destruction on Apple M1 Max, CUDA backend crashes with GLM 4.6V and Kimi K2.5 models, and illegal instruction crashes on riscv64 VMs. These highlight ongoing stability challenges with different compute backends and platforms.
  • [issues/19116, issues/19137, issues/19210, issues/19220, issues/19160]
  • Memory and Performance Regressions: There are reports of severe memory leaks with LoRA adapters, performance regressions on Intel integrated GPUs due to Vulkan device duplication, slow model loading with NUMA options, and intermittent token generation hangs with CUDA GPUs. These issues collectively point to resource management and performance degradation problems affecting usability and efficiency.
  • [issues/19217, issues/19221, issues/19191, issues/19232]
  • Threading and CPU Utilization Problems: Issues include double counting of CPU threads when using --threads -1 due to hyper-threading, and unexpectedly low CPU utilization in the rpc-server despite high thread counts. These problems affect parallelism and resource allocation, potentially limiting performance on multi-core systems.
  • [issues/19110, issues/19192]
  • Model Conversion and Support Requests: There are requests for adding support for new models like Kimi K2.5 and Geilim-1B-Instruct, as well as a bug report on the failure of the convert_hf_to_gguf.py script to handle tokenizer.json files, indicating gaps in model compatibility and tooling.
  • [issues/19127, issues/19214, issues/19152]
  • Feature Requests for UI and API Enhancements: Users request UI improvements such as collapsible code blocks and mouse-resizable chat windows, as well as server-side support for the OpenAI Responses API and multiple LoRA adapter initialization to optimize VRAM usage. These feature requests aim to improve usability and integration with existing tools.
  • [issues/19135, issues/19244, issues/19138, issues/19153]
  • Docker and Platform Support Limitations: Issues report missing linux/arm64 Docker images despite documentation claims and request Ubuntu ARM64 release binaries to avoid slow source builds, highlighting deployment challenges on ARM platforms.
  • [issues/19177, issues/19240]
  • CI/CD and Regression Detection Needs: There is a request for implementing continuous integration and deployment systems specifically for Vulkan on Intel platforms to catch regressions early and ensure stable releases, reflecting a need for improved development workflows.
  • [issues/19213]
  • Parsing and Output Handling Bugs: One issue describes a failure in JSON parsing of translation output despite successful translation logs, resulting in no visible output to users, indicating a disconnect between model output and user-facing results.
  • [issues/19212]
  • MoE Model Deadlocks on NVIDIA Jetson: A specific issue reports MoE model inference hangs on NVIDIA Jetson Orin AGX devices due to CUDA kernel hangs caused by command buffer scaling changes, requiring environment variable workarounds and ongoing fixes to prevent deadlocks in unified memory environments.
  • [issues/19219]
  • Tooling Crashes and Failures: The llama-imatrix tool crashes with a floating point exception when using the --show-statistics option, even with required parameters, indicating stability issues in auxiliary tools.
  • [issues/19190]
  • Windows Executable Failures: The llama-server and llama-cli executables on Windows exit immediately without output or error messages, preventing basic commands from running and complicating troubleshooting efforts.
  • [issues/19236]
  • Compilation and Build Errors: A compilation error occurs with GCC 15 and CUDA 13.1 on Linux due to exception specification mismatches in CUDA headers, requiring manual patching to resolve, showing build environment fragility.
  • [issues/19100]
  • Model Execution Failures with AMX_INT8: LFM2 and LFM2.5 models fail CPU-only inference when built with AMX_INT8 support enabled, causing assertion failures, while disabling AMX_INT8 allows successful runs, indicating compatibility issues with this instruction set extension.
  • [issues/19184]

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 20

Summarized Issues:

  • GPU and Flash Attention Issues: Multiple issues report problems related to GPU handling and Flash Attention features, including GPU deduplication on Mac Pro with MoltenVK causing recognition of fewer GPUs, crashes with Flash Attention enabled on ROCm and CUDA backends, and performance regressions or incorrect outputs when using Flash Attention or GPU offloading. These problems affect various hardware setups such as AMD and NVIDIA GPUs and lead to crashes, gibberish output, or slower processing.
  • [issues/19026, issues/19096, issues/19119, issues/19122, issues/19158, issues/19169, issues/19200]
  • Model Compatibility and Conversion Errors: Several issues highlight problems with model compatibility and format conversion, including the Granite model not recognized as compatible, failure to convert Granite 3.1 from safetensors to GGUF due to tensor mapping errors, and the MiniMax M2.1 model producing gibberish after an update. These issues prevent proper model evaluation or cause regressions in output quality.
  • [issues/19076, issues/19112, issues/19201]
  • Crash and Execution Failures: There are reports of crashes and execution failures such as llama.cpp exiting immediately on Windows 10 without prompt, llama-server crashing due to chat template parsing errors, and llama-tts failing to load models due to lost model paths or unrecognized command-line arguments. These failures disrupt normal operation and user interaction.
  • [issues/19083, issues/19104, issues/19130]
  • Quantization and Tokenizer Bugs: Issues include quantization aborts with core dumps caused by unsupported tensor sizes and tokenizer vocabulary test failures on Arm RK3588 devices due to invalid vocab files linked to missing Git LFS. These bugs cause crashes or incorrect processing during model preparation or tokenization.
  • [issues/19038, issues/19185]
  • Backend and Compilation Errors: Problems with backend loading and compilation include ZenDNN backend failing due to undefined symbols caused by missing linkage, and a compilation error fixed by adding a missing include directive for size_t. These issues prevent successful builds or runtime backend initialization.
  • [issues/19134, issues/19215]
  • CUDA Backend Operation Failures: The CUDA backend shows failures in specific operations such as 20 CONV_2D test cases failing on an RTX 4090 under Windows, indicating implementation problems affecting CUDA-based computations.
  • [issues/19149]
  • Chat Template Parsing and Speculative Decoding Issues: Bugs in chat template parsing cause incorrect tool usage or crashes, and speculative decoding in the chat completions endpoint only works correctly on the first request after server start, failing on subsequent requests and slowing processing. These issues impact chat functionality and response generation.
  • [issues/19155, issues/19231]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 35

Key Open Pull Requests

1. ggml-virtgpu: make the code thread safe: This pull request improves the ggml-virtgpu backend by making the code thread safe through the use of mutexes for accessing shared memory buffers, pre-caching constant backend values during initialization, deprecating the unused buffer_type_is_host method, and addressing various thread safety and stability issues such as removing static variables and adding cleanup functions.

  • URL: pull/19204
  • Associated Commits: e9d13, c6a60, 92390, fcc68, 171ab, 3864b, 07f41, f35b2, f9780, e9e9d

2. Remove pipeline cache mutexes: This pull request removes mutexes from pipeline caches by leveraging the per-thread nature of webgpu_context, while retaining necessary mutexes in other shared buffer pools.

  • URL: pull/19195
  • Associated Commits: 608fc, edee0, a9621, 55920, bb362, 83601, df604, 72369, 3e145

3. llama: Add option to merge gate and exp weights: This pull request adds an option --fuse_gate_up_exps to the convert_hf_to_gguf.py script to enable merging of gate and expert weights in MoE models, improving performance as demonstrated with the deepseek2 model, while noting that further integration is needed for all MoE models and their tensors.

  • URL: pull/19139
  • Associated Commits: 3c264, d3793, d29e8, 93c09, d5278, 56688, 362bf

Other Open Pull Requests

  • RVV and SVE Kernel Optimizations: Multiple pull requests add and optimize GEMM/GEMV kernels for RVV (with VLEN=256) and ARM SVE extensions targeting various quantization types, significantly improving LLM inference performance on platforms like BananaPI-BPI F3 and Graviton3E processors. These implementations maintain accuracy while boosting speed, with functional testing and benchmarking included.
    • pull/19121, pull/19132, pull/19171, pull/19196, pull/19167, pull/19180
  • Model Support and Vision Features: Pull requests introduce support for new models such as Kimi-K2.5 and Longcat-Flash, including handling compressed INT4 tensors, vision-related configuration keys, and the unique "zero-computing experts" MoE FFN architecture. Additional work includes preliminary LongCat-Flash-Lite support with ngram embeddings and model conversion adaptations to GGUF format.
    • pull/19170, pull/19182, pull/19166, pull/19167
  • Server and API Improvements: Several pull requests enhance server functionality by adding token healing support to the completion endpoint, improving error messages with specific model names, validating context and model pointers before sampling, and clarifying handling of multiple --model options with appropriate warnings and errors.
    • pull/19238, pull/19117, pull/19101, pull/19113, pull/19156, pull/19166, pull/19238
  • Bug Fixes and Stability Enhancements: Fixes include addressing build failures on Windows Snapdragon devices by adjusting CMake version limits, preventing pointer invalidation in Vulkan semaphore management, fixing asynchronous tensor synchronization bugs, and resolving CUDA flux attention NaN and overflow issues. These changes improve stability and correctness across platforms.
    • pull/19188, pull/19179, pull/19193, pull/19098, pull/19179, pull/19222
  • Memory and Performance Optimizations: Introductions include hybrid model loading combining mmap and DirectIO for weight streaming, chunking in FA CPU implementation for parallelization, and optimized scale computations using Zvfhmin instructions to reduce overhead. These optimizations target improved throughput and efficiency on supported hardware.
    • pull/19179, pull/19196, pull/19180
  • Template and Content Handling Fixes: A pull request fixes Jinja template failures by detecting templates requiring typed arrays and converting string message contents accordingly, preventing runtime errors and ensuring compatibility with expected input types.
    • pull/19156
  • GPU Backend and Vulkan Improvements: Fixes in GPU deduplication logic remove problematic filtering on Windows Intel iGPU and adjust Vulkan semaphore management to prevent pointer invalidation, ensuring correct GPU reporting and resource cleanup across devices and platforms.
    • pull/19209, pull/19188
  • Experimental Features: An experimental implementation explores the "Expected Attention" method to compress the KV cache by estimating KV pair importance based on predicted future queries, aiming to reduce memory usage during LLM inference, though the code is not yet functional.
    • pull/19183

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 70

Key Closed Pull Requests

1. Merge Upstream: This pull request is about merging upstream changes into the project, including various updates such as server improvements, web UI enhancements, new model support, bug fixes, build system adjustments, and optimizations, but it was not merged.

  • URL: pull/19124
  • Associated Commits: cbe3d, 0806d, 87224, a4373, 4396b, 1cc5c, 4a195, 306d5, 74376, 1fb1a, aa281, 41f55, d3b74, e601f, a79b8, c2870, aa3bc, 6f4f9, 3f730, 68ee9, 99bd8, c0260, 36ee8, f677c, c54ad, bbfb2, 4f060, ce639, bceb6, 83851, 7be32, 87454, 91edf, 30101, f26db, 75ee8, ff71b, 35349, 34530, 19be6, b61d1, 05836, d9e7d, 3aff2, 2baa7, 97aef, 79ea2, f6162, bbcf5, 72d46, ae3af, 55312, 6e7b2, 21f34, fd1ee, 85280, 04f06, 07684, daae6, 4838a, 9fb5f, 34e70, 37f78, 0a496, 0ed23, 06c69, 8976a, 27b36, 4b9d9, 05875, 23854
  • Associated Commits: cbe3d, 0806d, 87224, a4373, 4396b, 1cc5c, 4a195, 306d5, 74376, 1fb1a, aa281, 41f55, d3b74, e601f, a79b8, c2870, aa3bc, 6f4f9, 3f730, 68ee9, 99bd8, c0260, 36ee8, f677c, c54ad, bbfb2, 4f060, ce639, bceb6, 83851, 7be32, 87454, 91edf, 30101, f26db, 75ee8, ff71b, 35349, 34530, 19be6, b61d1, 05836, d9e7d, 3aff2, 2baa7, 97aef, 79ea2, f6162, bbcf5, 72d46, ae3af, 55312, 6e7b2, 21f34, fd1ee, 85280, 04f06, 07684, daae6, 4838a, 9fb5f, 34e70, 37f78, 0a496, 0ed23, 06c69, 8976a, 27b36, 4b9d9, 05875, 23854

2. hexagon: enable offloading to Hexagon on Windows on Snapdragon: This pull request enables offloading to the Hexagon DSP on Windows devices with Snapdragon processors by updating the GGML Hexagon backend to support building and running on Windows on Snapdragon, introducing a unified interface for Android and Windows DSP drivers, implementing necessary artifact signing for Windows, refactoring dynamic loading code for reuse, and providing detailed documentation and scripts for setup and usage, while noting some intermittent multi-session instability.

  • URL: pull/19150
  • Associated Commits: c1625, ee06f, 5c768, d87fe, bc0bb, 6a6d1, 3163b, 38d93, 50024, b084d, 37c56, 3cb73, 3c9bb, 82baa, aeb6d, 86db4, 06ee1, a3822, 499b9, 0bb90, 1d8f3, 934ca, 48e3f, 8b524, c3c23, 57ed1, 39c74, a2911, dc9fb, 58197, 88779, 62783, b5588, d71e9, e7bb3, 03581, a3b24, bcfb8, 8b978, 82c30, 1041b, eb275
  • Associated Commits: c1625, ee06f, 5c768, d87fe, bc0bb, 6a6d1, 3163b, 38d93, 50024, b084d, 37c56, 3cb73, 3c9bb, 82baa, aeb6d, 86db4, 06ee1, a3822, 499b9, 0bb90, 1d8f3, 934ca, 48e3f, 8b524, c3c23, 57ed1, 39c74, a2911, dc9fb, 58197, 88779, 62783, b5588, d71e9, e7bb3, 03581, a3b24, bcfb8, 8b978, 82c30, 1041b, eb275

3. Vulkan Flash Attention Coopmat1 Refactor: This pull request refactors the Vulkan Flash Attention Coopmat1 shader for AMD by extending the use of cooperative matrices to the Softmax * V matrix multiplication, optimizing shared memory usage, vectorizing variables, and improving memory loading strategies, resulting in significant performance gains on AMD RX 8060S while noting some regression on Nvidia hardware.

  • URL: pull/19075
  • Associated Commits: e8596, 7c46d, b58ac, 37960, 4cfc4, b4e96, 1ed1f, 0bbee, f435f, 35031, 1226b, 44cd0, 3037c, 4b3b6, 1c40f, 61745, 74d32, e0c41, ed11a, 93841, a875c, 7d75b, a0c9f, 1e576, ef86d, fcd3a
  • Associated Commits: e8596, 7c46d, b58ac, 37960, 4cfc4, b4e96, 1ed1f, 0bbee, f435f, 35031, 1226b, 44cd0, 3037c, 4b3b6, 1c40f, 61745, 74d32, e0c41, ed11a, 93841, a875c, 7d75b, a0c9f, 1e576, ef86d, fcd3a

Other Closed Pull Requests

  • Console refactoring: This pull request refactors the console functionality in tools/cli.cpp by replacing the console namespace with a singleton class console_t, moving several static methods into this class as private methods, and removing the use of atexit() for cleanup. It updates all console method calls accordingly to improve code structure and resource management.
    • pull/19198
  • WebGPU backend improvements: These pull requests refactor the WebGPU backend by splitting the shared webgpu_context state into global and per-thread states to improve modularity and thread safety, and implement software pipelining and vectorization optimizations for flash attention. The changes result in significant speedups and higher throughput in the WebGPU backend.
    • pull/18976, pull/19151
  • Flash attention and DSP optimizations: This pull request implements dual row dot product optimizations for flash attention on the Hexagon DSP by adding new vectorized dot product functions and refactoring the main attention kernel. These improvements yield significant performance gains on the 8Gen2 device.
    • pull/19141
  • HIP backend and CDNA support: This pull request adds and refactors matrix multiply-accumulate function (mmf) support for CDNA architectures in the HIP backend, including parameterizing rows_per_block, extending tile size, and improving compilation speed. It also provides performance data and requests further testing on CDNA1 and CDNA2.
    • pull/18896
  • OpenCL kernel enhancements: This pull request adds a flattened version of the q6_K matrix-vector multiplication kernel in OpenCL, renames the existing kernel file for clarity, and includes refactorings to enable future optimizations without affecting current performance.
    • pull/19054
  • Jinja template and filter improvements: These pull requests implement support for mixed type object keys in Jinja templates, replicate Python/Jinja behavior for int, float, and bool keys, fix issues with array/object output and JSON serialization, and update filters and tests to treat undefined values as sequences or iterables.
    • pull/18955, pull/19147
  • Speculative decoding methods: These pull requests introduce self-speculative decoding that accelerates token prediction by reusing draft parameters and a lightweight ngram-based speculative decoding module using rolling hashes for constant memory and complexity. Both methods improve performance in token generation and text processing tasks.
    • pull/18471, pull/19164
  • SYCL backend updates: These pull requests fix typos and update SYCL documentation, implement the GGML_UNARY_OP_SOFTPLUS operation with matching CPU backend behavior, and propose (but do not merge) the GGML_OP_TOP_K operation for F32 on SYCL.
    • pull/19162, pull/19114
  • CUDA improvements and fixes: These pull requests fix CUDA graph node property matching by restoring original logic for leaf nodes, increase CUDA command buffer size to reduce CPU stalls and improve pipeline parallelism performance, and refactor CUDA topk-moe implementation to support more models with simplified templates and improved throughput.
    • pull/19165, pull/19042, pull/19126
  • Build and CI updates: These pull requests update the HTTPS error message for clearer instructions, switch to the new 1vCPU GitHub Actions runner for lightweight jobs, and disable Direct IO by default with an override for mmap support to address a specific issue.
    • pull/19103, pull/19107, pull/19109
  • Server and context improvements: These pull requests modify server code to wrap the "id_slot" parameter for task distribution without total slot count knowledge and update the output_reserve function to always allocate sampling buffers for every batch, eliminating conditional branching and reallocations.
    • pull/19207, pull/18811
  • Metal backend and multi-GPU support: This pull request adds support for events in the Metal backend using MTLEvent, enabling multi-GPU workflows on Macs even with a single hardware device.
    • pull/18966

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 80 16 0 26
ngxson 77 2 0 12
CISC 55 7 0 14
am17an 40 4 0 8
max-krasnyansky 42 1 0 4
danbev 31 7 0 3
JohannesGaessler 18 4 0 16
0cc4m 34 1 0 2
Alcpz 29 2 0 4
reeselevine 20 0 0 13

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.