Weekly Project News

Archives

Weekly GitHub Report for Llama.cpp: March 23, 2026 - March 30, 2026 (22:25:53)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, reflecting a continued focus on optimizing user experience and system reliability. Notable highlights include improved processing speeds and bug fixes addressing previous issues.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. [ENHANCEMENT] Feature Request: TurboQuant support: This issue requests the addition of TurboQuant support, a new quantization method announced by Google that compresses KV cache using polar coordinates to reduce memory requirements, enabling larger models to run on smaller hardware. The original poster shares their experimental implementation and invites discussion on potential improvements, benchmarking, and compatibility with various hardware backends.

    • The comments include inquiries about reference implementations, discussions on performance trade-offs and structured rotations for speed, shared benchmarks on different hardware, reports of partial implementations and bugs, and community contributions with pull requests and alternative branches; overall, the conversation reflects active experimentation, validation, and interest in integrating TurboQuant into the project.
    • Number of comments this week: 40
  2. [BUG-UNCONFIRMED] Misc. bug: --threads -1 (default) works in llama-cli but slows down llama-server by 50%: This issue reports a performance problem where the default setting of --threads -1 causes llama-server to slow down by 50% when running certain large models quantized by AesSedai, while the same setting works fine in llama-cli and other quantized models do not exhibit this slowdown. The user provides detailed system specs and testing results, noting that explicitly setting --threads 14 resolves the slowdown, and suspects the issue is related to the specific quantization method used by AesSedai.

    • The comments explore possible causes such as NUMA configurations and recent bug fixes, confirm the user is on a unified memory system, and identify that the slowdown only occurs with AesSedai's quantized models but not others; further testing with different thread counts and settings is suggested to isolate the problem.
    • Number of comments this week: 13
  3. [ENHANCEMENT] Feature Request: Restore ARM64 Release Binaries: This issue requests the restoration of ARM64 release binaries in the continuous integration workflow following recent fixes that resolved ARM64 build failures, aiming to improve accessibility for ARM64 users by providing pre-built binaries. The proposal includes updating the release configuration to support ARM64 builds on Ubuntu 24.04 and discusses the inclusion of CPU and Vulkan ARM64 releases, while noting the complexity of providing CUDA Linux releases.

    • The comments reference a related pull request and discuss adding ARM64 releases for CPU and Vulkan on Ubuntu 24.04, confirm ongoing work on Docker ARM64 images, clarify the absence of CUDA Linux releases due to portability challenges, and address which workflow files need updating to ensure proper test coverage before release.
    • Number of comments this week: 9
  4. [BUG-UNCONFIRMED] Compile bug: RISC-V fails with GGML_CPU_ALL_VARIANTS=ON: This issue reports a compilation failure when building the RISC-V version of the project with the GGML_CPU_ALL_VARIANTS=ON flag enabled, due to missing definitions related to half-precision floating-point vector operations. The user identifies that the problem stems from the lack of support for certain RISC-V vector extensions and inconsistent guard usage in the code, and a fix involving explicit scalar fallbacks and potentially new backend variants is proposed.

    • The comments discuss the environment details, confirm the root causes involving missing vector extension support and guard inconsistencies, and agree on implementing scalar fallbacks and adding new backend variants with updated CI tests to resolve the issue.
    • Number of comments this week: 5
  5. [BUG-UNCONFIRMED] Eval bug: Parse errors increased with build 8515 (062cca58f): This issue reports an increase in random parse errors occurring after building version 8515 of the software, which were not present in earlier versions. The errors appear related to grammar parsing failures and have been observed across different hardware backends including Metal on Mac, CUDA, and Vulkan, with some users identifying a specific commit suspected to introduce the problem.

    • The comments reveal that multiple users experience similar parsing errors on various platforms, with some successfully resolving the issue by downgrading to earlier versions or reverting specific commits suspected to have introduced problematic grammar changes.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

As of our latest update, there are no stale issues for the project this week.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 47

Summarized Issues:

  • Parser and Grammar Errors: Multiple issues describe bugs related to parsing and grammar handling in the project, including null pointer dereferences in the jinja parser, empty grammar stacks causing runtime errors, and increased random parse errors after recent grammar changes. These problems lead to runtime crashes and unexpected termination during evaluation or token acceptance.
  • issues/20911, issues/21013, issues/21017
  • Backend Crashes and Segmentation Faults: Several issues report crashes and segmentation faults occurring in different backends such as OpenVINO GPU, BLAS on macOS without METAL, CUDA with NVIDIA A100, and CANN 8.5.0 with Ascend hardware. These crashes are caused by null pointer dereferences, backend-specific bugs, or illegal instructions during model execution or tokenization.
  • issues/20922, issues/20976, issues/21113, issues/21178
  • Performance Regressions and Slowdowns: There are multiple reports of significant performance issues including a tenfold drop in decode throughput when increasing CPU threads beyond 16, slow OpenVINO backend inference speeds, and a 50% slowdown in llama-server with certain quantized models due to thread settings. These regressions affect throughput and inference speed across various hardware and configurations.
  • issues/20938, issues/20972, issues/21042
  • Vulkan and GPU Backend Issues: Issues related to Vulkan backend include non-deterministic crashes on AMD RDNA3 hardware, continuous integration failures due to driver problems, slow quantization paths on AMD GPUs, and vision functionality producing nonsensical output on Intel Iris Xe iGPU. These problems impact stability and correctness of GPU-accelerated operations.
  • issues/20916, issues/21004, issues/21151, issues/21153
  • Model and Quantization Bugs: Several bugs affect model fine-tuning, quantization, and model-specific behaviors, including cascading bugs preventing fine-tuning on Apple Silicon, TurboQuant support requests, and bugs causing improper stop sequence handling or excessive logging during quantization. These issues hinder model training, compression, and inference quality.
  • issues/20977, issues/21037, issues/21063, issues/21115
  • Build and Compilation Failures: Multiple issues describe build failures on various platforms including RISC-V with missing vector extensions, ROCm backend compilation errors on Fedora Linux, and Windows ROCm device detection failures due to SDK changes. These failures block successful compilation and deployment on targeted architectures.
  • issues/20988, issues/21064, issues/21083, issues/21106
  • Server and Network Errors: Problems with llama-server include network errors on Windows requiring system reboot, MCP registration failures over HTTP, missing API keys in CORS proxy requests causing authentication failures, and the server becoming unresponsive during processing. These issues degrade server reliability and user experience.
  • issues/20945, issues/21069, issues/21127, issues/21167
  • Feature Requests for Platform Support and Releases: Requests include adding OpenVINO Windows builds, riscv64 and ARM64 release binaries, Windows OpenVINO versions, and support for new models like nvidia/gpt-oss-puzzle-88B and QWEN 3.5. These aim to improve accessibility and expand hardware and model support.
  • issues/20942, issues/20988, issues/21091, issues/21108, issues/21028
  • Multimodal and Hybrid Model Handling: Issues include bugs with the --mmproj flag causing assertion failures, requests to save and restore slots from SSD for hybrid Qwen 3.5 models, and problems with unified KV cache overflow handling causing interruptions in multi-user workflows. These affect efficient handling of multimodal inputs and large context models.
  • issues/21133, issues/21173, issues/21179
  • WebUI and Tool Integration Bugs: Problems in the WebUI include agentic session handling dropping reasoning content and merging tool-call turns incorrectly, and requests for sandboxed external tool executors to improve security without performance loss. These issues impact user interface functionality and security.
  • issues/21087, issues/21126

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 37

Summarized Issues:

  • Performance and Offloading Issues: Several issues report performance degradation and offloading problems with various models and hardware configurations. These include slower preprocessing and throughput with fused gate+up operations, incorrect CPU offloading of tensors on AMD GPUs, inability to offload asymmetric key/value cache quantization to GPU, and throughput drops when using mixed offload with fused models, complicating efficient model execution.
  • issues/20492, issues/20575, issues/20866, issues/20883
  • GPU Backend and Vulkan Compatibility: Multiple issues highlight problems with GPU backends, including Vulkan backend failures on AMD RX 580 GPUs, lower performance and instability of ROCm backend on AMD RX 7900 XTX compared to Vulkan, and compilation failures related to Vulkan support on Windows due to cpp-httplib limitations. These issues affect model execution stability and performance on AMD hardware.
  • issues/20699, issues/20934, issues/20909
  • Model Loading and Cache Management Bugs: There are several bugs related to model loading and cache handling, such as incorrect offload logic causing excessive VRAM usage, failure to load quantized models due to imatrix file misselection, incomplete migration of split GGUF model shards, and unauthorized model additions from HuggingFace cache causing resource misuse. These issues lead to errors, incomplete model loading, and cache confusion.
  • issues/20703, issues/21014, issues/21015, issues/21005
  • Server Stability and RPC Issues: Several reports describe server instability including segmentation faults on CUDA-enabled Linux, intermittent server freezes with hanging endpoints, rpc-server failures in multi-machine setups, and socket option bugs causing request misrouting. These problems disrupt server operation and multi-tenant usage.
  • issues/20631, issues/20921, issues/21006, issues/20963
  • Configuration and Environment Variable Problems: Issues include llama-server failing to start as a systemd service due to missing environment variables, Windows llama-server ignoring webui-config-file settings causing UI malfunctions, and Huggingface cache directory environment variables not isolating cache contents properly. These cause initialization errors and configuration inconsistencies.
  • issues/20952, issues/20871, issues/20994
  • Authentication and API Key Handling Bugs: Problems arise from conflicts in HTTP Authorization headers when using API keys with proxies, and the web UI CORS proxy failing to include stored API keys automatically, resulting in authentication errors and 401 responses. These issues hinder secure and seamless API access.
  • issues/21012, issues/21166
  • Crash and Memory Errors: Multiple crashes are reported including illegal memory access during prompt cache restore on multi-GPU ROCm setups, memory aperture violations on AMD GPUs with TurboQuant compression, CUDA crashes in ggml_top_k() with large tensors, and fatal errors triggered by kill switches during audio encoding. These cause instability and failures during model execution and audio processing.
  • issues/21140, issues/21096, issues/21162, issues/21104
  • Model Parsing and Tool Invocation Issues: There are parsing bugs in the Qwen3.5-27B model where multiple tool call XML blocks fail to parse correctly, and intermittent failures to emit closing tags during reasoning, causing malformed output and broken structured tool calls. Workarounds involve template changes and proxies to enable sequential tool calls.
  • issues/21158, issues/21118
  • Build and CI Failures: Issues include GitHub CI failing to create or attach build artifacts on Linux and unresolved symbol errors when building with IntelLLVM and SYCL on Linux, causing build and runtime failures. These hinder continuous integration and deployment.
  • issues/21061, issues/21041
  • Model Migration and Deletion Concerns: During model directory migration, outdated models are deleted without user confirmation, risking unwanted loss of local models. This behavior can cause data loss when models are no longer available on external servers.
  • issues/20986
  • Web UI and User Interface Bugs: The WebUI has bugs such as nested button elements causing cursor and event handling issues, and a regression causing the Voxtral-Mini-3B-2507 model to crash during audio encoding in the UI, affecting user experience and stability.
  • issues/20832, issues/21080
  • Research and Feature Exploration: One issue discusses research into the TurboQuant method, which uses advanced quantization algorithms to achieve smaller model sizes and faster speeds without accuracy loss, aiming to improve large language model efficiency and vector search.
  • issues/20979
  • Model Format Conversion and Compatibility: A problem was encountered converting the GLM-4-5-Air model to gguf format after qlora finetuning, resulting in missing tensor errors and server failures, resolved by adjusting configuration parameters to account for omitted layers.
  • issues/21172

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 59

Key Open Pull Requests

1. mtmd: Add DeepSeekOCR 2 Support: This pull request introduces support for DeepSeekOCR version 2 into the llama.cpp project, including implementation of the DeepSeek-OCR language model, vision model processing fixes, dynamic resolution preprocessing, and integration with the llama-cli, along with various improvements and bug fixes to enable effective OCR functionality.

  • URL: pull/20975
  • Associated Commits: 43a13, b6b9f, 85c7c, 578c8, 2aab5, eab28, 76305, 2de34, e8b26, 97e09, 13dc6, b32bb, 790bb, cec9a, 8b3d3, 1e081, 331ce, 6c071, a65dd, 63a04, 89afd, 88032, 1268d, 68b20, 8bce6, 5e6cf, 7e9fb, 0f558, 7b8d7, 86f11, effe6, f8f66, 3fcfc, ee8a1, 4cfa1, 3f711, a5949, 6dfda, 7941f, 206f8, 40e7e, 81533, 88109, a488b, ccb2f, 841a4, ed3b7, 55430, c5f4c, 95239, 6b0e7, 66341, c914e, e2085, 43dfc, b696c, b26b5, 7451b, 386ba, c7374, a661c, 0399d, c8917, 2dd99, fc3f6, 4d7d9, 5381b, 07613, d0c08, f5bd3, 6687b, 5f2ee, 1c886, d981f, 70539, 15f2a, 2d918, 5dfcc, 53273, 48c6c, 5174a, 01614, ed944, aaf2f, 33fab, d70f1, 4cbbe, 47f0f, e0e69, f95a6, f7736, fb3bb, 1b38c, 6c36c, dc206, 3fc61, 7f862, b3bf8, 8ad98, 4a4f8, 51c3d, 512b2, 00d23, 87e4a, f629d, 5a741, 616f0, e5d42, c739c, 9a05e, 4d917, ded92, a94c2, 6978c, 05789, 7e47a, 7ffa2, f41d3, 9b1a1, 52fcb, 0031b, 5f228, 7856e, 50c1e, 3e221, e037b, 0b61c, 7a53e, c2e67, 49f3c, 21243, 6d058

2. examples : add llama-eval: This pull request introduces a minimalistic yet feature-rich Python evaluation tool called llama-eval that supports multiple datasets (AIME, AIME2025, GSM8K, GPQA), various grader types (regex, LLM, custom), real-time result tracking with JSON state checkpointing, and outputs results both to stdout and HTML, enabling stop/resume functionality and parallelized evaluation against local or remote llama-server instances.

  • URL: pull/21152
  • Associated Commits: c05df, c2d83, 89cab, 88390, 07d5e, 23d4e, c87af, 5cc22, a8081, 5a1be, 9453f, 87f89, c2619, 04f68, 37b26, 62b04, a939f, e79e8, 812ae, fb148, 9695e, 8156d, fd907, 68dde, d2b10, 7751a, 1db84, e8a80, cffd2, 73e61, f762a, c6315, 99e3c, 52759, db10d, 350e7, de956, c6d70, ad3a5, e6e77, 60a50, 7b84a, 6c416, e2e99, 01396, 9c29b, 2ffa4, 7f049, c0c3e, a3405, 1c128

3. Add quantization recipes from custom recipe files: This pull request introduces a new system of quantization recipes using a custom, human-readable format to replace the existing llama_tensor_get_type_impl algorithm, aiming to simplify and modularize the addition and modification of quantization schemes while maintaining compatibility with legacy parameters and replicating existing functionality.

  • URL: pull/21070
  • Associated Commits: 99119, a3ff1, 363e6, 0a89c, 86103, 86273, 6e414, 2015d, 54474, 182cb, 506a4, 39482, aa8d5, d2586, 3fe55, 8ebfe, 4a2f6, 64d6c, d576a, 3adf3, 0b3cc, b85a7, c7aa7, 87be6, 9c00a, 04c82, d51eb, 072ac, 263bf, d7cee

Other Open Pull Requests

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 133

Key Closed Pull Requests

1. server : fix speculative checkpoint bugs on hybrid models: This pull request addresses multiple speculative checkpoint bugs in the server implementation for hybrid models by refining checkpoint creation and restoration logic, fixing memory leaks in attention key-value caches, improving sampler state handling to prevent invalid outputs, adding diagnostic counters and regression tests for determinism and crash prevention, and disabling speculative checkpoints for standard KV caches to optimize memory usage, although it was ultimately not merged due to being submitted to the wrong repository.

  • URL: pull/20925
  • Associated Commits: 47a1f, 95867, da872, 81038, 71e50, 21ae5, e6fa5, 951ac, 9245c, e62e7, 29d84, 0a797, a086c, e4cd6, 801bf, 50245, 61364, 177c1, dff7d, 4a855, 57d5b, 88cad, 4b836, 8b997, 33430, 4c740, 0f8e4, 6b38c
  • Associated Commits: 47a1f, 95867, da872, 81038, 71e50, 21ae5, e6fa5, 951ac, 9245c, e62e7, 29d84, 0a797, a086c, e4cd6, 801bf, 50245, 61364, 177c1, dff7d, 4a855, 57d5b, 88cad, 4b836, 8b997, 33430, 4c740, 0f8e4, 6b38c

2. webui: Conversation forking + branching improvements: This pull request introduces an initial implementation of git-inspired versioning and tracking for chat completion conversations by adding conversation forking actions, multi-level forking logic and UI enhancements in the chat sidebar, as well as improvements to message editing and branching logic.

  • URL: pull/21021
  • Associated Commits: b3dee, a27d2, 34ab8, 9fedd, 17ffe, b9a69, 5d2be, 777ac, 51150, 4ead6, d4b7e, bd656, 0c815, dbe83, 54bc9, 2b7fd, 54737, 7e9a0, 0ab04
  • Associated Commits: b3dee, a27d2, 34ab8, 9fedd, 17ffe, b9a69, 5d2be, 777ac, 51150, 4ead6, d4b7e, bd656, 0c815, dbe83, 54bc9, 2b7fd, 54737, 7e9a0, 0ab04

3. hexagon: general DMA and Binary Op fixes for large strides: This pull request addresses and fixes functional issues related to large strides in general DMA operations and binary operations on the Hexagon backend, enabling previously problematic models like gemma-3n-E4B and qwen3.5 to run correctly without producing errors or causing hardware exceptions.

  • URL: pull/20918
  • Associated Commits: 62dc9, a83f6, 9748f, 37b80, 73120, ea99c, eb37f, 5311a, debd3, a06b8, c1f03, 9d445
  • Associated Commits: 62dc9, a83f6, 9748f, 37b80, 73120, ea99c, eb37f, 5311a, debd3, a06b8, c1f03, 9d445

Other Closed Pull Requests

  • Backend support for built-in tools: This pull request adds backend support for built-in tools in the server, including handling the --tools launch option to enable these tools and returning a 404 response when tools are disabled. It also enforces permission dialogs for tools requiring write access and updates related documentation and UI behavior accordingly.
    • pull/20898
  • WebUI model status and UI fixes: This pull request adds a SLEEPING status indicator with an orange color to the WebUI model selector for the --sleep-idle-seconds router option and fixes the WebUI's handling of sleeping models to prevent "Failed to load model" errors. Additionally, it corrects the spelling of "favourite" to the US English "favorite" throughout the interface.
    • pull/20949
  • Hugging Face cache integration: This pull request adds standard Hugging Face cache support by using the Hugging Face API to locate all files and migrating manifests to the Hugging Face cache at startup. It also improves error handling, API error reporting, and fallback mechanisms for cached files.
    • pull/20775
  • Grammar sampler inhibition during reasoning: This pull request inhibits the lazy grammar sampler during active reasoning phases to prevent grammar constraints from interfering when tool call tags are present. It exposes the reasoning budget sampler state to the common sampler and adjusts token handling and sampler state updates accordingly, with test modifications reflecting these changes.
    • pull/20970
  • Model conversion support for RuGPT3XL: This pull request adds conversion support for the RuGPT3XL model by updating the Python conversion script to handle its custom architecture that concatenates separate query, key, and value projection weights into a single QKV tensor per layer. This enables compatibility with the existing GPT-2 inference framework in llama.cpp.
    • pull/21011
  • Voxtral Mini 4B streaming speech-to-text support: This pull request adds full support for the Voxtral Mini 4B realtime streaming speech-to-text model, including a new OpenAI-compatible /v1/audio/transcriptions API endpoint supporting multipart/form-data and JSON base64 audio input. It implements a dual-stream transcription protocol for low-latency decoding, provides tools for model conversion to GGUF, and includes CLI and server integration with testing and resource usage details.
    • pull/20638
  • MCP proxy header prefixing: This pull request addresses issue #21012 by modifying the server to wrap target headers with an X-Proxy-Header- prefix in the MCP proxy, preventing conflicts with llama.cpp-specific headers.
    • pull/21072
  • VGPR count parsing fix in gcn-cdna-vgpr-check: This pull request fixes parsing of VGPR counts by correctly associating amdclang Remark blocks with the appropriate functions using file and line number matching. This prevents misattribution caused by interleaved compile processes.
    • pull/20987
  • 3.5-bit KV cache quantization (TQ3_0): This pull request proposes a new 3.5-bit KV cache quantization type called TQ3_0 based on Google's TurboQuant algorithm, implementing CPU and Vulkan GPU support. It achieves approximately 4.6x compression over FP16 with minimal quality loss, enabling larger context sizes and more efficient VRAM usage for models like Qwen 3.5, integrating the method into GGML core, Vulkan shaders, and CLI options.
    • pull/21010
  • Refactor of clip_image_preprocess: This pull request refactors clip_image_preprocess by reorganizing it into dedicated classes inheriting from mtmd_image_preprocessor, separating image preprocessing logic from clip.cpp for improved modularity and clarity.
    • pull/21031
  • Bounded key-value cache implementation: This pull request introduces an optional bounded key-value cache that preserves evicted tokens by saving layer-0 residual embeddings to a host-memory ring buffer and recomputes key/value pairs on demand. This enables fixed-size caching with LRU eviction while maintaining output quality and stable memory usage during long multi-turn conversations.
    • pull/21097
  • Qwen3.5 model conversion fixes: This pull request fixes errors in converting Qwen3.5 and Qwen3.5 Moe models with NVFP4 precision by properly handling tensor name prefixes, skipping already repacked tensors, and applying correct reordering to linear attention weights. These changes enable successful creation of valid GGUF files for these models.
    • pull/20505
  • MOE GEMV kernel optimization: This pull request optimizes the MOE GEMV kernel for batch sizes greater than one by redesigning the kernel to reduce thread blocks and improve workload distribution. It introduces a new multi-token kernel using warp-level reductions without shared memory synchronization, simplifies the original kernel, removes the is_multi_token_id specialization, and achieves notable performance improvements across various GPUs without increasing compilation time.
    • pull/20905
  • Step3.5 Multi-Token Prediction (MTP) implementation: This pull request introduces a specialized end-to-end implementation of Step3.5 MTP in llama.cpp, including model conversion, loading, runtime graph execution, speculative decoding, server integration, and quantization compatibility. It enables early experimentation and benchmarking of MTP within the existing architecture without generalizing the MTP framework.
    • pull/20981
  • Bucket-mul feature for CPU matrix multiplication: This pull request introduces a new bucket-mul feature for CPU matrix multiplication with optional effort-based computation paths, dynamic CPU usage adjustments, new CLI flags, and documentation. These changes aim to improve model efficiency and memory compression.
    • pull/20990
  • WebUI chat message improvements: This pull request enhances the WebUI by improving initial and auto-scroll behavior of chat messages, simplifying Assistant Message rendering through a default agentic content component, and introducing lazy loading with fade-in transitions for message content blocks to optimize performance and visual experience.
    • pull/20999
  • WebUI HTML structure fix: This pull request addresses removal and replacement of illegal nested <button> elements in the WebUI to ensure valid HTML structure and improve code quality.
    • pull/21026
  • Socket options customization in server: This pull request adds the ability to customize socket options in the server, specifically allowing users to disable the default use of SO_REUSEPORT in cpp-httplib, providing more control over socket behavior.
    • pull/21056
  • Removal of verbose_prompt parameter: This pull request removes the verbose_prompt command line parameter from llama-server to discontinue printing the prompt, addressing issue #19653.
    • pull/21059
  • Native support for Qwen3-VL models: This pull request introduces native support for Qwen3-VL and Qwen3-VL-Embedding models in llama.cpp's multimodal infrastructure by aligning vision graph processing, preprocessing, and embedding frontends with the official Hugging Face PyTorch implementation to ensure architectural parity and prevent accuracy degradation.
    • pull/21103
  • Multi-backend profiler introduction: This pull request introduces a multi-backend profiler supporting CPU, BLAS, and CUDA for low-overhead, fine-grained profiling of operation executions including fused operations. It delegates event emission to each backend and includes a Python script to process profiling data into an interactive HTML timeline and stats table, though it requires disabling CUDA graphs and does not yet support parallel requests.
    • pull/21138
  • CI build system update to Ninja: This pull request updates the continuous integration build process by replacing make with the Ninja build system for certain configurations, improving portability and achieving approximately 1.7x faster build times in tested environments.
    • pull/20742
  • CANN components update and documentation: This pull request upgrades CANN-related Docker images from release candidate versions to the stable 8.5.0 release for improved reliability and optimization. It also revises CANN.md documentation to clarify device naming and add BF16 support information without affecting functional behavior outside the CANN environment.
    • pull/20801

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
ggerganov 107 7 0 15
CISC 59 7 0 53
ngxson 78 9 1 23
allozaur 71 9 0 19
pwilkin 82 8 2 2
rodgerhubhay 84 0 0 0
angt 41 19 0 18
taronaeo 54 3 0 16
max-krasnyansky 63 3 0 1
No author found 57 0 0 0

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.