Weekly GitHub Report for Llama.cpp: February 08, 2026 - February 15, 2026 (15:16:26)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, reflecting a continued focus on optimizing user experience and system reliability. Notable highlights include improved processing speed and bug fixes addressing previous issues.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[BUG-UNCONFIRMED] Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response): This issue reports that the Qwen3-Coder-Next model prematurely generates an end-of-sequence token immediately after a colon instead of producing the expected tool call, causing the agent's turn to end early and requiring multiple "please continue" prompts to complete tasks. The problem appears linked to how trailing newlines after the colon in tool call preambles are handled or trimmed in llama.cpp, differing from other clients like openrouter, and extensive debugging suggests that preserving these newlines prevents premature termination and enables correct tool call generation.
- The comments detail various user experiences confirming the issue across different backends and quantizations, comparisons showing the problem does not occur in other clients, and collaborative debugging efforts that identify newline trimming after colons as the root cause; a proposed quick fix adding two newlines to assistant messages successfully mitigates the problem, and subsequent updates to the autoparser branch reportedly resolve the issue without needing patches.
- Number of comments this week: 29
-
Why can't I convert the model to the GGUF version?: This issue concerns a user unable to convert a model to the GGUF format due to an unrecognized BPE pre-tokenizer, which leads to multiple errors including missing tokenizer files and type errors related to Llama 3 conversion requirements. The user provides error logs and details about their environment, and the discussion explores potential causes such as differences in tokenizer configurations, the method of downloading model files, and discrepancies introduced by fine-tuning or modifications to the tokenizer.
- The comments clarify that the root cause is the unrecognized pre-tokenizer requiring updates to the conversion script, with requests for the tokenizer.json file to diagnose the issue. Users confirm the conversion works with the base model and the same transformers version, suggesting the problem arises from changes in the fine-tuned model’s tokenizer. Differences in tokenizer files are examined but found not to affect tokenization, and the conversation highlights the need to update the conversion tool to handle the modified pre-tokenizer.
- Number of comments this week: 15
-
[BUG-UNCONFIRMED] Misc. bug: FA performance on macOS with Metal backend (Intel/AMD): This issue reports a performance regression with flash attention enabled on the Metal backend for Intel and AMD Macs, where token output speed becomes significantly slower despite prompt processing remaining unaffected. The user shares detailed benchmark results comparing Metal and Vulkan backends, noting that Metal without flash attention is faster, but enabling flash attention causes a severe slowdown on Metal for certain models and hardware configurations.
- The comments discuss whether the Metal backend works on Intel Macs and with RPC, confirm that Metal does function with recent fixes, and provide a patch adding a scalar flash attention fallback for GPUs lacking certain SIMD features. Users report mixed success with eGPU setups and RPC, with suggestions to prefer Vulkan for multi-GPU and eGPU support due to Metal's limitations in device enumeration and concurrency issues. Multiple fixes addressing concurrency, kernel bugs, and fallback implementations are shared, improving Metal flash attention performance on AMD and Intel GPUs, with detailed testing and benchmarking results provided.
- Number of comments this week: 14
-
[PERFORMANCE] Qwen3-Coder-Next (Qwen3-Next-80B) CPU inference ~5x slower than expected — consumer hardware benchmarks: This issue reports that the Qwen3-Coder-Next (Qwen3-Next-80B) model runs significantly slower on CPU inference than expected based on its active parameter count and memory bandwidth, achieving only about 7.7 tokens per second compared to the anticipated 20-30+ tokens per second on consumer hardware. The problem appears to stem from the llama.cpp backend not efficiently leveraging the sparse activation pattern of the MoE architecture, causing it to read more data than necessary and resulting in a 3-5x performance degradation compared to similar models and GPU runs.
- Comments discuss related GPU performance improvements, ongoing CPU optimization efforts including microkernel and delta.net support, and practical workarounds such as offloading layers to GPU to improve throughput; users share benchmark experiences confirming the issue and highlight potential fixes and experimental settings that can mitigate the slowdown.
- Number of comments this week: 10
-
[BUG] [CHAT PARSER] Eval bug: Coredump when using Kilo Code: This issue reports a coredump occurring when using Kilo Code in "Orchestrator" mode with the Qwen3-Next-Coder model, specifically after processing all tokens of the first request. The problem appears related to grammar parsing errors and is reproducible on Linux with CUDA backend on NVIDIA RTX A5000 hardware, with attempts to resolve it involving different branches and forks of the llama.cpp project.
- The comments discuss that the crash is not model-specific and occurs with different forks and branches, including an autoparser branch suggested as a potential fix, but the issue persists. Developers note differences in failure timing between branches and suspect template or parser generation problems, with ongoing investigation and requests for exact templates and models used.
- Number of comments this week: 8
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 42
Summarized Issues:
- Qwen3-Coder-Next Model Stability and Performance Issues: Multiple issues report crashes, segmentation faults, and unexpected behaviors when using the Qwen3-Coder-Next model in various environments including llama-server, CUDA, ROCm, and Windows builds. Problems include crashes during consecutive tool calls, premature end-of-sequence token generation, and slow CPU inference speed, indicating significant stability and performance challenges with this model.
- [issues/19430, issues/19480, issues/19485, issues/19513, issues/19518, issues/19579, issues/19637]
- GPU Backend Compatibility and Performance Problems: Several issues highlight bugs and performance regressions across different GPU backends such as Metal, Vulkan, CUDA, ROCm, and SYCL on various hardware including AMD, NVIDIA, Intel, and Apple Silicon. These include slow token output with flash attention on Metal, crashes due to workgroup dimension limits on Vulkan, CUDA invalid configuration errors, ROCm hangs on large models, and SYCL fallback to CPU on Intel GPUs.
- [issues/19431, issues/19471, issues/19482, issues/19487, issues/19543, issues/19563, issues/19570, issues/19601]
- Memory Management and Resource Exhaustion: Issues report increased RAM usage with mmap repack on ARM, CUDA out of memory errors after prolonged use, and continuous model loading without eviction in router mode causing RAM overload. These problems suggest inefficiencies and potential leaks in memory handling during model loading and inference.
- [issues/19578, issues/19627, issues/19639]
- Build and Installation Failures: Multiple issues describe build failures and installation problems including linker errors on Windows Snapdragon arm64 builds, missing shared libraries on macOS due to incorrect RPATH, and CMake install target errors related to KleidiAI. These hinder successful compilation and deployment across platforms.
- [issues/19444, issues/19501, issues/19618]
- Feature Requests for UI and Model Interaction Enhancements: Users request new features such as verbose debug info display in the web UI, preservation of reasoning content during chats, runtime expert count overrides for MoE models, and automatic GPU grouping for optimized model splitting. These aim to improve usability and model management.
- [issues/19446, issues/19449, issues/19528, issues/19607]
- Tokenization and Serialization Research and Bugs: There are proposals and bugs related to token input/output efficiency and token escaping, including research into binary token systems bypassing BPE, UTF-8 to UTF-32 GPU conversion, and double escaping in tool call templates causing prompt errors.
- [issues/19458, issues/19459, issues/19520]
- Model Loading and Conversion Issues: Problems include failure to load LoRA adapters due to CUDA assertion errors, inability to convert models to GGUF format due to unrecognized tokenizers, and failures loading models on specific platforms or backends. These issues affect model compatibility and deployment.
- [issues/19436, issues/19626, issues/19579]
- API and Endpoint Limitations: Issues report failures saving key-value caches for vision-enabled models and problems embedding base64-encoded images alongside text, indicating incomplete or unsupported API features for multimodal data handling.
- [issues/19466, issues/19525]
- Backend-Specific Crashes and Kernel Bugs: Several issues describe crashes and kernel bugs such as illegal memory access on CUDA, segmentation faults on ROCm, infinite loops in specific models, and kernel dispatch problems causing infinite output loops or assertion failures.
- [issues/19522, issues/19585, issues/19641]
- Hardware-Specific Backend Failures: Problems include Snapdragon Windows arm64 Hexagon backend session failures, Android Hexagon backend device creation errors due to unsigned libraries, and AMD PCI ID retrieval failures on Windows, highlighting hardware and driver integration challenges.
- [issues/19617, issues/19634, issues/19570]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 15
Summarized Issues:
- GPU Compatibility and Driver Issues: Several issues report crashes and errors related to GPU hardware and drivers, particularly on AMD GCN 1.0 GPUs and ROCm environments. These include device lost errors, invalid device function errors, and Vulkan backend crashes that require specific workarounds or driver flags to stabilize the system.
- [issues/19442, issues/19610, issues/19615, issues/19630]
- Memory Management and VRAM Usage: Problems with GPU memory allocation and VRAM usage are noted, including out-of-memory errors due to overallocation on CUDA devices and disproportionate VRAM consumption for KV caches in certain models. These issues affect model loading and runtime performance on multi-GPU systems.
- [issues/19437, issues/19552]
- Model Support and Compatibility: Requests and regressions related to model support include adding new model support with enhanced features and regressions causing context shifting functions to fail, impacting text generation coherence. These affect compatibility and functionality with specific LLM models.
- [issues/19127, issues/19292]
- Performance and Caching Bugs: Performance degradation and caching inconsistencies are reported, such as slowdowns in token generation due to unified KV cache handling and discrepancies in prompt caching behavior between different API compatibilities. These impact response speed and efficiency.
- [issues/19464, issues/19494, issues/19523]
- Build and Compilation Errors: Compilation failures occur due to ambiguous template specializations in ROCm headers, causing build errors on specific Ubuntu versions. This prevents successful building of the project in certain environments.
- [issues/19269]
- Quantization and Model Repacking Bugs: A bug introduced by a recent pull request causes stack smashing and crashes when repacking Q4_K quantized models, affecting specific model architectures and requiring disabling repack support as a workaround.
- [issues/19561]
- Crash and Stability Issues with Speculative Decoding and Web UI: Crashes occur with speculative decoding enabled in the CUDA backend and repeated incorrect "thinking" behavior in the Web UI linked to chat template configurations, indicating stability and UI integration problems.
- [issues/19613, issues/19629]
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 50
Key Open Pull Requests
1. [WIP] refactor llama-quant.cpp: This pull request is a work-in-progress mild-to-moderate refactor of the src/llama-quant.cpp file, including improvements such as adding a dry-run option to the quantization tool, refining tensor dimension formatting, fixing logic and style issues, enhancing error messages, and simplifying tensor-related functions.
- URL: pull/19616
- Associated Commits: 844ad, e6b79, 0d222, 56c27, c3f42, b9b32, 150e1, 966b2, 07f88, 2769f, ea8da, 3211a, 55dbe, 22db7, ae786, 1ccd7, 16582, b15bb, 40528, 44f9f, f58de, 75ab2, 0301b, 5d6c9, 67e25, 1f25c, 6734e, d6486, fd378, 053a2, 97aef, bddc6, 7b127, a3bf0, f14fd
2. Vulkan Scalar Flash Attention Refactor: This pull request refactors the Vulkan scalar Flash Attention implementation by adding proper float16 support, optimizing core computations with row splitting and shared memory staging, introducing a three-tier row size system, applying vendor-specific performance tuning for AMD and Intel GPUs, enhancing split_k logic for broader workload compatibility, improving device compatibility with conditional FP32 shader variants, and refining shared memory management and shader compilation to achieve significant performance improvements across various hardware configurations.
- URL: pull/19625
- Associated Commits: 015d7, e7a75, 828b7, f92d7, 9b309, c0f41, e3bba, 07afb, 3c208, ca5ec, 50a42, b7b67, 8236c, d8d53, 8fbd3, b626e, 4819f, 356f1, de6db, d6a00, 9f9b7, 3ed91, 28a3c, 3946e, cd54b, 16cb9, 3ae54, 9f9a8, dd92b, 0b4b0, 02ccf, 32d50
3. common : fix filename validation for subfolders: This pull request fixes the filename validation logic in the fs_validate_filename function to correctly handle subfolders when the allow_subdirs option is enabled, adds new Windows path traversal character mappings to the blocked Unicode list, and includes AI-generated and additional test cases to ensure robust validation.
- URL: pull/19472
Other Open Pull Requests
- Qwen Delta Net Implementations: Multiple pull requests focus on improving and extending the delta net implementations for Qwen models. These include adding a CPU implementation of the GATED_DELTA_NET operation and introducing a common base function to deduplicate and streamline delta-net graph builds, with plans to support chunked implementations and additional Qwen variants.
- Model Family Support Additions: Several pull requests add support for new model families, including ERNIE 4.5 with its multimodal MoE architecture, CohereLabs/tiny-aya with a custom BPE pre-tokenizer, and JAIS-2 Arabic-English bilingual models with specific architectural features and tokenizer compatibility. These enhancements expand the range of models supported by the project with tailored implementations.
- Performance and Optimization Enhancements: Several pull requests improve performance through various optimizations, such as WebAssembly SIMD relaxed dot product instructions, CUDA dequantization improvements yielding 5-10% FLOPS gains, and an RDNA4-specific MMVQ parameter table that boosts decode performance by over 10%. These changes target both CPU and GPU execution efficiency.
- pull/19590, [pull/19624](https://github.com/pull/19624], pull/19455, pull/19572
- Shader and Vulkan Improvements: Pull requests reorganize WebGPU shader code into a centralized JIT-compatible library and address Vulkan dispatch limitations by splitting operations to respect workgroup count limits. Additionally, Vulkan SDK constants replace hardcoded strings to improve maintainability and correctness.
- Speculative Decoding and Checkpointing: One pull request implements speculative checkpointing to enable efficient speculative decoding with recurrent modules, improving performance in repetitive tasks. It also includes fixes and raises questions about batch execution and checkpoint function refactoring.
- Compute Precision Configuration: A pull request introduces the ability to specify intermediate compute precision types in model configurations, supporting formats like fp32 and laying groundwork for future benchmarks with fp16 and bf16. This enables more flexible precision management for model computations.
- API and Tooling Enhancements: Enhancements include adding a
--dry-runoption to the quantization tool for size estimation without actual quantization, exposingggml_is_viewas a public API to improve backend compatibility, and adding a new standalone quantization benchmarking tool to profile matrix multiplication kernels. These improve usability and diagnostics.
- Server and API Improvements: Updates to the server include adding prompt caching usage metrics compatible with Anthropic API, fixing GLM 4.5 streaming tool-call parsing and grammar error handling to prevent server hangs, and improving error catching to return proper HTTP responses without aborting.
- Build and Dependency Updates: Pull requests update Python dependencies to require NumPy 2.0+ for better compatibility with newer Python versions and add a ROCm build target for multiple gfx architectures while excluding problematic CDNA targets. These changes improve build robustness and platform support.
- Miscellaneous Features and Fixes: Other contributions include adding Jinja "indent" string filter support without the
firstflag, introducing a new group split-mode for optimized multi-GPU usage, adding OS support checks for x86 SIMD architectures, and a work-in-progress Android/Java build fork providing an app template.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 111
Key Closed Pull Requests
1. Mcp mvp: This pull request introduces the MVP (Minimum Viable Product) implementation of the MCP feature, including architectural refactoring, UI improvements for MCP server selection and prompt handling, enhanced attachment and resource management, and various fixes and optimizations to support seamless integration and user experience within the web UI.
- URL: pull/19546
- Associated Commits: 5046a, 22204, f8cf5, aef65, 97e69, 93f57, ea58b, 892f4, ea99f, 33edd, 147cc, 1e6f1, b5de4, bda6d, 3edfa, 740a4, 9f16a, 5cdee, 0e005, 73fcc, 5ec2a, 112ea, 4a716, 63da3, d9671, 94bf0, aa0c8, 43d03, 905d6, 26321, b333b, a3aba, 5cbba, b6455, 90fb5, 89e1e, f1214, b454a, 83307, ab82c, 23398, 07982, f8ebb, c3b73, ddf8f, 57a6c, e2153, 7a12b, cc95a, 7a64d, 57417, bd272, 62082, 9e3a9, 0c26d, 8ad59, 2f97f, d5055, 53617, a259a, 7a529, 018ed, 4048a, 15337, 84c00, e55ce, 63052, 41273, 0d0ce, 871a0, d7a83, 1b104, f82c0, b7140, 24f15, aadf2, 530f1, 08a5d, d6244, 0b54d, 4f870, 45d54, 9f22d, 59956, 78628, 190ee, c7496, bcd5b, 08410, d8a7a, 2556f, 5c831, 19350, 4dc48, aa69e, 138c0, e3189, 7f929, 431ed, d5723, 71ae8, 4f2d9, 9587b, 9c01f, 3ca6e, c925d, c0387, e58db, 23f70, c0a1e, d875b, 00c28, ecb66, 0f6dc, 06e28, ee1a3, 42b42, 5fff2, e1a08, d4c5d, 7d956, 6aa12, a4394, 1e3cd, 168c2, f88e4, c1609, 190a6, ec477, 15ab8, 42673, b13a3, 87c8b, 36b1d, 79b74, 2d765, 77e67, d3917, dc352, c3d63, a780a, b87e0, 3663f, 00574, aa448, 1859b, ea1cf, 708a0, 4a0d6, 60f4a, a81f8, dbbfe, d95ef, b2d55, 8dbbb, 046d7, e29a9, f7cda, 8ea5f, 6192d, 76177, afc14, da788, 45eaf, 925aa, 18c8f, b2c41, 00036, 1833a, daed2, 97540, ba662, 75242, 51c3e, e0a78, dae1d, bbdfe, 2678f, fd504, a85c9, 467fe, c7f39, 5b0ec, 29a5a, eba8a, 29558, dffad, a174c, 9101c, 473a4, e6512, 6d107, 0aad1, 54253, efb0f, c9ee8, bec6a, b68b8, 52c3e, f217f, 5c4c5, c5786, b3484, 22c8a, b4d04, 67e42, 051e1, 156b1, f7e27, 65d73, 7985f, 0102b, 887a1, 3a101, b7d67, f580e, 8a914, 530a3, 97af7, 2414d, d3cef, 3e75f, 7f2e6, 61f92, 7ed7d, a3f92, cd9ab, 86375, 8cb87, 63403, 3755e, 113f3, 328a7, bcf70, 3e161, 42abd, b28e7, bc8df, b4081, 7a0a1, d131d, 218fd, b1986, 4960a, 461b6, 582d2, 6d58a, 79e28, 3d89e, 891ce
- Associated Commits: 5046a, 22204, f8cf5, aef65, 97e69, 93f57, ea58b, 892f4, ea99f, 33edd, 147cc, 1e6f1, b5de4, bda6d, 3edfa, 740a4, 9f16a, 5cdee, 0e005, 73fcc, 5ec2a, 112ea, 4a716, 63da3, d9671, 94bf0, aa0c8, 43d03, 905d6, 26321, b333b, a3aba, 5cbba, b6455, 90fb5, 89e1e, f1214, b454a, 83307, ab82c, 23398, 07982, f8ebb, c3b73, ddf8f, 57a6c, e2153, 7a12b, cc95a, 7a64d, 57417, bd272, 62082, 9e3a9, 0c26d, 8ad59, 2f97f, d5055, 53617, a259a, 7a529, 018ed, 4048a, 15337, 84c00, e55ce, 63052, 41273, 0d0ce, 871a0, d7a83, 1b104, f82c0, b7140, 24f15, aadf2, 530f1, 08a5d, d6244, 0b54d, 4f870, 45d54, 9f22d, 59956, 78628, 190ee, c7496, bcd5b, 08410, d8a7a, 2556f, 5c831, 19350, 4dc48, aa69e, 138c0, e3189, 7f929, 431ed, d5723, 71ae8, 4f2d9, 9587b, 9c01f, 3ca6e, c925d, c0387, e58db, 23f70, c0a1e, d875b, 00c28, ecb66, 0f6dc, 06e28, ee1a3, 42b42, 5fff2, e1a08, d4c5d, 7d956, 6aa12, a4394, 1e3cd, 168c2, f88e4, c1609, 190a6, ec477, 15ab8, 42673, b13a3, 87c8b, 36b1d, 79b74, 2d765, 77e67, d3917, dc352, c3d63, a780a, b87e0, 3663f, 00574, aa448, 1859b, ea1cf, 708a0, 4a0d6, 60f4a, a81f8, dbbfe, d95ef, b2d55, 8dbbb, 046d7, e29a9, f7cda, 8ea5f, 6192d, 76177, afc14, da788, 45eaf, 925aa, 18c8f, b2c41, 00036, 1833a, daed2, 97540, ba662, 75242, 51c3e, e0a78, dae1d, bbdfe, 2678f, fd504, a85c9, 467fe, c7f39, 5b0ec, 29a5a, eba8a, 29558, dffad, a174c, 9101c, 473a4, e6512, 6d107, 0aad1, 54253, efb0f, c9ee8, bec6a, b68b8, 52c3e, f217f, 5c4c5, c5786, b3484, 22c8a, b4d04, 67e42, 051e1, 156b1, f7e27, 65d73, 7985f, 0102b, 887a1, 3a101, b7d67, f580e, 8a914, 530a3, 97af7, 2414d, d3cef, 3e75f, 7f2e6, 61f92, 7ed7d, a3f92, cd9ab, 86375, 8cb87, 63403, 3755e, 113f3, 328a7, bcf70, 3e161, 42abd, b28e7, bc8df, b4081, 7a0a1, d131d, 218fd, b1986, 4960a, 461b6, 582d2, 6d58a, 79e28, 3d89e, 891ce
2. Kimi Linear fix conv state update: This pull request fixes the incorrect convolution state update in the Kimi Linear model implementation that caused state corruption during parallel execution in llama-server, and also introduces a block implementation that improves performance by approximately 20% and reduces VRAM usage.
- URL: pull/19531
- Associated Commits: 27baa, 84f82, 57cca, 6167f, 26a65, bf42b, d73d3, e3080, 13954, 83d32, 772ca, 9f126, a0269, ef5bc, ae977, f9a11, 77629, f67a4, f85e5, 8bd61, a4020, 66c0c, aba18, cfed1, e3542, 67bee, 30d88, 40f61, 1099c, f9991, 6977d, 6150b, d26fe, dce06, 426a8, b9360, 5f2b8, 10be7, 6ae66, 93afb, 59182, 58d1e, 4f6ef, 719d3, ac85c, 4faf2, 22bc5, 217e7, 6ba78, fe9d2, 18ae7, 28829, c163d, 0aea1, f3d11, c26c1, e87ac, 02987, e55ca, 56019, a8147, ae8d7, 38c6f, 92f49, 7fb54, bb02b, f1525, 0de46, 0444a, a6b2c, 62162, 005c3, aaf05, 2a62d, 2c8cd, 11282, 4bb42, 07f99, efaea, 000fd, 8ec5b, 82215, a8210, 64563, 17cd6, 97f22, cc16e, 06f07, 906ab, 3dfeb, 19cf7, 63a15, b2d02, 62862, a4678
- Associated Commits: 27baa, 84f82, 57cca, 6167f, 26a65, bf42b, d73d3, e3080, 13954, 83d32, 772ca, 9f126, a0269, ef5bc, ae977, f9a11, 77629, f67a4, f85e5, 8bd61, a4020, 66c0c, aba18, cfed1, e3542, 67bee, 30d88, 40f61, 1099c, f9991, 6977d, 6150b, d26fe, dce06, 426a8, b9360, 5f2b8, 10be7, 6ae66, 93afb, 59182, 58d1e, 4f6ef, 719d3, ac85c, 4faf2, 22bc5, 217e7, 6ba78, fe9d2, 18ae7, 28829, c163d, 0aea1, f3d11, c26c1, e87ac, 02987, e55ca, 56019, a8147, ae8d7, 38c6f, 92f49, 7fb54, bb02b, f1525, 0de46, 0444a, a6b2c, 62162, 005c3, aaf05, 2a62d, 2c8cd, 11282, 4bb42, 07f99, efaea, 000fd, 8ec5b, 82215, a8210, 64563, 17cd6, 97f22, cc16e, 06f07, 906ab, 3dfeb, 19cf7, 63a15, b2d02, 62862, a4678
3. 合入最新的官方代码: This pull request attempts to merge the latest official code updates into the project, including multiple commits related to creating and updating Dockerfiles and workflow configurations, but it was ultimately not merged.
- URL: pull/19473
- Associated Commits: 3dd11, 69c72, ec7cc, 02acc, 15614, 52051, 6a75d, a8876, bb58f, cdb67, 85436, 8cd81, 9c940, 87995, 03449, 77fc0, 7de81, 56419, 35530
- Associated Commits: 3dd11, 69c72, ec7cc, 02acc, 15614, 52051, 6a75d, a8876, bb58f, cdb67, 85436, 8cd81, 9c940, 87995, 03449, 77fc0, 7de81, 56419, 35530
Other Closed Pull Requests
- Qwen 3.5 Model Support: Multiple pull requests add and improve support for the Qwen 3.5 series models, including dense and Mixture of Experts (MoE) variants without vision capabilities. These PRs cover model conversion based on Hugging Face Transformers, refactoring and optimization of conversion scripts, and fixes for special conversion cases, with thorough testing demonstrating excellent model performance.
- Hexagon Backend Enhancements: Several pull requests optimize and extend the Hexagon backend by implementing in-place float32 to float16 conversion, adding new fused multiply-add HVX intrinsics, supporting new operations like ARGSORT and GEGLU, and introducing parallel 2x2 dot product computations. These improvements significantly boost performance on Gen5 devices and achieve over 10x speedups in prompt processing for various models.
- Web UI Improvements and Fixes: Multiple pull requests address UI and routing fixes, add new features like system message injection and raw LLM output toggling, and reorganize code for better architecture and user experience. These changes include scroll and redirect fixes, message editing enhancements, and UI/UX polish as part of efforts to offload non-MCP-related code.
- Memory Management and Resource Cleanup: A pull request fixes memory leaks in the WebGPU implementation by converting heap-allocated resources to smart pointers, explicitly destroying buffers and buffer pools on shutdown, and implementing destructors for key structs. Another PR fixes Windows handle lifetime issues in mmap by ensuring proper cleanup order and handle validity.
- OpenCL Backend and Matrix Operations: Pull requests introduce basic Q4_1 quantization support in OpenCL to enable GPU execution of models with Q4_1 weights and add general implementations of Q6_K matrix multiplication and Q4_K matrix-vector multiplication. Additional ARM64-specific repack implementations improve performance on Apple M4 Max and Raspberry Pi 5 devices.
- Model Support for Kimi-K2.5: One pull request adds support for the Kimi-K2.5 model, including handling compressed INT4 tensors by relocating dequantization, integrating new vision tower keys, updating configuration parsing, and introducing initial vision capabilities while excluding some experimental quantization changes.
- Broadcast and Permutation Enhancements: A pull request extends binary broadcast functionality to support permuted
src1tensors by removing CPU asserts, updating CUDA kernels, and adding tests to validate permutation support.
- Documentation for GGML-VirtGPU Backend: One pull request adds comprehensive Markdown documentation for the newly introduced GGML-VirtGPU backend, using AI-assisted content generation refined by the author to improve accuracy and reflect recent changes.
- Server Workflows on Metal Backend: A pull request adds server workflows running on the Metal backend by utilizing Metal virtual devices to simulate multi-GPU workflows, enhancing continuous integration testing for the project.
- Specification Tracking Improvements: One pull request removes the
--spec-ngram-check-rateconfiguration parameter, renames related statistics variables, and adds new countersn_call_beginandn_call_acceptto improve specification tracking.
- CUDA Backend Fixes: A pull request ports a fix from the CPU implementation to the CUDA backend to correct broken non-contiguous rope logic and improve variable naming consistency, addressing issues revealed by recent tests.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| allozaur | 217 | 6 | 0 | 0 |
| CISC | 67 | 3 | 0 | 101 |
| ggerganov | 116 | 18 | 0 | 13 |
| ngxson | 60 | 5 | 2 | 38 |
| pwilkin | 61 | 5 | 0 | 24 |
| am17an | 50 | 4 | 0 | 12 |
| 0cc4m | 59 | 1 | 0 | 6 |
| ServeurpersoCom | 54 | 0 | 0 | 0 |
| ddh0 | 50 | 2 | 0 | 0 |
| ymcki | 47 | 2 | 1 | 2 |