Weekly GitHub Report for Llama.cpp: February 15, 2026 - February 22, 2026 (14:49:42)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates that enhance overall performance and stability, reflecting a continued focus on optimizing user experience and system reliability. Notable highlights include improved processing speed and bug fixes addressing previous issues.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[BUG] [CUDA] Eval bug: [CUDA, cuBLAS] Corrupted output on CUBLAS with moe models like Nemotron-3-nano and gpt-oss-120b with long context preprocessing: This issue reports a bug causing corrupted output when running mixture-of-experts (MoE) models like Nemotron-3-nano and gpt-oss-120b on CUDA devices with long context preprocessing, where memory corruption leads to repeated token spamming and assertion failures related to cuBLAS operations. The problem appears linked to numerical overflows and NaNs in FP16 cuBLAS matrix multiplications on certain GPUs, with partial workarounds involving build flags and environment variables, but the root cause in the cuBLAS MoE pipeline remains under investigation.
- Comments discuss the nature of the corruption as NaN/Inf overflows in CUDA operations, attempts to debug with NaN checks, partial fixes using FP32 computation and disabling fusion, inconsistent reproduction depending on input distribution, and ongoing efforts to isolate the problematic commit and find a definitive solution.
- Number of comments this week: 30
-
[BUG-UNCONFIRMED] Eval bug: qwen35moe always forces a full prompt reprocess after each message, 'failed to truncate': This issue reports a bug with the Qwen3.5moe model where every generation request forces a full prompt reprocessing due to a failure to truncate tokens, causing inefficient handling of the prompt cache and repeated full evaluations. The problem appears related to the model's hybrid multi-modal nature and the server's current logic requiring at least one token to be processed before generation, which leads to clearing the memory when truncation fails, and users have found temporary workarounds by disabling multi-modal features or increasing context size.
- Commenters confirmed the issue occurs consistently across different setups and shared detailed logs illustrating the failure to truncate tokens and forced full prompt reprocessing; discussion identified the root cause as the model’s hybrid multi-modal architecture and server logic limitations, with suggestions including disabling multi-modal support, improving checkpointing logic, and a user-contributed patch that mitigates the problem, while also noting related issues with prompt cache handling and timeouts in certain frontends.
- Number of comments this week: 21
-
[BUG] [BUILD] [CUDA] CUDA compilation fails on Blackwell (sm_120) with MXFP4: "Instruction 'mma with block scale' not supported": This issue reports a failure to compile CUDA code for the NVIDIA RTX PRO 6000 Blackwell GPU (compute capability 12.0) due to unsupported MXFP4 instructions on the sm_120 target, causing hundreds of PTX assembly errors during the build of llama.cpp with CUDA support. The user requests clarification on whether MXFP4 support for Blackwell is planned, if it should work with CUDA 13.1, or if a newer toolkit is required, noting that disabling MXFP4 compilation is currently not possible and that rolling back to a pre-MXFP4 commit is the only workaround.
- Commenters confirm that sm_120 is expected to fail because MXFP4 instructions are only supported on sm_120f or 120a-real, and CMake automatically replaces 120 with 120a-real; issues often arise from outdated or mismatched NVIDIA drivers causing CUDA devices not to be detected properly. Users report similar errors on different platforms, and the problem is resolved for some by updating GPU drivers to versions compatible with CUDA 13.1, enabling proper device recognition and successful compilation.
- Number of comments this week: 11
-
[BUG-UNCONFIRMED] Misc. bug: Llama Cpp - Model - Chat Template interactions = Mess (Devstral 2): This issue describes difficulties encountered when using the
llama-cppproject with a quantized and fine-tuned version of the Devstral-Small-2 model, specifically related to the chat template's handling of thinking output tags. The user highlights that the default chat template enforces a strict user-assistant interaction pattern that is incompatible with their use case, and that the thinking output tags[THINK]and[/THINK]do not work as expected without patching the code, which is problematic given the diversity of open weight models and their varying token sets.- The comments discuss testing a related pull request that might address the issue, debate whether Devstral 2 is a reasoning model based on its tokenizer and training, and emphasize that hardcoded heuristics in chat template parsing are limiting; suggestions include forking to create a dedicated parser or using middleware, with the autoparser branch being tested as a promising solution.
- Number of comments this week: 8
-
b8070: qwen35moe long-prompt crash (libcuda segfault) with --op-offload on, multi-GPU: This issue reports a crash occurring in the llama.cpp project when processing very long prompts (around 20,000 tokens) using the Qwen35MoE model with multi-GPU and the
--op-offloadoption enabled, resulting in a libcuda segmentation fault. The problem is partially improved compared to a previous version where an assertion failed, but long prompt handling still leads to server crashes or connection drops, with related token truncation failures causing memory clearing and prompt reprocessing issues.- The comments provide detailed logs confirming the assertion failure in the earlier version and the segfault in the current one, describe reproducibility steps including incremental prompt length testing, and note that while some users do not experience crashes, they do observe truncation failures causing KV cache clearing and full prompt reprocessing, indicating a persistent edge case in long prompt handling with multi-GPU and op-offload enabled.
- Number of comments this week: 6
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 31
Summarized Issues:
- Memory management and allocation issues: Several issues report crashes or failures due to improper handling of memory allocation, including the
llama_params_fitfunction not accounting for system memory on CPU-only builds, thefit-paramsfeature ignoring additional VRAM needs for multimodal models, and out-of-memory errors when loading very large models despite sufficient reported GPU memory. These problems cause crashes, allocation failures, or require manual adjustments to avoid failures. - issues/19646, issues/19678, issues/19764
- CUDA and GPU backend crashes and errors: Multiple issues describe crashes, assertion failures, or compilation errors related to CUDA and GPU backends, including memory corruption with mixture-of-experts models on CUDA, unsupported instructions on NVIDIA Blackwell GPUs, segmentation faults with multi-GPU setups, CUDA assertion failures during inference, and crashes with flash attention enabled on specific architectures. These errors disrupt GPU inference and require workarounds or updated support.
- issues/19659, issues/19662, issues/19676, issues/19705, issues/19724
- Prompt cache and token truncation bugs: Several issues highlight bugs where prompt caches fail to truncate tokens properly, causing full prompt reprocessing on every message generation or invalidated context checkpoints, which leads to inefficiencies and repeated memory clearing during inference. These bugs affect models like Qwen3.5moe and Qwen3-Coder-Next (Hybrid), degrading performance.
- issues/19690, issues/19794
- Logging and verbose output malfunctions: There are reports that verbose logging options such as
--verbose-prompt,--log-verbosity, and--log-timestampsdo not work as documented when used with the llama server, preventing users from obtaining detailed logs as expected. - issues/19653
- Web UI usability and functionality regressions: Issues include the Web UI failing to create attachments when pasting content, causing pasted data to be plain text only, and the default scrollable containers for assistant code blocks hiding output content, which reduces usability and requires manual scrolling or feature requests for improvement.
- issues/19741, issues/19742
- Server signal handling and process management problems: Problems are reported with the server not properly handling stop signals from extensions, causing continued model execution, improper thread change handling leading to mixed contexts, and failure to close reverse ports after SSE streaming sessions, resulting in ports stuck in TIME_WAIT state and delayed reuse.
- issues/19758, issues/19760, issues/19775
- Model-specific inference and generation bugs: Some models exhibit unique bugs such as the Kimi K2.5 model ignoring logit bias settings causing banned tokens to be generated, and chat completion responses from Kimi K2.5 and Minimax M2.5 models omitting trailing quotation marks, indicating parser or generation issues specific to these models.
- issues/19699, issues/19795
- Build and installation path inconsistencies: There is an issue where specifying installation directories for LLAMA_LIB_INSTALL_DIR and GGML_LIB_INSTALL_DIR does not result in libraries being installed in the intended locations, with most files ending up in default paths instead, complicating custom installations.
- issues/19748
- API and protocol inconsistencies: The REST API inconsistently handles requests exceeding model context limits by returning different error codes (500 or 400), and it is suggested to use the standard HTTP 413 status code for clarity and consistency.
- issues/19774
- Crash and error reports on Apple Silicon and Metal backend: Crashes occur unpredictably on Apple Silicon Macs using the Metal backend with specific models, including stack overflows with the
--rerankargument and runtime errors related to empty grammar stacks, causing instability during evaluation or server runs. - issues/19679, issues/19756
- Feature requests for model additions and decoding support: Requests include adding the Voxtral Realtime model for local deployment due to its popularity and small size, and adding Speculative Decoding support for multimodal models to improve performance on vision-language tasks.
- issues/19696, issues/19712
- Documentation errors: The documentation incorrectly instructs users to start the server in multi-model mode using a non-existent
--model-storeflag instead of the correct--models-dirflag, causing confusion and failure to load models as intended. - issues/19786
- Hardware-specific regressions and incompatibilities: Image encoding is broken on Vulkan backend with AMD Radeon 8060S hardware starting from a specific release, causing failures in image slicing and encoding for multimodal models, representing a regression not present in earlier versions.
- issues/19735
- Compute buffer size discrepancies on CUDA devices: A discrepancy is reported where the compute buffer size on an NVIDIA GeForce RTX 3090 does not match expected values during server shutdown debug logs, indicating potential resource management or reporting issues.
- issues/19766
- Backend-specific assertion failures: The SYCL backend crashes with a failed assertion on Windows with Intel Arc A770 GPU when running a specific model after an update, causing immediate failure after the first chat message is sent.
- issues/19779
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 33
Summarized Issues:
- Model crashes and assertion failures: Multiple issues report crashes and assertion failures in various models and backends, including Qwen Next 80B Coder crashing due to an empty grammar stack, Kimi-Linear-48B-Instruct crashing on Vulkan backend from exceeding GPU limits, and Qwen3-Coder-Next crashing on GPU offload due to CUDA backend assertion failures. These crashes often occur during inference or tool calls and are linked to backend-specific or model-specific conditions.
- issues/19304, issues/19471, issues/19672
- Qwen3-Coder-Next model bugs and crashes: The Qwen3-Coder-Next model exhibits multiple issues including invalid JSON output with duplicate fields causing parsing failures, llama-server crashes when multiple tool calls are made or content and tool calls are output simultaneously, and failure to load on Windows builds due to unknown model architecture errors. These problems affect both model output correctness and server stability.
- issues/19382, issues/19430, issues/19579
- Compilation and build errors: Several issues describe build failures such as link errors on Windows with the arm64-windows-snapdragon-release preset due to missing runtime libraries, install target failures when building with GGML_CPU_KLEIDIAI enabled because of missing KleidiAI libraries, and compilation failures on Ubuntu with ROCWMMA_FATTN enabled caused by type mismatches in HIP backend calls. These errors prevent successful builds under specific configurations.
- issues/19444, issues/19501, issues/19580
- Backend-specific Vulkan and CUDA issues: Vulkan backend problems include crashes on Adreno 740 GPUs due to null pointer dereferences when shader compilation fails, Vulkan producing garbled or incorrect output on certain models resolved by shader fixes, and Vulkan backend mode collapse after a commit that was later reverted. CUDA backend issues include FlashAttn module crashes on NVIDIA Geforce 1060 and qwen35moe model producing degenerate output due to numerical overflow with a specific build flag.
- issues/19652, issues/19734, issues/19710, issues/19746, issues/19683
- Tool call and template bugs in gpt-oss: The gpt-oss tool call templates have issues with double JSON escaping causing malformed prompts, and multi-turn requests fail due to Jinja template errors when messages contain both content and thinking fields or reasoning_content and tool_calls simultaneously. These bugs cause server errors and HTTP 500 responses that were fixed by adjusting template logic.
- issues/19520, issues/19701, issues/19703
- Inference context and token count errors: There are reports of inference unexpectedly stopping with errors indicating the request token count exceeds the available context size, despite logs suggesting sufficient context. This issue affects long interactions via llama-server and open-webui, causing premature termination of inference.
- issues/19636
- Model loading and tensor mismatch errors: Crashes during model loading occur due to assertion failures related to tensor dimensions, such as in delta-net-base.cpp affecting Qwen3.5-397B on Mac with Metal backend, and missing tensors in the PrimeIntellect 3.1 model causing load errors due to configuration mismatches. These issues prevent successful model initialization.
- issues/19728, issues/19733
- Web UI and server interface problems: The llama.cpp web UI has issues including failure to upload images to vision models, file attachments not being sent with messages, intermittent blank white pages with JavaScript errors mitigated by reducing server HTTP threads, and permission denied errors when pulling Docker container images from GitHub Container Registry. These problems affect usability and deployment.
- issues/19717, issues/19723, issues/19719, issues/19739
- Model conversion and tokenizer support requests: Requests include adding support for converting cerebras/MiniMax-M2.5-REAP-139B-A10B and Tri-21B models to GGUF format, with challenges due to unrecognized BPE pre-tokenizers and unique tokenizers. Additionally, improvements to error handling in the convert_hf_to_gguf.py script are requested to provide clearer messages and search additional repositories for missing tokenizer files.
- issues/19715, issues/19718, issues/19776
- Miscellaneous issues and documentation: Other issues include a broken download link on Ubuntu x64 with ROCm 7.2, a research phase documentation template for projects, and a confirmation that a previously reported cross-stream sequence copy bug was not reproducible and likely due to experimental confounds. These cover project management and minor infrastructure problems.
- issues/19789, issues/19695, issues/19792
- Pull request template proposal: A detailed pull request description template was proposed including sections for summarizing changes, motivation, dependencies, version bumps, testing, and a checklist to ensure thorough code review and documentation. This aims to improve contribution quality and consistency.
- issues/19664
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 40
Key Open Pull Requests
1. feat: Ultra-Low-Bit Quantization Kernels (Q1_5_K, Q2_K_S): This pull request introduces two new ultra-low-bit quantization kernels, Q1_5_K and Q2_K_S, designed to enable running very large models (e.g., 70B parameters) efficiently on memory-constrained consumer GPUs by providing aggressive yet accurate quantization formats that reduce model size significantly while maintaining decoding speed and accuracy through optimized CPU implementations and AVX2 SIMD acceleration.
- URL: pull/19750
- Associated Commits: 8b75b, 7b0f2, 90412, 4dfb8, 6e97c, 25c8f, 56c9e, ff4ca, 793a5, 47b0c, 95be8, 3db03, 70477, 6900f, 2fb65, 8ea1d, 6e295, 08eae, bae75, 118ce, 0649b, 75f88, fa8a3, 77cbc, a6e87, c7c2f, be1f0, fd24b, 4bbe6, e68a1, c6f12, d9dd2, 0889b, 5ab38, cc957, 620f0, 8902f
2. WIP: ggml : add NVFP4 quantization type support: This pull request adds support for NVIDIA's NVFP4 quantization format to the ggml library, including new data types, conversion helpers, backend optimizations across CPU, CUDA, Metal, and Vulkan, integration with the gguf format, and comprehensive testing, enabling efficient handling of NVFP4 models produced by NVIDIA ModelOpt.
- URL: pull/19769
- Associated Commits: 52754, d45d3, ab01d, 0a85b, a96f4, 87c74, c0839, 32864, e403c, 307ff, 14c51, 86dd3
3. [WIP] ggml-hexagon: convert f32 to f16 - fa opt part4: This pull request improves the ggml-hexagon implementation by replacing the existing dot product function with new highly parallelized versions that compute multiple f16 dot products simultaneously, adds vector reduction utilities to efficiently sum HVX vector results, and updates the main attention kernel to leverage these enhancements for increased throughput and simplified code.
- URL: pull/19780
Other Open Pull Requests
- Multimodal Context Checkpointing Fix: This pull request fixes the issue where context checkpointing was disabled for multimodal projects, such as Qwen3.5, ensuring that full prompt reprocessing is avoided on every turn. It also enables proper handling of processed images in the key-value cache for hybrid and recurrent model architectures.
pull/19747
- Quantization Guidelines and Documentation Updates: This pull request adds explicit guidelines requiring contributors to provide GGUF, PPL, KLD data, and pure CPU performance data when submitting new quantization schemes. It also updates contributing documentation and changes the required precision from Q8 to FP16/BF16.
pull/19762
- CUDA FP32 Compute Option for Numerical Stability: This pull request introduces the GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F option to use fp32 as the compute type in cuBLAS, preventing numerical overflows on CUDA devices like Tesla V100. It maintains high performance during prompt processing while improving numerical stability.
pull/19697
- IQ*_K Quantization Types Port to CPU Backend: This pull request ports IQ*_K quantization types from ik_llama.cpp to the mainline llama.cpp CPU backend, enabling loading and production of models quantized with IQ2_K through IQ6_K. It includes quality-of-life improvements to testing and preliminary KLD and perplexity evaluations.
pull/19726
- Per-thread Buffer Pool and Job Submission Cleanup: This pull request cleans up the per-thread parameter buffer pool and job submission logic by allowing dynamic resizing for multiple kernels and removing the inflight_threads mechanism. It uses the number of kernels to determine batch submission timing.
pull/19772
- Mistral Voxtral Mini 4B Model Support: This pull request adds comprehensive support for the Mistral Voxtral Mini 4B Realtime 2602 model, including a causal audio encoder, Voxtral-specific mel spectrogram preprocessor, dual-stream inference, adaptive RMSNorm, streaming CLI tool, GGUF conversion, new tensor types, and tied embeddings handling.
pull/19698
- Token Classification Model Support: This pull request adds support for token classification using
BertForTokenClassificationandModernBertForTokenClassificationby introducing a new pooling type for token-level embeddings compatible with specific Hugging Face models.
pull/19725
- AVX2 Backend ksigns Computation for IQ2_XXS: This pull request attempts to implement ksigns computation directly in the AVX2 CPU backend for IQ2_XXS quantization, replacing lookup table loading. Initial benchmarks show mixed results, and further review and benchmarking are requested.
pull/19657
- Block Interleaving for Q5_K Quantization on x64/x86 SIMD: This pull request implements block interleaving for Q5_K quantization optimized for AVX512/AVX2, including GEMM and GEMV functions, resulting in significant prompt processing performance improvements without affecting perplexity.
pull/19707
- Quantization Logic Refactor and Error Handling: This pull request refactors quantization logic to enable immediate failure with informative errors when an importance matrix is missing, improves modularity and readability, renames key components, updates default settings, and enhances user experience by preventing wasted quantization runs.
pull/19770
- Responses API Endpoint Improvement: This pull request improves the Responses API by merging contiguous assistant input items into a single message to better support chat templates expecting combined content, reasoning, and tool calls, while preserving reasoning content not linked to tool calls.
pull/19773
- Block Interleaving and Fixes for Q6_K Quantization: This pull request implements block interleaving and fixes inaccuracies in Q6_K quantization for AVX512/AVX2, introducing optimized GEMM and GEMV functions that improve prompt processing speed while maintaining perplexity on the llama 7B model.
pull/19706
- Default Context Size Update for CPU-only Builds: This pull request updates the default context size to 4096 for CPU-only builds to prevent crashes on low-memory systems by bypassing automatic adjustment based on free memory, while allowing manual user specification and leaving GPU builds unchanged.
pull/19711
- SO_REUSEADDR Socket Option Added to HTTP Port: This pull request adds the SO_REUSEADDR socket option to the HTTP port to allow immediate reuse after server stops, aligning behavior with the GRPC port and resolving issue #19758.
pull/19763
- Fixes for Vision-Language Embedding Model Stability: This pull request fixes server crashes with Vision-Language embedding models by skipping speculative decoding checks requiring output logits and adding tensor bounds validation for variable embedding tensor dimensions, ensuring stable support for models like Qwen3-VL-Embedding.
pull/19694
- TQ1_0 and TQ2_0 Quantization Support in Vulkan Backend: This pull request adds support for TQ1_0 and TQ2_0 quantization types in Vulkan backend matrix multiplication operations, validated by tests and performance benchmarks comparing Vulkan and CPU backends using the 1bitLLM model.
pull/19743
- gguf-split Tool Operation Clarification: This pull request clarifies the operation of the gguf-split tool to prevent users from learning its functionality through trial and error.
pull/19749
- Grammar Root Symbol Check Bug Fix: This pull request fixes a bug in the grammar root symbol check by verifying the presence of a rule matching the specified grammar root symbol instead of the literal "root," preventing incorrect failures or crashes. It also improves error logging and adds tests demonstrating the issue.
pull/19761
- Partial Success for seq_rm Operation in Hybrid Memory: This pull request enables partial success of the seq_rm operation by allowing cheap snapshots through saving recurrent memory combined with inexpensive seq_rm rollbacks, restoring both recurrent and normal cache states during rollback.
pull/19670
- Pylint Workflow Addition: This pull request adds a Pylint workflow for automated Python code analysis in the project.
pull/19671
- Vision API Test Cases Added: This pull request adds three new test cases to
test_vision_api.pyverifying server handling of multiple images, no images, and text-only content parts in vision API requests, addressing a previous TODO and ensuring compatibility with thetinygemma3model.
pull/19691
- OpenAI Responses API Compliance Enhancements: This pull request updates the server to improve compliance with the OpenAI Responses API by fixing the Response object schema, streaming event schema, and multi-turn conversation input parsing for the /v1/responses endpoint, adding 24 missing fields, correcting event and function call structures, and passing 5 of 6 compliance tests.
pull/19720
- mxfp4 Format Repack Implementation: This pull request adds a repack implementation for the mxfp4 format in the ggml-cpu backend, replicating the
iq4_nlquantization method with modified scale loading, demonstrating performance improvements on AVX2 and requesting further testing on ARM and AVX512.
pull/19738
- XML Tool Call Parser Fix in Qwen3-Coder-Next: This pull request fixes the XML tool call parser by replacing operator[] with .emplace_back() to correctly handle duplicate parameter keys, preventing streaming diffs from failing due to overwritten values.
pull/19753
- CMake Build System Fix for Library Installation: This pull request fixes the CMake build system to ensure
GGML_LIB_INSTALL_DIRandLLAMA_LIB_INSTALL_DIRoptions update config files and correctly install ggml and llama libraries and headers to custom directories, resolving configuration and compilation errors for downstream projects usingfind_package.
pull/19755
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 76
Key Closed Pull Requests
1. [WIP] refactor llama-quant.cpp: This pull request is a work-in-progress refactor of the src/llama-quant.cpp file in the llama.cpp project, aiming to improve code structure and functionality related to tensor dimension handling, quantization processes, and related utility functions, while also adding features like a dry-run option and enhanced error messaging.
- URL: pull/19616
- Associated Commits: 844ad, e6b79, 0d222, 56c27, c3f42, b9b32, 150e1, 966b2, 07f88, 2769f, ea8da, 3211a, 55dbe, 22db7, ae786, 1ccd7, 16582, b15bb, 40528, 44f9f, f58de, 75ab2, 0301b, 5d6c9, 67e25, 1f25c, 6734e, d6486, fd378, 053a2, 97aef, bddc6, 7b127, a3bf0, f14fd
- Associated Commits: 844ad, e6b79, 0d222, 56c27, c3f42, b9b32, 150e1, 966b2, 07f88, 2769f, ea8da, 3211a, 55dbe, 22db7, ae786, 1ccd7, 16582, b15bb, 40528, 44f9f, f58de, 75ab2, 0301b, 5d6c9, 67e25, 1f25c, 6734e, d6486, fd378, 053a2, 97aef, bddc6, 7b127, a3bf0, f14fd
2. Add Kimi Linear to unified delta net: This pull request adds the Kimi Linear component to the unified delta net in the llama.cpp project, along with various code simplifications, optimizations, and synchronization updates to improve the model's implementation and performance.
- URL: pull/19668
- Associated Commits: cff8f, b0594, 6c765, a93bc, 7b268, 4a639, c0797, 11776, 6dad4, df269, 1cea2, a6fa6, 6432f, de6a8, 23ccc
- Associated Commits: cff8f, b0594, 6c765, a93bc, 7b268, 4a639, c0797, 11776, 6dad4, df269, 1cea2, a6fa6, 6432f, de6a8, 23ccc
3. Pre-MCP UI and architecture cleanup: This pull request focuses on pre-MCP user interface and architecture cleanup by refactoring the chat input flow, reworking message editing with shared contexts, cleaning up service and store structures, improving streaming and generation handling, normalizing reasoning and tool-call handling, enhancing model metadata caching, updating settings and type constants, and updating dependencies to prepare the codebase for further modularization.
- URL: pull/19689
- Associated Commits: 629cc, 956a0, b02d5, 1305b, 6823a, b78b9, 40f66, 726bd, eddbb, e259b, 28d4d, 3729f, d278b, 560eb
- Associated Commits: 629cc, 956a0, b02d5, 1305b, 6823a, b78b9, 40f66, 726bd, eddbb, e259b, 28d4d, 3729f, d278b, 560eb
Other Closed Pull Requests
- Model support and tokenization enhancements: Multiple pull requests add native support for new model families and improve tokenization methods, including CohereLabs/tiny-aya with a custom digit-grouping regex, JAIS-2 Arabic-English bilingual models with specific architecture features, and JoyAI-LLM-Flash tokenizer hash mapping. These updates enhance compatibility and conversion capabilities across various models and tokenizers.
- Performance improvements and CUDA graph enhancements: Several pull requests focus on improving performance through hardware-specific optimizations such as SVE support on aarch64, enabling CUDA graphs for matrix multiplication with microbatching, and optimizing CUDA dequantization operations. Additionally, CUDA graph capture is improved by delaying activation until stable, reducing overhead and fixing related issues.
- Graph and attention mechanism refinements: Updates include fixing reuse conditions for KQ masks and LoRA adapters to improve graph reuse during parallel decoding, refactoring attention build functions to toggle flash attention via command-line options, and deduplicating delta-net graph builds for Qwen family models. These changes enhance modularity, testing, and performance of graph-related components.
- Shader and Vulkan improvements: A pull request reorganizes the ggml WebGPU shader code into a centralized shader library with JIT compilation and caching to improve performance and future GPU specialization. Vulkan-related updates replace hardcoded strings with SDK constants and address batch dimension limits by splitting operations and adding fallback support.
- Build system, Docker, and CI updates: One pull request updates the ROCm docker container and CI workflows to version 7.2, adds support for gfx1150 architecture, disables incompatible features on gfx908, and introduces a new build target for ROCm 7.2 artifacts. Another synchronizes the project with the ggml library version update.
- New features and tooling additions: New functionality includes adding a lightweight audio tokenizer based on LFM2 architecture, a
--dry-runoption for thellama-quantizetool to preview tensor sizes without quantization, and exposingggml_is_viewas a public API to improve backend compatibility.
- User interface and management core improvements: Pull requests extract and clean up pre-MCP UI and architecture code for better maintainability and add the MCP core runtime, management UI, settings dialogs, and integration into the chat experience with documentation and static web UI output.
- Model integration and architecture additions: Support is added for the GLM-OCR model integrating text and vision components based on GLM4V architecture, addressing a specific issue and incorporating Huggingface transformers code.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| allozaur | 241 | 5 | 0 | 6 |
| ggerganov | 112 | 9 | 0 | 21 |
| pwilkin | 61 | 2 | 0 | 15 |
| CISC | 43 | 3 | 0 | 23 |
| 0cc4m | 60 | 2 | 0 | 7 |
| ddh0 | 49 | 4 | 1 | 5 |
| ymcki | 50 | 1 | 0 | 5 |
| ServeurpersoCom | 54 | 0 | 0 | 0 |
| No author found | 50 | 0 | 0 | 0 |
| am17an | 34 | 4 | 0 | 11 |