Weekly GitHub Report for Llama.cpp: January 16, 2026 - January 23, 2026 (21:06:35)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined features aimed at improving user experience and system efficiency.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[BUG-UNCONFIRMED] Eval bug: CPU usage is abnormally high when using the CUDA backend to infer GLM-4.7-Flash: This issue reports abnormally high CPU usage and severely degraded performance when using the CUDA backend to run inference on the GLM-4.7-Flash model, particularly with the "-fa on" flag enabled. Users discuss that disabling "-fa" significantly improves speed, and subsequent fixes and pull requests have addressed the problem, including improvements to the quantized KV cache and FA implementation, resulting in restored and stable performance.
- Commenters confirmed the issue affects multiple hardware setups and is linked to the "-fa on" flag causing slowdowns; disabling it or applying a specific PR greatly improves performance. Later comments show that recent commits have fixed the problem, with benchmarks demonstrating restored token-per-second rates and users verifying the fix works as expected.
- Number of comments this week: 15
-
[BUG-UNCONFIRMED] Eval bug: llama_model_load: error loading model: vk::Queue::submit: ErrorDeviceLost: This issue reports a problem where loading certain large models on an AMD Radeon RX 5700 XT GPU using Vulkan results in a "vk::Queue::submit: ErrorDeviceLost" error, which appears related to insufficient memory for command submission. The user notes that the error occurs with specific flags disabling memory mapping and direct IO, but the model loads successfully when using the
--mmapflag or without direct IO.- The comments discuss attempts to resolve the issue by checking driver versions and testing different command-line flags; it is confirmed that the error persists without
--mmapand with direct IO enabled, suggesting a possible internal driver or memory management problem. - Number of comments this week: 9
- The comments discuss attempts to resolve the issue by checking driver versions and testing different command-line flags; it is confirmed that the error persists without
-
[BUG-UNCONFIRMED] [BUG] Misc. bug: Qwen3-Next fails to run in 2 nodes ( rpcserver / metal ): This issue reports a problem where the Qwen3-Next model fails to run across two Mac nodes using the rpcserver with the Metal backend, crashing at startup while other models work fine. The root cause appears to be that the Metal backend does not support the 'DIAG' operation, which is required but not handled properly in the RPC server's operation support checks, leading to crashes during remote procedure calls.
- The comments discuss that Metal as an RPC backend has been unstable for months, highlight the need for consistent operation support across all backends, and propose caching operation support results to improve performance and avoid repeated RPC calls for operation checks.
- Number of comments this week: 7
-
[BUG-UNCONFIRMED] Eval bug: gpt-oss:120b does not load on RTX4090(24G VRAM)+64G RAM: This issue reports that the gpt-oss-120b model fails to load on an RTX 4090 system with 64GB RAM using the latest version of the software, resulting in an out-of-memory (OOM) kill of the llama-server process. The problem appears linked to a change in how model tensors are loaded—specifically, the switch from memory mapping (mmap) to direct I/O, which increases RAM usage and causes the failure unless the user explicitly enables mmap with a command-line flag.
- The comments discuss comparisons between versions showing that disabling mmap leads to OOM kills due to higher RAM usage with direct I/O; users express that the new default behavior degrades user experience since large models no longer load "just work." A proposal emerges to revert to mmap as the default and make direct I/O opt-in, with explanations about the trade-offs between mmap and direct I/O in terms of speed and memory consumption.
- Number of comments this week: 6
-
[ENHANCEMENT] Feature Request: Support LightOnOCR-2-1B: This issue requests the addition of support for the newly released LightOnOCR-2-1B model in the llama.cpp project, highlighting that the first generation model is already supported and well-regarded for its speed and accuracy in OCR tasks. The user emphasizes that the newer version appears to be a significant improvement and suggests integrating it to enhance the project's capabilities.
- The comments express strong interest in the feature, with one contributor clarifying that the new model is architecturally similar to the first and can be used by converting weights to GGUF format. Another user shares GGUF files and instructions for the new model, which are well received and tested with positive feedback on the results.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 40
Summarized Issues:
- Backend and Hardware Compatibility Issues: Multiple issues report crashes, errors, or performance problems related to specific hardware backends such as Metal, Hexagon, Vulkan, CUDA, and AMD GPUs. These problems include unsupported operations causing startup failures, device lost errors, memory management incompatibilities, and slow or inefficient processing on various GPUs and platforms.
- issues/18882, issues/18910, issues/18932, issues/18940, issues/18946, issues/18948, issues/18962, issues/18984, issues/19026
- Memory Usage and Management Problems: Several issues highlight excessive memory consumption, out-of-memory errors, or memory corruption caused by mmap usage, memory accounting failures, or improper memory allocation during model loading and inference. These lead to crashes, degraded performance, or inability to run large models on certain hardware.
- issues/18889, issues/18949, issues/18946, issues/19035, issues/19003
- Server and RPC Stability Issues: Problems with server crashes, deadlocks, malformed RPC responses, and zombie processes occur due to blocking calls, assertion failures, or invalid argument errors during inference or request handling. These issues cause service interruptions and require restarts or workarounds.
- issues/18882, issues/18912, issues/18996, issues/19034, issues/19041
- Model Loading and Conversion Errors: Several issues describe failures or errors during model loading or conversion, including unsupported models, misconfigured tokenizers, and bugs in conversion scripts. These problems prevent proper model usage or cause incorrect tokenization and output.
- issues/18885, issues/18978, issues/19013, issues/19005
- Performance Degradation and Optimization Requests: Users report significant slowdowns, high CPU usage, or inefficient GPU utilization, and request improvements such as GPU prioritization and UI size reduction to enhance performance on various systems and network conditions.
- issues/18914, issues/18917, issues/18940, [issues/19039](https://github.com/issues/19039]
- Security Vulnerabilities and Memory Safety Bugs: Multiple issues reveal vulnerabilities including heap buffer overflows, null pointer dereferences, unvalidated input sizes causing unbounded memory allocation, and potential denial of service attacks due to missing bounds checks and unsafe parsing logic.
- issues/18988, issues/18990, issues/19003, [issues/19005](https://github.com/issues/19005]
- Regex and Grammar Parsing Failures: Several bugs cause stack overflows, infinite recursion, or crashes when parsing complex or maliciously crafted regex patterns and grammar rules, leading to unbounded stack growth and server errors.
- issues/18988, issues/19007, issues/19008, [issues/19010](https://github.com/issues/19010], issues/19051
- Model-Specific Bugs and Feature Requests: Issues include bugs triggered by specific models or parameters, such as reasoning budget malfunctions, template parsing errors, and requests for new model support or bindings to improve usability and compatibility.
- issues/18943, issues/19001, issues/19004, [issues/19047](https://github.com/issues/19047]
- Quantization and Inference Crashes: Crashes during quantization or inference occur due to unsupported tensor sizes, invalid CUDA arguments, or errors in key-value cache processing, causing aborted runs and failed model execution.
- issues/19036, issues/19038, [issues/18996](https://github.com/issues/18996]
- Docker and WebUI Issues: Problems with the WebUI and Docker setups include output freezing, incorrect token speed display, and the need for image updates to support newer hardware and software versions.
- issues/18913, [issues/18975](https://github.com/issues/18975]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 26
Summarized Issues:
- Model output and inference errors: Several issues report problems with model output including infinite loops, nonsensical or empty outputs, and crashes during inference. These problems occur across different models and backends, such as MiniMax-M2 GGUF causing Cyrillic output loops, Qwen2.5 0.5B GGUF conversion producing empty tokens, and Granite/Nemotron models failing due to missing Jinja filters.
- CUDA backend compilation failures: Multiple issues describe build errors in the CUDA backend caused by missing members "make_strided_iterator" and "make_counting_iterator" in the CUDA namespace. These errors prevent successful compilation on Linux systems and have appeared since a specific commit, affecting files like argsort.cu.
- Jinja template rendering and chat template bugs: Several issues involve errors in Jinja template processing causing server errors or crashes. Problems include non-callable values in templates, unknown filters for Boolean or None types, and modulo-by-zero exceptions during chat template initialization, all leading to failures in template rendering or server operation.
- Vulkan backend and OpenAPI interface issues: Bugs in the Vulkan backend cause NaN outputs and crashes on Intel iGPUs due to f16 accumulation overflow, and the OpenAPI interface on Vulkan Windows builds enters infinite loops producing garbage output, while the web GUI remains functional.
- GGUF file format and tensor validation crashes: Issues with GGUF file handling include null pointer dereferences and integer division by zero errors caused by insufficient validation of tensor data and dimensions. These bugs lead to crashes when reading or initializing GGUF files.
- Model fetching and CI pipeline failures: A recent removal of libcurl in favor of OpenSSL broke model fetching on Ubuntu systems due to missing SSL libraries, and the Github Actions CI stopped creating prebuilt binary release assets after a certain commit, causing release workflow errors.
- Performance regressions with Flash Attention: Using Flash Attention on the GLM-4.7-FLASH model running on Pascal GPUs like Tesla P40 causes a significant inference slowdown, halving performance compared to disabling Flash Attention, an issue not observed on smaller mixture-of-experts models.
- Expert ID bounds and model initialization crashes: A GGML_ASSERT failure related to expert ID bounds causes crashes when running the GPT-OSS 20B model at any quantization level or flags, preventing successful model initialization or execution.
- New model architecture support requests: There is a request to add support for the new 30B MoE model architecture
Glm4MoeLiteForCausalLMto enable GGUF compatibility as a smaller alternative to the flagship model.
- Cross-compilation build failures on RISC-V: Compilation errors occur when cross-compiling for RISC-V Vector targets due to undefined macros related to half-precision floating-point vector operations, causing build failures in vector header files.
- Winget job and CI queue management issues: The Winget job fails due to incorrect permissions preventing CreateRef execution, and there is a request to fix the ggml-ci-x64-cpu-amx job to avoid manual queue clearing.
- Runtime parsing errors with CUDA backend: Running llama-cli with a specific model and CUDA backend aborts due to a missing parser definition causing a failure to parse input after a few seconds.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 29
Key Open Pull Requests
1. Split shared state (webgpu_context) into global state and per-thread state: This pull request refactors the WebGPU backend by splitting the global webgpu_context struct into separate global and per-thread states, moving device handles, capabilities, and synchronization primitives to global state while making webgpu_context a per-thread struct containing pipelines and buffers necessary for running a graph.
- URL: pull/18976
- Associated Commits: 6b93b, 893e6, 417fa, 5a1b5, c7442, 08f79, c0373, 82b52, 51b39, 5f645, 2e4c9, 1dd56, f9796, 07838, af9c6, c0368, 94101, 8d044, cbad4
2. HIP: add mmf for CDNA: This pull request adds and refactors the matrix multiply function (mmf) for the CDNA GPU architecture in the llama.cpp project, including making rows_per_block an input parameter, passing MUL_MAT and MUL_MAT_ID operations, extending tile size for shared memory loading, and includes performance data while requesting further testing on CDNA2 and CDNA1.
- URL: pull/18896
- Associated Commits: fe248, be5ec, a3122, 8c875, 7b43c, cd8a3, 45024, 7444a, 250ae, 4d2ef, 47a7d, 8375c, ee044, 3bb26, b0656, 8331f
3. ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860: This pull request implements and optimizes the q6_K repack GEMM and GEMV (and generic) matrix multiplication routines for ARM aarch64 CPUs, significantly improving performance on various models and hardware configurations while maintaining accuracy.
- URL: pull/18888
- Associated Commits: 95d63, 8d37d, 2ab02, c996c, 2933c, f99fd, 6973e, 8ad50, 5ba52, 3d6ad, 99366, 236a1, 0f8f7, 5823c, 69be9
Other Open Pull Requests
- CUDA performance optimizations: Multiple pull requests improve CUDA-related performance by addressing specific bottlenecks. These include using the mmvq method for small batch sizes to eliminate performance dips, enabling piece-wise CUDA graphs for multi-GPU setups to achieve speedups, and increasing the CUDA command buffer size to reduce CPU stalls during pipeline parallelism, resulting in significant overall performance gains.
[pull/18958, pull/18934, pull/19042]
- Flash attention and numerical stability improvements: A pull request refactors the
flash_attn_ext_f16_threadfunction to optimize vectorized score computation and enhance numerical stability in the softmax calculation. This restructuring ensures proper vector size handling and improves the online softmax update for better performance.
[pull/19025]
- Model evaluation and utility tools: New Python-based tools are introduced for model evaluation and inspection. The
llama-evaltool enables zero-dependency benchmarking of large language models on common tasks via HTTP, whiletensor-info.pyallows quick querying of tensor shapes from safetensors models to facilitate efficient verification during conversion.
[pull/18892, pull/18954]
- Template and JSON handling enhancements: Support for mixed type object keys in Jinja templates is added, replicating Python/Jinja behavior for various types and fixing related issues with string filters and JSON escaping. Additionally, server-side fixes improve handling and validation of the
json_schemaresponse format, ensuring proper error reporting and schema extraction.
[pull/18955, pull/18963]
- API and backend support expansions: Several pull requests add new features and backend support, including an initial multi-token prediction API targeting GLM models, cumulative sum operation support for the OpenCL backend, and Metal backend enhancements for virtual devices and event support to enable multi-GPU workflows on Macs.
[pull/18886, pull/18981, pull/18966]
- Type safety and code cleanup: Fixes are made to integer type inconsistencies by replacing
intwithint32_tin path splitting functions to avoid narrowing conversions and improve safety. Additionally, redundant wrapper functions are removed to streamline code and consolidate documentation.
[pull/18894, pull/18968]
- Hardware-specific performance improvements: RISC-V Vector support is added for the SSM scan operation, yielding a 46% speedup on float32 data. CPU performance for long-context prompt processing is improved by replacing vector kernel-based FA with a tiled FA approach optimized for AMD EPYC 64-core machines.
[pull/18923, pull/19012]
- Error handling and stability improvements: Graceful error handling is introduced in the ggml-rpc system to replace server crashes with error logging and response propagation, improving stability in distributed inference. A stack overflow fix in the GBNF grammar parser adds cycle detection and recursion limits to prevent infinite recursion.
[pull/18926, pull/18993]
- Model and conversion support: Support is added for the Mistral-Nemotron-Vision-4B-Instruct vision-language model, including instructions for GGUF conversion and quantization, with testing on Windows RTX GPUs. A legacy torch flag is introduced in the conversion script to maintain compatibility with older Intel-based Macs unable to upgrade PyTorch beyond version 2.2.2.
[pull/19024, pull/18908]
- Configuration and usability improvements: A new rerank preset simplifies setup for the BGE Reranker v2 M3 model. Help messages for command-line arguments are updated to display floating-point defaults with two decimal places for accuracy. Prompt cache handling for recurrent models is fixed to avoid unnecessary memory sequence removals.
[pull/18923, pull/19045, pull/19048]
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 68
Key Closed Pull Requests
1. ggml webgpu: support for backend sampling: This pull request implements comprehensive support for backend sampling operations in the ggml WebGPU backend, including the addition of multiple unary operators, support for various tensor operations like CLAMP, LOG, ARGMAX, ARGSORT, TOP_K, and CUMSUM, as well as runtime JIT compilation of shaders to improve flexibility and resource usage, thereby fixing broken WebGPU CI tests caused by a prior refactor.
- URL: pull/18880
- Associated Commits: 13ad7, ac51c, e2a00, 267d3, 0e594, 4f358, c4c4f, 0ba2c, e7a0a, 0db82, c212d, 4c13d, be94a, 8fa18, 8eaaf, 5f60b, 23373, b0588, b33df, df04f, 40710, 6c5cf, 9f06f
- Associated Commits: 13ad7, ac51c, e2a00, 267d3, 0e594, 4f358, c4c4f, 0ba2c, e7a0a, 0db82, c212d, 4c13d, be94a, 8fa18, 8eaaf, 5f60b, 23373, b0588, b33df, df04f, 40710, 6c5cf, 9f06f
2. ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm): This pull request implements the REPACK version of the q5_K quantization format for ARM aarch64 CPUs, including optimized GEMM and GEMV operations using integer matrix multiplication (i8mm), improves code structure and performance compared to the previous vec_dot implementation, and provides detailed benchmarking showing significant speedups on multiple models and hardware configurations.
- URL: pull/18860
- Associated Commits: 63fe1, f623c, c3ed6, d729b, 9d038, ee775, 2c1e3, 7e722, 72cdc, f4a7a, d13ca, e858b
- Associated Commits: 63fe1, f623c, c3ed6, d729b, 9d038, ee775, 2c1e3, 7e722, 72cdc, f4a7a, d13ca, e858b
3. graph : utilize ggml_build_forward_select() to avoid reallocations: This pull request utilizes the new ggml_build_forward_select() function to avoid memory reallocations when switching between different input types (tokens or embeddings) across most models, improves graph topology consistency for deepstack models, and enables server CI to detect unexpected reallocations during tests.
- URL: pull/18898
Other Closed Pull Requests
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| ngxson | 146 | 10 | 0 | 60 |
| ochafik | 97 | 0 | 0 | 0 |
| CISC | 42 | 8 | 1 | 40 |
| ggerganov | 55 | 9 | 0 | 22 |
| am17an | 39 | 5 | 0 | 3 |
| JohannesGaessler | 23 | 1 | 0 | 13 |
| Copilot | 36 | 0 | 0 | 0 |
| jeffbolznv | 24 | 4 | 0 | 5 |
| danbev | 24 | 7 | 0 | 2 |
| pwilkin | 13 | 3 | 1 | 15 |