Weekly GitHub Report for Llama.cpp: September 22, 2025 - September 29, 2025 (12:03:46)

            Weekly GitHub Report for Llama.cpp: September 22, 2025 - September 29, 2025 (12:03:46)

                    Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced functionality and performance optimizations. Notable highlights include streamlined features that improve user experience and system efficiency.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Misc. bug: Obvious performance downgrade between Vulkan and CUDA backend.: This issue reports a significant performance discrepancy observed between the Vulkan and CUDA backends when running the llama-server on Windows, specifically noting that Vulkan shows much slower token generation speeds during the prefill phase compared to CUDA, while CUDA is slower during the decode phase. The user requests investigation into why this performance gap exists and asks for improvements to Vulkan’s prefill speed and CUDA’s decode speed, providing detailed benchmark results and discussing GPU utilization and quantization formats in the comments.

The comments reveal that this issue is a known duplicate and may be related to Windows-specific behavior, with suggestions to limit Vulkan to a single GPU to avoid model splitting across devices. Benchmark comparisons show Vulkan’s performance improves with legacy quant models using integer dot product paths, but still lags behind CUDA in some phases. Users discuss hardware differences, backend optimizations, and the difficulty of switching to Linux for testing, concluding that Vulkan’s Pascal GPU support could be further optimized and that clearer device usage reporting is needed.
Number of comments this week: 15

Eval bug: Gpt-oss-20b garbage outputs with Vulkan backend: This issue reports that running the GPT-OSS-20B model with the Vulkan backend on an Apple M2 Pro device produces garbage outputs, while running the same model on CPU or older commits with Vulkan works correctly. The problem appears linked to the multi_add shader operation in the Vulkan backend, which passes most tests except for specific fused add operations, suggesting a possible driver or hardware bug related to the Apple Honeykrisp GPU.

The discussion explored Vulkan validation layers and tests, revealing no explicit validation errors but multiple best practice warnings. Running a dedicated test for the multi_add operation showed failures only on the Vulkan backend for certain tensor configurations, implicating the multi_add shader as the root cause. The issue was narrowed down to a probable hardware or driver bug on the Apple M2 Pro (Honeykrisp), with suggestions to report it to the hardware vendor and attempts to rule out codebase issues by comparing with other Vulkan implementations like MoltenVK.
Number of comments this week: 12

Misc. bug: GGML_CUDA_ENABLE_UNIFIED_MEMORY Does not work: This issue reports that the GGML_CUDA_ENABLE_UNIFIED_MEMORY feature, which is supposed to enable automatic VRAM swapping to allow running larger models, does not work as expected and results in CUDA out-of-memory errors despite meeting documented system requirements. The user highlights that this problem persists despite previous discussions being closed, and questions whether the feature should be removed from the documentation if it is not supported or functional.

The comments discuss platform-specific limitations of managed memory support, noting that it depends on hardware and OS configurations, with some users confirming it works but with very poor performance, especially on discrete GPUs. It is suggested that the feature may be broken due to NVIDIA driver or kernel module issues, and a proposal is made to warn users when unified memory is not available or effective, along with calls for clearer documentation on its limitations and performance implications.
Number of comments this week: 11

Eval bug: Official gpt-oss-120b model output has dropped/missing tokens, can't count to 100: This issue reports that the official gpt-oss-120b model, when run via llama-server with CUDA backend, produces outputs with dropped or missing tokens, notably failing to correctly count from 1 to 100 in generated sequences. The problem appears both in server API streaming and the new web UI, with evidence suggesting that added network latency causes streamed tokens to be lost or not rendered properly in the UI, rather than a fundamental model or sampling bug.

Commenters tested various hardware setups and builds, with some unable to reproduce the issue and others confirming it; troubleshooting focused on sampling settings and quantization formats. Ultimately, the consensus emerged that the root cause is a bug in the new web UI’s handling of streamed tokens under latency, as server logs show all tokens are generated correctly, and a fix is being prepared.
Number of comments this week: 11

Ubuntu Cuda Dedicated Executable Release: This issue is about a user who has created a llama.cpp executable for Ubuntu with CUDA support on an older Nvidia GPU and is seeking guidance on how to publish or push this build as a release on the llama.cpp GitHub repository. Additionally, the user expresses interest in contributing a feature for built-in tracing and observability support for NVIDIA GPUs on Ubuntu and asks about the preferred contribution workflow.

The comments include a suggestion to use the xz compression format for the release artifact, detailed system and build environment information provided by the user, and instructions on how to add an Ubuntu CUDA release by modifying the repository’s release workflow file. Further discussion advises contributing via a fork and separating CUDA runtime libraries from llama.cpp binaries for flexibility, with the user acknowledging and agreeing to follow this approach.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the alternative Vulkan backend. The problem has remained unresolved for over 546 days, indicating a persistent and potentially complex bug affecting this particular backend implementation.
Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace for the llama.cpp project during model inference, specifically to aid in improving the Metal backend. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, which they have been collecting for other frameworks.
common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the CURLOPT_NOPROGRESS option in the download process to ensure accurate and non-overlapping progress status for each parallel download.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using the llama-cli on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information and preventing the model from being loaded successfully.
Misc. bug:  CONVERT merged_16bit TO f16_gguf  BY MODEL phi-3.5-mini-instruct: This issue describes a problem encountered when converting a fine-tuned microsoft-phi-3.5-mini model from a merged 16-bit format to an f16_gguf format using llama.cpp's conversion script. The user reports that although the fine-tuned model performs accurately in its original merged 16-bit form, the converted f16_gguf model shows a significant drop in accuracy, and they are seeking a solution to preserve model performance after conversion.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 42
Summarized Issues:

OpenCL and Vulkan Backend Output Issues: Several issues report problems with output quality and performance when using OpenCL or Vulkan backends on various hardware. These include garbled output on Qualcomm Adreno GPUs, garbage outputs on Apple M2 Pro with Vulkan, and significantly slower decoding speeds on Snapdragon Android devices, indicating compatibility and shader execution problems.  
issues/16152, issues/16188, issues/16217

SvelteKit WebUI Functionality and Usability Bugs: Multiple issues highlight bugs and missing features in the new SvelteKit WebUI, such as failure to display thought processes or reasoning tags, lack of processing statistics, theme inconsistencies, premature query submission on macOS Safari, and UI settings not syncing or reverting unexpectedly. These problems degrade user experience and hinder effective interaction with the model.  
issues/16154, issues/16158, issues/16163, issues/16179, issues/16191, issues/16227, issues/16267

Model Loading and Runtime Failures on Specific Backends: There are reports of model loading failures and runtime crashes on CUDA, Vulkan, and CANN backends due to missing hyperparameters, assertion errors, or memory allocation issues. These failures prevent successful model initialization or inference, impacting stability and usability on affected platforms.  
issues/16247, issues/16254, issues/16269

Performance Regressions and Backend Efficiency Problems: Users report significant performance regressions and inefficiencies, including slower Vulkan prompt evaluation compared to CUDA, a 40% inference speed drop on ARM CPUs, and questions about flash-attention effectiveness on older GPUs. These issues highlight challenges in maintaining or improving backend performance across platforms.  
issues/16230, issues/16242, issues/16272

Compilation and Build Failures on Various Architectures: Several issues describe compilation failures on platforms such as arm64 with GCC, HIP builds targeting CDNA architecture, and Vulkan-enabled Docker images due to missing files or incorrect CPU flag handling. These build problems hinder development and deployment on diverse hardware.  
issues/16153, issues/16237, issues/16248

Feature Requests for Model and UI Enhancements: There are multiple feature requests including adding support for new models like dots.ocr, Qwen3-Omni-30B-A3B, Qwen3-VL, and openPangu-Embedded-7B-V1.1, as well as UI improvements such as custom endpoints for chat UI, stable semantic versioning releases, and updated README documentation. These requests aim to expand functionality and improve user experience.  
issues/16161, issues/16184, issues/16186, issues/16207, issues/16226, issues/16233, issues/16256

Server and Streaming Output Issues: Problems with llama-server include UI desynchronization with command-line parameters, Markdown rendering failures, token drops during streaming output on CUDA, and inability to set parameters to zero in the WebUI. These issues affect server reliability and user control over model behavior.  
issues/16201, issues/16228, issues/16263, issues/16267

Memory Management and CUDA Unified Memory Failures: The GGML_CUDA_ENABLE_UNIFIED_MEMORY feature does not work as documented, failing to prevent out-of-memory errors and showing limited support with performance drawbacks. This indicates challenges in managing GPU memory efficiently under pressure.  
issues/16197

Quantization and Finetuning Failures: Issues report assertion errors during model finetuning and quantization processes, preventing successful training or compression of models on both CPU and CUDA builds, and on Windows with Python 3.12.11. These bugs block advanced model customization workflows.  
issues/16258, issues/16283

Documentation and Installation Updates: Some issues address outdated documentation and installation instructions, such as the Termux wiki no longer requiring a specific PR wait and corrections to threshold values in the documentation, ensuring users have accurate setup guidance.  
issues/16223, issues/16259

Miscellaneous Bugs and Research Requests: Additional reports include file loading bugs with certain extensions, Metal backend crashes on older macOS versions, and research into code review assistance tools similar to gemini-code-assist. These highlight ongoing maintenance and exploration needs.  
issues/16218, issues/16266, issues/16293

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 6
Summarized Issues:

GPU and Vulkan Compatibility Issues: Several issues describe crashes and errors related to GPU support and Vulkan version compatibility. One issue reports a SIGABRT crash when calling llama_supports_gpu_offload on devices with Vulkan versions below 1.2 due to improper version handling, while another details a ROCm error causing a crash during benchmarking on AMD GPUs, which was fixed by adjusting build flags to reduce GPU overhead.
[issues/16142, issues/16175]

User Interface Usability Enhancements: There is a request to improve the SvelteKit WebUI by always displaying action buttons like Copy, Edit, Delete, and Regenerate by default rather than only on mouse hover. This change aims to enhance usability and accessibility for users interacting with the interface.
[issues/16155]

Model Quantization Failures: A quantization process for the MiniCPM model fails due to a missing metadata key minicpm.embedding_scale, causing the tool to error out because it cannot locate this required parameter. This issue highlights the importance of complete and accurate model metadata for successful quantization.
[issues/16192]

Thread Safety and False Positives in Testing: A data race was detected by ThreadSanitizer during thread safety tests on x86 and s390x architectures, but it was identified as a false positive caused by OpenMP. The problem was resolved by disabling OpenMP during the tests, ensuring accurate test results.
[issues/16245]

Build and Linkage Errors with OpenSSL: A compile error occurs when building with the -DLLAMA_CURL=OFF flag due to missing OpenSSL linkage, resulting in undefined references to SSL_ctrl. This happens because the build system does not properly add OpenSSL libraries to the linker flags even though cpp-httplib requires them.
[issues/16285]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 30
Key Open Pull Requests
1. ggml : add repack testing support : This pull request adds support for testing the ggml-cpu repack feature, which repackages quantized data into a more optimal layout for matrix multiplication on specific CPU architectures, enabling validation of CPU backend variants that use repacked data against a reference CPU backend that does not.

URL: pull/16182

Merged: No

Associated Commits: 77452, b6f2f, 922d8, d3016, d9e48, 22ef4, aba90, 5d75e, 8b6a0, 2e2c0, 84ac2, 56f3e, 7f032, 11f64, e3937, 743f7, caa91

2. ggml webgpu: support for rope,div,sub,glu,scale,cont operators: This pull request adds support for the ROPE, DIV, SUB, GLU, SCALE, and CONT operators in the ggml WebGPU backend by introducing new shader code, refactoring shader templates to unify binary operations, updating the cpy shader for the CONT operator, and enhancing tests to support inplace operations required by WebGPU buffer binding constraints.

URL: pull/16187

Merged: No

Associated Commits: f0fc8, 0dd41, 89f6c, 5f833, 8c189, 415b2, a7c9d, eee4e, 7c264, 67254, 7eb94

3. Model: Granite docling + Idefics3 preprocessing (SmolVLM): This pull request adds support for the IBM Granite Docling 258M model by enhancing the conversion scripts and implementing a detailed, tile-based image preprocessing pipeline for the idefics3 model in the llama.cpp project, closely aligning with the transformers library's approach to resizing, slicing, and tokenizing images to improve model performance.

URL: pull/16206

Merged: No

Associated Commits: 4ef31, 0aef5, 8819c, e1723, 64cef, e1ba7, f5a7f, cb51d, 08f30

Other Open Pull Requests

Mobile UI improvements: This set of pull requests enhances the mobile user interface by improving interactivity in sidebar conversation item actions and refining Alert Dialog and Dialog designs for mobile. It also adds a confirmation Alert Dialog for resetting settings to default, improving user experience on mobile devices.  
pull/16222

Metal backend matrix multiplication enhancements: These pull requests extend the Metal backend's matrix-matrix multiplication support by enabling operations with GGML_TYPE_F16 and removing the requirement for the first dimension to be a multiple of 32. They also add compile-time bounds checks, reduce shared memory usage, and optimize data loading and output bounds checks for better performance.  
pull/16225

Code organization refactor: This pull request refactors the large llama-model.cpp file by moving all llm_build_* definitions into separate class files within src/models/, improving code organization and maintainability.  
pull/16252

Continuous integration test coverage: This pull request extends CI tests to include scenarios where i8mm kernels are triggered with nrc == 2, addressing the previous limitation of only testing nrc=1. This ensures proper validation of these kernels under more conditions.  
pull/16234

ACL graph matching improvements: This pull request improves ACL graph matching by recording ne and nb information for source tensors and incorporating these into the graph matching check. This prevents incorrect matches when source tensors share the same data address but differ in shape or stride.  
pull/16166

Large matrix multiplication handling: This pull request addresses matrix multiplication operations involving an A matrix larger than 4GB by splitting operations into chunks along the M dimension. It also fixes stride setting order in mul_mm_cm2 to prevent stride clobbering, improving support for large im2col matrices in stable-diffusion use cases.  
pull/16176

llama-cli prompt token bug fix: This pull request fixes a bug where the last token of a user’s formatted prompt was incorrectly appended to the assistant response buffer before model-generated tokens were sampled. The fix updates the code to append tokens only when new, non-end-of-generation tokens are sampled, preventing prompt contamination in chat history.  
pull/16202

RPC server multi-device support: This pull request adds support for the rpc-server to expose multiple devices from a single endpoint by modifying the RPC protocol to include device identifiers and introducing a new API to retrieve device counts.  
pull/16276

Enhanced backend library error messages: This pull request improves the error message generated by ld_load_library() to include detailed root cause information when loading backend libraries fails. This aids in diagnosing issues like missing libraries, dependencies, or unresolved symbols.  
pull/16172

llama-server download improvements: This pull request proposes implementing a progress bar and multi-connection downloads to enhance the llama-server pulling functionality.  
pull/16196

Vulkan backend ACC_TYPE_VEC2 implementation: This pull request adds ACC_TYPE_VEC2 support in the Vulkan backend, improving caching and performance for non-coopmat shaders by enabling more efficient 32-bit value access. Benchmark results on an NVIDIA GeForce RTX 4060 Ti demonstrate these improvements.  
pull/16203

ROCWMMA 2.0.0 compile-time bug fix: This pull request redesigns selection conditions for the WMMA fattn kernel to disable its compilation on CDNA architectures with ROCWMMA 2.0.0 and on RDNA4 with older ROCWMMA versions. This prevents faulty fp16 accumulation emulation.  
pull/16221

Documentation updates: These pull requests provide clearer wording on the meaning of the -t or --threads parameter and correct documentation for the XTC threshold feature arguments to improve clarity and accuracy.  
pull/16236, pull/16260

Ubuntu arm64 CPU flag detection fix: This pull request addresses an issue on Ubuntu 20.04 arm64 where GCC 9 and 12 fail to detect correct CPU flags using "gcc -mcpu=native -E -v -". It proposes using gcc -march instead to obtain correct CPU flags and avoid compilation failures.  
pull/16239

llama_token_data structure update: This pull request merges the logit and p fields into a single score field with a new raw boolean to indicate raw logits or normalized probabilities. This resolves issues caused by sequential samplers modifying probabilities and applying softmax multiple times.  
pull/16241

kleidiai library fp16 fixes and update: This pull request fixes work size and thread synchronization issues for fp16 operations in the kleidiai library and updates it to version 1.14.0.  
pull/16246

AMD V710 GPU CI integration and benchmarking: This pull request adds CI runners and workflows using AMD V710 GPUs and reports performance benchmarks showing AMD GPU-based CI runs are significantly slower than expected. It seeks advice on potential misconfigurations or additional setup to improve AMD GPU performance.  
pull/16249

Ascend operators FP16 native support: This pull request updates Ascend operators like get_rows, rms_norm, and flash_attn_ext to natively support FP16 data format, reducing unnecessary FP32-FP16 casting and improving computational efficiency. It achieves about a 10% performance gain validated on the Qwen2 0.5b model.  
pull/16251

gpt-oss forced tool call reasoning fix: This pull request fixes gpt-oss models to perform reasoning before making a forced tool call when the tool_choice parameter is set to required.  
pull/16264

musa compiler flags update: This pull request updates compiler flags in the musa component to achieve minor performance improvements on MTGPU and resolve build warnings in recently updated files.  
pull/16265

FP16 intermediate results demo for graph inference: This pull request demonstrates using FP16 for intermediate results in graph inference to reduce computation and improve speed. It modifies operators for type inference, adds FP16 support for GET_ROWS, and casts outputs back to FP32, showing 3%–10% performance improvements on several models with the CANN backend.  
pull/16270

convert_hf_to_gguf_update.py script update: This pull request updates the conversion script by adding Stockmark and verifies the correctness of the script before and after changes.  
pull/16280

Svelte web UI download action addition: This pull request adds a download action in the Svelte web UI replicating functionality from a previous React implementation, including a filename prefix derived from the start of the conversation text.  
pull/16282

HIP backend bpermute to swizzle optimization: This pull request replaces bpermute instructions with native swizzle operations in the HIP backend for GFX906 architecture, resulting in an average 20% inference speed improvement without degrading model quality. The implementation and dispatch logic are contained in the common.cuh file.  
pull/16291

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 61
Key Closed Pull Requests
1. Master secure ggml rpc: This pull request proposes initial work on securing the ggml RPC mechanism, as indicated by the commit titled "first commit of secure rpc," but it was not merged into the master branch.

URL: pull/16281

Merged: No

Associated Commits: 2c946, 6dbf1, f20a7, 40fdc, c0e18, b22dc, bdae4, 199ae, a0196, 9e622, 1872d, d3c31, bb0c2, 86b13, d10b9, 838e5, 61861, 2c487, cb118, 4d552, 9519b, 1a4ca, def12, 32978, 1b23d, d0a69, 3823c, b4a3a, 4653d, 76985, f092d, 714a0, 223ba, f7ee0, f707b, 2e2c1, 1a3b9, a787b, 2cd96, a3ced, de11e, 8f549, e7e34, 367ca, ba522, c9386, c9906, ca6c6, ad8c5, c5ae5, e7a8f, cc34c, 97ca8, 4daa3, 06943, 390ef, 8b57b, 45898, 64925, e5819, 3e534, a8f42, d9676, 9fdf1, be08d, a1627, a3262, e09bd, 5c89e, 78000, f0887, a4823, 0a8d6, 78b55, 17d4a, b1593, df220, 4aa91, 84844, d67e0, 335b8, d71dc, 1cc47

2. ggml-cpu: implement MXFP4 SIMD for s390x: This pull request implements the MXFP4 SIMD instruction set for the s390x platform in the ggml CPU backend, resulting in significant performance improvements of over 159% for prompt processing and 136% for token generation, validated through benchmarks on an IBM z17 mainframe and extensive testing across multiple models.

URL: pull/16193

Merged: Yes

Associated Commits: 618ef, 6549e, 377d0, 35389, cf927, ae718, f7e75, 5fb1b, 4f85c, 1fe55, 1f99e, 96cba

3. ggml : implement set_rows with i32 index: This pull request implements support for using a 32-bit integer (i32) index in the set_rows function across multiple backends including CPU, CUDA, Metal, OpenCL, SYCL, Vulkan, and CANN, while disabling it for WebGPU due to implementation challenges.

URL: pull/16159

Merged: Yes

Associated Commits: a7403, 9a7bf, 7657e, 480b4, 98991, aa768, bd47c, d8abd, 25102

Other Closed Pull Requests

Continuous Integration Improvements: Several pull requests enhance the continuous integration (CI) workflows by switching to GitHub-hosted machines for x64 and ARM tests, optimizing CPU core usage, updating runner allocations, and reducing test runtimes with smaller models. These changes improve efficiency, reduce runtime, and address hardware availability issues to streamline the CI process.  
pull/16183, pull/16167, pull/16200, pull/16156, pull/16200, pull/16185

Vulkan Backend Fixes and Enhancements: Multiple pull requests improve the Vulkan backend by fixing bugs in index calculations for vector dot matrix multiplications, improving initialization error handling to avoid crashes on older devices, and adding support for arbitrary key-value dimensions in flash attention. These updates increase stability and compatibility across devices and use cases.  
pull/16151, pull/16160, pull/16199

CUDA and Metal Backend Optimizations: Pull requests refactor CUDA FlashAttention kernels for better performance and flexibility, add optimized CUDA matrix multiplication operations for specific batch sizes, and unify RMS_NORM and NORM implementations in the Metal backend with extended input shape support. These improvements deliver significant speedups and enhanced backend capabilities.  
pull/16204, pull/16273, pull/16220

File and Attachment Handling Enhancements: Updates include improved detection logic for text and binary file attachments with configurable heuristics and expanded support for additional text file types like LaTeX and BibTeX. These changes enhance file type recognition and handling within the project.  
pull/16199

Caching and Offline UI Improvements: One pull request implements caching of the /props response and adds a user interface to allow interaction with conversations even when the llama server is down, ensuring the chat UI remains functional using cached data and graceful offline handling.  
pull/16255

Codebase Cleanup and Documentation: A pull request updates the CODEOWNERS file and removes obsolete examples and scripts like gritlm, while another refactors the zDNN codebase by organizing operations into individual files, adding backend documentation, and updating the README to list zDNN as an available backend. These efforts improve maintainability and clarity.  
pull/16174, pull/16178

Web UI Routing and Interface Updates: Changes include switching the web UI routing to hash-based routing in SvelteKit to improve subdirectory deployment compatibility and updating the message UI to always display message actions by default, enhancing user accessibility and interaction.  
pull/16157, pull/16277

Model Support and Labeling Fixes: Updates add a correct label for the LiquidAI LFM2-2.6B model by fixing its identification parameter and update the MiniCPM model loader to treat certain GGUF metadata keys as optional with legacy defaults, restoring backward compatibility and preventing quantization failures.  
pull/16204, pull/16231

Build and Vendor File Fixes: Pull requests fix false positive build warnings in the miniaudio.h vendor file triggered by GCC 13.3+ and address s390x Docker build failures by synchronizing parallel builds and resolving related warnings, ensuring smoother build processes.  
pull/16208, pull/16212

Code Quality and Cleanup: Some pull requests focus on code quality by disabling specific clang-tidy warnings about braces in if statements and removing unused local variables overridden in loops, resulting in cleaner and more maintainable code.  
pull/16139, pull/16140, pull/16144, pull/16199

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
161
24
0
42

taronaeo
151
10
2
28

danbev
90
12
0
3

CISC
36
6
0
62

ngxson
48
4
0
38

jeffbolznv
39
13
0
35

pwilkin
44
3
2
19

JohannesGaessler
30
4
0
31

0cc4m
20
1
0
31

allozaur
30
5
3
13

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	161	24	0	42
taronaeo	151	10	2	28
danbev	90	12	0	3
CISC	36	6	0	62
ngxson	48	4	0	38
jeffbolznv	39	13	0	35
pwilkin	44	3	2	19
JohannesGaessler	30	4	0	31
0cc4m	20	1	0	31
allozaur	30	5	3	13