Weekly GitHub Report for Llama.cpp: January 27, 2025 - February 03, 2025

            Weekly GitHub Report for Llama.cpp: January 27, 2025 - February 03, 2025

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4611
1.2 Version Information:
The version released on February 1, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit: This issue involves benchmarking the performance of the DeepSeek-R1 IQ1_S 1.58bit model using llama.cpp, focusing on various stages such as background research, hypothesis formation, and analysis of results. The analysis includes detailed performance metrics like token sampling speed, model loading time, prompt evaluation, and generation evaluation, highlighting bottlenecks and overall performance characteristics.

The comments discuss discrepancies in reported token generation speeds, with clarifications provided about the correct speeds for prompt evaluation and generation. Users share their testing results on different hardware setups, suggesting optimizations and discussing potential improvements in performance. There is also a discussion about the impact of different configurations and hardware on the model's performance, with users sharing their experiences and results.
Number of comments this week: 32

Misc. bug: AMD Rcom command error only with cli tools: This issue involves a bug with the AMD Rcom command that occurs only when using CLI tools in the llama.cpp project, specifically affecting older AMD hardware like the Vega 56. The problem is that while other tools that use llama.cpp as an engine driver work correctly, the llama-cli tool itself fails to offload the model to the GPU memory, and the user is seeking assistance to resolve this anomaly.

The comments discuss various troubleshooting steps, including trying different builds and configurations, such as building against an older LLVM version and using specific compiler flags. There are also discussions about potential issues with multi-GPU setups, memory allocation errors, and suggestions to test with different models and settings. The conversation includes technical exchanges about debugging and potential solutions, with some users experiencing similar issues and sharing their findings.
Number of comments this week: 18

Compile bug: ios swift xcode build error when upgrade to llama : use cmake for swift build : This issue involves a compilation error encountered when building an iOS Swift project using Xcode after upgrading to a new version of the llama.cpp library, which now requires using CMake for the build process. The error manifests as the inability to find certain types and functions in scope, which were previously accessible before the upgrade.

The comments discuss the need to use CMake for building the library and the challenges of integrating it with Swift projects, especially for iOS. Users share their experiences and solutions, such as modifying build settings and using CMake to generate frameworks instead of dynamic libraries. There is also a discussion about the limitations of shipping dynamic libraries on iOS and potential workarounds, including creating an XCFramework or reverting changes in the Package.swift file.
Number of comments this week: 10

Feature Request: mixed ROCm+CUDA possible?: This issue is a feature request to enable the use of both ROCm and CUDA backends simultaneously in the llama.cpp project, as the current build only lists CUDA devices despite successful compilation with both backends. The motivation behind this request is to allow users to utilize all available GPUs, similar to the existing CUDA+Vulkan mix functionality.

The comments discuss attempts to resolve memory access faults when using both ROCm and CUDA, with one user sharing a workaround involving renaming functions and changing symbol visibility. Another user suggests using the RPC backend for this feature, while others debate the effectiveness of dynamically loading backends to avoid symbol conflicts.
Number of comments this week: 9

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices: This issue involves a bug in the llama-server module where the system runs out of memory (OOM) during allocation, despite having sufficient VRAM available on the devices. The problem occurs when running RPC servers, leading to inefficient memory usage and eventual crashes, particularly when attempting to load a large model.

The comments discuss troubleshooting steps, including not running RPC servers for local devices and using the --tensor-split option to manage memory allocation across devices. There is a noted discrepancy in device order between --list-devices and --tensor-split, with suggestions to reorder devices for performance optimization. The conversation also touches on potential documentation updates and the challenge of efficiently utilizing VRAM across local and remote devices.
Number of comments this week: 8

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 49
Summarized Issues:

Quantization and Model Conversion Issues: Users are encountering various issues related to quantization and model conversion in the llama.cpp project. These include requests for 4-bit quantization support to improve mobile performance, difficulties in converting models to GGUF format due to errors in tensor mapping, and a failure to quantize models using specific methods. These issues highlight the need for improved support and documentation for model conversion and quantization processes.
issues/11430, issues/11488, issues/11497, issues/11541, issues/11554, issues/11560

Bug Reports and Performance Issues: Several users have reported bugs and performance issues in the llama.cpp project. These include segmentation faults, crashes during inference, and performance regressions on specific hardware setups. These issues often involve compatibility problems with different operating systems and hardware configurations, indicating a need for more robust testing and error handling.
issues/11431, issues/11451, issues/11456, issues/11476, issues/11479, issues/11508, issues/11509, issues/11510, issues/11523, issues/11537, issues/11559, issues/11575, issues/11578

Feature Requests for Model and Tool Enhancements: There are multiple feature requests aimed at enhancing the capabilities of the llama.cpp project. These include requests for integrating new models, improving inference efficiency, and adding new functionalities like dynamic context resizing and prebuilt binaries. These requests reflect the community's desire for a more versatile and user-friendly toolset.
issues/11439, issues/11467, issues/11490, issues/11506, issues/11526, issues/11532, issues/11536, issues/11547, issues/11577, issues/11579, issues/11584

Compilation and Build Issues: Users are facing challenges with compiling and building the llama.cpp project on various platforms. These issues include errors related to compiler identification, missing files, and outdated dependencies, which hinder the successful compilation of the project. Addressing these issues would improve the build process and accessibility for developers.
issues/11478, issues/11498, issues/11542, issues/11562

Web Interface and API Issues: The llama.cpp project's web interface and API are experiencing several issues, including bugs in the chat functionality and ignored parameters in API requests. These issues affect user interaction and the reliability of the web interface, suggesting a need for improvements in the user experience and API functionality.
issues/11563, issues/11565, issues/11567, issues/11544

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 24
Summarized Issues:

Model Loading and Execution Errors: Issues related to model loading and execution errors in llama.cpp highlight various challenges, such as runtime errors and assertion failures. These problems often stem from mismatches in configurations or compatibility issues with specific hardware or software environments.
issues/8455, issues/10790, issues/10793, issues/10799, issues/11378, issues/11552

Feature Requests and Enhancements: Several issues propose new features or enhancements for llama.cpp, such as support for new models or quantization types. These requests aim to improve the project's functionality and performance, addressing specific user needs and expanding its capabilities.
issues/10758, issues/10834, issues/10848, issues/11524

Compilation and Build Issues: Compilation and build issues in llama.cpp often arise from compatibility problems with compilers or specific system configurations. These issues can prevent successful builds and require updates or configuration changes to resolve.
issues/11447, issues/11504

Performance and Optimization Concerns: Performance issues, such as slow execution or inefficient resource usage, are reported in llama.cpp, particularly when using specific backends or hardware. These concerns highlight the need for optimization and potential workarounds to improve performance.
issues/11153

Template and Configuration Bugs: Bugs related to templates and configurations in llama.cpp can lead to crashes or unexpected behavior. These issues often involve unsupported templates or configuration options that need to be addressed to ensure stability.
issues/11400, issues/11443, issues/11500

Parameter and Option Changes: Changes in parameters or options, such as the removal or modification of command-line arguments, can lead to confusion or unexpected behavior in llama.cpp. Users often seek clarification or alternatives to adapt to these changes.
issues/10002, issues/10774

Backend and Hardware Compatibility: Compatibility issues with specific backends or hardware configurations can cause errors or suboptimal performance in llama.cpp. These issues often require updates or configuration changes to ensure proper functionality.
issues/10610, issues/10850, issues/11538

Docker and Platform Issues: Problems with Docker images or platform compatibility can lead to warnings or errors in llama.cpp. These issues often require updates to Docker manifests or platform-specific configurations to resolve.
issues/11469

Library and Dependency Bugs: Bugs in libraries or dependencies used by llama.cpp can cause errors during execution. These issues often require updates or patches to the affected libraries to resolve.
issues/10161

Clustering and Multi-node Support: The need for clustering and multi-node support in llama.cpp is discussed to enhance performance and scalability. These discussions explore potential solutions and related projects to achieve these goals.
issues/10939

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 24
Key Open Pull Requests
1. llama: Add support for RWKV v7 architecture: This pull request introduces support for the RWKV v7 architecture in the llama.cpp project, including the implementation of a new GGML_OP_RWKV_WKV7 operation for the core architecture, L2 normalization, and inference support for RWKV7 and ARWKV7 models across multiple backends such as CPU, CUDA, SYCL, Vulkan, and Metal, while also optimizing performance and addressing various model-specific enhancements and bug fixes.

URL: pull/11452

Merged: No

Associated Commits: a44e9, 666d7, ea20c, 68689, 9b06a, 694b5, 5f4dc, 6c159, d11f4, b4e6c, 3b4ec, e9c63, 6588c, 8cfa1, 01c78, f48c2, e8c4b, 0c43f, 5b372

2. SYCL: Kernel function refactor: This pull request refactors the SYCL backend by removing the ggml_sycl_op_flatten function, integrating its responsibilities directly into kernel functions to avoid unnecessary type conversions and improve numerical stability, while also introducing flexibility for additional data types, removing unused variables, and addressing several code organization tasks such as sorting includes and adding exception handling.

URL: pull/11515

Merged: No

Associated Commits: 0cb29, 1ea57, 7d8a4, 57e2d, 1d5ad, c9f41, 98de6, 10ab9, 414e6, 9c894, c07b0, 387c5, 11fe7, 70692, 498db

3. Optimized DeepSeek V2/V3 implementation (MLA): This pull request introduces optimizations to the DeepSeek V2/V3 implementation by caching latent representations, replacing the naive attention mechanism with a more efficient one based on intermediate representations, and splitting model tensors to improve inference performance, while also addressing CUDA performance issues and planning further improvements such as removing unused tensors and supporting older model files.

URL: pull/11446

Merged: No

Associated Commits: 93864, f0ce5, de538, ce730, 202f3, 93c59, 1eee9, 8ff09, 8a887, 76543

Other Open Pull Requests

Vulkan Backend Enhancements: Several pull requests focus on improving the Vulkan backend's performance and functionality. These include optimizing cooperative matrix callbacks for iq2 and iq3 types, introducing simpler Kompute MAT_MUL shaders for better compatibility with embedded GPUs, and addressing issues like crashes when Vulkan is unavailable and memory allocation improvements to reduce fragmentation.
pull/11521, pull/11525, pull/11494, pull/11551

Quantization Support in Vulkan: The Vulkan backend sees enhancements with the introduction of IQ1_S, IQ1_M, and IQ4_XS quantizations. These updates aim to provide performance comparable to existing quantizations and include optimizations for shared memory usage and performance metrics for specific devices.
pull/11528, pull/11501

Documentation Updates: Updates to documentation include adding information about the IRIS Android app and ChatPDFLocal MacOS application. These updates aim to provide users with more resources and examples of applications using llama.cpp.
pull/11477, pull/11534

Performance Optimizations: Various pull requests focus on performance improvements across different components. These include optimizing SIMD instructions for WebAssembly, enhancing Flash Attention in Deepseek V3 models, and introducing a NUMA-aware key-value cache buffer for multi-CPU systems.
pull/11453, pull/11557, pull/11580

Code and Build System Improvements: Enhancements to the codebase and build system include updates to CMakeLists.txt for Windows version detection, using #define directives for color naming, and fixing issues in the continuous integration environment for openEuler.
pull/11558, pull/11573, pull/11581

CUDA and FlashAttention Enhancements: A new CUDA FlashAttention kernel is proposed to replace the existing one, utilizing PTX instructions for better performance. This update aims to improve performance for large batch sizes and maintain compatibility with newer architectures.
pull/11583

Miscellaneous Enhancements: Other enhancements include introducing a lambda function for slot type handling, support for tool-calls in llama-cli, and loading all experts in MoE models during warmup. These updates aim to reduce code duplication, enhance functionality, and address specific issues.
pull/11535, pull/11556, pull/11571

Precision and Bug Fixes: Fixes for precision issues and bugs include modifications to the minicpm-v code and addressing shared memory size checks in the Vulkan backend. These updates ensure better accuracy and compatibility across different configurations.
pull/11513, pull/11502

Warp Size and Performance Improvements: Support for selectable warp sizes in the mmv component is introduced to enhance performance on devices with non-standard warp sizes. This update results in significant performance improvements on specific architectures.
pull/11519

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 54
Key Closed Pull Requests
1. Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars: This pull request introduces tool call support for various models, including Llama, Functionary, Hermes, Mistral, Firefunction, and DeepSeek, by implementing a minimal Jinja templating engine and lazy grammars to facilitate native and generic tool call styles, enhancing the functionality and compatibility of the llama.cpp project with different model architectures and tool schemas.

URL: pull/9639

Merged: Yes

Associated Commits: ec9f3, 9a86e, c8809, 7fde6, dd6d0, 168ad, ec547, b51c7, 74d71, b8254, aefac, 64287, fa4c1, 773ff, 92c38, 3ebdb, 35ac1, 52273, e4d54, 61655, be9de, 54285, 7d9c9, e8d9d, c395d, f5b78, c7735, b35aa, 9477c, c4a80, f5f74, fe967, 479c1, bc52c, c059a, 5789f, f9b19, adc67, 1afa3, 30fbc, a469f, cbe39, 1fd5f, 5d003, 1f0b1, 93a52, 05505, 1e211, 7bfcd, 7e3fe, e70ce, f0bd6, f6458, 0e87a, 0a5d5, a2fe8, 523eb, e7ff6, 7a7d6, e183f, 01072, d47f4, 3c778, 138a4, 045ed, 2ceab, 259d9, acf7c, 76893, d6f05, c207f, 0401a, 9bab6, b1103, 7ea6a, 56aa9, ba8dd, c6062, fec02, b49d0, f6e73, 77f40, dbf84, ef61a, 39729, d77fe, 5268e, 9e8b4, 03fe8, 41a61, e2116, 28cac, 2dd09, 01b34, 82b6e, 63387, a4226, cce11, c6a22, 30d33, 9ccc6, d1867, f0231, 5e358, cdfa8, a46de, c2d83, 46415, 36ed1, c479d, 0208b, a6463, 51b7a, 3f3fc, 11594, 43385, 5ec4c, f7078, ca0c8, bddc1, da606, 15ec0, 2efa0, 57f40, 67709, 09971, 92ac3, 118f7, add91, fa065, ad229, 90eff, cafea, b565a, 2d607, ef9ef, 62717, 6d568, 2f992, 0a51e, d274f, 62d45, ec4ae, b5a74, ba10b, cd63b, cad14, 4f257, d603d, 64263, 4cdbb, 47be4, 18d5a, 4a1e8, 923c8, 384f5, 40cc3, 41eec, 76f6a, 77dd6, 0f8af, babde, 68202, 7b5e0, ba27e, 6e676, ed7c6, 36c77, bc8a6, 84bc0, 2b245, 64545, cbecb, a810c, 77c60, d86a1, 77455, 590c9, f8e14, 81547, 18450, b831a, 76359, 9591a, 8ef37, 2d51c, c88f4, 3dcde, 06c4c, 0c171, 96850, 2bb3f, 7d59b, 5a64a, f223d, 82052, 5add2, 1029f, 3bd6a, 729d2, 34f54

2. deprecated: This pull request, which was not merged, aimed to introduce various updates and improvements to the llama.cpp project, including support for ggml, updates to the README after renaming GGML, enhancements for omni-audio and qwen2-audio, removal of unnecessary builds, updates to C++17 for compilation, addition of omni-vlm examples in C++ and Python, and several bug fixes and optimizations related to memory leakage, build processes, and model inference.

URL: pull/11568

Merged: No

Associated Commits: 5f815, 3a355, 4a29b, c7b91, f0d1c, 9e67e, 4bdc7, d277c, 995ba, a4747, 6f1ed, 14196, d42e0, 05853, 91b3c, d6c06, 983b4, b535c, 38c6f, 22da7, 5574b, b24a4, 6a4cf, 5edad, 20b9f, 3dfac, df584, 86c22, 400fc, b1768, 16c22, 3d9c6, eb6d5, 8c417, d5df5, ecfe0, 667a6, d04e3, 21bc8, 6f0e8, 7cf07, 362bd, 5f2d9, 55953, 82dbd, 89bcf, 4e801, 98297, bb334, fc255, b9845, aad01, 8e2e6, e4ca9, 25190, fe792, fd2c5, 75891, bbf1a, 46021, fe8c7, 43f41, 3479f, 809db, a2c53, 661b3, 0b15d, 71b56, 97267, be54c, ca7e8, 07c7f, b86cd, b2958, 64a60, 5962b, 1487d, 9201d, a4ee5, 23649, 37b57, e39e2

3. cmake: add ggml find package: This pull request introduces a CMake find package for the ggml library, enabling users to link specific backends or all backends collectively through targets like ggml:: and ggml::all, while also requiring explicit backend requests when using the llama find-package.

URL: pull/11369

Merged: Yes

Associated Commits: 530fd, 5b4c1, 314f2, b14e8, ea0a8, 09ab0, 817cf, 1760b, 65b0d, 6388d, bf444, 835e0, 7f3c2, c2332

Other Closed Pull Requests

Vulkan Backend Optimization: This topic covers the implementation of initial support for IQ2 and IQ3 quantizations in the Vulkan backend, optimizing performance for various quantization types. Additionally, it addresses the issue of pipeline creation failure in Vulkan by implementing error message logging and fixes for warnings related to a previous on-demand compile change.
pull/11360, pull/11436

Continuous Integration Improvements: The use of ccache-action across all CI workflows significantly reduces build times and addresses cache management issues. Another pull request simplifies the CMake build commands in the CI process, ensuring a more streamlined integration.
pull/11516, pull/11548

ARM and Vulkan Build Fixes: This topic addresses issues with ARM and Vulkan builds by downgrading to Ubuntu Jammy and includes several commits such as separating ARM64/AMD64 builds. It also fixes the build process for CPU architecture arm64 in the CI setup.
pull/11434, pull/11472

Metal Backend Optimization: Implementing residency sets in the Metal backend keeps allocated memory wired, reducing overhead and improving request speeds. This pull request also provides an option to disable residency sets for cases where GPU memory collection by the OS is preferred.
pull/11427

Llama 3.x Compatibility: This topic ensures better integration with the pydantic_ai package, updating the README, and implementing various fixes for Llama 3.x and Functionary 3.2. It also addresses a bug by ensuring the linefeed token for models like Llama-3 is correctly identified.
pull/11539, pull/11496

Docker and Python Compatibility: This topic enables the installation of pip packages system-wide during the Python libraries installation step in Docker, addressing compatibility issues with Ubuntu 24.04's Python version. It also adds new arguments to the tools.sh script to facilitate performance benchmarking and perplexity evaluation for models using the Vulkan backend.
pull/11437, pull/11438

Windows Build Improvements: This topic addresses the issue with the Windows SYCL build by replacing ccache with sccache in the CI process. It also involves reverting the Windows HIP build process to use plain ccache due to compatibility issues with sccache.
pull/11545, pull/11553

Documentation Updates: This topic updates the server's README.md file to include documentation on the response format for the /apply-template route. It also updates the README documentation by adding relative links to reference examples.
pull/11503, pull/11505

Performance Optimization: This topic introduces minor optimizations to improve the loading speed of the Llama model by approximately 20%. It also addresses performance issues caused by excessive use of host-visible video memory in Vulkan, implementing a heuristic to avoid using host-visible vidmem when it becomes "mostly full."
pull/11448, pull/11520

Bug Fixes and Error Handling: This topic addresses a segmentation fault error by handling null values returned from the MTLCreateSystemDefaultDevice() function. It also addresses a bug in the llama-run application by adding a check for the required model parameter to prevent crashes.
pull/11441, pull/11399

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
138
32
3
148

ochafik
248
16
0
30

ngxson
106
22
1
151

slaren
15
8
1
104

jeffbolznv
21
16
0
58

JohannesGaessler
18
7
0
38

ericcurtin
16
16
0
29

0cc4m
4
2
1
48

danbev
25
12
1
14

qnixsynapse
32
5
0
12

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	138	32	3	148
ochafik	248	16	0	30
ngxson	106	22	1	151
slaren	15	8	1	104
jeffbolznv	21	16	0	58
JohannesGaessler	18	7	0	38
ericcurtin	16	16	0	29
0cc4m	4	2	1	48
danbev	25	12	1	14
qnixsynapse	32	5	0	12