Weekly GitHub Report for Llama.cpp: January 27, 2025 - February 03, 2025
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4611
1.2 Version Information:
The version released on February 1, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit: This issue involves benchmarking the performance of the DeepSeek-R1 IQ1_S 1.58bit model using llama.cpp, focusing on various stages such as background research, hypothesis formation, and analysis of results. The analysis includes detailed performance metrics like token sampling speed, model loading time, prompt evaluation, and generation evaluation, highlighting bottlenecks and overall performance characteristics.
- The comments discuss discrepancies in reported token generation speeds, with clarifications provided about the correct speeds for prompt evaluation and generation. Users share their testing results on different hardware setups, suggesting optimizations and discussing potential improvements in performance. There is also a discussion about the impact of different configurations and hardware on the model's performance, with users sharing their experiences and results.
- Number of comments this week: 32
-
Misc. bug: AMD Rcom command error only with cli tools: This issue involves a bug with the AMD Rcom command that occurs only when using CLI tools in the llama.cpp project, specifically affecting older AMD hardware like the Vega 56. The problem is that while other tools that use llama.cpp as an engine driver work correctly, the llama-cli tool itself fails to offload the model to the GPU memory, and the user is seeking assistance to resolve this anomaly.
- The comments discuss various troubleshooting steps, including trying different builds and configurations, such as building against an older LLVM version and using specific compiler flags. There are also discussions about potential issues with multi-GPU setups, memory allocation errors, and suggestions to test with different models and settings. The conversation includes technical exchanges about debugging and potential solutions, with some users experiencing similar issues and sharing their findings.
- Number of comments this week: 18
-
Compile bug: ios swift xcode build error when upgrade to llama : use cmake for swift build : This issue involves a compilation error encountered when building an iOS Swift project using Xcode after upgrading to a new version of the llama.cpp library, which now requires using CMake for the build process. The error manifests as the inability to find certain types and functions in scope, which were previously accessible before the upgrade.
- The comments discuss the need to use CMake for building the library and the challenges of integrating it with Swift projects, especially for iOS. Users share their experiences and solutions, such as modifying build settings and using CMake to generate frameworks instead of dynamic libraries. There is also a discussion about the limitations of shipping dynamic libraries on iOS and potential workarounds, including creating an XCFramework or reverting changes in the Package.swift file.
- Number of comments this week: 10
-
Feature Request: mixed ROCm+CUDA possible?: This issue is a feature request to enable the use of both ROCm and CUDA backends simultaneously in the llama.cpp project, as the current build only lists CUDA devices despite successful compilation with both backends. The motivation behind this request is to allow users to utilize all available GPUs, similar to the existing CUDA+Vulkan mix functionality.
- The comments discuss attempts to resolve memory access faults when using both ROCm and CUDA, with one user sharing a workaround involving renaming functions and changing symbol visibility. Another user suggests using the RPC backend for this feature, while others debate the effectiveness of dynamically loading backends to avoid symbol conflicts.
- Number of comments this week: 9
-
Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices: This issue involves a bug in the llama-server module where the system runs out of memory (OOM) during allocation, despite having sufficient VRAM available on the devices. The problem occurs when running RPC servers, leading to inefficient memory usage and eventual crashes, particularly when attempting to load a large model.
- The comments discuss troubleshooting steps, including not running RPC servers for local devices and using the
--tensor-split
option to manage memory allocation across devices. There is a noted discrepancy in device order between--list-devices
and--tensor-split
, with suggestions to reorder devices for performance optimization. The conversation also touches on potential documentation updates and the challenge of efficiently utilizing VRAM across local and remote devices. - Number of comments this week: 8
- The comments discuss troubleshooting steps, including not running RPC servers for local devices and using the
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 49
Summarized Issues:
- Quantization and Model Conversion Issues: Users are encountering various issues related to quantization and model conversion in the llama.cpp project. These include requests for 4-bit quantization support to improve mobile performance, difficulties in converting models to GGUF format due to errors in tensor mapping, and a failure to quantize models using specific methods. These issues highlight the need for improved support and documentation for model conversion and quantization processes.
- Bug Reports and Performance Issues: Several users have reported bugs and performance issues in the llama.cpp project. These include segmentation faults, crashes during inference, and performance regressions on specific hardware setups. These issues often involve compatibility problems with different operating systems and hardware configurations, indicating a need for more robust testing and error handling.
- Feature Requests for Model and Tool Enhancements: There are multiple feature requests aimed at enhancing the capabilities of the llama.cpp project. These include requests for integrating new models, improving inference efficiency, and adding new functionalities like dynamic context resizing and prebuilt binaries. These requests reflect the community's desire for a more versatile and user-friendly toolset.
- Compilation and Build Issues: Users are facing challenges with compiling and building the llama.cpp project on various platforms. These issues include errors related to compiler identification, missing files, and outdated dependencies, which hinder the successful compilation of the project. Addressing these issues would improve the build process and accessibility for developers.
- Web Interface and API Issues: The llama.cpp project's web interface and API are experiencing several issues, including bugs in the chat functionality and ignored parameters in API requests. These issues affect user interaction and the reliability of the web interface, suggesting a need for improvements in the user experience and API functionality.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 24
Summarized Issues:
- Model Loading and Execution Errors: Issues related to model loading and execution errors in llama.cpp highlight various challenges, such as runtime errors and assertion failures. These problems often stem from mismatches in configurations or compatibility issues with specific hardware or software environments.
- Feature Requests and Enhancements: Several issues propose new features or enhancements for llama.cpp, such as support for new models or quantization types. These requests aim to improve the project's functionality and performance, addressing specific user needs and expanding its capabilities.
- Compilation and Build Issues: Compilation and build issues in llama.cpp often arise from compatibility problems with compilers or specific system configurations. These issues can prevent successful builds and require updates or configuration changes to resolve.
- Performance and Optimization Concerns: Performance issues, such as slow execution or inefficient resource usage, are reported in llama.cpp, particularly when using specific backends or hardware. These concerns highlight the need for optimization and potential workarounds to improve performance.
- Template and Configuration Bugs: Bugs related to templates and configurations in llama.cpp can lead to crashes or unexpected behavior. These issues often involve unsupported templates or configuration options that need to be addressed to ensure stability.
- Parameter and Option Changes: Changes in parameters or options, such as the removal or modification of command-line arguments, can lead to confusion or unexpected behavior in llama.cpp. Users often seek clarification or alternatives to adapt to these changes.
- Backend and Hardware Compatibility: Compatibility issues with specific backends or hardware configurations can cause errors or suboptimal performance in llama.cpp. These issues often require updates or configuration changes to ensure proper functionality.
- Docker and Platform Issues: Problems with Docker images or platform compatibility can lead to warnings or errors in llama.cpp. These issues often require updates to Docker manifests or platform-specific configurations to resolve.
- Library and Dependency Bugs: Bugs in libraries or dependencies used by llama.cpp can cause errors during execution. These issues often require updates or patches to the affected libraries to resolve.
- Clustering and Multi-node Support: The need for clustering and multi-node support in llama.cpp is discussed to enhance performance and scalability. These discussions explore potential solutions and related projects to achieve these goals.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 24
Key Open Pull Requests
1. llama: Add support for RWKV v7 architecture: This pull request introduces support for the RWKV v7 architecture in the llama.cpp project, including the implementation of a new GGML_OP_RWKV_WKV7
operation for the core architecture, L2 normalization, and inference support for RWKV7 and ARWKV7 models across multiple backends such as CPU, CUDA, SYCL, Vulkan, and Metal, while also optimizing performance and addressing various model-specific enhancements and bug fixes.
- URL: pull/11452
- Merged: No
- Associated Commits: a44e9, 666d7, ea20c, 68689, 9b06a, 694b5, 5f4dc, 6c159, d11f4, b4e6c, 3b4ec, e9c63, 6588c, 8cfa1, 01c78, f48c2, e8c4b, 0c43f, 5b372
2. SYCL: Kernel function refactor: This pull request refactors the SYCL backend by removing the ggml_sycl_op_flatten
function, integrating its responsibilities directly into kernel functions to avoid unnecessary type conversions and improve numerical stability, while also introducing flexibility for additional data types, removing unused variables, and addressing several code organization tasks such as sorting includes and adding exception handling.
- URL: pull/11515
- Merged: No
- Associated Commits: 0cb29, 1ea57, 7d8a4, 57e2d, 1d5ad, c9f41, 98de6, 10ab9, 414e6, 9c894, c07b0, 387c5, 11fe7, 70692, 498db
3. Optimized DeepSeek V2/V3 implementation (MLA): This pull request introduces optimizations to the DeepSeek V2/V3 implementation by caching latent representations, replacing the naive attention mechanism with a more efficient one based on intermediate representations, and splitting model tensors to improve inference performance, while also addressing CUDA performance issues and planning further improvements such as removing unused tensors and supporting older model files.
- URL: pull/11446
- Merged: No
Other Open Pull Requests
- Vulkan Backend Enhancements: Several pull requests focus on improving the Vulkan backend's performance and functionality. These include optimizing cooperative matrix callbacks for iq2 and iq3 types, introducing simpler Kompute MAT_MUL shaders for better compatibility with embedded GPUs, and addressing issues like crashes when Vulkan is unavailable and memory allocation improvements to reduce fragmentation.
- Quantization Support in Vulkan: The Vulkan backend sees enhancements with the introduction of IQ1_S, IQ1_M, and IQ4_XS quantizations. These updates aim to provide performance comparable to existing quantizations and include optimizations for shared memory usage and performance metrics for specific devices.
- Documentation Updates: Updates to documentation include adding information about the IRIS Android app and ChatPDFLocal MacOS application. These updates aim to provide users with more resources and examples of applications using llama.cpp.
- Performance Optimizations: Various pull requests focus on performance improvements across different components. These include optimizing SIMD instructions for WebAssembly, enhancing Flash Attention in Deepseek V3 models, and introducing a NUMA-aware key-value cache buffer for multi-CPU systems.
- Code and Build System Improvements: Enhancements to the codebase and build system include updates to CMakeLists.txt for Windows version detection, using
#define
directives for color naming, and fixing issues in the continuous integration environment for openEuler.
- CUDA and FlashAttention Enhancements: A new CUDA FlashAttention kernel is proposed to replace the existing one, utilizing PTX instructions for better performance. This update aims to improve performance for large batch sizes and maintain compatibility with newer architectures.
- Miscellaneous Enhancements: Other enhancements include introducing a lambda function for slot type handling, support for tool-calls in llama-cli, and loading all experts in MoE models during warmup. These updates aim to reduce code duplication, enhance functionality, and address specific issues.
- Precision and Bug Fixes: Fixes for precision issues and bugs include modifications to the minicpm-v code and addressing shared memory size checks in the Vulkan backend. These updates ensure better accuracy and compatibility across different configurations.
- Warp Size and Performance Improvements: Support for selectable warp sizes in the mmv component is introduced to enhance performance on devices with non-standard warp sizes. This update results in significant performance improvements on specific architectures.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 54
Key Closed Pull Requests
1. Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars: This pull request introduces tool call support for various models, including Llama, Functionary, Hermes, Mistral, Firefunction, and DeepSeek, by implementing a minimal Jinja templating engine and lazy grammars to facilitate native and generic tool call styles, enhancing the functionality and compatibility of the llama.cpp
project with different model architectures and tool schemas.
- URL: pull/9639
- Merged: Yes
- Associated Commits: ec9f3, 9a86e, c8809, 7fde6, dd6d0, 168ad, ec547, b51c7, 74d71, b8254, aefac, 64287, fa4c1, 773ff, 92c38, 3ebdb, 35ac1, 52273, e4d54, 61655, be9de, 54285, 7d9c9, e8d9d, c395d, f5b78, c7735, b35aa, 9477c, c4a80, f5f74, fe967, 479c1, bc52c, c059a, 5789f, f9b19, adc67, 1afa3, 30fbc, a469f, cbe39, 1fd5f, 5d003, 1f0b1, 93a52, 05505, 1e211, 7bfcd, 7e3fe, e70ce, f0bd6, f6458, 0e87a, 0a5d5, a2fe8, 523eb, e7ff6, 7a7d6, e183f, 01072, d47f4, 3c778, 138a4, 045ed, 2ceab, 259d9, acf7c, 76893, d6f05, c207f, 0401a, 9bab6, b1103, 7ea6a, 56aa9, ba8dd, c6062, fec02, b49d0, f6e73, 77f40, dbf84, ef61a, 39729, d77fe, 5268e, 9e8b4, 03fe8, 41a61, e2116, 28cac, 2dd09, 01b34, 82b6e, 63387, a4226, cce11, c6a22, 30d33, 9ccc6, d1867, f0231, 5e358, cdfa8, a46de, c2d83, 46415, 36ed1, c479d, 0208b, a6463, 51b7a, 3f3fc, 11594, 43385, 5ec4c, f7078, ca0c8, bddc1, da606, 15ec0, 2efa0, 57f40, 67709, 09971, 92ac3, 118f7, add91, fa065, ad229, 90eff, cafea, b565a, 2d607, ef9ef, 62717, 6d568, 2f992, 0a51e, d274f, 62d45, ec4ae, b5a74, ba10b, cd63b, cad14, 4f257, d603d, 64263, 4cdbb, 47be4, 18d5a, 4a1e8, 923c8, 384f5, 40cc3, 41eec, 76f6a, 77dd6, 0f8af, babde, 68202, 7b5e0, ba27e, 6e676, ed7c6, 36c77, bc8a6, 84bc0, 2b245, 64545, cbecb, a810c, 77c60, d86a1, 77455, 590c9, f8e14, 81547, 18450, b831a, 76359, 9591a, 8ef37, 2d51c, c88f4, 3dcde, 06c4c, 0c171, 96850, 2bb3f, 7d59b, 5a64a, f223d, 82052, 5add2, 1029f, 3bd6a, 729d2, 34f54
2. deprecated: This pull request, which was not merged, aimed to introduce various updates and improvements to the llama.cpp project, including support for ggml, updates to the README after renaming GGML, enhancements for omni-audio and qwen2-audio, removal of unnecessary builds, updates to C++17 for compilation, addition of omni-vlm examples in C++ and Python, and several bug fixes and optimizations related to memory leakage, build processes, and model inference.
- URL: pull/11568
- Merged: No
- Associated Commits: 5f815, 3a355, 4a29b, c7b91, f0d1c, 9e67e, 4bdc7, d277c, 995ba, a4747, 6f1ed, 14196, d42e0, 05853, 91b3c, d6c06, 983b4, b535c, 38c6f, 22da7, 5574b, b24a4, 6a4cf, 5edad, 20b9f, 3dfac, df584, 86c22, 400fc, b1768, 16c22, 3d9c6, eb6d5, 8c417, d5df5, ecfe0, 667a6, d04e3, 21bc8, 6f0e8, 7cf07, 362bd, 5f2d9, 55953, 82dbd, 89bcf, 4e801, 98297, bb334, fc255, b9845, aad01, 8e2e6, e4ca9, 25190, fe792, fd2c5, 75891, bbf1a, 46021, fe8c7, 43f41, 3479f, 809db, a2c53, 661b3, 0b15d, 71b56, 97267, be54c, ca7e8, 07c7f, b86cd, b2958, 64a60, 5962b, 1487d, 9201d, a4ee5, 23649, 37b57, e39e2
3. cmake: add ggml find package: This pull request introduces a CMake find package for the ggml library, enabling users to link specific backends or all backends collectively through targets like ggml::
and ggml::all
, while also requiring explicit backend requests when using the llama find-package.
- URL: pull/11369
- Merged: Yes
- Associated Commits: 530fd, 5b4c1, 314f2, b14e8, ea0a8, 09ab0, 817cf, 1760b, 65b0d, 6388d, bf444, 835e0, 7f3c2, c2332
Other Closed Pull Requests
- Vulkan Backend Optimization: This topic covers the implementation of initial support for IQ2 and IQ3 quantizations in the Vulkan backend, optimizing performance for various quantization types. Additionally, it addresses the issue of pipeline creation failure in Vulkan by implementing error message logging and fixes for warnings related to a previous on-demand compile change.
- Continuous Integration Improvements: The use of
ccache-action
across all CI workflows significantly reduces build times and addresses cache management issues. Another pull request simplifies the CMake build commands in the CI process, ensuring a more streamlined integration.
- ARM and Vulkan Build Fixes: This topic addresses issues with ARM and Vulkan builds by downgrading to Ubuntu Jammy and includes several commits such as separating ARM64/AMD64 builds. It also fixes the build process for CPU architecture arm64 in the CI setup.
- Metal Backend Optimization: Implementing residency sets in the Metal backend keeps allocated memory wired, reducing overhead and improving request speeds. This pull request also provides an option to disable residency sets for cases where GPU memory collection by the OS is preferred.
- Llama 3.x Compatibility: This topic ensures better integration with the pydantic_ai package, updating the README, and implementing various fixes for Llama 3.x and Functionary 3.2. It also addresses a bug by ensuring the linefeed token for models like Llama-3 is correctly identified.
- Docker and Python Compatibility: This topic enables the installation of pip packages system-wide during the Python libraries installation step in Docker, addressing compatibility issues with Ubuntu 24.04's Python version. It also adds new arguments to the
tools.sh
script to facilitate performance benchmarking and perplexity evaluation for models using the Vulkan backend.
- Windows Build Improvements: This topic addresses the issue with the Windows SYCL build by replacing
ccache
withsccache
in the CI process. It also involves reverting the Windows HIP build process to use plain ccache due to compatibility issues with sccache.
- Documentation Updates: This topic updates the server's README.md file to include documentation on the response format for the
/apply-template
route. It also updates the README documentation by adding relative links to reference examples.
- Performance Optimization: This topic introduces minor optimizations to improve the loading speed of the Llama model by approximately 20%. It also addresses performance issues caused by excessive use of host-visible video memory in Vulkan, implementing a heuristic to avoid using host-visible vidmem when it becomes "mostly full."
- Bug Fixes and Error Handling: This topic addresses a segmentation fault error by handling null values returned from the
MTLCreateSystemDefaultDevice()
function. It also addresses a bug in thellama-run
application by adding a check for the required model parameter to prevent crashes.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 138 | 32 | 3 | 148 |
ochafik | 248 | 16 | 0 | 30 |
ngxson | 106 | 22 | 1 | 151 |
slaren | 15 | 8 | 1 | 104 |
jeffbolznv | 21 | 16 | 0 | 58 |
JohannesGaessler | 18 | 7 | 0 | 38 |
ericcurtin | 16 | 16 | 0 | 29 |
0cc4m | 4 | 2 | 1 | 48 |
danbev | 25 | 12 | 1 | 14 |
qnixsynapse | 32 | 5 | 0 | 12 |