Weekly GitHub Report for Llama.cpp: February 03, 2025 - February 10, 2025
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4675
1.2 Version Information:
The version released on February 8, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Compile bug: How to compile llama.cpp with Vulkan for android device: This issue involves a user attempting to compile the llama.cpp project with Vulkan support for an Android device using NDK 27 and Vulkan-header v1.4.307, but encountering multiple compilation errors related to undeclared identifiers and compatibility issues. The user provides detailed logs and compile commands, seeking assistance to resolve these errors and successfully compile the project.
- The comments discuss similar issues faced by others, with suggestions to provide full logs and check for specific support definitions. There is a discussion about potential issues with cross-compilation and the use of different versions of the
glslc
compiler. A workaround involving hardcoding theglslc
path is suggested, but the problem persists. The conversation concludes with a proposal to ensure consistentglslc
usage across CMake files to resolve the issue. - Number of comments this week: 14
- The comments discuss similar issues faced by others, with suggestions to provide full logs and check for specific support definitions. There is a discussion about potential issues with cross-compilation and the use of different versions of the
-
Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60: This issue reports a problem with the Vulkan implementation on AMD Instinct MI60 graphics cards, where the user experiences a premature out-of-memory exception when trying to allocate more than 16GB of VRAM, despite the card having 32GB available. The user describes attempts to run a llama-server instance with a large context length, encountering errors when exceeding a certain threshold, and notes that the problem persists across different Vulkan versions.
- The comments discuss the limitation of Vulkan in handling buffer sizes over 4GB, with suggestions to use environment variables to increase allocation size, though this may not work if the driver does not support it. There is a discussion about using multiple buffers and potential algorithmic changes, as well as a suggestion to use a quantized format to reduce cache size. A user identifies a setting in the "Vulkan Configurator" as the cause of their issue and resolves it by changing the configuration to allow the application to control layers.
- Number of comments this week: 8
-
Research: Performance differences between Metal (macOS) and Vulkan (Linux): This issue involves a developer from the Asahi Linux GPU drivers team seeking assistance to improve the performance of llama.cpp on Apple Silicon platforms using the Vulkan backend, as they have observed that macOS performs significantly faster than Linux in their tests. The developer is requesting insights into the performance differences between the Metal and Vulkan backends, as well as guidance on running micro-benchmarks and understanding the scheduling of work on both platforms to identify potential areas for optimization.
- The comments discuss the state of the shader compiler and the performance of Vulkan shaders, with suggestions to run specific tests to compare performance between Metal and Vulkan. There is a consensus that missing cooperative matrix support in Vulkan could be a significant factor in the performance gap. Various contributors share insights on recent optimizations, potential areas for improvement, and the impact of different backend configurations, with some suggesting further testing and comparison using MoltenVK on macOS.
- Number of comments this week: 5
-
Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit: This issue involves benchmarking the performance of the DeepSeek-R1 IQ1_S 1.58bit model using llama.cpp, focusing on various stages of research such as background research, hypothesis formation, and analysis of results. The issue provides detailed performance metrics, including token sampling performance, model loading time, prompt evaluation, and generation evaluation, highlighting bottlenecks and overall performance characteristics on different hardware configurations.
- The comments discuss discrepancies in reported token generation speeds, with clarifications provided on the correct speeds for prompt evaluation and generation. Users share their testing results on different hardware setups, suggesting optimizations and discussing potential improvements in performance through changes in configuration and implementation.
- Number of comments this week: 5
-
Feature Request: Console Compatibility for Llama.cpp (PS5 & Xbox): This issue is a feature request to make the Llama.cpp project compatible with PS5 and Xbox consoles, as it currently cannot be successfully compiled on these platforms. The request highlights the potential for new AI-driven applications in gaming and other fields if support for these consoles is expanded, and it seeks guidance on the recommended approach to achieve this compatibility, noting that compiling for PS5 currently results in 396 errors.
- The comments discuss potential starting points for a minimal core module, inquire about specific errors encountered, and identify missing functions and header files causing compilation issues on both PS5 and Xbox. There is a clarification that some missing elements are from example code rather than the core library, and further details about the compilation process and SDKs used are requested and provided.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 45
Summarized Issues:
- Compilation Issues: Compilation problems are prevalent across various systems and configurations, affecting the llama.cpp project. Users report errors when compiling with Vulkan on Termux, CUDA on Linux, and SYCL on Arch Linux, often due to missing dependencies, unsupported features, or misconfigurations in the build environment.
- Feature Requests: There are multiple feature requests aimed at enhancing the llama.cpp project. These include adding support for hardware accelerators like TPUs, implementing experimental web UI features, and enabling model unloading after inactivity to conserve resources.
- Memory and Performance Issues: Users encounter memory allocation failures and performance bottlenecks in the llama.cpp project. These issues include premature out-of-memory exceptions on AMD GPUs and CPU-bound performance during inference with certain models.
- Web UI Enhancements: Several issues focus on improving the web UI of the llama.cpp project. Requests include adding file upload support, import/export functions for conversations, and reading data from specific endpoints to enhance user interaction.
- Bug Reports: Various bugs are reported in the llama.cpp project, affecting different components. These include segmentation faults during quantization, runtime errors in the llama-tts project, and JSON parsing errors causing server termination.
- Cross-Platform Compatibility: Users face challenges in making the llama.cpp project compatible with various platforms. Issues include compilation errors on PS5 and Xbox consoles and discrepancies in output between different interfaces.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 41
Summarized Issues:
- Server Crashes and Errors: This topic covers issues related to server crashes and errors in the llama.cpp project. One issue describes a server crash caused by sending a string instead of the expected JSON object, where the server should return an HTTP 400 error. Another issue involves a segmentation fault due to a null pointer in the glm-4-9b-chat model's layers, with a suggested fix provided.
- Feature Requests for Model Support and Enhancements: This topic includes feature requests for adding support for new models and enhancing existing functionalities. Requests include adding support for Pixtral and the Llama-3_1-Nemotron-51 model, as well as implementing a multi-prompt caching system to improve performance.
- Compilation and Build Errors: This topic addresses various compilation and build errors encountered in the project. Issues include a
__device__
variable marked asconstexpr
causing build failures, and a problem with theget_executable_path()
function lacking a return value on certain platforms.
- Bugs in Model Inference and Output: This topic covers bugs related to model inference and output issues. Problems include incorrect token generation in the Falcon-40b model and repeated output in Q4 quantized models, affecting the quality and accuracy of the generated text.
- User Interface and Experience Issues: This topic involves issues affecting the user interface and experience in the llama.cpp project. Problems include a bug in the web UI where pressing Shift+Enter adds a newline incorrectly, and a request to restore typing functionality during text generation.
- Security and Virus Detection Concerns: This topic addresses security concerns and virus detection warnings related to the project's releases. Users reported a Trojan warning on Windows 11 for specific release files, prompting an investigation into potential false positives.
- Performance and Regression Issues: This topic includes performance regressions and related issues in the llama.cpp project. A significant slowdown was observed in the llama-server module, and a regression in token generation behavior was noted with flash attention in CUDA.
- Compatibility and Configuration Problems: This topic covers compatibility and configuration issues encountered in the project. Problems include a requirement error with
nvidia-container-cli
due to CUDA version constraints and a bug with thellama-qwen2vl-cli
tool on MacBook Pro M1 Max.
- Documentation and Code Refactoring: This topic involves documentation improvements and code refactoring requests. Suggestions include refactoring the GGUF packing process documentation and transitioning the server web UI from Vue.js to React.js with TypeScript for better manageability.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 31
Key Open Pull Requests
1. server
: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none
: This pull request addresses the enhancement of tool-call support for DeepSeek-R1-Distill-Qwen models by introducing a --reasoning-format
flag to control the output of reasoning_content
in API responses, updating the Command R7B parser for better handling of tool plans, and ensuring compliance with native thinking tags, while also providing detailed instructions for building and running the updated server with various models.
- URL: pull/11607
- Merged: No
- Associated Commits: d3b60, 87de8, 130ca, 04d51, 28345, c80cb, 08716, 73d08, 04be7, ae9d5, 19bea, 5e6f2, a7607, 2b3c4, 4cb0e, 08271, df347, c397b, 56961, 0be7f, 7dc27, c6214, 1c302, 108da, bc6d9, 11c1f, 30ea3, bbd45, bff54, ce282, e84ee, 18a11, 9a684, a682d, f0154, 326e7, 78b47, 86994, d43e4, 81254, d44eb, b6e14, 1f5ec, 438ce, b5b11, 0db98, d1b66, 39c1d, b2d17, 933f7, 5d60c, 1f1f0, 9d7c3, d20c2, f3e9f, 3841a, e6d9b, 39b50, 0917e, 09862, 33efc, 99430, d1a06, cc2c7, c0f97, af638, a59fd, b829c
2. Supporting Velvet model: This pull request introduces support for Velvet models in the llama.cpp project, including updates to various scripts and files, such as convert_hf_to_gguf.py
and llama.h
, and adds a test case for the Velvet chat template, as detailed in the commits.
- URL: pull/11716
- Merged: No
3. Attempt to add the mllama
support: This pull request aims to integrate mllama support from the Ollama GitHub repository into the current project by applying patches for mllama implementation and the unpad operator, while addressing issues related to model conversion and tensor dimension mismatches.
- URL: pull/11639
- Merged: No
Other Open Pull Requests
- Build and Compilation Issues: Several pull requests address build and compilation issues across different platforms and configurations. These include resolving build issues on Windows and Gentoo systems, fixing compilation warnings in the
ggml-cpu-aarch64
module, and modifying CMake configurations for better compatibility and functionality.
- Memory Management Improvements: Two pull requests focus on addressing memory leak issues in the
llava.cpp
andclip.cpp
files. These changes ensure proper memory management in error handling paths without affecting model or runtime performance.
- Continuous Integration Enhancements: Enhancements to the CI process are proposed in multiple pull requests, including restricting ccache writes to specific jobs and addressing certificate revocation errors on Windows workers. These changes aim to improve the reliability and efficiency of the CI pipeline.
- Documentation and Usability Improvements: Several pull requests introduce documentation updates and usability improvements, such as adding a README for Qwen2VL, clarifying function purposes, and providing build instructions for the OpenCL backend. These changes aim to enhance user understanding and ease of use.
- Feature Additions and Optimizations: New features and optimizations are introduced, such as support for uploading
.pdf
files in the web UI, chunking support in themul_mat_id
function, and tensor transmission optimizations. These enhancements aim to improve functionality and performance.
- Error Handling and Signal Processing: Pull requests address issues related to error handling and signal processing, such as ensuring correct processing of the Ctrl+C signal and preventing token mismatches in the server. These changes improve the robustness of the system.
- Script and Test Enhancements: Enhancements to scripts and tests include adding inline script metadata for dependency management, updating test scripts for better documentation, and modifying tests for umlauts. These changes aim to streamline development and testing processes.
- Miscellaneous Improvements: Various other improvements include reverting changes to the Swift package, proposing the use of named colors in examples, and using ANSI escape codes for efficient line clearing. These changes contribute to overall project maintenance and enhancement.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 67
Key Closed Pull Requests
1. SYCL: Kernel function refactor: This pull request refactors the SYCL backend by removing the ggml_sycl_op_flatten
function, integrating its responsibilities directly into kernel functions to avoid unnecessary type conversions, improve numerical stability, and enhance flexibility for supporting additional data types, while also cleaning up unused variables and adding exception handling for SYCL operations.
- URL: pull/11515
- Merged: No
- Associated Commits: 2d72b, 957c1, 108be, e1326, fa7c4, 95a09, 5288b, a153f, 51bed, 3a346, aaf9e, a16b6, ecacf, 7d8d6, 04d8b, 98f5f, 8e867, 7f2d2, 92792, 0c319, ddc5e, ba792, 4db56, eb466, 5c05a, d31c6, 1ccfa, bba4b, 6dbb7, a6a23, 6eb30, 539b0, 18d70, 0ae9a, 7369e, e5926, 52b06, 0b602, efb57, cfa2c
2. server : (webui) revamp Settings dialog, add Pyodide interpreter: This pull request revamps the Settings dialog in the web UI by organizing it into two columns, introduces an "Experimentals" section featuring a Python interpreter powered by Pyodide (a CPython compiled as WebAssembly), and adds an API for a side panel called "Canvas" to support future extensions, while also addressing various UI improvements and bug fixes.
- URL: pull/11759
- Merged: Yes
- Associated Commits: 422e5, 483a3, 115f7, be22b, fbf28, 6f1fc, 19a95, 22e82, e1f03, 84919, 8e092, 69fa9, 84fe6, 475b2, 85da9, 77918
3. server : (webui) migrate project to ReactJS with typescript: This pull request involves migrating the project's web user interface from VueJS to ReactJS with TypeScript, introducing enhancements such as improved markdown rendering using react-markdown
, enabling text selection during generation, providing unique addresses for each conversation, updating the "Copy" button functionality, and allowing users to switch between conversations while text generation is in progress, all while ensuring the changes remain transparent to end-users and require no migrations.
- URL: pull/11688
- Merged: Yes
- Associated Commits: 0d172, cc277, 699e8, 58499, 82ab8, 518e0, c8dc8, 64c5b, 71235, 124df, d9959, 1dc99
Other Closed Pull Requests
- CUDA and GPU Optimizations: This topic covers several pull requests aimed at enhancing CUDA and GPU performance. The introduction of a new CUDA FlashAttention kernel and support for non-contiguous input tensors are notable improvements. Additionally, optimizations for ROCm and the handling of AMD GPU architectures have been addressed to ensure compatibility and performance across various hardware configurations.
- Codebase Synchronization and Updates: This topic includes pull requests focused on synchronizing the codebase with external repositories and updating project components. The synchronization with the Google repository and updates to the
minja
project are part of ongoing efforts to maintain code consistency and functionality.
- Web UI and User Experience Improvements: Several pull requests address enhancements to the web UI and user experience. These include fixing Shift+Enter handling, improving textarea overflow, and ensuring numeric settings are saved correctly, all contributing to a more intuitive and functional user interface.
- Documentation and Readme Updates: This topic covers updates to documentation and README files to reflect recent changes and improvements. These updates include adding new features, such as the
llm_client
Rust crate, and ensuring documentation is aligned with the current state of the project.
- Error Handling and Exception Management: Pull requests under this topic focus on improving error handling and exception management within the project. Enhancements include adding try-catch blocks in the server and addressing unhandled exceptions to ensure robust error management.
- Performance and Memory Optimization: This topic includes pull requests aimed at optimizing performance and memory usage. Efforts include addressing memory fragmentation in Vulkan and optimizing the Vulkan cooperative matrix callbacks for improved performance.
- Build and Compilation Fixes: Several pull requests address build and compilation issues across different architectures and environments. These include fixing compile errors related to SIMD operations on LoongArch and ensuring compatibility with older compilers.
- Quantization and Model Optimization: This topic covers pull requests focused on model quantization and optimization. Notable changes include the introduction of quantization for the visual projector LLAVA and Qwen2VL, significantly reducing file sizes and improving efficiency.
- Server and API Enhancements: Pull requests in this category focus on server and API improvements, such as introducing a new
--rpc-layers
flag and enhancing the server logging system for better debugging and user experience.
- Bug Fixes and Issue Resolutions: This topic includes pull requests that address various bugs and issues within the project. Fixes include resolving a segmentation fault issue with older models and correcting the installation directory for
llama.pc
.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ochafik | 346 | 24 | 1 | 40 |
ngxson | 122 | 20 | 7 | 183 |
ggerganov | 119 | 21 | 3 | 156 |
slaren | 18 | 7 | 1 | 95 |
jeffbolznv | 19 | 15 | 0 | 54 |
JohannesGaessler | 27 | 11 | 1 | 42 |
ericcurtin | 20 | 18 | 0 | 36 |
IMbackK | 12 | 8 | 1 | 45 |
danbev | 30 | 14 | 2 | 18 |
qnixsynapse | 50 | 4 | 0 | 8 |