Weekly GitHub Report for Llama.cpp - 2024-12-16 12:00:29
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4329
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week.
-
Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED): This issue is about a bug in the llama.cpp project where the Q4_0 model with runtime repacking is not performing as expected after a recent update, resulting in significantly slower performance compared to the previous Q4_0_4_4 model. The user reports that the runtime repacking feature does not seem to be utilized, leading to slower execution times on ARM64 architecture, and they provide detailed logs and comparisons to illustrate the problem.
- The comments discuss potential causes and solutions for the issue, including checking compilation settings and hardware capabilities, such as NEON and dot product support. Users share their experiences with similar issues on different hardware, like Raspberry Pi and Rockchip boards, and suggest possible improvements, such as allowing both online repacking and pre-repacked models. There is also a discussion about the scalability of supporting multiple quantization types and the potential benefits of a build-time repack feature.
- Number of comments this week: None
-
Feature Request: Source code highlight and math formula rendering: This issue is a feature request for the llama.cpp project, proposing the addition of source code highlighting and mathematical formula rendering in the web user interface. The implementation suggestion involves using the highlight.js and markdown-it-katex-gpt libraries to enhance the display of code and formulas, with detailed steps provided for integrating these features into the existing system.
- The comments discuss the feasibility and implications of the proposed feature, with initial support for the idea and suggestions to submit a pull request. Concerns are raised about the size of the katex library, which is considered too large for a non-essential feature, leading to a discussion about alternative solutions such as using a CDN or compression techniques like gzip. The conversation also touches on the potential use of Brotli compression, but it is decided to stick with gzip due to compatibility issues with some reverse proxies.
- Number of comments this week: None
-
Misc. bug: Virus detected: This issue reports that Windows Defender has detected a Trojan:Script/Wacatac.B!ml in the file "llama-b4297-bin-win-cuda-cu12.4-x64.zip" from the llama.cpp project. The detection is suspected to be a false positive, as similar issues have been noted with Windows Defender in the past.
- The comments discuss whether the virus detection is a false positive, with some users confirming that VirusTotal did not flag the file, while others note that the specific version b4297 is still flagged by Windows Defender. It is suggested that the detection might be due to loose rule sets used by Defender, and some users recommend reporting the issue to Microsoft. The issue was closed and reopened multiple times, with users noting that newer versions seem unaffected, and there is a consensus that ignoring false positives is not ideal.
- Number of comments this week: None
-
Eval bug: Q2_K and Q3_K not working on Vulkan anymore on RX 5700XT: This issue is about a bug in the Vulkan backend of a GitHub project, where models using Q2_K and Q3_K tensors produce gibberish output on an AMD Radeon RX 5700 XT GPU with the proprietary driver. The problem is linked to a specific commit, and reverting it resolves the issue, indicating a potential regression in the code.
- The comments discuss various confirmations of the issue on different systems, with some users experiencing similar problems on Arch Linux. There is a debate about whether the issue is related to a Vulkan update or a proprietary driver problem. A potential fix is suggested, and users are asked to test it. Some users report that the fix resolves the build issue, while others still face problems with specific models. The discussion also covers the differences between using AMDVLK and RADV drivers, with recommendations to use RADV for better stability. A user shares detailed steps and experiments to resolve the issue, including using environment variables to disable certain Vulkan features.
- Number of comments this week: None
- Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs): This issue involves a bug where the llama.cpp application, when built with Vulkan, fails to run on devices with Qualcomm Snapdragon X processors running Windows, despite successfully building and running on the CPU. The error occurs during model loading, specifically with the Vulkan compute pipeline, and the user has attempted various troubleshooting steps, including using different drivers and reaching out to Qualcomm for support.
- The comments discuss potential causes and solutions for the issue, including a shader compiler bug in the Adreno driver, attempts to use a Microsoft Vulkan to DX12 driver, and suggestions to manually select the device using environment variables. Users share their experiences and workarounds, such as modifying code to avoid out-of-memory errors and testing different quantization formats, with some success in running models on the Snapdragon X platform. The conversation also touches on performance comparisons and ongoing efforts to resolve the issue, including testing with different models and configurations.
- Number of comments this week: None
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 21
Summarized Issues:
- Incomplete Version String in Windows CUDA Build: The llama.cpp project has an issue with the Windows CUDA build where the version string is incomplete, missing the 'x64' suffix. This affects the
--version
output for both the llama-cli and llama-server modules. The absence of the expected suffix can lead to confusion and potential misidentification of the build version.
- Server Output Format Bugs: The llama.cpp server has bugs related to output formats and parameters. It fails to provide structured output for
response_format: json_schema
, despite documentation support, and then_probs
parameter does not function correctly in certain Docker images, affecting probability outputs. These issues impact the server's ability to deliver expected results and require fixes to align with documentation and user needs.
- Model Loading and Quantization Errors: Several issues in the llama.cpp project involve errors during model loading and quantization processes. Errors include missing tensors, incorrect tensor counts, and assertion failures related to attention weights. These problems prevent successful model deployment and require updates to scripts and configurations.
- Compilation and Runtime Errors: The llama.cpp project faces compilation and runtime errors across different platforms. These include missing types in iOS Swift projects, crashes on Huawei hardware, and runtime errors due to configuration issues. Addressing these errors is crucial for ensuring compatibility and stability across diverse environments.
- Performance and Feature Requests: Users have requested performance enhancements and new features for the llama.cpp project. These include improving ARM CPU performance, adding support for new models, and enhancing the web UI with code highlighting and formula rendering. Implementing these requests would expand the project's capabilities and user satisfaction.
- Security and Configuration Concerns: The llama.cpp project has encountered security and configuration concerns, such as a false positive Trojan detection and the removal of a configuration option. These issues highlight the need for careful management of security alerts and configuration changes to maintain user trust and functionality.
- Model Performance and Context Limitations: There are issues with model performance and context limitations in the llama.cpp project. Users report slower performance on ARM64 architecture and unexpected context limits with certain models. These issues necessitate optimizations and clarifications to meet user expectations and model specifications.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 39
Summarized Issues:
- Server Code Refactoring and Cleanup: The refactoring and cleanup of server code in the llama.cpp project aim to improve maintainability and ease of contribution. This involves restructuring the current architecture, renaming and organizing code components, and considering the use of existing libraries for efficiency. The focus is on enhancing the codebase without introducing new features.
- Quantization and Performance Enhancements: The integration of the QuaRot quantization scheme and tensor parallelism in the llama.cpp project is proposed to enhance inference speed and memory efficiency. QuaRot utilizes rotations for efficient 4-bit quantization by removing outliers from the hidden state. Tensor parallelism aims to distribute computations across multiple devices, reducing latency through efficient weight splitting.
- Caching and Model Management: Implementing a local caching mechanism for downloaded model files in llama.cpp is discussed, similar to Hugging Face's transformers. This feature would store models in a specified cache directory when the
--model
argument is not provided. The implementation should align with llama.cpp's environment variable naming conventions.
- Feature Requests for Enhanced Functionality: Several feature requests aim to enhance the functionality of the llama.cpp project. These include adding an
echo=True
option for benchmarking, supporting 'tools' and 'tool_choice' parameters for OpenAI compatibility, and implementing ShifTed Rotray position embeddING (STRING) for improved long-context inference. Additionally, requests for integrating Meta's Layer Skip and adding "tokens per second" information in the Web UI are made.
- Bug Fixes and Performance Issues: Various bugs and performance issues are identified in the llama.cpp project, affecting different components. These include text selection reset during inference, incorrect ChatML template formatting, and performance problems on Windows using ROCm. Other issues involve illegal memory access errors with CUDA, Vulkan backend errors, and floating point exceptions leading to Inf or NaN values.
- Compilation and Build Issues: Compilation and build issues are reported in the llama.cpp project, affecting various platforms and configurations. These include errors related to Vulkan shader generation, unsupported GL_KHR_cooperative_matrix extension, and undefined references to
std::filesystem
functions. Solutions involve updating compilers, performing clean builds, and adjusting build configurations.
- Documentation and Usability Improvements: Documentation inconsistencies and usability improvements are addressed in the llama.cpp project. These include updating the README to correct endpoint information, adding documentation for
cache-type-k/v
parameters, and proposing the addition of syntax code coloring for enhanced readability. These changes aim to improve user experience and reduce confusion.