Weekly GitHub Report for Llama.cpp - 2024-12-09 12:00:20
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4285
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week.
- Feature Request: Support for Qwen2-VL: This issue is a feature request for adding support for the Qwen2-VL model in the llama.cpp project, highlighting its state-of-the-art capabilities in visual understanding and video analysis. The request emphasizes the model's potential to enhance the project with advanced image and video comprehension features, which are licensed under Apache 2.0.
- The comments section shows a mix of enthusiasm and technical discussions. Many users express support for the feature with "+1" comments, while others discuss technical challenges and progress updates. Some users report issues with implementation, such as compatibility with CUDA and CPU, and share solutions or workarounds. There are also discussions about the model's performance, quantization, and potential improvements, with some users offering to contribute to the development.
- Number of comments this week: None
- Performance bug: Speculative Decoding Performance Degradation: This issue reports a performance bug in the llama.cpp project, where speculative decoding is unexpectedly decreasing token generation speed across various model configurations and hardware platforms, including NVIDIA A100 and Apple M2 Pro. The user provides benchmark results showing a performance degradation ranging from 6.9% to 23.0% when speculative decoding is enabled, contrary to the expected behavior of increased speed.
- The comments discuss various attempts to resolve the performance issue, including suggestions to use quantized draft models and adjust parameters like top-k and temperature. Users share their experiences with different hardware setups and model configurations, noting that speculative decoding often results in slower performance. Some users report specific scenarios where speculative decoding provides a speed-up, but these are rare and highly dependent on the prompt and quantization settings. The discussion highlights the challenge of finding suitable prompts and configurations to achieve the intended performance benefits of speculative decoding.
- Number of comments this week: None
- Bug: Flash Attention performs worse under ROCM: This issue highlights a performance degradation when using Flash Attention under ROCM, particularly with the 7900 XTX GPU, where enabling Flash Attention results in significantly reduced performance during prompt processing and token generation, especially at larger batch sizes. The user is seeking a solution to this problem, as Flash Attention is necessary for quantization of the KV-cache, but the current performance hit is substantial.
- The comments discuss the known issue of poor performance with the HIP port of the CUDA FlashAttention kernel for large batch sizes, and the lack of current development focus on AMD performance. There is a conversation about the potential for optimizations if AMD hardware becomes more popular, and a user shares a forked branch that improves performance on RDNA3. The discussion also covers the challenges of optimizing for AMD, the potential for new developers to contribute, and the possibility of reaching out to AMD for support. Some users express interest in contributing to the project, and there is mention of existing resources and third-party implementations that could aid in developing a solution.
- Number of comments this week: None
-
Misc. bug: Inconsistent Vulkan segfault: This issue reports an inconsistent segmentation fault occurring in a Vulkan-based application on Linux, specifically when using Nvidia drivers, which suggests a potential problem with the ggml-vulkan backend not properly destroying Vulkan instances or devices before the process terminates. The problem is reproducible by running a simple program multiple times, and the segmentation fault is observed in Nvidia driver threads, indicating a possible driver bug or resource cleanup issue.
- The comments discuss potential causes and solutions, including updating drivers, adding functions to properly destroy Vulkan resources, and using static destructors. There is a suggestion to implement a preferred backend selection to avoid Vulkan on Nvidia systems, and a workaround using CUDA is shared. The conversation also touches on dynamically loading backends to improve compatibility and resource usage.
- Number of comments this week: None
-
Feature Request: Add "tokens per second" information in the Web UI: This issue is a feature request to add "tokens per second" information in the Web UI of the project, which would help users understand the prompt processing and text generation speeds. The motivation behind this request is to allow users to investigate how different parameters affect performance, although no specific implementation details have been provided.
- A user expressed interest in taking ownership of the issue and asked to be assigned, having reviewed the project's guidelines. Another commenter mentioned that related work is already in progress under a different issue number. A third commenter acknowledged this information, and a fourth suggested that the interested user could focus on implementing the frontend since the backend API was already added. It was also noted that this feature should ideally be added after certain pull requests to avoid conflicts.
- Number of comments this week: None
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 25
Summarized Issues:
- Compilation and Build Errors: Several issues in the llama.cpp project relate to compilation and build errors, particularly when using CUDA and Vulkan. One issue involves a syntax error in the
scripts/build-info.sh
file, causing a build failure on Linux with CUDA enabled. Another issue describes a compilation problem with CMake for CUDA due to missing data types, while a separate issue involves a__device__
variable incorrectly marked asconstexpr
, leading to a build failure. Additionally, there is a compilation error when building the Vulkan backend on Ubuntu 24.04 due to undeclared variables.
- Model Conversion and Inference Issues: The llama.cpp project faces multiple issues with model conversion and inference. One issue involves a failure in converting the Meta-Llama-3.1-8B-Instruct model to GGUF format due to errors like a missing tokenizer file. Another issue describes a bug in the
convert_lora_to_gguf
script, which ignores the specified output type and produces files in FP32 format. Additionally, there is a problem with theconvert_hf_to_gguf.py
script, where converting thellama-3.2-11B-vision
model results in incorrect inference outputs.
- Performance and Optimization Concerns: Performance issues are a recurring theme in the llama.cpp project, affecting both speed and output quality. One issue reports a performance bug where speculative decoding degrades token generation speed. Another issue describes a lack of performance speedup when running multiple RPC servers, despite increased CPU core usage. Additionally, there is a significant degradation in output quality when using compressed cache types in the llama-cli module.
- Feature Requests and Enhancements: The llama.cpp project includes several feature requests aimed at enhancing functionality and user experience. One request is to merge modifications supporting the Llama-3_1-Nemotron-51 model into the main branch. Another request seeks to enhance the
GGUFWriter
functionality to allow specifying an alternative output directory. Additionally, there is a request for the llama-server to support hot swapping and scaling of control vectors via an API.
- Platform-Specific Bugs and Errors: Various platform-specific bugs affect the llama.cpp project, impacting different operating systems and hardware. One issue involves a bug where the
default.metallib
file cannot be located on MacOS. Another issue describes a bug where thellama-embedding
tool crashes on macOS with M1 Max chips due to unsupported instructions. Additionally, there is a problem with thellama-imatrix.exe
application inconsistently loading computations onto the CPU instead of the GPU.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 29
Summarized Issues:
- Feature Requests for Model Support: Several issues involve requests for supporting new models in the llama.cpp project. Users have requested the addition of the "stella_en_400M" model and its conversion to the "gguf" format, as well as support for the Ministral-8B-Instruct-2410 model with its advanced features. Another request is for the OmniGen model, which would enhance the project's capabilities by enabling multimodal processing.
- Bugs in Model Conversion and Execution: Various issues highlight bugs in model conversion and execution processes. A
TypeError
occurs during model conversion to GGUF format due to incorrect handling of thelicense
field, while another issue involves a crash in the imatrix tool on Mac due to 'nan' values in computations. Additionally, a bug in thedeserialize_tensor
function causes crashes in the RPC server due to zero dimensions in tensors.
- Compilation and Build Errors: Several issues report errors during the compilation and build processes on different platforms. Users have encountered problems on Windows x64 with the LLAMA_CURL option, on NVIDIA Jetson AGX Xavier due to undefined identifiers, and on macOS with undeclared identifiers and missing symbols. These issues often require specific workarounds or adjustments to the build environment.
- Performance and Output Issues: Users have reported performance drops and incorrect outputs in various scenarios. A performance regression is noted in the Android aarch64 Neon implementation, while another issue involves a 16% performance drop with the
q8_0
key-value cache type. Additionally, users have experienced meaningless output when generating JSON responses and incorrect output with multiple AMD GPUs.
- Crashes and Memory Management Bugs: Several issues involve crashes and memory management bugs in the llama.cpp project. Crashes occur in the Android application when unloading models, and on Mac systems when running the
llama-cli
tool. Memory management issues are also noted in thellama-server
module, leading to potential segmentation faults.
- GPU Selection and Performance Optimization: Users have requested features and reported issues related to GPU selection and performance optimization. A feature request seeks the ability to select a specific Metal-compatible GPU on macOS, while another issue involves the llama-cli tool selecting a lower-performance GPU by default. These issues highlight the need for better GPU management in the project.
- Documentation and Usability Concerns: Some issues address documentation and usability concerns in the llama.cpp project. Users have reported missing version information when running commands, incorrect documentation for command-line parameters, and confusion about the compilation process. These issues suggest a need for improved documentation and user guidance.