Weekly GitHub Report for Llama.cpp - 2024-12-09 12:00:20

            Weekly GitHub Report for Llama.cpp - 2024-12-09 12:00:20

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues

2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues

2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4285
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. 

Feature Request: Support for Qwen2-VL: This issue is a feature request for adding support for the Qwen2-VL model in the llama.cpp project, highlighting its state-of-the-art capabilities in visual understanding and video analysis. The request emphasizes the model's potential to enhance the project with advanced image and video comprehension features, which are licensed under Apache 2.0.

The comments section shows a mix of enthusiasm and technical discussions. Many users express support for the feature with "+1" comments, while others discuss technical challenges and progress updates. Some users report issues with implementation, such as compatibility with CUDA and CPU, and share solutions or workarounds. There are also discussions about the model's performance, quantization, and potential improvements, with some users offering to contribute to the development.
Number of comments this week: None

Performance bug: Speculative Decoding Performance Degradation: This issue reports a performance bug in the llama.cpp project, where speculative decoding is unexpectedly decreasing token generation speed across various model configurations and hardware platforms, including NVIDIA A100 and Apple M2 Pro. The user provides benchmark results showing a performance degradation ranging from 6.9% to 23.0% when speculative decoding is enabled, contrary to the expected behavior of increased speed.

The comments discuss various attempts to resolve the performance issue, including suggestions to use quantized draft models and adjust parameters like top-k and temperature. Users share their experiences with different hardware setups and model configurations, noting that speculative decoding often results in slower performance. Some users report specific scenarios where speculative decoding provides a speed-up, but these are rare and highly dependent on the prompt and quantization settings. The discussion highlights the challenge of finding suitable prompts and configurations to achieve the intended performance benefits of speculative decoding.
Number of comments this week: None

Bug: Flash Attention performs worse under ROCM: This issue highlights a performance degradation when using Flash Attention under ROCM, particularly with the 7900 XTX GPU, where enabling Flash Attention results in significantly reduced performance during prompt processing and token generation, especially at larger batch sizes. The user is seeking a solution to this problem, as Flash Attention is necessary for quantization of the KV-cache, but the current performance hit is substantial.

The comments discuss the known issue of poor performance with the HIP port of the CUDA FlashAttention kernel for large batch sizes, and the lack of current development focus on AMD performance. There is a conversation about the potential for optimizations if AMD hardware becomes more popular, and a user shares a forked branch that improves performance on RDNA3. The discussion also covers the challenges of optimizing for AMD, the potential for new developers to contribute, and the possibility of reaching out to AMD for support. Some users express interest in contributing to the project, and there is mention of existing resources and third-party implementations that could aid in developing a solution.
Number of comments this week: None

Misc. bug: Inconsistent Vulkan segfault: This issue reports an inconsistent segmentation fault occurring in a Vulkan-based application on Linux, specifically when using Nvidia drivers, which suggests a potential problem with the ggml-vulkan backend not properly destroying Vulkan instances or devices before the process terminates. The problem is reproducible by running a simple program multiple times, and the segmentation fault is observed in Nvidia driver threads, indicating a possible driver bug or resource cleanup issue.

The comments discuss potential causes and solutions, including updating drivers, adding functions to properly destroy Vulkan resources, and using static destructors. There is a suggestion to implement a preferred backend selection to avoid Vulkan on Nvidia systems, and a workaround using CUDA is shared. The conversation also touches on dynamically loading backends to improve compatibility and resource usage.
Number of comments this week: None

Feature Request: Add "tokens per second" information in the Web UI: This issue is a feature request to add "tokens per second" information in the Web UI of the project, which would help users understand the prompt processing and text generation speeds. The motivation behind this request is to allow users to investigate how different parameters affect performance, although no specific implementation details have been provided.

A user expressed interest in taking ownership of the issue and asked to be assigned, having reviewed the project's guidelines. Another commenter mentioned that related work is already in progress under a different issue number. A third commenter acknowledged this information, and a fourth suggested that the interested user could focus on implementing the frontend since the backend API was already added. It was also noted that this feature should ideally be added after certain pull requests to avoid conflicts.
Number of comments this week: None

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 25
Summarized Issues:

Compilation and Build Errors: Several issues in the llama.cpp project relate to compilation and build errors, particularly when using CUDA and Vulkan. One issue involves a syntax error in the scripts/build-info.sh file, causing a build failure on Linux with CUDA enabled. Another issue describes a compilation problem with CMake for CUDA due to missing data types, while a separate issue involves a __device__ variable incorrectly marked as constexpr, leading to a build failure. Additionally, there is a compilation error when building the Vulkan backend on Ubuntu 24.04 due to undeclared variables.
issues/ggerganov/llama.cpp/issues/10617
issues/ggerganov/llama.cpp/issues/10629
issues/ggerganov/llama.cpp/issues/10652
issues/ggerganov/llama.cpp/issues/10712

Model Conversion and Inference Issues: The llama.cpp project faces multiple issues with model conversion and inference. One issue involves a failure in converting the Meta-Llama-3.1-8B-Instruct model to GGUF format due to errors like a missing tokenizer file. Another issue describes a bug in the convert_lora_to_gguf script, which ignores the specified output type and produces files in FP32 format. Additionally, there is a problem with the convert_hf_to_gguf.py script, where converting the llama-3.2-11B-vision model results in incorrect inference outputs.
issues/ggerganov/llama.cpp/issues/10634
issues/ggerganov/llama.cpp/issues/10671
issues/ggerganov/llama.cpp/issues/10681

Performance and Optimization Concerns: Performance issues are a recurring theme in the llama.cpp project, affecting both speed and output quality. One issue reports a performance bug where speculative decoding degrades token generation speed. Another issue describes a lack of performance speedup when running multiple RPC servers, despite increased CPU core usage. Additionally, there is a significant degradation in output quality when using compressed cache types in the llama-cli module.
issues/ggerganov/llama.cpp/issues/10664
issues/ggerganov/llama.cpp/issues/10654
issues/ggerganov/llama.cpp/issues/10697

Feature Requests and Enhancements: The llama.cpp project includes several feature requests aimed at enhancing functionality and user experience. One request is to merge modifications supporting the Llama-3_1-Nemotron-51 model into the main branch. Another request seeks to enhance the GGUFWriter functionality to allow specifying an alternative output directory. Additionally, there is a request for the llama-server to support hot swapping and scaling of control vectors via an API.
issues/ggerganov/llama.cpp/issues/10648
issues/ggerganov/llama.cpp/issues/10658
issues/ggerganov/llama.cpp/issues/10685

Platform-Specific Bugs and Errors: Various platform-specific bugs affect the llama.cpp project, impacting different operating systems and hardware. One issue involves a bug where the default.metallib file cannot be located on MacOS. Another issue describes a bug where the llama-embedding tool crashes on macOS with M1 Max chips due to unsupported instructions. Additionally, there is a problem with the llama-imatrix.exe application inconsistently loading computations onto the CPU instead of the GPU.
issues/ggerganov/llama.cpp/issues/10675
issues/ggerganov/llama.cpp/issues/10702
issues/ggerganov/llama.cpp/issues/10687

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 29
Summarized Issues:

Feature Requests for Model Support: Several issues involve requests for supporting new models in the llama.cpp project. Users have requested the addition of the "stella_en_400M" model and its conversion to the "gguf" format, as well as support for the Ministral-8B-Instruct-2410 model with its advanced features. Another request is for the OmniGen model, which would enhance the project's capabilities by enabling multimodal processing.
issues/ggerganov/llama.cpp/issues/9202
issues/ggerganov/llama.cpp/issues/9914
issues/ggerganov/llama.cpp/issues/10020

Bugs in Model Conversion and Execution: Various issues highlight bugs in model conversion and execution processes. A TypeError occurs during model conversion to GGUF format due to incorrect handling of the license field, while another issue involves a crash in the imatrix tool on Mac due to 'nan' values in computations. Additionally, a bug in the deserialize_tensor function causes crashes in the RPC server due to zero dimensions in tensors.
issues/ggerganov/llama.cpp/issues/9819
issues/ggerganov/llama.cpp/issues/9899
issues/ggerganov/llama.cpp/issues/9799

Compilation and Build Errors: Several issues report errors during the compilation and build processes on different platforms. Users have encountered problems on Windows x64 with the LLAMA_CURL option, on NVIDIA Jetson AGX Xavier due to undefined identifiers, and on macOS with undeclared identifiers and missing symbols. These issues often require specific workarounds or adjustments to the build environment.
issues/ggerganov/llama.cpp/issues/9937
issues/ggerganov/llama.cpp/issues/10555
issues/ggerganov/llama.cpp/issues/10632
issues/ggerganov/llama.cpp/issues/10647
issues/ggerganov/llama.cpp/issues/10661

Performance and Output Issues: Users have reported performance drops and incorrect outputs in various scenarios. A performance regression is noted in the Android aarch64 Neon implementation, while another issue involves a 16% performance drop with the q8_0 key-value cache type. Additionally, users have experienced meaningless output when generating JSON responses and incorrect output with multiple AMD GPUs.
issues/ggerganov/llama.cpp/issues/10552
issues/ggerganov/llama.cpp/issues/9934
issues/ggerganov/llama.cpp/issues/10662
issues/ggerganov/llama.cpp/issues/10682

Crashes and Memory Management Bugs: Several issues involve crashes and memory management bugs in the llama.cpp project. Crashes occur in the Android application when unloading models, and on Mac systems when running the llama-cli tool. Memory management issues are also noted in the llama-server module, leading to potential segmentation faults.
issues/ggerganov/llama.cpp/issues/9946
issues/ggerganov/llama.cpp/issues/9973
issues/ggerganov/llama.cpp/issues/10635

GPU Selection and Performance Optimization: Users have requested features and reported issues related to GPU selection and performance optimization. A feature request seeks the ability to select a specific Metal-compatible GPU on macOS, while another issue involves the llama-cli tool selecting a lower-performance GPU by default. These issues highlight the need for better GPU management in the project.
issues/ggerganov/llama.cpp/issues/10003
issues/ggerganov/llama.cpp/issues/10009

Documentation and Usability Concerns: Some issues address documentation and usability concerns in the llama.cpp project. Users have reported missing version information when running commands, incorrect documentation for command-line parameters, and confusion about the compilation process. These issues suggest a need for improved documentation and user guidance.
issues/ggerganov/llama.cpp/issues/9977
issues/ggerganov/llama.cpp/issues/10679

2.5 Issue Discussion Insights

Don't miss what's next. Subscribe to Weekly Project News: