Weekly GitHub Report for Llama.cpp - 2024-12-30 12:00:17
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4394
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week.
- Misc. bug: Buffer offset is not aligned on macOS / Intel / Vulkan: This issue is about a bug in the llama.cpp project where the buffer offset is not aligned on macOS with Intel processors using Vulkan, causing garbled model output after approximately 400 words. The problem is linked to a validation error indicating that the buffer offset must be a multiple of the device's minimum storage buffer offset alignment, which is not being met.
- The comments discuss attempts to reproduce the issue on different systems, with one user suggesting a potential fix that removes validation errors but results in completely garbled output. Further suggestions include building with specific flags to identify the failing node, and tests reveal numerous failures in the MUL_MAT operations, indicating a possible driver or compiler bug. A proposed pull request is tested but does not resolve the issue, leading to speculation about a compiler or MoltenVK bug.
- Number of comments this week: None
-
Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes: This issue reports a significant performance drop when using the Vulkan backend with the AMD Radeon RX 7900 XTX graphics card at certain batch sizes, particularly sizes 2 and 4, which severely impacts the usability of speculative decoding. The user is seeking solutions to improve performance, as the current inefficiencies make the Vulkan backend impractical for their needs.
- The comments discuss the difficulty of optimizing small-batch kernels and suggest increasing the draft size and probability to improve performance. There is a consensus that the performance drop from non-batched to batched processing is unusually high, indicating a potential bug. A contributor explains the backend's matrix multiplication paths and suggests improvements, with plans to address the issue after another update. The user is encouraged to test a proposed fix and provide feedback.
- Number of comments this week: None
-
Misc. bug: All llama executables exit immediately without console output: This issue describes a problem where all llama executables exit immediately without providing any console output on Windows, affecting both SYCL and CPU-only builds. The user reports that while some prebuilt binaries work as expected, others do not, and they suspect the issue might be related to OpenMP installations or other build configurations.
- The comments discuss potential solutions and troubleshooting steps, including using the
--log-verbose
flag, which did not yield any results, and sharing a stack trace for further analysis. Suggestions include reinstalling the Microsoft Visual C++ redistributable, checking if a specific commit resolves the issue, and ensuring the oneAPI runtime is activated. Despite these efforts, the problem persists, and further investigation is needed. - Number of comments this week: None
- The comments discuss potential solutions and troubleshooting steps, including using the
-
Research: Performance differences between Metal (macOS) and Vulkan (Linux): This issue involves a developer from the Asahi Linux GPU drivers team seeking assistance to improve the performance of llama.cpp on Apple Silicon platforms using the Vulkan backend, as they have observed that macOS performs significantly faster than Linux in their tests. The developer is requesting help to understand the performance differences between the Metal and Vulkan backends, how workloads are scheduled, and how to conduct micro-benchmarks to identify potential driver and shader compiler issues.
- The comments discuss the state of the Honeykrisp driver and shader compiler, with suggestions to run specific tests to identify performance bottlenecks. It is noted that the Vulkan shaders are well-tuned, but there might be room for optimization on the Apple hardware. The conversation highlights the importance of cooperative matrix support for performance and mentions recent improvements in the Vulkan backend, particularly on Nvidia and AMD hardware, while acknowledging the need for further tuning on Apple platforms.
- Number of comments this week: None
-
Feature Request: Molmo 72B vision support: This issue is a feature request for the llama.cpp project to support the Molmo 72B vision model, which is a combination of Qwen2-72B and OpenAI CLIP architectures. The requester has provided links to the model and a gist for reference, and expresses a need for guidance on implementing this feature.
- The comments discuss the architecture of the Molmo models, with some users expressing support and interest in the feature. One user shares a method for listing Molmo tensors and provides a link to a guide for adding custom models. There is a suggestion to contact AllenAI for support in creating a GGUF format for the model, and some users note that llama.cpp is becoming outdated.
- Number of comments this week: None
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 20
Summarized Issues:
- Feature Requests for Model Support and Conversion: The llama.cpp project has received multiple feature requests to enhance model support and conversion capabilities. One request is to implement IndexedDB for the server web UI to handle more than 5MB of data efficiently. Another request is to support the conversion of quantized models to the gguf format for deployment through ollama. Additionally, there is a request to add support for the DeepSeek-v3 model, which requires offloading to RAM due to its large size.
- Bugs in Model Execution and Output: Several bugs have been reported in the llama.cpp project affecting model execution and output. A missing
#else
clause in a preprocessor directive causes crashes on FreeBSD 14.2. Another bug involves incorrect text formatting with excessive punctuation marks. Additionally, a bug in the "cache_prompt" parameter results in longer response times when switching prompts.
- Memory and Performance Issues: The llama.cpp project faces several memory and performance-related issues. A memory allocation error occurs when running the qwen2-vl-2B model on a Jetson AGX Orin. There is also a severe performance drop-off in the Vulkan backend with the AMD Radeon RX 7900 XTX graphics card. Additionally, an Out of Memory Error is encountered when running the Qwen2-VL model on Windows due to high-resolution input images.
- Server and Module Crashes: Crashes in the llama-server module have been reported under various conditions. Sending a large request with full context length causes a server crash due to a fatal error in the ggml-cpu.c file. A segmentation fault and unsupported operation error occur on a Mac system when using the llama-server module. Additionally, a significant performance regression is observed in the llama-server module on Windows.
- Compilation and Execution Errors: The llama.cpp project encounters various compilation and execution errors. A compile-time linking error occurs on Linux when using gcc8 due to unresolved references to the
std::filesystem
library. Users face problems converting a model to Llama.cpp GGUF on a Kaggle Notebook due to deprecated Makefile builds. Additionally, a bug in the Qwen.cpp software results in an "unknown token" error on a Jetson Orin Nano.
- Vulkan and GPU Performance Discrepancies: Discrepancies in Vulkan and GPU performance have been identified in the llama.cpp project. A developer from the Asahi Linux GPU drivers team seeks to improve performance discrepancies between Metal on macOS and Vulkan on Linux. A bug in the Vulkan implementation on macOS with Intel hardware causes garbled model output. Additionally, there is a feature request to split a model over multiple Vulkan GPUs to enhance performance.
- Miscellaneous Tool and Parameter Issues: The llama.cpp project also faces miscellaneous issues with tools and parameters. The
llama-qwen2vl-cli
tool ignores--log*
options, resulting in unwanted messages being output to stdout. A bug in the Llama-3_1-Nemotron-51B model results in incorrect outputs for prompts close to or exceeding 4,000 tokens.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 25
Summarized Issues:
- Passkey Feature Malfunction: The passkey feature in the server has been malfunctioning for a week, leading to its closure. This issue was resolved by introducing new models that support long context, which resulted in the removal of self context extent support. The change aimed to improve the server's functionality by leveraging the capabilities of the new models.
- Model Context Handling Issues: The Phi-3 4K model exhibits incorrect output after processing around 2048 tokens, despite its capability to handle up to 4096 tokens. This suggests a problem with the model's context handling or sliding window attention mechanism. Such issues can affect the model's performance and reliability in processing longer sequences.
- Compatibility and Compilation Problems: The llama3.1 model fails to load on Windows due to an incompatible avx2 version, and users face issues building the
llama.cpp
project with C++ 23 standard. These problems highlight the need for better compatibility checks and documentation updates to guide users on supported configurations. Addressing these issues can prevent user frustration and improve the project's usability.
- Server and Build Process Bugs: Long repeated strings cause server crashes due to regex processing errors, and using ccache on Windows disrupts the SYCL backend build process. These bugs indicate underlying issues in memory handling and file management during builds. Solutions involve adjusting regex stack size and ccache settings to ensure stability and successful builds.
- Performance and Efficiency Enhancements: Performance slowdowns occur with certain RPC servers in distributed setups, and feature requests aim to improve RPC model loading and support for Flash Attention 3. These enhancements focus on optimizing data transfer, communication, and inference speed. Implementing these changes can significantly boost the project's performance and user experience.
- Cross-Compilation and GPU Support Issues: Cross-compiling for Android with Vulkan libraries and supporting Airllm for smaller GPUs present challenges. These issues highlight the need for better cross-platform support and compatibility with various hardware configurations. Addressing these can lower entry barriers and expand the project's accessibility.
- Documentation and Feature Requests: Bugs in documentation and vague feature requests like "adderALL" indicate areas for improvement in user guidance and feature clarity. Clear documentation and well-defined feature requests are crucial for effective project development and user satisfaction. Enhancing these aspects can streamline user interactions and project contributions.
- Encoding and Embedding Enhancements: A feature request to support
"encoding_format": "base64"
in embeddings endpoints aims to reduce JSON payload and improve compatibility with OpenAI's API. This enhancement can streamline data handling and integration with external services. Implementing such features can enhance the project's interoperability and efficiency.
- Streaming and API Issues: Bugs in streaming generation and API responses affect the project's functionality. These issues involve duplicated text during streaming and empty model identifiers in API responses. Resolving these bugs is essential for maintaining the project's reliability and ensuring seamless user interactions.
- Compilation and Platform-Specific Errors: Compilation failures on various platforms, including Android and BSD, highlight challenges in maintaining cross-platform compatibility. These errors involve issues with ARM features and pointer qualifiers, necessitating updates to build configurations. Addressing these can ensure broader platform support and smoother development processes.
- ROCm and Kernel Support Issues: A bug related to ROCm p2p copy operation failure due to kernel support issues affects certain Linux distributions. This problem underscores the importance of ensuring proper runtime support and kernel compatibility. Addressing these issues can enhance the project's robustness and deployment flexibility.