Weekly GitHub Report for Llama.cpp - 2024-12-23 12:00:25
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4372
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week.
- Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend: This issue reports a bug in the Qwen2-VL model when using the Vulkan backend, where the model generates descriptions unrelated to the provided image, despite working correctly on the CPU backend. The user notes that while the Vulkan backend is not officially supported for Qwen2-VL, it should only result in slower performance rather than incorrect outputs.
- The comments discuss various troubleshooting steps, including testing with an F16 vision projector and using a surgery script, but these attempts do not resolve the issue. Suggestions are made to enable GGML_VULKAN_CHECK_RESULTS to identify the broken operation, and there are discussions about linker errors and potential fixes, such as adding CPU backend source files to ggml-vulkan or building with
-DBUILD_SHARED_LIBS=OFF
. The issue persists even with no layers offloaded, and the problem is confirmed to occur on the Vulkan backend but not on the CPU backend.- Number of comments this week: None
-
Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used: This issue describes a performance degradation problem in the llama-server when using multiple slots for processing, where the generation speed decreases progressively after each generation when more than two slots are utilized. The user has attempted various configurations and settings to mitigate the issue, but none have been effective, and they suspect that a specific commit may have revealed an existing bug rather than causing it.
- The comments discuss whether the total throughput increases with more slots, with users sharing test results showing that parallel runs in a multi-slot server are not faster than sequential runs in a single-slot server. A contributor explains that the issue is due to a unified KV cache, which will be addressed in future updates, and suggests not using more than four slots as a temporary measure. Another user suggests clearing the KV cache after slot completion, but it is noted that determining when a slot has finished is challenging.
- Number of comments this week: None
-
Feature Request: Support for Qwen2-VL: This issue is a feature request for adding support for the Qwen2-VL model in the llama.cpp project, highlighting its state-of-the-art performance in visual understanding and video comprehension. The request emphasizes the model's capabilities in handling images and videos, suggesting its potential for enhancing the project's functionality.
- The comments section shows a strong interest in the feature, with many users expressing support and anticipation for the implementation. Some users discuss technical aspects, such as compatibility issues and potential solutions, while others share their progress and challenges in integrating the model. There are also discussions about the model's performance, with users sharing their experiences and troubleshooting tips. Overall, the comments reflect a collaborative effort to implement and optimize the feature, with users actively engaging in problem-solving and sharing resources.
- Number of comments this week: None
-
Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM: This issue is a feature request for the GitHub project llama.cpp, seeking support for the C4AI Command R7B model by Cohere, which is a 7-billion-parameter language model with enhanced capabilities in math, code, reasoning tasks, and multilingual support. The requester believes that integrating this model, which uses an optimized transformer architecture with sliding window attention and global attention layers, would be a valuable addition to the project.
- The comments discuss the implementation and testing of the C4AI Command R7B model, with users sharing patches and testing results. Some users report issues with context length and output quality, while others provide code snippets and links to converted model weights. There is a discussion about the differences between Command-R/R+ and Command R7B models, particularly regarding their approach to grounded RAG capabilities.
- Number of comments this week: None
-
Bug:
llama-server
web UI resets the text selection during inference on every token update: This issue describes a bug in thellama-server
web UI where the text selection resets during inference with every token update, making it difficult to select or copy text until the generation process is complete. The problem seems to stem from the script replacing all DOM nodes of the current generation with each new token output, preventing users from copying text as it is being generated.- The comments discuss the ongoing issue and potential solutions, including a dedicated copy button in the new UI, but the problem persists due to the re-rendering by
markdown-it
. Suggestions include exploring alternative markdown renderers likeremark
ormicromark
, which handle virtual DOM differently, though integrating them with the current setup may be complex. The conversation remains open for further exploration and potential fixes. - Number of comments this week: None
- The comments discuss the ongoing issue and potential solutions, including a dedicated copy button in the new UI, but the problem persists due to the re-rendering by
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 23
Summarized Issues:
- Backend Compatibility Issues: Several issues highlight compatibility problems with different backends when running models. The Qwen2-VL model generates incorrect outputs on the Vulkan backend, while the Metal backend on MacBook Pro M1 Max fails due to unsupported operations. Additionally, a segmentation fault occurs with the SYCL/HIP backend on AMD GPUs, indicating backend-specific challenges.
- Compilation and Build Failures: Various issues report failures during the compilation and build processes across different systems. A Linux system using CUDA 12.7 fails to compile due to CUDA compiler errors, while a macOS Vulkan build encounters argument type mismatches. Additionally, a Linux ARM64 system faces compilation errors related to ARM features.
- Feature Requests for Model and Performance Enhancements: Several feature requests aim to enhance model support and performance. Requests include implementing a Q6_0 quantization type, adding support for the SmolVLM model, and enabling SIMD activations for the s390x platform. These enhancements are expected to improve model efficiency and compatibility.
- Server and Execution Issues: Problems with server execution and model operation are reported in multiple issues. The llama-server experiences performance degradation with multiple slots, and a server error occurs due to unsupported parameters. Additionally, executables exit immediately on Windows without output, indicating execution challenges.
- Model Output and Embedding Issues: Issues with model outputs and embeddings are highlighted, affecting the consistency and accuracy of results. The gte-Qwen2 model produces non-homogenous embedding vectors, while the llama-server inserts extra newlines during streaming inference. These issues impact the reliability of model outputs.
- Infrastructure and Configuration Improvements: Feature requests focus on improving infrastructure and configuration for better usability. Requests include implementing a multi-prompt caching system, adding configuration presets, and supporting multiple accelerators for faster generation. These improvements aim to enhance user experience and system efficiency.
- Platform and Library Support Issues: Issues related to platform support and library integration are reported. The function
get_executable_path()
lacks support for FreeBSD and other platforms, while a missing dynamic library causes runtime failures on iOS. Additionally, a request to replace the markdown rendering library aims to improve UI functionality.
- Binary and API Endpoint Issues: Problems with binaries and API endpoints are reported, affecting usability and functionality. The llama-simple.exe binary appears corrupted, while the Open WebUI API endpoint returns empty model identifiers. These issues necessitate verification and fixes to ensure proper operation.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 21
Summarized Issues:
- Model Support and Tokenization Issues: The llama.cpp project has faced multiple requests and challenges related to model support and tokenization. One issue involves adding support for the Mistral-Nemo-Instruct-2407 128K model, which requires a custom tokenizer and addresses tensor shape mismatches. Another request is for the Aya Expanse models, which need the CohereTokenizer for multilingual capabilities. Additionally, a feature request aims to enhance token counting before API calls to improve user experience.
- SYCL and Docker Compatibility Problems: Users have reported issues with running SYCL-based applications on Intel GPUs, particularly with Docker. One user experienced a complete failure on an Intel Arc A380 GPU, potentially due to compatibility issues with the CPU and Intel's compiler. Another issue involved an invalid kernel name error linked to a version mismatch with Intel's oneAPI, requiring Dockerfile updates.
- Conversion and Loading Bugs: Several issues have arisen from model conversion and loading processes in llama.cpp. The CodeLlama-7B-instruction model produced erroneous outputs after conversion, suggesting compatibility issues. Another problem involved a FileNotFoundError during model conversion due to missing tokenizer files. Additionally, a tokenization regression caused incorrect token sequences, requiring script adjustments.
- Server and API Bugs: The llama.cpp server has encountered several bugs affecting API responses and server functionality. The
/v1/chat/completions
API had an issue with themodel
parameter defaulting incorrectly, causing inconsistencies. Another bug involved then_probs
parameter not functioning correctly in certain Docker images. Additionally, a commit introduced an error with the Authorization header, causing request failures.
- Performance and Compatibility Concerns: Performance regressions and compatibility issues have been reported in the llama.cpp project. The removal of a tensor type led to slower execution on ARM64 architectures. Compatibility problems with Safari were noted, causing JavaScript errors in the web UI. Additionally, a "bad interpreter" error occurred during compilation due to a misconfigured Python environment.
- Feature Requests for New Models: There have been requests to support new models in the llama.cpp project. The Phi-4 14B model requires modifications to the conversion script and configuration for full attention over a 16K context length. Additionally, there are concerns about the context limit of the EXAONE-3.5-2.4B-Instruct model compared to other models.
- Miscellaneous Bugs and Fixes: Various other bugs and fixes have been reported in the llama.cpp project. An ODR violation caused model loading failures on Windows, requiring struct definition unification. A missing
CMakeLists.txt
file led to a Docker build failure, resolved by cloning the repository. Additionally, a NumPy version conflict caused anAttributeError
in thegguf
library.