Weekly GitHub Report for Llama.cpp - 2025-01-06 12:00:31
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4418
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week.
- Misc. bug: Inconsistent Vulkan segfault: This issue involves an inconsistent segmentation fault occurring in a Vulkan environment on Linux, specifically when using the ggml-vulkan backend with Nvidia drivers. The problem is reproducible by running a simple program multiple times, which occasionally results in a segmentation fault, potentially due to improper cleanup of Vulkan resources before the process terminates.
- The comments discuss potential causes and solutions, including updating drivers, ensuring proper destruction of Vulkan resources, and considering the order of static destructors and library unloads. There is a consensus that the issue might be Nvidia-specific, with suggestions to test on different hardware and configurations. Some users propose workarounds, such as preferring CUDA over Vulkan or dynamically loading backends, while others attempt to reproduce and debug the issue, noting that it might require a fix from Nvidia's side.
- Number of comments this week: None
-
Feature Request: Reversed-LORA, Removing Layers/Knowledge During Runtime or Live-Abliteration: This issue is a feature request for implementing a mechanism in the llama.cpp project that allows for the dynamic removal of layers or information from a language model during runtime, potentially referred to as "Reversed-LORA." The motivation behind this request is to enable more effective customization and fine-tuning of the model, allowing users to tailor the model to their specific needs by removing irrelevant layers, which could lead to improved performance and focus on relevant features.
- The comments discuss the feasibility and existing implementations related to the proposed feature, with references to similar projects like Sumandora's work and cvector. There is a technical discussion on the concept of subtracting LoRA from model weights to create a new model, and the conversation includes clarifications and suggestions on how this could be implemented. Participants express interest in exploring these ideas further, acknowledging the complexity and potential of the feature.
- Number of comments this week: None
-
Compile bug: Error Domain=MTLLibraryErrorDomain Code=3: This issue involves a user attempting to compile the
llama.cpp
project as a static library for use in custom C++ projects on a Mac, but encountering an error related to the Metal backend during the compilation process. The error message indicates a problem with loading certain instructions, specifically mentioning an unknown type name 'block_q4_0', which prevents the successful initialization of the Metal backend.- The comments discuss troubleshooting steps, including verifying if the
llama.cpp
examples work and ensuring no cached files interfere by deleting the build directory. The user confirms the issue persists even after these steps and shares a file for further inspection. It is suggested that the problem might be related to thesed
command not working correctly, possibly due to compatibility issues between GNU and Apple/FreeBSD versions. The user testssed
with a script and finds it functioning correctly, but later discovers that checking out a specific tag (b4409
) resolves the issue, allowing successful compilation and linking against the static libraries. - Number of comments this week: None
- The comments discuss troubleshooting steps, including verifying if the
-
Feature Request: Support GPTQ (Quotes: GPTQModel 4bit can match BF16): This issue is a feature request for the llama.cpp project to support GPTQ quantized models, which are advantageous because they can be fine-tuned with a dataset, potentially matching the quality of BF16 models. The requester highlights the benefits of supporting GPTQ models, referencing the GPTQModel repository, and suggests that this enhancement could improve the project's capabilities.
- The comments discuss previous pull requests related to GPTQ converters that were merged but later removed from the master branch, replaced by imatrix techniques. There is a conversation about the differences between GPTQ and imatrix quantization methods, with explanations on how imatrix is used in the project. It is clarified that existing GPTQ models from platforms like Hugging Face cannot be directly used due to format differences, and users are advised to look for imatrixed GGUFs instead.
- Number of comments this week: None
-
Misc. bug: SYCL out of memory error: This issue involves a memory allocation error encountered when using the SYCL backend in a GitHub project, where the user is unable to allocate 568 MB of memory on a device with 16 GB of shared GPU memory, despite the same setup working without errors when using the VULKAN backend. The problem is not limited to a specific version or interface, as it also occurs with earlier versions and when using Python bindings, indicating a potential inefficiency in the SYCL backend's memory management.
- The comments discuss potential solutions and insights into the memory error, including trying a reduced context length and using the
-nkvo
option, which works but is significantly slower than VULKAN. A suggestion is made to use GPU-Z to monitor memory usage, and it is hypothesized that the issue might be due to the KV cache size and model size exceeding the reserved memory for the iGPU, despite the log indicating sufficient available memory. The user suspects a memory inefficiency in the SYCL backend compared to VULKAN. - Number of comments this week: None
- The comments discuss potential solutions and insights into the memory error, including trying a reduced context length and using the
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 24
Summarized Issues:
- Compile-Time Warnings and Errors: Several issues in the llama.cpp project involve compile-time warnings and errors across different systems. On BSD systems, clang-18 generates warnings due to integer comparison and macro redefinition, while FreeBSD systems face unreachable code warnings. Additionally, OpenBSD 7.6 encounters compilation failures related to the
fileno
macro, and Mac systems using the Metal backend experience errors due to unknown type names.
- Feature Requests for Model Enhancements: The llama.cpp project has received multiple feature requests aimed at enhancing model capabilities. These include support for GPTQ quantized models, a "Reversed-LORA" capability, and a mapping of model names to LoRA configurations. Additionally, requests have been made for a "Top-nσ sampler" and a chat template for the Llama-Android project.
- Bugs in Token Generation and Output: Several issues report bugs related to token generation and output in the llama.cpp project. These include incorrect BOS and EOS tokens during inference, discrepancies in tokenization results between
llama-tokenize
andAutoTokenizer
, and the appearance of the "<|end_of_text|>" token in outputs. Additionally, the llama-android module experiences crashes when unloading models.
- Memory and Resource Management Issues: The llama.cpp project faces several issues related to memory and resource management. Users report out-of-memory errors with the SYCL backend despite sufficient GPU memory, and a fatal error in the
ggml_sycl_cpy
function due to unsupported type combinations. Additionally, the llama-cli module on RISC-V architecture produces corrupted output when RVV is enabled.
- Installation and Packaging Problems: The llama.cpp project encounters several installation and packaging issues. The
gguf
module'sscripts
folder is incorrectly installed, and the "llama-b4409-bin-ubuntu-x64.zip" package lacks a necessary shared library. Additionally, a Snap package version is requested to facilitate testing on older systems.
- Test Suite and CI Release Issues: The llama.cpp project faces issues with its test suite and CI releases. The test suite reports a SEGFAULT failure on FreeBSD despite successful independent test runs, and CI releases encounter false positive malware detections, potentially due to dynamic JSON object generation.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 26
Summarized Issues:
- GPU Utilization and Performance Issues: Users have reported various challenges related to GPU utilization and performance in the llama.cpp project. One user faced difficulties in utilizing the GPU on an Android device with a Qualcomm Adreno due to the removal of OpenCL support, exploring alternatives like Vulkan but encountering compilation challenges. Another user experienced crashes and failures to load GGUF models on an NVIDIA RTX 4070 Ti Super GPU after software updates, despite previous functionality, leading to speculation about potential bugs or driver issues. Additionally, a performance discrepancy was noted where using "GPU + CUDA + VRAM + shared memory (UMA)" resulted in higher CPU load and worse performance compared to "CPU + RAM," raising questions about shared memory bank conflicts.
- Feature Requests for Model Support and Enhancements: Several feature requests have been made to enhance the llama.cpp project with support for new models and functionalities. Requests include support for the BitNet.cpp quantization format, the Tencent-Hunyuan-Large model, and the C4AI Command R7B model by Cohere, each offering unique capabilities and performance improvements. Additionally, there is a request to implement SIMD activations for the s390x platform to improve performance for LLM inference on IBM mainframes.
- Compilation and Integration Challenges: Users have encountered various compilation and integration issues within the llama.cpp project. These include failures in a Swift Package Manager project due to interoperability problems between Objective-C and C++ code, and Vulkan shader compilation failures on Debian Stable due to unsupported extensions. Additionally, there are issues with the
llama_cli
tool where interactive mode flags are ignored, resulting in errors.
- Performance and Optimization Concerns: Performance issues have been reported in the llama.cpp project, affecting various functionalities. A significant slowdown was noted in the
llama_decode
function when using certain batch methods, and a severe performance drop-off was observed in the Vulkan backend with an AMD Radeon RX 7900 XTX at specific batch sizes. Additionally, a memory allocation error was encountered when running a large model on a Jetson AGX Orin, suggesting the need for model quantization.
- Bug Reports and Troubleshooting: Various bugs have been reported in the llama.cpp project, requiring troubleshooting and potential fixes. These include incorrect parsing of
rope-scale
parameters, a server response bug labeling the model incorrectly, and a problem with thellama-server
module's model field location. Users have also reported issues with unsupported chat templates leading to suboptimal model responses.
- Research and Development Inquiries: Users have expressed interest in research and development aspects of the llama.cpp project. Inquiries include understanding the design and efficiency of "llama-bench," benchmarking processes, and a request for a tutorial to help users analyze and understand the codebase. These inquiries highlight the need for better documentation and resources to support user engagement with the project.
- Batch Processing and Token Management: Feature requests and issues related to batch processing and token management have been raised in the llama.cpp project. These include a proposal to enhance shared token handling for improved computational efficiency and a request for "hot swap" functionality for LoRA adapters to allow dynamic switching without disrupting ongoing processes. Additionally, there is a request to apply LoRA adapters on a per-request basis in a server environment.