Weekly GitHub Report for Llama.cpp - 2025-01-06 12:00:31

            Weekly GitHub Report for Llama.cpp - 2025-01-06 12:00:31

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues

2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues

2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4418
1.2 Other Noteworthy Updates:
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. 

Misc. bug: Inconsistent Vulkan segfault: This issue involves an inconsistent segmentation fault occurring in a Vulkan environment on Linux, specifically when using the ggml-vulkan backend with Nvidia drivers. The problem is reproducible by running a simple program multiple times, which occasionally results in a segmentation fault, potentially due to improper cleanup of Vulkan resources before the process terminates.

The comments discuss potential causes and solutions, including updating drivers, ensuring proper destruction of Vulkan resources, and considering the order of static destructors and library unloads. There is a consensus that the issue might be Nvidia-specific, with suggestions to test on different hardware and configurations. Some users propose workarounds, such as preferring CUDA over Vulkan or dynamically loading backends, while others attempt to reproduce and debug the issue, noting that it might require a fix from Nvidia's side.
Number of comments this week: None

Feature Request: Reversed-LORA, Removing Layers/Knowledge During Runtime or Live-Abliteration: This issue is a feature request for implementing a mechanism in the llama.cpp project that allows for the dynamic removal of layers or information from a language model during runtime, potentially referred to as "Reversed-LORA." The motivation behind this request is to enable more effective customization and fine-tuning of the model, allowing users to tailor the model to their specific needs by removing irrelevant layers, which could lead to improved performance and focus on relevant features.

The comments discuss the feasibility and existing implementations related to the proposed feature, with references to similar projects like Sumandora's work and cvector. There is a technical discussion on the concept of subtracting LoRA from model weights to create a new model, and the conversation includes clarifications and suggestions on how this could be implemented. Participants express interest in exploring these ideas further, acknowledging the complexity and potential of the feature.
Number of comments this week: None

Compile bug: Error Domain=MTLLibraryErrorDomain Code=3: This issue involves a user attempting to compile the llama.cpp project as a static library for use in custom C++ projects on a Mac, but encountering an error related to the Metal backend during the compilation process. The error message indicates a problem with loading certain instructions, specifically mentioning an unknown type name 'block_q4_0', which prevents the successful initialization of the Metal backend.

The comments discuss troubleshooting steps, including verifying if the llama.cpp examples work and ensuring no cached files interfere by deleting the build directory. The user confirms the issue persists even after these steps and shares a file for further inspection. It is suggested that the problem might be related to the sed command not working correctly, possibly due to compatibility issues between GNU and Apple/FreeBSD versions. The user tests sed with a script and finds it functioning correctly, but later discovers that checking out a specific tag (b4409) resolves the issue, allowing successful compilation and linking against the static libraries.
Number of comments this week: None

Feature Request: Support GPTQ (Quotes: GPTQModel 4bit can match BF16): This issue is a feature request for the llama.cpp project to support GPTQ quantized models, which are advantageous because they can be fine-tuned with a dataset, potentially matching the quality of BF16 models. The requester highlights the benefits of supporting GPTQ models, referencing the GPTQModel repository, and suggests that this enhancement could improve the project's capabilities.

The comments discuss previous pull requests related to GPTQ converters that were merged but later removed from the master branch, replaced by imatrix techniques. There is a conversation about the differences between GPTQ and imatrix quantization methods, with explanations on how imatrix is used in the project. It is clarified that existing GPTQ models from platforms like Hugging Face cannot be directly used due to format differences, and users are advised to look for imatrixed GGUFs instead.
Number of comments this week: None

Misc. bug: SYCL out of memory error: This issue involves a memory allocation error encountered when using the SYCL backend in a GitHub project, where the user is unable to allocate 568 MB of memory on a device with 16 GB of shared GPU memory, despite the same setup working without errors when using the VULKAN backend. The problem is not limited to a specific version or interface, as it also occurs with earlier versions and when using Python bindings, indicating a potential inefficiency in the SYCL backend's memory management.

The comments discuss potential solutions and insights into the memory error, including trying a reduced context length and using the -nkvo option, which works but is significantly slower than VULKAN. A suggestion is made to use GPU-Z to monitor memory usage, and it is hypothesized that the issue might be due to the KV cache size and model size exceeding the reserved memory for the iGPU, despite the log indicating sufficient available memory. The user suspects a memory inefficiency in the SYCL backend compared to VULKAN.
Number of comments this week: None

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 24
Summarized Issues:

Compile-Time Warnings and Errors: Several issues in the llama.cpp project involve compile-time warnings and errors across different systems. On BSD systems, clang-18 generates warnings due to integer comparison and macro redefinition, while FreeBSD systems face unreachable code warnings. Additionally, OpenBSD 7.6 encounters compilation failures related to the fileno macro, and Mac systems using the Metal backend experience errors due to unknown type names.
issues/ggerganov/llama.cpp/issues/11033
issues/ggerganov/llama.cpp/issues/11034
issues/ggerganov/llama.cpp/issues/11035
issues/ggerganov/llama.cpp/issues/11067
issues/ggerganov/llama.cpp/issues/11071

Feature Requests for Model Enhancements: The llama.cpp project has received multiple feature requests aimed at enhancing model capabilities. These include support for GPTQ quantized models, a "Reversed-LORA" capability, and a mapping of model names to LoRA configurations. Additionally, requests have been made for a "Top-nσ sampler" and a chat template for the Llama-Android project.
issues/ggerganov/llama.cpp/issues/11024
issues/ggerganov/llama.cpp/issues/11029
issues/ggerganov/llama.cpp/issues/11031
issues/ggerganov/llama.cpp/issues/11057
issues/ggerganov/llama.cpp/issues/11056

Bugs in Token Generation and Output: Several issues report bugs related to token generation and output in the llama.cpp project. These include incorrect BOS and EOS tokens during inference, discrepancies in tokenization results between llama-tokenize and AutoTokenizer, and the appearance of the "<|end_of_text|>" token in outputs. Additionally, the llama-android module experiences crashes when unloading models.
issues/ggerganov/llama.cpp/issues/11009
issues/ggerganov/llama.cpp/issues/11054
issues/ggerganov/llama.cpp/issues/11043
issues/ggerganov/llama.cpp/issues/11055

Memory and Resource Management Issues: The llama.cpp project faces several issues related to memory and resource management. Users report out-of-memory errors with the SYCL backend despite sufficient GPU memory, and a fatal error in the ggml_sycl_cpy function due to unsupported type combinations. Additionally, the llama-cli module on RISC-V architecture produces corrupted output when RVV is enabled.
issues/ggerganov/llama.cpp/issues/11044
issues/ggerganov/llama.cpp/issues/11078
issues/ggerganov/llama.cpp/issues/11041

Installation and Packaging Problems: The llama.cpp project encounters several installation and packaging issues. The gguf module's scripts folder is incorrectly installed, and the "llama-b4409-bin-ubuntu-x64.zip" package lacks a necessary shared library. Additionally, a Snap package version is requested to facilitate testing on older systems.
issues/ggerganov/llama.cpp/issues/11051
issues/ggerganov/llama.cpp/issues/11068
issues/ggerganov/llama.cpp/issues/11048

Test Suite and CI Release Issues: The llama.cpp project faces issues with its test suite and CI releases. The test suite reports a SEGFAULT failure on FreeBSD despite successful independent test runs, and CI releases encounter false positive malware detections, potentially due to dynamic JSON object generation.
issues/ggerganov/llama.cpp/issues/11036
issues/ggerganov/llama.cpp/issues/11077

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 26
Summarized Issues:

GPU Utilization and Performance Issues: Users have reported various challenges related to GPU utilization and performance in the llama.cpp project. One user faced difficulties in utilizing the GPU on an Android device with a Qualcomm Adreno due to the removal of OpenCL support, exploring alternatives like Vulkan but encountering compilation challenges. Another user experienced crashes and failures to load GGUF models on an NVIDIA RTX 4070 Ti Super GPU after software updates, despite previous functionality, leading to speculation about potential bugs or driver issues. Additionally, a performance discrepancy was noted where using "GPU + CUDA + VRAM + shared memory (UMA)" resulted in higher CPU load and worse performance compared to "CPU + RAM," raising questions about shared memory bank conflicts.
issues/ggerganov/llama.cpp/issues/8705
issues/ggerganov/llama.cpp/issues/9852
issues/ggerganov/llama.cpp/issues/10330

Feature Requests for Model Support and Enhancements: Several feature requests have been made to enhance the llama.cpp project with support for new models and functionalities. Requests include support for the BitNet.cpp quantization format, the Tencent-Hunyuan-Large model, and the C4AI Command R7B model by Cohere, each offering unique capabilities and performance improvements. Additionally, there is a request to implement SIMD activations for the s390x platform to improve performance for LLM inference on IBM mainframes.
issues/ggerganov/llama.cpp/issues/10179
issues/ggerganov/llama.cpp/issues/10263
issues/ggerganov/llama.cpp/issues/10816
issues/ggerganov/llama.cpp/issues/10888

Compilation and Integration Challenges: Users have encountered various compilation and integration issues within the llama.cpp project. These include failures in a Swift Package Manager project due to interoperability problems between Objective-C and C++ code, and Vulkan shader compilation failures on Debian Stable due to unsupported extensions. Additionally, there are issues with the llama_cli tool where interactive mode flags are ignored, resulting in errors.
issues/ggerganov/llama.cpp/issues/10371
issues/ggerganov/llama.cpp/issues/11052
issues/ggerganov/llama.cpp/issues/10297

Performance and Optimization Concerns: Performance issues have been reported in the llama.cpp project, affecting various functionalities. A significant slowdown was noted in the llama_decode function when using certain batch methods, and a severe performance drop-off was observed in the Vulkan backend with an AMD Radeon RX 7900 XTX at specific batch sizes. Additionally, a memory allocation error was encountered when running a large model on a Jetson AGX Orin, suggesting the need for model quantization.
issues/ggerganov/llama.cpp/issues/10322
issues/ggerganov/llama.cpp/issues/10966
issues/ggerganov/llama.cpp/issues/10956

Bug Reports and Troubleshooting: Various bugs have been reported in the llama.cpp project, requiring troubleshooting and potential fixes. These include incorrect parsing of rope-scale parameters, a server response bug labeling the model incorrectly, and a problem with the llama-server module's model field location. Users have also reported issues with unsupported chat templates leading to suboptimal model responses.
issues/ggerganov/llama.cpp/issues/10355
issues/ggerganov/llama.cpp/issues/11069
issues/ggerganov/llama.cpp/issues/11028

Research and Development Inquiries: Users have expressed interest in research and development aspects of the llama.cpp project. Inquiries include understanding the design and efficiency of "llama-bench," benchmarking processes, and a request for a tutorial to help users analyze and understand the codebase. These inquiries highlight the need for better documentation and resources to support user engagement with the project.
issues/ggerganov/llama.cpp/issues/10386
issues/ggerganov/llama.cpp/issues/10405
issues/ggerganov/llama.cpp/issues/10399

Batch Processing and Token Management: Feature requests and issues related to batch processing and token management have been raised in the llama.cpp project. These include a proposal to enhance shared token handling for improved computational efficiency and a request for "hot swap" functionality for LoRA adapters to allow dynamic switching without disrupting ongoing processes. Additionally, there is a request to apply LoRA adapters on a per-request basis in a server environment.
issues/ggerganov/llama.cpp/issues/10295
issues/ggerganov/llama.cpp/issues/10374
issues/ggerganov/llama.cpp/issues/10377

2.5 Issue Discussion Insights

Don't miss what's next. Subscribe to Weekly Project News: