Weekly GitHub Report for Llama.cpp: January 20, 2025 - January 27, 2025
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4552
1.2 Version Information:
The version released on January 25, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without further information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Compile bug: Vulkan can not work on Android (cross-compilation from linux) - Aborted without explaination: This issue involves a compile bug where Vulkan cannot function on Android when cross-compiled from Linux, resulting in the process aborting without explanation. The user has followed all instructions and attempted various solutions, but the problem persists across different operating systems and NDK versions, specifically on a Redmi Note 13 Pro 5G with a Qualcomm CPU and Adreno GPU.
- The comments discuss potential issues with the Vulkan backend on Qualcomm GPUs, suggesting enabling Vulkan Validation Layers and disabling certain shaders. It is noted that Qualcomm GPUs have known issues with Vulkan, and OpenCL is recommended instead. There is ongoing work to optimize Vulkan for embedded GPUs, and a pull request is suggested as a potential fix. The user also reports issues with the OpenCL backend, and a separate issue is recommended for tracking that.
- Number of comments this week: 18
-
Eval bug:
tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf : This issue is about a user encountering unexpected
tags in the output of the DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf model when using the llama-cli, which they believe should not be visible in the final response. The user is seeking clarification on whether this is a bug or a feature and is discussing potential solutions for handling these tags in multi-turn conversations to improve model performance and user experience.- The comments discuss whether the
tags are a feature or a bug, with some suggesting they are a feature of the R1 series. There is a debate on whether these tags should be visible to users, with suggestions to separate the "thinking" content from the actual response. Some users propose implementing a feature to filter out these tags, while others discuss the impact on performance and context management. There is also mention of potential solutions like using a proxy or the upcoming support for Jinja templates to handle the tags. - Number of comments this week: 16
- The comments discuss whether the
-
Eval bug: ggml_gallocr_reserve_n tries to allocate beyond max buffer size: This issue involves a bug in the
ggml_gallocr_reserve_n
function, which attempts to allocate memory beyond the maximum buffer size allowed by the device, resulting in an out-of-memory error when using the CodeQwen 1.5 7B model with large context sizes on a Vulkan backend. The problem arises because the calculated buffer size exceeds the device's memory allocation limit, and the user suggests that this might be a bug in the llama.cpp implementation since other models do not exhibit the same issue under similar conditions.- The comments discuss potential workarounds, such as reducing the
ubatch
size to decrease the compute buffer size, and clarify misunderstandings about theubatch
parameter's role in processing tokens. The conversation also touches on the trade-offs between performance and memory usage, with explanations provided about how the attention mechanism and feed-forward network operate in this context. The user expresses gratitude for the quick responses and seeks further clarification on the technical details. - Number of comments this week: 9
- The comments discuss potential workarounds, such as reducing the
-
Feature Request: MiniMax-Text-01 model: This issue is a feature request to add support for the MiniMax-Text-01 model in the llama.cpp project, highlighting its potential performance benefits and large token context. The requester suggests that the model, which is a Mixture of Experts (MoE) model, could be a valuable addition due to its performance being comparable to deepseek v3.
- The comments discuss interest in the model and share a partially working implementation with noted issues, such as lack of support for multiple token sequences and potential redesign needs for the KV cache. Users report testing results, including performance metrics and issues with word omissions during text generation, and collaborate on further testing with different setups and configurations.
- Number of comments this week: 8
-
Eval bug: segfault on Alpine linux docker image: This issue reports a segmentation fault occurring when running the llama.cpp model compilation in a Docker container on Alpine Linux, affecting both x86 machines and Raspberry Pi devices. The problem seems to be related to shader compilation when using Vulkan on specific hardware configurations, including AMD CPUs and Intel or AMD GPUs.
- Multiple users report similar segmentation faults on Alpine Linux, with discussions focusing on potential causes such as static vs. shared library linking, driver updates, and Vulkan SDK installation. Some users attempt recompilation with different build flags and debug options, but the issue persists, indicating a possible compatibility problem with Alpine's musl library or Vulkan setup.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 27
Summarized Issues:
- ggml File Conflict: The llama.cpp and whisper.cpp projects both attempt to install different versions of ggml files to the same location, causing conflicts. This suggests that ggml should potentially be separated into a standalone project to avoid such issues.
- Segmentation Faults and Compilation Errors: Segmentation faults occur when running the llama.cpp model compilation within a Docker container on Alpine Linux, particularly with Vulkan for GPU acceleration. Additionally, compilation errors arise when building the project from source using CUDA in a Docker environment due to missing rules for required targets.
- GPU Support and Performance Issues: The removal of GPU support from the clip.cpp file in the llama.cpp project has significantly impacted performance, urging restoration to maintain functionality. Furthermore, there are issues with slow inference and low CPU/GPU usage when running models over a network, despite attempts to offload layers to the GPU.
- Model Conversion and Compatibility Problems: Users face errors when converting models from Hugging Face format to GGUF, such as missing tensors or unsupported models. These issues highlight the need for script modifications and support for specific models to ensure successful conversions.
- Vulkan Backend Bugs: Bugs in the Vulkan backend cause crashes during inference with multiple contexts and prevent layer offloading to the GPU. These issues require mutex-controlled access and indicate a need for better GPU utilization.
- NUMA and Memory Management: Implementing NUMA-aware expert allocation in the llama.cpp project aims to optimize Mixture-of-Experts models by reducing cross-NUMA communication costs. Additionally, memory management issues arise with out-of-memory errors due to buffer size discrepancies.
- Server and Template Bugs: The llama-server module experiences crashes and token generation issues due to changes in the httplib library and Jinja template exceptions. These problems highlight the need for robust error handling and template processing methods.
- Feature Requests for Model Enhancements: Requests include adding a
reasoning_effort
parameter to control CoT output length, supporting the kosmos-2.5 model for visual text conversion, and integrating SwiftKV for performance boosts. These enhancements aim to improve model functionality and efficiency.
- Compilation and Execution Errors: Compilation errors occur due to undeclared functions and missing shared libraries, affecting the build process and execution of binaries. These issues necessitate alternative function usage and library inclusion for successful builds.
- Model Loading and Performance Issues: Slow model loading on Mac systems with Metal GGML backends affects large models, and performance issues arise with the DeepSeek-R1-Zero-GGUF model due to expert number constraints. These problems require optimizations and parameter adjustments.
- Connection and Request Handling Errors: The llama-server module encounters "connection reset by peer" errors with concurrent requests, leading to incomplete processing. This issue highlights the need for improved request handling and error management.
- Customization and Branding Requests: Users request the ability to customize the name "llama.cpp" on the web UI for better understanding and branding, while also suggesting options to disable such customizations to maintain intended use.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 37
Summarized Issues:
- Log Probability Issues: Users experienced difficulties obtaining log probabilities when using the
create_chat_completion
function in thellama-cpp
library, despite setting thelogprobs=True
attribute. This issue was resolved by a subsequent update that correctly implemented the log probabilities feature.
- Performance Degradation on AMD GPUs: The llama.cpp server experienced significant performance degradation over time on machines with AMD Radeon RX 7900 XT GPUs, requiring frequent reboots to restore normal operation. The server's response times slowed down by more than four times after handling hundreds of requests, impacting usability without causing crashes or data loss.
- Model Loading and Compilation Issues: Users reported significant slowdowns in model loading times when using memory-mapped files on Apple Silicon M3 Max hardware and encountered various compilation errors on different systems. These issues suggest potential misconfigurations in the CMake setup and conflicts with system headers.
- Output Quality and Cache Type Issues: There was a significant degradation in output quality when using compressed cache types (q8/q4) for cache-type-k and cache-type-v in the llama-cli module. This was evidenced by incorrect calculations of the number of days between two dates, prompting discussions on potential solutions such as finetuning and better evaluation methods.
- Inference and Embedding Process Errors: Users encountered errors such as "could not attach to process" and crashes during the embedding process on specific hardware configurations. These issues were related to unsupported instructions and errors in server UI apps, requiring workarounds and updates for resolution.
- Feature Requests for Enhanced Performance and Support: There were requests for adding "Vulkan enabled" prebuilt Ubuntu binaries and enhancing performance on ARM CPUs, specifically the Kunpeng 920. Users highlighted the absence of such binaries and the slow processing speed of the current implementation.
- Compilation and Execution Errors on Various Platforms: Users faced compile errors on Ubuntu with the Ascend310b1 chip and encountered issues with the
llama-cli
command on Huawei Ascend 910b devices. These problems were related to unsupported configurations and errors in the CANN library, requiring updates and configuration changes.
- Unexpected Tokens and Missing Libraries: A bug in the
llama-cli
tool caused an unexpected<| <|end_of_text|>
token to appear at the end of the output, which was resolved with the latest version ofllama.cpp
. Additionally, release binaries were missing necessary shared libraries, causing execution errors and requiring users to adjust their build processes.
- Vulkan Backend and Kompute Issues: Users requested the implementation of the CPY operation for quantized types in Vulkan and reported problems with the Kompute backend, where models failed to load or performed poorly. These issues were related to memory allocation and performance constraints, prompting discussions on potential solutions.
- Model Loading Failures and Compile Bugs: The llama-server versions b4468 and b4474 were unable to load the Phi-3.5 MoE model due to an "unknown architecture" error. Additionally, compile bugs in the
llama-mmap.cpp
file caused build failures on macOS systems, requiring updates to resolve these issues.
- Reproducibility and Library Conflicts: There were reproducibility problems with the
libggml-vulkan.so
library in thellamacpp
openSUSE package, and conflicts between ggml files installed by llama.cpp and whisper.cpp. These issues suggested the need for modifications to ensure deterministic builds and separation of libggml into a standalone project.
- Compile Bugs and Missing Libraries on MacOS: Users faced compile bugs where the
GGML_NATIVE
option caused reproducibility issues, and thellama-cli
binary failed to run due to a missing@rpath/libllama.dylib
library. These issues required adjustments to the CMake configuration and build settings.
- Model Support and Pre-tokenizer Errors: There were feature requests for supporting the "DeepSeek-R1-Distill-Qwen" model and errors related to unknown pre-tokenizer types. Users encountered discrepancies in model loading and sought assistance to resolve these issues.
- Access Violations and Memory Issues: Users encountered an "Access violation executing location" error while using the
gguf_init_from_file
function, which was resolved by correcting library file conflicts. Additionally, a regression caused the application to run out of memory when using the ROCm/HIP backend, related to a race condition.
- VRAM and GTT Memory Usage Regression: A specific commit led to increased VRAM and GTT memory usage on Linux systems using Vulkan with amdgpu hardware, resulting in slower processing speeds. This was potentially due to the increased number of shader variants and their compilation, prompting discussions on optimizing shader management.
- Compilation and Execution Errors on Various Platforms: Users faced compilation problems when building the Vulkan backend for Android and encountered issues with the
llama-server
module, where the server failed to stop a text generation task upon receiving a cancel task message. These issues required updates and configuration changes for resolution.
- Autocomplete and Generation Issues: The autocomplete functionality in the llama-server experienced long delays with no output following the initial completion, potentially due to recent changes in the server's cancellation logic. Additionally, a bug in the llama-server on ROCm/Windows caused the model to generate a single letter repeatedly in a loop, requiring a process kill to stop.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 14
Key Open Pull Requests
1. cmake: add ggml find package: This pull request introduces a CMake find package for the ggml library, enabling users to link specific backends or all backends collectively through targets like ggml::
and ggml::all
, while also requiring explicit backend requests when using the Llama find-package.
- URL: pull/11369
- Merged: No
- Associated Commits: 530fd, 5b4c1, 314f2, b14e8, ea0a8, 09ab0, 817cf, 1760b, 65b0d, 6388d, bf444, 835e0
2. vulkan: implement initial support for IQ2 and IQ3 quantizations: This pull request implements initial support for IQ2 and IQ3 quantizations in the Vulkan backend, aiming to achieve acceptable performance improvements, while also optimizing the Q3_K
implementation, renaming the init_iq4nl_shmem
function for simplified logic, and testing on a Radeon 780M iGPU with Mesa 24.3.3, although it lacks testing on coopmat2 hardware.
- URL: pull/11360
- Merged: No
3. ci : allow creating artifacts on PRs on demand: This pull request introduces a feature that allows the creation of artifacts for a pull request commit by applying the artifacts
label, as detailed in the commits and described in the pull request body.
- URL: pull/11398
- Merged: No
Other Open Pull Requests
- CUDA and Fedora Updates: The pull request updates the
cuda-fedora.md
guide to include the latest CUDA 12.8 release and Fedora 41. It enhances clarity for compiling with specific compute compatibility targets while maintaining compatibility with Silverblue and Workstation systems.
- CMake Build Process Refinements: This pull request refines the conditions for linking the math library in the CMake build process on Windows. It ensures compatibility with Intel oneAPI, MSVC, and MinGW, addressing build issues on systems with both MSVC and MinGW installed.
- CPU Power and Performance Strategy: A new feature allows users to specify a CPU power and performance strategy, focusing on efficiency by targeting e-cores on hybrid CPUs. This implementation automatically calculates a core mask and applies affinity to specific cores.
- Byteswapping for Model Conversion: The pull request implements byteswapping for q4_k and q6_k in the gguf_convert_endian.py script. This enables the conversion of the llama3.2 model to big endian format.
- Typographical Error Fix: A typographical error is addressed by adding a missing underscore to the
layer_norm_epsilon
parameter in theconvert_hf_to_gguf
function. This correction ensures proper functionality as detailed in a specific commit.
- Code Refactoring for Reusability: The
llama_decode_impl
function is refactored by extracting parts into new functionsllama_prepare_sbatch
andllama_prepare_ubatch
. This change facilitates code reuse for training without altering existing functionality.
- KleidiAI Library Support: Support for the KleidiAI library is introduced in the ggml-cpu backend, enabling optimized matrix multiplication kernels. This feature leverages hardware features like sme, i8mm, and dot product acceleration, activated via the GGML_CPU_KLEIDIAI build option.
- Model Tensor Allocation Override: A new command line parameter
--override-tensor
(-ot
) is introduced, allowing users to specify buffer types for model tensor allocation. This enables efficient offloading schemes by keeping specific tensors on the CPU while offloading others to the GPU.
- Model Name Check in CLI: An issue in the
llama-run
application is addressed by implementing a check for the required model parameter. This prevents crashes and ensures errors from resource downloads are properly propagated to avoid JSON parsing errors.
- LoRA Benchmarking Feature: A draft feature is introduced to the Llama-bench tool for benchmarking the impact of LoRA on model performance. The author notes uncertainty about its implementation, especially regarding integration with quantized weights in the lcpp framework.
- Vulkan Docker Image Update: An issue with the Vulkan Docker image is addressed by adding the missing Vulkan library (
libvulkan-dev
) to the base layer. The Ubuntu version is also updated to 24.04 to ensure compatibility and functionality.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 53
Key Closed Pull Requests
1. Automate vocab support and model conversion: This pull request aims to automate the process of downloading supported model architectures from HuggingFace, handling necessary conversions, and addressing fragile implementations related to tokenizers, thereby streamlining the process for users by reducing manual implementations and improving the overall user experience.
- URL: pull/7379
- Merged: No
- Associated Commits: dbdf6, ba13d, 742ab, 98cf7, 4790f, 5c814, 1a286, 3ba01, f7515, 30225, b2ca2, 5eda2, 2ef73, 04fb7, 832b4, b6f70, 006bb, 1a825, 4b373, 0479e, bd322, d02a0, ce777, da5de, 316b4, 5840b, dcc5d, c6f2a, 89a46, a0362, 9a283, 381da, 6fc44, a1951, bdd02, 18bb3, d9ba9, 2fa2c, 5978b, 12537, aed05, a35b7, 62962, a3bda, fb32f, 47686, 2fe28, 83b9f, b2aac, 34e14, 0b43e, 12285, 1957c, cd00b, 78d78, 9814b, 9ba6b, 0ccf5, c92c6, 17492, f6208, ea4fc, b4b55, 77bc7, c91dc, e62e0, 6c9ac, 64096, 6da2b, 16829, 99275, 4438d, fda23, 2ffe6, 63c34, 6c1b0, e9759, f30bd, da725, fcd20, e4275, b3a54, 36bea, 7f48e, b1c92, 0732b, 21539, 0a478, 9dbc9, aa28c, f1d06, 5c928, 6a725, de0f0, c2e48, 47ef6, c4470, 647d2, 250bd, ce852, 5836d
2. Add Jinja template support: This pull request introduces Jinja template support to the llama.cpp project by incorporating files from the Google Minja repository, adding new command-line flags for Jinja and chat template files, and implementing dual testing for legacy and Jinja templating routes, with plans for further enhancements in a subsequent pull request.
- URL: pull/11016
- Merged: Yes
- Associated Commits: abd27, e5113, 80138, 06b51, ce485, 389d7, 238b9, cb72c, 78861, 1aac9, 7c84e, 18f25, 8dd4f, c04c5, a6afb, b4083, b7e21, a57bb, 4daae, 1b3bb, 3ed67, b75d0, 40db7, 81c0d, d5fa3, ee1e1, e6352, 33322, 5074e, fc608, 0e74c, e3c47, cc503, 153e8, db9dd, c9e8f, 8c84a, 154bf, 099f9, 54a66, 8348c, ee475, 8a7c8, 8347d, ff2cc, 9d8eb, cbb9b
3. Add example script for rendering jinja2 templates: This pull request introduces an updated and modified example script for rendering Jinja2 templates, which is designed to help users visualize and debug chat templates by extracting and displaying them, thereby aiding in understanding how the model creator intended the templates to be rendered.
- URL: pull/7246
- Merged: No
- Associated Commits: eac2e, f5722, bf515, 4a018, 8b9ed, 668c7, fa0b0, 6be35, 214e9, f8bb2, da96f, cfe65, b4b6f, 2185e, 8b67a, 3c23d, 174bb, 1b186, 4204c, b7528, 0cb40, f455e, 27070, 5481c, 6875a, 0de43, fe883, a083c, 43eef, 964ee
Other Closed Pull Requests
- Server Functionality Enhancements: This topic covers improvements to server functionality, including the ability to cancel prompt processing and non-streamed requests when the connection is closed. It also addresses issues with task management and queued requests, ensuring proper cleanup and correct timeout states.
- Build and Configuration Updates: Several pull requests focus on packaging directories in release packages, updating build configurations for various systems, and addressing build failures by adding necessary includes. These changes aim to improve compatibility and resolve issues across different environments.
- Web UI and User Experience Improvements: Enhancements to the web UI include the addition of collapsible elements to hide certain tags and suggestions for future conversation features. These changes aim to improve user interaction and interface compactness.
- Model and Feature Integrations: New capabilities are introduced, such as video understanding with FFmpeg integration and image understanding with the MiniCPM-omni model. These integrations expand the project's functionality for multimedia processing.
- Documentation and Readme Updates: Updates to documentation include information on batch size, plugin links, and Docker build instructions. These changes aim to enhance clarity and provide additional resources for users.
- Numerical Stability and Performance Fixes: Pull requests address numerical instability in models and improve performance by optimizing operations and fixing bugs. These changes ensure more reliable and efficient processing.
- Vulkan and Shader Enhancements: Improvements in Vulkan components include shader sorting for deterministic binaries and on-demand shader compilation to reduce startup time. These updates enhance the graphics processing capabilities of the project.
- Bug Fixes and Issue Resolutions: Various pull requests focus on fixing bugs, such as incorrect token additions and out-of-bounds writes, and resolving issues like build warnings and test timeouts. These fixes contribute to the overall stability and functionality of the project.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 158 | 37 | 3 | 141 |
ngxson | 101 | 20 | 2 | 99 |
ochafik | 128 | 5 | 0 | 20 |
slaren | 17 | 8 | 0 | 89 |
jeffbolznv | 15 | 10 | 0 | 50 |
JohannesGaessler | 20 | 7 | 0 | 35 |
0cc4m | 6 | 2 | 1 | 47 |
netrunnereve | 45 | 2 | 0 | 9 |
ericcurtin | 11 | 11 | 0 | 25 |
qnixsynapse | 17 | 4 | 0 | 14 |