Weekly GitHub Report for Llama.cpp: January 20, 2025 - January 27, 2025

            Weekly GitHub Report for Llama.cpp: January 20, 2025 - January 27, 2025

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4552
1.2 Version Information:
The version released on January 25, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without further information.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Compile bug: Vulkan can not work on Android (cross-compilation from linux) - Aborted without explaination: This issue involves a compile bug where Vulkan cannot function on Android when cross-compiled from Linux, resulting in the process aborting without explanation. The user has followed all instructions and attempted various solutions, but the problem persists across different operating systems and NDK versions, specifically on a Redmi Note 13 Pro 5G with a Qualcomm CPU and Adreno GPU.

The comments discuss potential issues with the Vulkan backend on Qualcomm GPUs, suggesting enabling Vulkan Validation Layers and disabling certain shaders. It is noted that Qualcomm GPUs have known issues with Vulkan, and OpenCL is recommended instead. There is ongoing work to optimize Vulkan for embedded GPUs, and a pull request is suggested as a potential fix. The user also reports issues with the OpenCL backend, and a separate issue is recommended for tracking that.
Number of comments this week: 18

Eval bug:  tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf: This issue is about a user encountering unexpected  tags in the output of the DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf model when using the llama-cli, which they believe should not be visible in the final response. The user is seeking clarification on whether this is a bug or a feature and is discussing potential solutions for handling these tags in multi-turn conversations to improve model performance and user experience.

The comments discuss whether the  tags are a feature or a bug, with some suggesting they are a feature of the R1 series. There is a debate on whether these tags should be visible to users, with suggestions to separate the "thinking" content from the actual response. Some users propose implementing a feature to filter out these tags, while others discuss the impact on performance and context management. There is also mention of potential solutions like using a proxy or the upcoming support for Jinja templates to handle the tags.
Number of comments this week: 16

Eval bug: ggml_gallocr_reserve_n tries to allocate beyond max buffer size: This issue involves a bug in the ggml_gallocr_reserve_n function, which attempts to allocate memory beyond the maximum buffer size allowed by the device, resulting in an out-of-memory error when using the CodeQwen 1.5 7B model with large context sizes on a Vulkan backend. The problem arises because the calculated buffer size exceeds the device's memory allocation limit, and the user suggests that this might be a bug in the llama.cpp implementation since other models do not exhibit the same issue under similar conditions.

The comments discuss potential workarounds, such as reducing the ubatch size to decrease the compute buffer size, and clarify misunderstandings about the ubatch parameter's role in processing tokens. The conversation also touches on the trade-offs between performance and memory usage, with explanations provided about how the attention mechanism and feed-forward network operate in this context. The user expresses gratitude for the quick responses and seeks further clarification on the technical details.
Number of comments this week: 9

Feature Request: MiniMax-Text-01 model: This issue is a feature request to add support for the MiniMax-Text-01 model in the llama.cpp project, highlighting its potential performance benefits and large token context. The requester suggests that the model, which is a Mixture of Experts (MoE) model, could be a valuable addition due to its performance being comparable to deepseek v3.

The comments discuss interest in the model and share a partially working implementation with noted issues, such as lack of support for multiple token sequences and potential redesign needs for the KV cache. Users report testing results, including performance metrics and issues with word omissions during text generation, and collaborate on further testing with different setups and configurations.
Number of comments this week: 8

Eval bug: segfault on Alpine linux docker image: This issue reports a segmentation fault occurring when running the llama.cpp model compilation in a Docker container on Alpine Linux, affecting both x86 machines and Raspberry Pi devices. The problem seems to be related to shader compilation when using Vulkan on specific hardware configurations, including AMD CPUs and Intel or AMD GPUs.

Multiple users report similar segmentation faults on Alpine Linux, with discussions focusing on potential causes such as static vs. shared library linking, driver updates, and Vulkan SDK installation. Some users attempt recompilation with different build flags and debug options, but the issue persists, indicating a possible compatibility problem with Alpine's musl library or Vulkan setup.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 27
Summarized Issues:

ggml File Conflict: The llama.cpp and whisper.cpp projects both attempt to install different versions of ggml files to the same location, causing conflicts. This suggests that ggml should potentially be separated into a standalone project to avoid such issues.
issues/11303

Segmentation Faults and Compilation Errors: Segmentation faults occur when running the llama.cpp model compilation within a Docker container on Alpine Linux, particularly with Vulkan for GPU acceleration. Additionally, compilation errors arise when building the project from source using CUDA in a Docker environment due to missing rules for required targets.
issues/11308, issues/11376

GPU Support and Performance Issues: The removal of GPU support from the clip.cpp file in the llama.cpp project has significantly impacted performance, urging restoration to maintain functionality. Furthermore, there are issues with slow inference and low CPU/GPU usage when running models over a network, despite attempts to offload layers to the GPU.
issues/11322, issues/11361

Model Conversion and Compatibility Problems: Users face errors when converting models from Hugging Face format to GGUF, such as missing tensors or unsupported models. These issues highlight the need for script modifications and support for specific models to ensure successful conversions.
issues/11345, issues/11347, issues/11387

Vulkan Backend Bugs: Bugs in the Vulkan backend cause crashes during inference with multiple contexts and prevent layer offloading to the GPU. These issues require mutex-controlled access and indicate a need for better GPU utilization.
issues/11327, issues/11367, issues/11371

NUMA and Memory Management: Implementing NUMA-aware expert allocation in the llama.cpp project aims to optimize Mixture-of-Experts models by reducing cross-NUMA communication costs. Additionally, memory management issues arise with out-of-memory errors due to buffer size discrepancies.
issues/11333, issues/11332

Server and Template Bugs: The llama-server module experiences crashes and token generation issues due to changes in the httplib library and Jinja template exceptions. These problems highlight the need for robust error handling and template processing methods.
issues/11335, issues/11400, issues/11402

Feature Requests for Model Enhancements: Requests include adding a reasoning_effort parameter to control CoT output length, supporting the kosmos-2.5 model for visual text conversion, and integrating SwiftKV for performance boosts. These enhancements aim to improve model functionality and efficiency.
issues/11351, issues/11413, issues/11415

Compilation and Execution Errors: Compilation errors occur due to undeclared functions and missing shared libraries, affecting the build process and execution of binaries. These issues necessitate alternative function usage and library inclusion for successful builds.
issues/11385, issues/11404

Model Loading and Performance Issues: Slow model loading on Mac systems with Metal GGML backends affects large models, and performance issues arise with the DeepSeek-R1-Zero-GGUF model due to expert number constraints. These problems require optimizations and parameter adjustments.
issues/11382, issues/11378

Connection and Request Handling Errors: The llama-server module encounters "connection reset by peer" errors with concurrent requests, leading to incomplete processing. This issue highlights the need for improved request handling and error management.
issues/11411

Customization and Branding Requests: Users request the ability to customize the name "llama.cpp" on the web UI for better understanding and branding, while also suggesting options to disable such customizations to maintain intended use.
issues/11412

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 37
Summarized Issues:

Log Probability Issues: Users experienced difficulties obtaining log probabilities when using the create_chat_completion function in the llama-cpp library, despite setting the logprobs=True attribute. This issue was resolved by a subsequent update that correctly implemented the log probabilities feature.
issues/6423, issues/11346

Performance Degradation on AMD GPUs: The llama.cpp server experienced significant performance degradation over time on machines with AMD Radeon RX 7900 XT GPUs, requiring frequent reboots to restore normal operation. The server's response times slowed down by more than four times after handling hundreds of requests, impacting usability without causing crashes or data loss.
issues/10227

Model Loading and Compilation Issues: Users reported significant slowdowns in model loading times when using memory-mapped files on Apple Silicon M3 Max hardware and encountered various compilation errors on different systems. These issues suggest potential misconfigurations in the CMake setup and conflicts with system headers.
issues/10478, issues/10575, issues/10629

Output Quality and Cache Type Issues: There was a significant degradation in output quality when using compressed cache types (q8/q4) for cache-type-k and cache-type-v in the llama-cli module. This was evidenced by incorrect calculations of the number of days between two dates, prompting discussions on potential solutions such as finetuning and better evaluation methods.
issues/10697

Inference and Embedding Process Errors: Users encountered errors such as "could not attach to process" and crashes during the embedding process on specific hardware configurations. These issues were related to unsupported instructions and errors in server UI apps, requiring workarounds and updates for resolution.
issues/10701, issues/10702

Feature Requests for Enhanced Performance and Support: There were requests for adding "Vulkan enabled" prebuilt Ubuntu binaries and enhancing performance on ARM CPUs, specifically the Kunpeng 920. Users highlighted the absence of such binaries and the slow processing speed of the current implementation.
issues/10707, issues/10754

Compilation and Execution Errors on Various Platforms: Users faced compile errors on Ubuntu with the Ascend310b1 chip and encountered issues with the llama-cli command on Huawei Ascend 910b devices. These problems were related to unsupported configurations and errors in the CANN library, requiring updates and configuration changes.
issues/10719, issues/10777

Unexpected Tokens and Missing Libraries: A bug in the llama-cli tool caused an unexpected <| <|end_of_text|> token to appear at the end of the output, which was resolved with the latest version of llama.cpp. Additionally, release binaries were missing necessary shared libraries, causing execution errors and requiring users to adjust their build processes.
issues/11043, issues/11091

Vulkan Backend and Kompute Issues: Users requested the implementation of the CPY operation for quantized types in Vulkan and reported problems with the Kompute backend, where models failed to load or performed poorly. These issues were related to memory allocation and performance constraints, prompting discussions on potential solutions.
issues/11127, issues/11217

Model Loading Failures and Compile Bugs: The llama-server versions b4468 and b4474 were unable to load the Phi-3.5 MoE model due to an "unknown architecture" error. Additionally, compile bugs in the llama-mmap.cpp file caused build failures on macOS systems, requiring updates to resolve these issues.
issues/11259, issues/11295

Reproducibility and Library Conflicts: There were reproducibility problems with the libggml-vulkan.so library in the llamacpp openSUSE package, and conflicts between ggml files installed by llama.cpp and whisper.cpp. These issues suggested the need for modifications to ensure deterministic builds and separation of libggml into a standalone project.
issues/11306, issues/11313

Compile Bugs and Missing Libraries on MacOS: Users faced compile bugs where the GGML_NATIVE option caused reproducibility issues, and the llama-cli binary failed to run due to a missing @rpath/libllama.dylib library. These issues required adjustments to the CMake configuration and build settings.
issues/11317, issues/11321

Model Support and Pre-tokenizer Errors: There were feature requests for supporting the "DeepSeek-R1-Distill-Qwen" model and errors related to unknown pre-tokenizer types. Users encountered discrepancies in model loading and sought assistance to resolve these issues.
issues/11324, issues/11341

Access Violations and Memory Issues: Users encountered an "Access violation executing location" error while using the gguf_init_from_file function, which was resolved by correcting library file conflicts. Additionally, a regression caused the application to run out of memory when using the ROCm/HIP backend, related to a race condition.
issues/11334, issues/11337

VRAM and GTT Memory Usage Regression: A specific commit led to increased VRAM and GTT memory usage on Linux systems using Vulkan with amdgpu hardware, resulting in slower processing speeds. This was potentially due to the increased number of shader variants and their compilation, prompting discussions on optimizing shader management.
issues/11339

Compilation and Execution Errors on Various Platforms: Users faced compilation problems when building the Vulkan backend for Android and encountered issues with the llama-server module, where the server failed to stop a text generation task upon receiving a cancel task message. These issues required updates and configuration changes for resolution.
issues/11358, issues/11370, issues/11414

Autocomplete and Generation Issues: The autocomplete functionality in the llama-server experienced long delays with no output following the initial completion, potentially due to recent changes in the server's cancellation logic. Additionally, a bug in the llama-server on ROCm/Windows caused the model to generate a single letter repeatedly in a loop, requiring a process kill to stop.
issues/11416, issues/11421

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 14
Key Open Pull Requests
1. cmake: add ggml find package: This pull request introduces a CMake find package for the ggml library, enabling users to link specific backends or all backends collectively through targets like ggml:: and ggml::all, while also requiring explicit backend requests when using the Llama find-package.

URL: pull/11369

Merged: No

Associated Commits: 530fd, 5b4c1, 314f2, b14e8, ea0a8, 09ab0, 817cf, 1760b, 65b0d, 6388d, bf444, 835e0

2. vulkan: implement initial support for IQ2 and IQ3 quantizations: This pull request implements initial support for IQ2 and IQ3 quantizations in the Vulkan backend, aiming to achieve acceptable performance improvements, while also optimizing the Q3_K implementation, renaming the init_iq4nl_shmem function for simplified logic, and testing on a Radeon 780M iGPU with Mesa 24.3.3, although it lacks testing on coopmat2 hardware.

URL: pull/11360

Merged: No

Associated Commits: 74edb, bcbbc, dec3f, 74b7d, 7c8ec, f36e8, 85050, 434d7, c5fd9, 6ed30, 3f7aa

3. ci : allow creating artifacts on PRs on demand: This pull request introduces a feature that allows the creation of artifacts for a pull request commit by applying the artifacts label, as detailed in the commits and described in the pull request body.

URL: pull/11398

Merged: No

Associated Commits: 202b1, df0ed, de9d2

Other Open Pull Requests

CUDA and Fedora Updates: The pull request updates the cuda-fedora.md guide to include the latest CUDA 12.8 release and Fedora 41. It enhances clarity for compiling with specific compute compatibility targets while maintaining compatibility with Silverblue and Workstation systems.
pull/11393

CMake Build Process Refinements: This pull request refines the conditions for linking the math library in the CMake build process on Windows. It ensures compatibility with Intel oneAPI, MSVC, and MinGW, addressing build issues on systems with both MSVC and MinGW installed.
pull/11312

CPU Power and Performance Strategy: A new feature allows users to specify a CPU power and performance strategy, focusing on efficiency by targeting e-cores on hybrid CPUs. This implementation automatically calculates a core mask and applies affinity to specific cores.
pull/11326

Byteswapping for Model Conversion: The pull request implements byteswapping for q4_k and q6_k in the gguf_convert_endian.py script. This enables the conversion of the llama3.2 model to big endian format.
pull/11349

Typographical Error Fix: A typographical error is addressed by adding a missing underscore to the layer_norm_epsilon parameter in the convert_hf_to_gguf function. This correction ensures proper functionality as detailed in a specific commit.
pull/11377

Code Refactoring for Reusability: The llama_decode_impl function is refactored by extracting parts into new functions llama_prepare_sbatch and llama_prepare_ubatch. This change facilitates code reuse for training without altering existing functionality.
pull/11381

KleidiAI Library Support: Support for the KleidiAI library is introduced in the ggml-cpu backend, enabling optimized matrix multiplication kernels. This feature leverages hardware features like sme, i8mm, and dot product acceleration, activated via the GGML_CPU_KLEIDIAI build option.
pull/11390

Model Tensor Allocation Override: A new command line parameter --override-tensor (-ot) is introduced, allowing users to specify buffer types for model tensor allocation. This enables efficient offloading schemes by keeping specific tensors on the CPU while offloading others to the GPU.
pull/11397

Model Name Check in CLI: An issue in the llama-run application is addressed by implementing a check for the required model parameter. This prevents crashes and ensures errors from resource downloads are properly propagated to avoid JSON parsing errors.
pull/11399

LoRA Benchmarking Feature: A draft feature is introduced to the Llama-bench tool for benchmarking the impact of LoRA on model performance. The author notes uncertainty about its implementation, especially regarding integration with quantized weights in the lcpp framework.
pull/11410

Vulkan Docker Image Update: An issue with the Vulkan Docker image is addressed by adding the missing Vulkan library (libvulkan-dev) to the base layer. The Ubuntu version is also updated to 24.04 to ensure compatibility and functionality.
pull/11422

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 53
Key Closed Pull Requests
1. Automate vocab support and model conversion: This pull request aims to automate the process of downloading supported model architectures from HuggingFace, handling necessary conversions, and addressing fragile implementations related to tokenizers, thereby streamlining the process for users by reducing manual implementations and improving the overall user experience.

URL: pull/7379

Merged: No

Associated Commits: dbdf6, ba13d, 742ab, 98cf7, 4790f, 5c814, 1a286, 3ba01, f7515, 30225, b2ca2, 5eda2, 2ef73, 04fb7, 832b4, b6f70, 006bb, 1a825, 4b373, 0479e, bd322, d02a0, ce777, da5de, 316b4, 5840b, dcc5d, c6f2a, 89a46, a0362, 9a283, 381da, 6fc44, a1951, bdd02, 18bb3, d9ba9, 2fa2c, 5978b, 12537, aed05, a35b7, 62962, a3bda, fb32f, 47686, 2fe28, 83b9f, b2aac, 34e14, 0b43e, 12285, 1957c, cd00b, 78d78, 9814b, 9ba6b, 0ccf5, c92c6, 17492, f6208, ea4fc, b4b55, 77bc7, c91dc, e62e0, 6c9ac, 64096, 6da2b, 16829, 99275, 4438d, fda23, 2ffe6, 63c34, 6c1b0, e9759, f30bd, da725, fcd20, e4275, b3a54, 36bea, 7f48e, b1c92, 0732b, 21539, 0a478, 9dbc9, aa28c, f1d06, 5c928, 6a725, de0f0, c2e48, 47ef6, c4470, 647d2, 250bd, ce852, 5836d

2. Add Jinja template support: This pull request introduces Jinja template support to the llama.cpp project by incorporating files from the Google Minja repository, adding new command-line flags for Jinja and chat template files, and implementing dual testing for legacy and Jinja templating routes, with plans for further enhancements in a subsequent pull request.

URL: pull/11016

Merged: Yes

Associated Commits: abd27, e5113, 80138, 06b51, ce485, 389d7, 238b9, cb72c, 78861, 1aac9, 7c84e, 18f25, 8dd4f, c04c5, a6afb, b4083, b7e21, a57bb, 4daae, 1b3bb, 3ed67, b75d0, 40db7, 81c0d, d5fa3, ee1e1, e6352, 33322, 5074e, fc608, 0e74c, e3c47, cc503, 153e8, db9dd, c9e8f, 8c84a, 154bf, 099f9, 54a66, 8348c, ee475, 8a7c8, 8347d, ff2cc, 9d8eb, cbb9b

3. Add example script for rendering jinja2 templates: This pull request introduces an updated and modified example script for rendering Jinja2 templates, which is designed to help users visualize and debug chat templates by extracting and displaying them, thereby aiding in understanding how the model creator intended the templates to be rendered.

URL: pull/7246

Merged: No

Associated Commits: eac2e, f5722, bf515, 4a018, 8b9ed, 668c7, fa0b0, 6be35, 214e9, f8bb2, da96f, cfe65, b4b6f, 2185e, 8b67a, 3c23d, 174bb, 1b186, 4204c, b7528, 0cb40, f455e, 27070, 5481c, 6875a, 0de43, fe883, a083c, 43eef, 964ee

Other Closed Pull Requests

Server Functionality Enhancements: This topic covers improvements to server functionality, including the ability to cancel prompt processing and non-streamed requests when the connection is closed. It also addresses issues with task management and queued requests, ensuring proper cleanup and correct timeout states.
pull/9679, pull/11340, pull/11418

Build and Configuration Updates: Several pull requests focus on packaging directories in release packages, updating build configurations for various systems, and addressing build failures by adding necessary includes. These changes aim to improve compatibility and resolve issues across different environments.
pull/11392, pull/10217, pull/11296

Web UI and User Experience Improvements: Enhancements to the web UI include the addition of collapsible elements to hide certain tags and suggestions for future conversation features. These changes aim to improve user interaction and interface compactness.
pull/11364

Model and Feature Integrations: New capabilities are introduced, such as video understanding with FFmpeg integration and image understanding with the MiniCPM-omni model. These integrations expand the project's functionality for multimedia processing.
pull/9165, pull/11289

Documentation and Readme Updates: Updates to documentation include information on batch size, plugin links, and Docker build instructions. These changes aim to enhance clarity and provide additional resources for users.
pull/11353, pull/11355, pull/11368

Numerical Stability and Performance Fixes: Pull requests address numerical instability in models and improve performance by optimizing operations and fixing bugs. These changes ensure more reliable and efficient processing.
pull/11283, pull/11356

Vulkan and Shader Enhancements: Improvements in Vulkan components include shader sorting for deterministic binaries and on-demand shader compilation to reduce startup time. These updates enhance the graphics processing capabilities of the project.
pull/11315, pull/11406

Bug Fixes and Issue Resolutions: Various pull requests focus on fixing bugs, such as incorrect token additions and out-of-bounds writes, and resolving issues like build warnings and test timeouts. These fixes contribute to the overall stability and functionality of the project.
pull/11278, pull/11302, pull/11309, pull/11300

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
158
37
3
141

ngxson
101
20
2
99

ochafik
128
5
0
20

slaren
17
8
0
89

jeffbolznv
15
10
0
50

JohannesGaessler
20
7
0
35

0cc4m
6
2
1
47

netrunnereve
45
2
0
9

ericcurtin
11
11
0
25

qnixsynapse
17
4
0
14

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	158	37	3	141
ngxson	101	20	2	99
ochafik	128	5	0	20
slaren	17	8	0	89
jeffbolznv	15	10	0	50
JohannesGaessler	20	7	0	35
0cc4m	6	2	1	47
netrunnereve	45	2	0	9
ericcurtin	11	11	0	25
qnixsynapse	17	4	0	14