Weekly GitHub Report for Llama.cpp - 2025-01-20 12:01:08
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4508
1.2 Version Information:
The version released on January 18, 2025, introduces key updates and changes, though specific details are not provided in the given data. Notable highlights or trends cannot be identified without further information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
server: Bring back multimodal support: This issue is about the removal of multimodal support from the llama.cpp project, which was dependent on the refactoring of
llava
, and the need to track progress on reintroducing this feature. The issue is primarily for tracking purposes, and there is currently no plan to address it, but contributors are encouraged to take on the task.- The comments reflect a strong desire from the community to restore multimodal support, with users expressing frustration over its prolonged absence and discussing potential technical approaches for reimplementation. There is a consensus on the need for contributions to improve the vision code, and some users suggest alternative implementations while others discuss the technical challenges and propose solutions for integrating multimodal capabilities.
- Number of comments this week: 11
-
Feature Request: Add support for Kokoro TTS: This issue is a feature request to add support for Kokoro TTS, a text-to-speech model known for its natural tone and accent capabilities, to the llama.cpp project. The requester highlights the model's popularity and performance, particularly its efficiency on CPU/edge devices, and suggests its integration would be beneficial for many users.
- The comment section shows overwhelming support for the feature request, with multiple users expressing agreement by commenting "+1" and one user noting the model's fast performance on Mac.
- Number of comments this week: 11
-
Eval bug: Q2_K and Q3_K not working on Vulkan anymore on RX 5700XT: This issue reports a bug where Q2_K and Q3_K models produce gibberish output when running on Vulkan with an AMD Radeon RX 5700 XT GPU, specifically after a certain commit. The problem is reproducible on Windows with the AMD proprietary driver, and reverting the commit fixes the issue.
- The comments discuss various confirmations of the issue on different systems, potential causes related to Vulkan updates, and attempts to resolve the problem. Some users suggest that the issue might be specific to AMD's proprietary drivers on Windows, while others report successful builds using open-source drivers. There are also discussions about potential workarounds and the need for further investigation into the driver compatibility.
- Number of comments this week: 9
-
Feature Request: Better chat UX for llama-cli: This issue is a feature request to enhance the chat user experience for the
llama-cli
by enabling it to run in chat mode automatically if a built-in chat template is available and by adding commands like/regen
and/readfile
. The motivation behind this request is to streamline the chat functionality withinllama-cli
, making it more user-friendly and integrated, similar to the updated UI forllama-server
, while addressing the complexity ofmain.cpp
for chat-based applications.- The comments discuss the potential repurposing of
llama-cli
for chat, with some suggesting removing extra functionalities and others emphasizing the importance of maintaining backward compatibility. There is debate over whether to integratecommon.cpp
into examples and the implications for user experience, with suggestions to simplifymain.cpp
while considering historical reasons for its current structure. The conversation also touches on the need for clear communication to users if changes are made, such as migrating tollama-run
. - Number of comments this week: 8
- The comments discuss the potential repurposing of
-
Misc. bug: Kompute models fail and struggles where Vulkan works fine: This issue involves a bug where Kompute models fail to perform as expected on a Raspberry Pi 5 with an AMD Radeon RX 7600 XT GPU, while Vulkan models work fine under similar conditions. The user is experiencing difficulties with memory allocation and performance discrepancies between the two backends, seeking advice on potential solutions and optimizations for Kompute.
- The comments discuss the limitations and current state of the Kompute backend compared to Vulkan, noting that Kompute is less optimized and lacks certain features. Suggestions are made to use Vulkan for discrete GPUs, while discussions also touch on shader optimizations and potential improvements for mobile GPUs. There is an interest in developing alternative shaders for specific hardware, and offers of collaboration and support are extended.
- Number of comments this week: 8
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 20
Summarized Issues:
- Performance and Compatibility Issues: Users are experiencing performance and compatibility problems with various backends and hardware configurations. For instance, the Kompute backend on a Raspberry Pi 5 with an AMD Radeon RX 7600 XT GPU shows poor performance compared to the Vulkan backend, raising questions about memory allocation and PCIe bandwidth constraints. Similarly, increasing the number of threads in the llama.cpp project results in slower execution, indicating inefficiencies in thread management.
- Model Support and Integration Challenges: There are multiple requests and issues related to supporting new models and integrating them into the llama.cpp project. Users are facing difficulties with chaotic outputs and debugging due to lack of documentation, as seen with the iFlytek Spark 13B model. Additionally, there are requests for supporting models like LlamaV-o1 and ModernBert, highlighting the need for enhanced model compatibility.
- Compilation and Backend Bugs: Several issues involve compilation bugs and backend-specific problems. For example, the ggml_vulkan shader compilation fails on a Qualcomm Adreno 750 GPU, and the OpenCL backend encounters errors due to unsupported Double types. These issues suggest a need for better error handling and support for diverse hardware configurations.
- Quantization and Model Loading Errors: Users are encountering errors during model quantization and loading processes. The
llama-quantize
command fails due to unrecognized model architecture, and certain models cannot be loaded in newer versions of llama-server, indicating potential regressions or compatibility issues with model formats.
- Feature Requests for Model Enhancements: There are several feature requests aimed at enhancing the capabilities of the llama.cpp project. These include adding support for models like MiniMax-Text-01 and exploring new GitHub Actions runners with Cobalt 100-based processors, which could leverage Arm features for improved performance.
- Code and Build Process Concerns: Issues have been raised about the build process and code design choices. For instance, a compile bug in
llama-mmap.cpp
due to a missing
header causes build failures, and there are questions about the rationale behind certain memory size truncations in the code.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 24
Summarized Issues:
- Server Bugs: Several issues highlight bugs in the server components of the llama.cpp project. One issue describes a bug where non-stream completion requests cannot be canceled midway due to the blocking nature of the HTTP library. Another issue involves a bug in the llama-server where enabling the
logprobs
parameter results in garbled text, particularly affecting Kanji characters. Additionally, there is a feature request to implement a mechanism to terminate specific inference tasks that are stuck without shutting down the entire server.
- Model and Backend Compatibility Issues: Various issues report compatibility problems with models and backends. One issue reports a fatal error on the Ascend 310p chip using the CANN backend, while another describes garbled inference results with the Qwen2.5-7b-f16.gg model on Ascend 310P3 devices. Additionally, there is a problem with the OpenCL backend defaulting to Vulkan, which does not function correctly on the user's system.
- Compilation and Syntax Errors: Compilation and syntax errors are reported in several issues. One issue involves a compilation bug where a "No rule to make target" error occurs, while another pertains to a syntax error in the
scripts/build-info.sh
file due to a missing "then" statement. These issues highlight the need for careful code review and testing to ensure compatibility across different systems.
- Feature Requests for Model and Server Enhancements: Multiple feature requests aim to enhance model and server functionalities. Requests include integrating the OLMoE model, adding support for the VideoGameBunny-V1 model in GGUF format, and modifying the llama-server to expose the draft model for faster inferencing. These enhancements are intended to improve model compatibility and server efficiency.
- Logging and Output Issues: Several issues address problems with logging and output in the llama-cli module. One issue reports that user input is not saved to the log file, while another describes incorrect console output due to the use of the
LOG
macro. These issues affect the accuracy and usability of logs, necessitating improvements in logging mechanisms.
- Performance and Functionality Bugs: Performance and functionality bugs are reported in various components. An issue describes a slowdown in speculative decoding performance for quantized models, while another involves a bug in the
llama.android
example where the model generates responses indefinitely. These issues highlight the need for optimization and proper handling of model operations.
- Operator Support and Precision Problems: An issue summarizes the support and limitations of operators in the CANN backend, highlighting a precision problem with matrix transposition tests. This issue details the level of support for each operator, indicating areas where improvements are needed to enhance precision and functionality.
- Vulkan and OpenCL Backend Issues: Problems with the Vulkan and OpenCL backends are reported, including a bug where enabling Coopmat2 Flash Attention results in incoherent output. These issues suggest the need for better handling of input tensors and validation errors to ensure reliable backend performance.
- Miscellaneous Bugs and Requests: Various other issues include a bug in the "phi 4" model where the input is empty upon loading, and a feature request for proxy header support to display the correct external IP address. These issues reflect the diverse challenges faced in maintaining and enhancing the llama.cpp project.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 19
Key Open Pull Requests
1. llama : refactor llama_kv_cache, llama_context and llm_build_context: This pull request aims to refactor the llama_kv_cache
, llama_context
, and llm_build_context
components to create a more generic and flexible implementation that abstracts the underlying KV cache logic, introduces a new llama_batch_manager
for batch processing, and prepares the codebase for future enhancements by reorganizing the code and updating the API to support different use cases and architectures without introducing any functional changes at this stage.
- URL: pull/11213
- Merged: No
- Associated Commits: d40bdb0ed013711ed0f3b451c11fe0016858a6f5, 18d4c726b7a4c348f1fb3108cbdb69a8c5d312ed, b206c62a76db6c4c618f70f9c58b26a868c2d36f, fd62d74cb7ca5b0090cc157825e98266638da7e4, 07408f07fa9423a58241fbe4d60e0c6e59e76843, d8946f81781f15360aef184df4bac6cc0165d260, 7f58b0bd7b1560f876b78969a52898891f0fdd22, e05183f4fcadbc526ccbfceb558fc0410e67208f, b0396b5c644e436cde514332ddb513e5e1cdf164, ed98eea250fd95feb77453420e53403af615cbcc, 9027f329803485c96ad0229f4a23174fc1c1ddad, 0bcc2c59e883f8975da376c7ab2084d7abccefdd, 1eb0b12f55fef318344ddbf37d6d5ee6e28dead9, eaf837453c089856b5dd113f445c7df336a63987, 5ca07409c21ba2d0cc7fe42515f3d2b4c677e406, 501c661bfdd9a2a9171d396852f5be2fba81b9a5, a2683953dd6eea57fa242e4ced4f9c5c55fe4429, 60106c62fdff347894700111a011adb07357b788
2. sampling: add Top-nσ sampler: This pull request introduces a new sampling method called Top-nσ, which is designed to maintain a stable sampling space regardless of temperature scaling by distinguishing between a Gaussian-distributed noisy region and an informative region in logits, and implements this method as a stand-alone sampler for the llama-cli
tool.
- URL: pull/11223
- Merged: No
- Associated Commits: ddc3c2208acf0fb5a05f28205f1291486f922822, da038d8715c68fb02f62059dcbf52882b501ad39, bee4c7c9fa0e44a70ef8802e2d0f86a29b8498d9, 8fb681bf9ae94eee631f87abdb4f6175d8951ce9, 54ef105c85b1220908fa40c2b59f768a0392139b, d905a9e9b7339a23f5181dce4f839169e1ecfda2, 66cffa8aff9a433f1bc6cd5faee3440e8826afcc, a590dcb7f6cd8460284723ed30fa721735c7cdf0, 0f7501c913bcf57a25956e91a06591a2ac1158bd, b29deb83cd6cf1a19bce3610107868c43a9a4eda, f08e6f5bdcb1a01f17be1aec32a4f1b95cf28eef, 6664d4709fe66910ac82b07879ef67189fd31bbb, c6123e69b00012d3ed95f8d19ecf10e8115de07f
3. Allow s390x to load little endian models unmodified: This pull request introduces functionality to allow the s390x architecture to load little-endian models without modification by implementing a byteswap function for ggml data, addressing issues such as disabling mmap on s390x due to the need for byteswapping, and making various code improvements and fixes to support this feature.
- URL: pull/11234
- Merged: No
- Associated Commits: 5000c5757e032317854043dad99564b95ef792ae, e97900f311a3dd3cbce0685bebd66a4be7a677a4, 5d68dce2930975f3457a5e9e9d11aff1852c882e, 4809e7062d0d1579811209897c075adbf0d30382, 6face95bb8ffe16aa38018aa6474eb4d0d0eb668, b66546c77f7bf928be0fed47f652f1489f01d2f9, d9db534ba9f02004c0b5e4c9d6b1d214f80ae215, b92446f448eebc4e51ffd80e698ab82c5dbba0b2, d80e110c6b2c70e638be051b5effd7bc47869e52, 5a42d17d88612722931eb025da21c01ae68c2c2b, ca9e6386832c0c2fdb7e1c48cb882b108486767c, 46b9ec8b0139f7c39137508c64b59362e210aaaa
Other Open Pull Requests
- MiniCPM-omni Integration: This topic involves integrating MiniCPM-omni's image understanding capabilities into the llama.cpp framework. The pull request enhances functionality for efficient inference on end-side devices like iPads.
- SYCL Backend Memory Optimization: The pull request introduces a host memory pool for the
matrix_info_t
struct in the SYCL backend. This change eliminates the need forhost_task
synchronization during memory freeing, improving performance.
- Linux CUDA Compatibility: This pull request focuses on building Linux CUDA releases compatible with Google Colab and other platforms on version 12.2. It includes tasks like fixing the build on continuous integration and updating a Colab example.
- Vision API Refactor: The pull request is a second attempt to refactor the vision API in the llama project. It addresses issue #8010 and includes testing instructions for processing an image with the llama-vision target.
- Numerical Stability in Models: This pull request addresses numerical instability issues in the Granite and Granitemoe models. It enforces the use of the Q8_0 quantization type for all token embeddings to prevent early stopping.
- Chat Template Support: The pull request introduces support for chat templates in the llama-run CLI. It addresses the issue where executing llama-run on models requiring a chat template would previously fail.
- AMD Architecture Parsing: This pull request enhances AMD architecture parsing by utilizing the value returned by
gcnArchName
. It ensures compatibility with devices like CDNA3, CDNA, VEGA, and GCN4.
- AARCH64 Makefile and CMake Fixes: The pull request addresses issues in the Makefile and CMake logic for AARCH64. It replaces incorrect 'ifndef' statements with 'ifdef' to ensure proper architecture optimization.
- F16 Mask Support in SYCL: This pull request introduces support for F16 mask in the
ggml_sycl_op_soft_max()
function. It includes code cleanups and requests thorough testing by reviewers.
- llama-simple-chat BOS Token Fix: The pull request addresses an issue in the
llama-simple-chat
feature where a BOS token was incorrectly added. It proposes a sample fix to correct this behavior.
- Vulkan Validation Fixes: This pull request addresses validation failures in Vulkan's coopmat2. It corrects the invalid usage of loading f32 types directly into A/B matrices and ensures compatibility with SPIR-V 1.6 and Vulkan 1.3.
- OuteTTS Model Support: The pull request introduces basic support for OuteTTS v0.3 500m and 1b models. It ensures compatibility with previous versions by dynamically determining token offsets.
- 64-bit System Compatibility: This pull request enhances compatibility and efficiency by aligning data structures for 64-bit systems. It addresses a compilation error specific to Clang 19.
- Optimization of Const References: The pull request proposes a minor optimization by removing const references for simple types and structures smaller than 16 bytes. It also changes
probs_iterator tmp
to a constant iterator.
- Build Failure Fix: This pull request addresses build failures by adding a missing
include insrc/llama-mmap.cpp
. It resolves issue #11295.
- gguf_writer Reservation Mechanism: The pull request introduces a reservation mechanism in the
gguf_writer
. It modifies method parameters to use constant pointers, enhancing code clarity.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 41
Key Closed Pull Requests
1. vulkan: scale caching for k quants + misc fixes: This pull request focuses on optimizing the inference process in a Vulkan-based project by parallelizing the extraction of scales and storing them in shared memory to enhance performance, while also addressing various miscellaneous fixes and improvements, particularly for different quantization types, except for Q4_K and Q5_K due to their complex scale packing.
- URL: pull/11081
- Merged: Yes
- Associated Commits: d122d5c987b8b13483190acf9838535298e585f5, 6b06d1689011196ff3312277530402adefb53fbb, 21c6b805c99d90332250adb969e128171d431525, b0e4ccbeb95f3987052fbc08500459773216a691, 07d0d58bef57366233b52c421632cfb2e54c76d4, d70a731639d9acd0baa588a94bfaa5f928b26b9c, c01ccf8288f1e375ed5be535bab1efc43c213406, bdd98c74e24b38a820aacecd5d0cef1149b66879, 173077180ff5e63ffeda96a7b451303a9af69543, b4ae7005e66cb03d8de601dd271b10a1127970c4, cdf70cf27fb9c8abb0caa2f3104b66616c207f60, 6f5d62b098a45d9c4a0d833d03ec68848b0c06b5, 91f1d9ce991f068a0bc39befecadf0fa52800e24, cc28742ca39b12efea5f9b8d87d44860a3430ccb, fe71a8c4a12540f0ae666e2b6d518f54a057d833, 923e9a8377dfc76a189c0e3f5e06aff4384453e3, c9463641af21791b6b4c9130cedb9e6a9df75c53, 51b5ac507db6c7e4288c57cf2e7955c411ef8490, 973bc4069f0d8da9e11c2612bc4b085e18d975af, 6145fc79e5117959e49a667ea76f72649922e705, 845d572b877a94c91b756e8532787f7b9507458f, d63497b3a34d882b7db31ed246d6d96cfa53b768, 30eacad2905e375540fa800dccb4d62b053d3df5, ed1ad94c8411213d7a22eedfa47fd7e205783e90, 4ae3fc01552cb3a3a2f210fafb7f277c41cb906f
2. contrib : add naming guidelines: This pull request introduces and expands upon naming guidelines for the project, including the addition of a _t
suffix guideline, moving and rewording coding guidelines, and clarifying the usage of the _context
suffix, as evidenced by multiple commits addressing these aspects.
- URL: pull/11177
- Merged: Yes
- Associated Commits: 610a03a8c447d1b119ae038f2d89e2dd7c5dfca2, e7bc61bc53af790fe59d7265a560ba60e58b43bd, 7fd17ba7cc15ef9e263e9acd9df6a4e767aad2ee, da47eb0650e27946da02c0858b657348aff9665d, f44939a6eba3ab49ae28beb140b24d7248fcc295, 7637216d3f104ea3900350bb4aa6b674e7eded54, 10ef6c1853f93cde09392f01adf133418a591809, 31a44094ad818f6d1472777bc6bbc02a82eaf295, b6f9640157aa6046e2312f072cf616f7af55cc73, 95d87cbf65d0e94effac01b387f1b5404a58786c, 7e1950d0bcbaf55c784d580c80ba8aa43e4ab6ce, d974cae28612e910632fe23532ff9611811ded76, df65154415db7e25ad07ffd3fa4ad65012e2fdab, 34223a21bc55796e6c71849c9b050324d7c9c06c
3. fix: ggml: fix vulkan-shaders-gen build: This pull request addresses the issue of incorrect building of the vulkan-shaders-gen
target during cross-compilation by ensuring it is built for the host, improving the toolchain setup, fixing compile errors, and refining path handling and compiler detection.
- URL: pull/10448
- Merged: Yes
- Associated Commits: 17b80f080d2904431978805f8af24b35b15eece4, 4a17b483c918ca5a56b19af246dbee3a623320c3, 1921b9d39c9e7ae269130ce9d2cade0b74c4b276, b6ebd4fc8c1fb9036609795d74df147e703d1ee0, ce14d9b7cb80ea2de76f783ccdf2f33d8f7eeec1, 481d57f7c76b127bda945edcfb231447cb3fc0f7, 46b4c8da440865668dcf507f8d66f80381e43ae5, efe4b14e602527d049dac06dd857dd3cdbae3719, 37d0cb6e848ac4c96066f41c3e8ac2e629bd8ba9, 6fdbf07181087ef33ea0dce496718bb0bca85b69, f4d1fbc79f0a02e0c91bb120858f9d6d73d692eb
Other Closed Pull Requests
- Refactoring and Code Optimization: Several pull requests focus on refactoring and optimizing code within the llama.cpp project. These include refactoring functions for readability, optimizing dequantization functions, and addressing performance regressions in the Vulkan backend. The changes aim to improve maintainability and performance without altering existing logic.
- Build and Continuous Integration Improvements: Enhancements to the build process and continuous integration are addressed in multiple pull requests. These include adding sanitizer flags, fixing build failures, and updating CI configurations to ensure smooth and error-free builds.
- CUDA and Vulkan Enhancements: Several pull requests introduce improvements to CUDA and Vulkan support, including adding backward pass operations, optimizing dequant functions, and addressing issues with non-contiguous inputs. These changes enhance performance and ensure consistency across implementations.
- Feature Additions and Support: New features and support for various models and functionalities are introduced in several pull requests. These include support for tag-based repository access, RTL text handling, and new models from Hugging Face, enhancing the project's capabilities and user experience.
- Bug Fixes and Issue Resolutions: Multiple pull requests address and resolve various bugs and issues within the project. These include fixing memory leaks, addressing CI issues, and resolving typographical errors, ensuring the project's stability and reliability.
- Testing and Validation: Enhancements to testing and validation processes are covered in several pull requests. These include adding tests for new features, refining test structures, and ensuring consistency between CPU and CUDA implementations, which help maintain code quality and reliability.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 135 | 32 | 4 | 97 |
ngxson | 80 | 19 | 3 | 59 |
slaren | 10 | 5 | 0 | 80 |
jeffbolznv | 15 | 9 | 0 | 42 |
ochafik | 59 | 3 | 0 | 1 |
netrunnereve | 46 | 2 | 0 | 9 |
JohannesGaessler | 16 | 4 | 0 | 34 |
0cc4m | 6 | 3 | 1 | 44 |
ericcurtin | 5 | 5 | 0 | 21 |
VJHack | 22 | 3 | 0 | 3 |