Weekly GitHub Report for Llama.cpp: January 04, 2025 - January 11, 2025
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4458
1.2 Version Information:
The version released on January 10, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without further information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Misc. bug: Inconsistent Vulkan segfault: This issue reports an inconsistent segmentation fault occurring in a Vulkan-related process when running a specific program multiple times on Linux, particularly when using Nvidia drivers. The problem seems to be related to the improper cleanup of Vulkan resources, such as VkDevice and VkInstance, before the process terminates, leading to crashes in Nvidia-specific driver threads.
- The comments discuss potential causes and solutions, including updating drivers, adding functions to properly destroy Vulkan resources, and considering the order of static destructors and library unloads. Various users share their experiences and attempts to reproduce the issue, with some suggesting workarounds like preferring CUDA over Vulkan or dynamically loading backends. Despite efforts to clean up resources, the problem persists, particularly with Nvidia drivers, and further investigation is needed to resolve the issue.
- Number of comments this week: 14
-
Vulkan related question: what's the different between server and cli? : This issue involves a user experiencing core dumps when running llama-server and stable-diffusion.cpp on Vulkan in Termux, while llama-cli functions correctly, and the user suspects the absence of glslangvalidator during the build process might be related. The user seeks assistance in understanding the differences between server and client operations in this context and is exploring potential solutions, including rebuilding with OpenBLAS and addressing GPU-related issues.
- The comments discuss potential causes for the core dumps, including missing dependencies and differences in how server and client processes interact with the GPU. Suggestions include checking GPU compatibility, considering OpenCL as an alternative, and addressing background process handling on Android. The user shares additional logs and details about their setup, and others provide insights into possible solutions and performance considerations.
- Number of comments this week: 9
-
Misc. bug: SYCL out of memory error: This issue involves a memory allocation error encountered when using the SYCL backend in a llama.cpp project, where the user is unable to allocate 568 MB of memory on a device with 16 GB of shared GPU memory, despite the same setup working with the VULKAN backend. The problem persists across different versions and affects both the llama-cli and Python bindings, suggesting a potential inefficiency in the SYCL backend's memory management.
- The comments discuss potential solutions and optimizations, including reducing context length and using the
-nkvo
option, which works but is slower. There is speculation about memory inefficiency in the SYCL backend, with plans for future optimizations. The issue remains open for further discussion, with suggestions to test different configurations and consider limitations similar to those in OpenCL. - Number of comments this week: 8
- The comments discuss potential solutions and optimizations, including reducing context length and using the
-
Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend: This issue reports a bug in the Qwen2-VL model when using the Vulkan backend, where the model generates descriptions unrelated to the provided image, despite working correctly on the CPU backend. The user notes that while the Vulkan backend is not fully supported for Qwen2-VL, it should only result in slowdowns rather than incorrect outputs.
- The comments discuss various attempts to resolve the issue, including testing with an F16 vision projector and enabling GGML_VULKAN_CHECK_RESULTS to identify broken operations. Despite some initial confusion and attempts to fix the problem, it is confirmed that the issue persists when running CLIP on Vulkan, and the discussion suggests keeping the issue open until a proper fix is implemented.
- Number of comments this week: 7
-
DeepSeek Models (V2/V3) Hang with ROCm Backend: This issue involves the DeepSeek models (V2 and V3) hanging when using the ROCm backend, where the models load into VRAM but fail to generate output, with one GPU stuck at 100% utilization while others remain idle. The problem persists across multiple attempts and is consistent with different quantization methods, although the models run as expected when using CPU-only.
- The comments discuss attempts to diagnose the issue, including running commands to test CPU-only performance, which works as expected. There are observations of the problem persisting after reboots, with one user noting that the issue resolves temporarily after running certain commands or reverting to an older commit. Another user reports that letting the process run for several minutes eventually produces output, suggesting a potential delay issue, but this is not consistent across all users.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 31
Summarized Issues:
- Compilation Failures and Errors: Compilation issues are prevalent across various systems and configurations, often due to unsupported features or parameter changes. These problems manifest in different environments, such as Raspberry Pi, BSD operating systems, and when using CUDA on Ubuntu, requiring specific adjustments or updates to resolve.
- Missing Libraries and Shared Objects: Several issues highlight the absence of necessary shared libraries like
libllama.so
, causing execution errors. Users have suggested troubleshooting steps, such as adjusting library paths or compiling statically, to address these missing components.
- Backend and Architecture-Specific Problems: Various backend-related issues arise, including segmentation faults and performance discrepancies on specific architectures like aarch64 and AMD GPUs. These problems often require backend-specific solutions or optimizations.
- Feature Requests for Enhanced Functionality: Users have requested new features to improve usability and functionality, such as implementing CPY operations for Vulkan, slider controls in settings, and navigation through previous messages. These requests aim to enhance user experience and operational efficiency.
- Performance and Efficiency Concerns: Performance issues are reported, such as significant slowdowns in inference times and inefficiencies in model warmup processes. Users seek to understand and mitigate these performance bottlenecks to optimize processing times.
- Bug Reports and Error Handling: Various bugs are reported, including errors in model loading, input handling, and unexpected behavior during execution. These issues often require debugging and code adjustments to ensure proper functionality.
- Security and Malware Detection Concerns: Potential false positive malware detections on CI releases are a concern, with specific Windows binaries being flagged as malicious. This issue may be linked to dynamic JSON object generation in the code or other unrelated factors.
- Runtime and Execution Errors: Users encounter runtime errors due to missing files or incorrect configurations, such as the absence of required templates or libraries. These issues necessitate configuration adjustments or file restorations to resolve execution failures.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 24
Summarized Issues:
- Feature Requests for Model Support: Several issues highlight requests for adding support to the GitHub project for various models, including Zyphra/Zamba2-2.7B, Macro-o1, and DeepSeek-v3. These requests emphasize the need for compatibility with hybrid models, advanced features like Chain-of-Thought fine-tuning, and strategies for handling large models with offloading to RAM.
- Compilation and Installation Issues: Various issues report problems related to compilation and installation, such as errors with CMakeLists.txt, Vulkan shader compilation on Debian, and OpenBSD compilation failures. These issues often involve missing package configurations, unsupported extensions, and macro errors, with discussions on potential solutions and workarounds.
- Bugs in Model Execution and Tokenization: Several issues describe bugs in model execution and tokenization processes, such as Vulkan-supported model failures on Android, discrepancies in token sequences, and errors during inference due to large inputs. These bugs often require adjustments in settings or code to resolve.
- Server and API Issues: Issues related to server and API functionalities include problems with serving static files, incorrect token insertion, and inefficiencies in handling large prompts. These issues often involve authorization errors, unexpected token outputs, and suggestions for improving system performance.
- Performance and Optimization Concerns: Some issues focus on performance problems and optimization needs, such as throughput degradation on macOS, excessive memory usage in CUDA kernels, and uneven GPU memory distribution. These concerns often lead to discussions on temporary solutions and long-term improvements.
- Package and Configuration Conflicts: Issues involving package and configuration conflicts include problems with the gguf package registering a scripts package and incorrect directory placements in conda environments. These conflicts often suggest renaming or restructuring to prevent import issues.
- Documentation and Usability Issues: Some issues highlight documentation and usability problems, such as broken links in model provisioning documentation and the need for prompt processing cancellation features. These issues often suggest improvements in documentation clarity and feature implementation.
- Security and Compatibility Concerns: Issues related to security and compatibility include antivirus software detecting potential backdoors and difficulties in loading converted models. These concerns often require careful handling of files and following specific methods for successful execution.
- Hardware and Performance Enhancements: An issue discusses the discovery of an ARM hardware acceleration library, Arm NN, which aims to enhance machine learning inference performance on ARM CPUs and GPUs. This library leverages architecture-specific optimizations to improve efficiency.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 15
Key Open Pull Requests
1. vulkan: scale caching for k quants + misc fixes: This pull request aims to enhance the performance of inference in the Vulkan backend by implementing parallel extraction and caching of scales for certain quantization types, while also addressing various miscellaneous fixes and optimizations, although it excludes Q4_K and Q5_K due to their complex scale packing.
- URL: pull/11081
- Merged: No
- Associated Commits: d122d5c987b8b13483190acf9838535298e585f5, 6b06d1689011196ff3312277530402adefb53fbb, 21c6b805c99d90332250adb969e128171d431525, b0e4ccbeb95f3987052fbc08500459773216a691, 07d0d58bef57366233b52c421632cfb2e54c76d4, d70a731639d9acd0baa588a94bfaa5f928b26b9c, c01ccf8288f1e375ed5be535bab1efc43c213406, bdd98c74e24b38a820aacecd5d0cef1149b66879, 173077180ff5e63ffeda96a7b451303a9af69543, b4ae7005e66cb03d8de601dd271b10a1127970c4, cdf70cf27fb9c8abb0caa2f3104b66616c207f60, 6f5d62b098a45d9c4a0d833d03ec68848b0c06b5, 91f1d9ce991f068a0bc39befecadf0fa52800e24, cc28742ca39b12efea5f9b8d87d44860a3430ccb, fe71a8c4a12540f0ae666e2b6d518f54a057d833, 923e9a8377dfc76a189c0e3f5e06aff4384453e3, c9463641af21791b6b4c9130cedb9e6a9df75c53, 51b5ac507db6c7e4288c57cf2e7955c411ef8490, 973bc4069f0d8da9e11c2612bc4b085e18d975af, 6145fc79e5117959e49a667ea76f72649922e705, 845d572b877a94c91b756e8532787f7b9507458f
2. llama : functions -> methods: This pull request reorganizes the llama.cpp
project by isolating the functionality related to struct llama_model
and struct llama_vocab
into their respective modules, moving tensor data loading to src/llama-model.cpp
, and primarily focusing src/llama.cpp
on graph build logic, with plans to make struct llama_vocab
public in the API and update calls accordingly.
- URL: pull/11110
- Merged: No
- Associated Commits: 609ec7e0a0961208702e65710e250bf1d67a31c2, c725f691ea291f36cfa52779922cb29d7770915c, 45aab64e93d11aa08ba1c722fc0c65c1ff458364, a857dc50af223cf721f2868f8395a58b12fc4117, aeeb9420a32699f5f55afc5471bcae61bcc8e473
3. llama : update API names to use correct prefix: This pull request aims to improve the consistency of the public API in the llama.cpp project by updating the API function names to include the correct prefixes, changing from names like llama_n_ctx_train
to llama_model_n_ctx_train
for better clarity and organization.
- URL: pull/11174
- Merged: No
- Associated Commits: 1586ed50611c69de5305d934c8e94b00ef56e34c, 1d9f1f277867e8993d79a6c4faf2efa06a7c24fa, 940139cd29a2b2792d24a52e07b8f5d4daa2df26, 36074d1eb80352304eb3b4f87c11de36d044c9ac
Other Open Pull Requests
- CUDA and Vulkan Enhancements: Several pull requests focus on enhancing GPU performance and compatibility in the llama.cpp project. These include the introduction of CUDA kernels for ternary quantization, Vulkan shader support for data copying between floating-point and quantized formats, and optimization of Vulkan components for performance improvements.
- Web UI and Usability Improvements: Enhancements to the server Web UI and usability scripts are addressed in these pull requests. They include enabling pre-filling of text areas via URL parameters and sorting supported models in scripts for easier access.
- Build and Testing Enhancements: Pull requests in this category introduce new workflows and tests to improve build processes and performance measurement. This includes a new GitHub Actions workflow for visionOS and a test for measuring token generation rates.
- Code Consistency and Maintenance: These pull requests focus on maintaining code consistency and addressing issues. They include updates to API names for consistency, addition of naming guidelines, and fixing broken package detection paths.
- SYCL and ROCm Adjustments: Adjustments related to SYCL and ROCm are made in these pull requests. They include the introduction of a gated linear attention kernel using SYCL and the removal of an outdated HIP workaround for ROCm.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 53
Key Closed Pull Requests
1. [swift] add module omnivlm : This pull request aims to add a new module called "omnivlm" to the project, which includes support for various features such as ggml, omni-audio, and qwen2-audio, updates to C++17 for compilation, and the addition of examples in C++ and Python, while also addressing issues like memory leaks, build errors, and compatibility with different platforms like iOS and Android.
- URL: pull/11171
- Merged: No
- Associated Commits: 5f81588780339fd58c172591aa8273a198a20bca, 3a3552632aa4e9529a9bff1030db37c153fd4a42, 4a29bca867e2601a2e69e007640ac1abb9f3a381, c7b912bdca66ad5cc3edef66117d35837006ff2e, f0d1c4fa1c06c1bc9cd7a42b83e32f73e69e6fbe, 9e67ef75b46b4d267b9df4ac6c1f232681470a4c, 4bdc70aaac8884df987f4b079b3d063f2f31e076, d277c674ae20a3a1277f71a14a2956bf5b3196ca, 995baefeed7407cdbaf4e8aef4debfeb2621a12b, a4747b2edb90b9fbf8cb7c3108ba973fc79d7152, 6f1ed6e5cb1e8003b1b7146bc5aaf1e525bf9096, 141968108994905dc481863b75e0837cb693f5e3, d42e0371f84b413c25511328d75f079962c6fbbb, 05853eb861d522cc51c450efbabdc1470118cf5b, 91b3cafbb5acee59ad5cf94a05c952c5177d2969, d6c0627d31866d865f16e862a5456f3bb8857dfd, 983b4625ef51503853979ddbedd7df14084faadd, b535cd941e657ac1984d8022dd5f0c98f2b9e265, 38c6fa3b8fb6c88075102fd859d04eaea27aa87c, 22da7bc379b491680a7db25600c14f8addfbc93d, 5574bda471c4ddfa263438f5ca978ccad2e85903, b24a409e22dc49daa7f7cb422492281403dfb239, 6a4cf0b983195c7f32251e6d550f3c65b854ca6b, 5edadffd887f2b72ebc93134e7ad76082757b75a, 20b9f02cee483d09d15832c35e6117e5a020f517, 3dfac7817f8f40562b559e877b50b104d697bcf8, df5841b6b8ac0740b1e2310f9e8ae609d6290b3c, 86c2233a38c963b2b9112994a9f9c3890b6522f0, 400fc2a4b09d37d3256c803d8f4292385285dad6, b17684efb3a1da600bbde26cd6554f74e964af2f, 16c22471e88a9c8bd2049be890642ba496ee496f, 3d9c63a3ffc27fc10c910e3b71b89c87008926d7, eb6d54679e518edebf2ee8b5f39c0dcb613811cc, 8c417282d52e9b8931ff3e93ff6382b85be81d87, d5df53658f587fe1bd0de22376b3dadc055eb713, ecfe0b487f49d52e0d9012b89ca40d07b3f38b41, 667a6d9838931f2aaab95ee9d70142dc1ba057bd, d04e354f2f55510e7a4c9dfa4659e4861d7290d3, 21bc833273721000501fd8c742450731db6d4709, 6f0e8c3ee6be4695863545f786bc9159548eed31, 7cf07df5e20c12624376656ce81c06b621cbb3a6, 362bdf32924b8a0c00c8998b9a9f274977e07b80, 5f2d95849269f4b847afc9563de763ff5daf2afe, 55953d35a4aab2ffa96fde090bdd48ff7e385f16, 82dbdbdb40b3b2e4a945d23e9e952d4e463d598e, 89bcf5a6d918c6dd0987432b859680d3b8548929, 4e80184c321b9c04866735ca6f4545eb15919e4a, 98297afbd590d511b90c7e2b6986f6d8788c25a4, bb33473f08db604e1f30334366032f0904e2a722, fc25544867f591e0831dea493675ff0d8775dfc9, b9845b4f63bb50eb16c1d510706ddb885b380975, aad0167bc3accc17ec80db5225576e4130383cc7, 8e2e6304057af44e66c0c3a123ca798dc4d25a55, e4ca946c48ee6e1a848cf88e5f81680179b0fbf5, 25190fefa29d946ba28f92a01599c228f0c66e9d, fe792d62b1bed14c6dbc48421840473eae2a08ae, fd2c58286aaeb4ed51d6b963344a6d2584e25ab5, 7589158595091a88b7844c83569f68c780469d5b, bbf1aaa7eddb2eccd3a955f476a4e07475cae3be, 460212ac2a61cd24f479bba145a9e652f01f31b3, fe8c7b45fd5eca1c38a09c257ebf8cf1ccae3a4a, 43f41a4c00086d163463b79a3bc55d1656b6bf2b, 3479f516ea55c9a278986e9a300a163979be4177, 809db95990cd53c62bce94afb9ad99848d770413, a2c53052bdb74dde139ed61b3f4e724e3848b7d5, 661b3f718c4b31793875f4c1d310ee12076b4ae3, 0b15d2d7452f4cdbb2295bec1979fe9194ae7400, 71b563ec9a4ab9c06fb23d1b72ea3688d8843bf4, 97267e60bdfa06126899bee025a0d52f3b36f2e9, be54cb02ff14354ac78dd8ec8a9efa170475b00d, ca7e8ef19e1e3ca1558d64e184218e83294ebb5d, 07c7ff3e4a0e067a78e61bad11964376aec8c9da, b86cdedb7e5d0b9b2fe61404c39010a149da99be, b2958b33ddd4c8f13c98fb1c1249ac067769df91, 64a6001a1a408129eb510f49840947876220c5fa, 5962b506bab3f46821e0fb74bcbe224cb6b10b68, 1487d32b46a210c5619886af8fe24c93091f7ca0, d3eeeae2185b8f1cb626421ae96fb8af76b2ce82, 61777707ca6aecf077d35c7439dd263342a36226, 91ab9ed858aabb7eea7ddcc0ec6843367404a148
2. Add support for QRWKV6 hybrid models & slight optimization for RWKV6: This pull request introduces support for QRWKV6 hybrid models, which combine the Qwen2.5 architecture with RWKV6 to convert Qwen2.5-32B-Instruct model's QKV attention into RWKV6 linear attention, and includes optimizations for RWKV6 such as graph simplification and concatenated lerp weights to reduce CPU overhead during inference.
- URL: pull/11001
- Merged: Yes
- Associated Commits: f298f0397084dcc50e1a1f2dbdb1ed36daf10687, 385b611d4590c3b761e97d6fd99f710c5b5a7c85, fab0aa7b1a52e4aebe484878a35ba372ad821b5f, bc930cd59a3c245da4c9f47f1458e65bead1e92d, f2c1a5c91892656c3b399fb205017b519e1e94ca, aaa870e80eea3fdda0be7fed4ed28d5c5ec8910a, 00930e6fe5a64baf3faccab9b12ef2638e3c6a60, 08cf56060bc3bbe55e9a40db423b36567bfd6f4b, 331581b2e3d46cac285b34447ec8ad15cb212f95, aed0afb40884d0066ea64046fdd0d70575accdf2, d8a304c2ef86a5449bb075bb707fbde66b3b4ad4, 324afba5ccac7250b251f4cff31c812e2e86a3fc
3. Add support for DeepSeek V3: This pull request adds support for the DeepSeek V3 model by introducing changes such as a new boolean parameter for expert weights normalization, a numerical parameter for expert gating function using sigmoid instead of softmax, a tensor type for expert weights bias, updates to the llm_build_moe_ffn()
API, and a new pre-tokenization regex, while omitting the multi-token prediction feature.
- URL: pull/11049
- Merged: Yes
- Associated Commits: 0061955a067be69104655f0677d367c680ac5a43, a43d4953ba77dda8ece5f46d21d6675e20f8c696, 93aca64520f907cb1b56ee35e6c485af567e6ecd, d2f784d50d3b64ce247a29f7c449bd255fe6e18a, 140eb292644f201aadc042392419dea0da236ecc, 5b4673b3dd8e65f74b81538f992395a89180e1f9, dfffe676118b3878d8465602ea5bbada7abd2d34, ddb2dddac108f69f319825f522f77a8f82eae913, a48c3df3df2220fc0df3a7038cdbdd2b9ed4eb3b, 4a58b99777d357c1457f6c97d9462bc6aa3e6646
Other Closed Pull Requests
- Compatibility Enhancements: This topic includes pull requests that enhance compatibility across various components of the project. The updates include changes to scripts for better model conversion, addressing hardcoded paths for Vulkan shader generation, and fixing compatibility issues with AMD's Vulkan driver.
- Code Refactoring and Optimization: Several pull requests focus on refactoring and optimizing the codebase. These changes include refactoring GGUF code into C++, optimizing tokenization processes, and introducing new formats for improved performance.
- Bug Fixes and Issue Resolutions: This topic covers pull requests that address various bugs and issues within the project. Fixes include resolving package registration conflicts, addressing Vulkan extension issues, and fixing macro clashes.
- Feature Additions: New features have been introduced through several pull requests. These include adding support for the PhiMoE architecture, introducing tooltips in the web UI, and adding a new "phi 4" template to the llama-chat project.
- Documentation and Template Updates: Pull requests in this category focus on improving documentation and templates. Changes include adding a README.md to the TTS example and updating the bug report template to include command line fields.
- Continuous Integration and Build System Improvements: Enhancements to the CI process and build system are covered in these pull requests. Updates include pinning workflows to specific versions and fixing a CMake option for better integration.
- API and Functionality Changes: This topic includes pull requests that modify APIs and functionalities. Changes involve renaming functions for consistency, updating the
llama_model
API, and introducing thestruct llama_vocab
.
- Performance Improvements: Pull requests in this category focus on enhancing performance. Notable improvements include adding BF16 support for CUDA/HIP and introducing an INT8 implementation for matrix multiplication.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 148 | 34 | 4 | 97 |
ngxson | 92 | 17 | 3 | 101 |
slaren | 15 | 8 | 0 | 73 |
netrunnereve | 52 | 4 | 0 | 16 |
jeffbolznv | 12 | 7 | 0 | 49 |
0cc4m | 7 | 3 | 0 | 39 |
qnixsynapse | 20 | 4 | 0 | 16 |
Djip007 | 5 | 0 | 0 | 34 |
JohannesGaessler | 15 | 4 | 0 | 14 |
fairydreaming | 0 | 3 | 1 | 27 |