Weekly GitHub Report for Llama.cpp - 2024-07-22 12:00:02
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
I. Issues
1.1 Open Issues
Open Issues This Week: 41
Summarized Issues:
- Runtime Errors and Parsing Issues: This topic covers issues related to runtime errors and parsing problems in the llama.cpp project. One issue reports a RuntimeError due to the inability to parse the ModelProto from a tokenizer model file. Another issue describes a bug where the process of measuring perplexity gets stuck indefinitely at the "tokenizing the input" stage. Additionally, there is a problem with the
convert.py
script encountering a "No such file or directory" error.
- Feature Requests for Model Support: This topic includes feature requests for adding support for various models in the llama.cpp project. Requests include support for the Qwen VL vision model, the newly released 7B coding model Codestral Mamba, and the Lite-Mistral-Instruct chat template. There are also requests for supporting the "LlavaMistralForCausalLM" architecture and the SmolLM family of models.
- Vulkan Backend Issues: This topic addresses various issues with the Vulkan backend in the llama.cpp project. One issue reports a failure to build on the RISC-V platform due to a missing Vulkan GLSLC executable. Another issue describes a problem where the Vulkan backend build fails on Windows using MSVC cmake. Additionally, there is a bug where the "rpc-server" binary fails to utilize the GPU on a Linux system.
- CUDA and GPU-Related Bugs: This topic covers bugs related to CUDA and GPU usage in the llama.cpp project. Issues include a "CUDA error: invalid device function" on an AMD Radeon RX 6700 XT GPU, crashes due to out-of-memory errors on a Framework Laptop 16, and an assertion failure during the finetuning process using CUDA 12 on Windows 11.
- Inconsistent Outputs and Seed Issues: This topic discusses issues related to inconsistent outputs and problems with random seed handling in the llama.cpp project. One issue describes inconsistent output content despite setting a fixed seed and temperature. Another issue highlights discrepancies in outputs when specifying a random seed manually versus automatically generated seeds.
- Build and Compilation Errors: This topic includes issues related to build and compilation errors in the llama.cpp project. Issues include a warning during the Docker image build process due to a casing mismatch, and a bug where the Vulkan backend build fails on Windows using MSVC cmake. Additionally, there is a problem with the
convert.py
script encountering a "No such file or directory" error.
- Feature Requests for Enhanced Functionality: This topic covers feature requests aimed at enhancing the functionality of the llama.cpp project. Requests include adding a
--silent
flag to the executable, support for reranking API endpoints and models, and the integration of a feature to pull models from the Ollama repository.
- Documentation and Usability Issues: This topic highlights issues related to documentation and usability in the llama.cpp project. One issue points out the lack of descriptive release notes on the GitHub page. Another issue highlights the absence of documentation on how to offload layers to the GPU using the
-ngl
option.
- Model Conversion and Tokenization Issues: This topic addresses issues related to model conversion and tokenization in the llama.cpp project. Issues include an error encountered while converting the SmolLM-1.7B-Instruct model due to an unrecognized BPE pre-tokenizer, and a request for support for fast tokenizers using a
tokenizer.json
file.
- Miscellaneous Bugs and Issues: This topic covers various other bugs and issues in the llama.cpp project. Issues include a bug where legacy models started issuing an End Of Sequence (EOS) token after updating the docker image, and a problem with the
export-lora
command not supporting GGUF files.
1.2 Top 5 Active Issues:
We consider active issues to be issues that have generated much discussion in the issue's comments.
-
Bug: QWEN2 quantization GGML_ASSERT: This issue involves a bug encountered when attempting to quantize the Qwen2 7B Instruct model to IQ2_XS, resulting in an assertion error related to the grid index. The user has provided relevant logs and details about their environment, and is seeking assistance to debug the problem.
- The comments discuss various errors encountered during different quantization attempts, potential causes such as
nan
values in the imatrix, and possible solutions including patches and workarounds like using flash attention. The conversation also covers the impact of hardware and software configurations on the issue, and includes suggestions for further testing and verification. - Number of comments: 73
- The comments discuss various errors encountered during different quantization attempts, potential causes such as
-
Please compile also clblast version!: This issue is about a user requesting the reintroduction of the CLBlast version for their NVIDIA GTX 970M laptop, as it previously provided performance improvements by offloading tasks to the GPU. The user notes that other versions do not utilize the GPU memory as effectively.
- Multiple users express similar needs for the CLBlast version, citing issues with alternatives like Vulkan and CUDA. Some users offer to help maintain the CLBlast backend, while others discuss technical challenges and potential solutions for reintroducing or maintaining the OpenCL backend. The conversation includes suggestions for forking the project to restore CLBlast functionality and debates on the necessity and feasibility of maintaining it within the main project.
- Number of comments: 41
-
Subtle Vulkan shader compilation bug when running on Adreno GPUs (Samsung Galaxy S23 Ultra): This issue describes a subtle bug in Vulkan shader compilation when running on Adreno GPUs, specifically on the Samsung Galaxy S23 Ultra. The problem occurs in the
dequant_q4_K_body
function, where the shader crashes with a compilation error, and a workaround involves modifying the code to avoid branch convergence inside the loop.- The comments discuss the necessity of compiling shaders on Android, the process of generating and using shader bytecode, and the subsequent issues encountered during inference, including a
device lost
error. Various debugging steps and potential fixes are suggested, such as enabling validation layers, adjusting the use of compute queues, and limiting the number of operations per command buffer. The conversation also touches on broader issues with Vulkan backends on Adreno GPUs, including memory allocation limits and driver-specific quirks, with some users sharing their experiences and modifications to improve performance and stability. - Number of comments: 32
- The comments discuss the necessity of compiling shaders on Android, the process of generating and using shader bytecode, and the subsequent issues encountered during inference, including a
-
[feature] Make space insertion optional in SentencePiece tokenizer: This issue discusses making space insertion optional in the SentencePiece tokenizer, as the current behavior of inserting spaces into non-empty text disrupts various use cases, such as prompt size control and specific prompt formats. The issue proposes three potential solutions: tying space insertion to the beginning-of-sequence (BOS) token, adding a separate argument to control space insertion, or creating a new tokenization function with multiple options.
- The comments section includes a detailed discussion on the pros and cons of the proposed solutions, with contributors testing and debating the behavior of space insertion in different scenarios. There is a consensus that the current implementation is problematic, and various suggestions are made to improve it, including adding a new advanced tokenization function and ensuring compatibility with existing APIs.
- Number of comments: 28
-
Bug: CUDA error: out of memory - Phi-3 Mini 128k prompted with 20k+ tokens on 4GB GPU: This issue reports a CUDA out-of-memory error when using the Phi-3 Mini 128k model with a large prompt on a laptop with an Nvidia A2000 4GB GPU. The user suggests that the expected behavior would be to avoid crashing by reallocating memory and seeks guidance on disabling GPU usage for CPU-only inference.
- The comments discuss the inefficiencies in GPU memory allocation, potential bugs, and various suggestions for improving memory management. There are mentions of specific compilation options and the need for more detailed testing to identify the exact cause of increased memory usage. The conversation also touches on the challenges of maintaining open-source projects with limited developer resources.
- Number of comments: 27
1.3 Top 5 Quiet Issues:
We consider quiet issues to be issues that have been opened in this project for the longest time. The team should work together to get these issues resolved and closed as soon as possible.
-
llama : combined beam search + grammar sampling strategy: This issue proposes a feature that combines beam search and grammar sampling strategies to perform constrained evaluation of logits based on a union of possible text values, aiming to select the most probable outcome overall. The implementation would involve parallel generation and logit evaluation of multiple possible paths to ensure optimal results, especially in cases where multiple choices start with the same logit.
- Open for 325 days, 15 hours, 55 minutes
-
[feature] Make space insertion optional in SentencePiece tokenizer: This issue addresses the problem of automatic space insertion in the SentencePiece tokenizer, which disrupts various use cases such as precise token count management and prompt formatting. The proposed solutions include making space insertion optional through different methods, such as conditional insertion based on the presence of a BOS token or introducing a separate argument to control this behavior.
- Open for 277 days, 14 hours, 44 minutes
-
Recoverable Error Handling: This issue is about implementing a form of error handling in the llama.cpp project that does not immediately terminate the process when an error is encountered, as the current GGML_ASSERT method does. The motivation behind this request is to enable more robust service development, particularly for projects like LLamaSharp, where handling errors through exceptions rather than abrupt termination is more practical.
- Open for 225 days, 22 hours, 00 minutes
-
Subtle Vulkan shader compilation bug when running on Adreno GPUs (Samsung Galaxy S23 Ultra): This issue describes a subtle bug in the Vulkan shader compilation process when running on Adreno GPUs, specifically on the Samsung Galaxy S23 Ultra. The problem causes shader compilation to fail, and a workaround involving code duplication in the if-branches has been identified to prevent the crash, suggesting a potential bug in the Adreno shader compiler.
- Open for 174 days, 15 hours, 16 minutes
-
llama : create llamax library: This issue involves the creation of a new library called
llamax
, which aims to wrap the existingllama
library and provide high-level functionality to simplify its integration into third-party projects. The primary objective is to make it easier for most projects to interface withllama.cpp
through thellamax
API, while still allowing access to the low-levelllama
API for more specialized use cases.- Open for 173 days, 09 hours, 24 minutes
1.4 Closed Issues
Closed Issues This Week: 49
Average Issue Close Time (This Week): 36.06 days
Summarized Issues:
1.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open issues within the past week to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open issues from the past week.
II. Pull Requests
2.1 Open Pull Requests
Open Pull Requests This Week: 21
Pull Requests:
- Fallback Type Change: This pull request proposes changing the fallback type from
IQ4_NL
toQ4_0
in the llama.cpp project due to the lack of implementation ofIQ4_NL
in some backends. The change aims to ensure compatibility across different backends. This update is crucial for maintaining the functionality of the project.
- Documentation Updates: This pull request adds AI Studio to the list of user interfaces in the documentation of the project. It also updates the README.md file to include steps for running CMake. Additionally, it clarifies that the
n_keep
parameter excludes the BOS token, addressing a potential source of confusion for users.
- Code Refactoring: This pull request involves refactoring the
llama
code by reorganizing various components and preparing for future API changes. It includes renaming functions, moving implementations to new files, and updating the Makefile dependencies. These changes aim to streamline the codebase and facilitate future development.
- Model and Performance Enhancements: This pull request aims to simplify the Mamba model by introducing advanced batch splits and various optimizations. It also adds FlashAttention support for Gemma 2 on the CPU and CUDA backends, including performance benchmarks. Additionally, it addresses shape issues in Mistral Nemo by updating the loader to handle different embedding sizes.
- Windows and Platform Support: This pull request addresses improvements for running the project on Windows with Snapdragon X by adding documentation for building on Windows. It also fixes issues related to MSVC's lack of support for C in-line assembly for ARM. Additionally, it addresses the issue of incorrect paths for Vulkan shader compilation on Windows.
- New Model and Feature Support: This pull request adds support for the Chameleon model, enabling text-to-text inference and laying the groundwork for future implementations. It also adds support for the SmolLM pre-tokenizer, including updates to various scripts and files. Additionally, it introduces the ability to override specific tokenizer flags for enhanced customization.
- Metal and GPU Enhancements: This pull request introduces a straightforward Metal implementation of
SSM_CONV
andSSM_SCAN
using single-threaded kernels. It also addresses a multi-GPU crash issue in SYCL by filtering the platforms. Additionally, it adds IQ4_NL support to Vulkan and resolves issues with iq4_nl fallbacks in k-quants.
- UUID and Tensor Data: This pull request proposes to automatically generate a deterministic UUID based on tensor data when it is missing in the gguf model file. This ensures consistency in the generated file's hash. The change is aimed at improving the reliability of the model files.
- CodeShell and Tokenizer Tests: This pull request addresses and fixes issues with CodeShell support that arose after a recent update. It ensures compatibility with the latest version of the repository. Additionally, it aims to re-enable tokenizer tests that are now successfully passing.
- Unicode and Export-Lora Enhancements: This pull request aims to enhance the
unicode-data.cpp
file by adding all Unicode subcategories. It also aims to fix theexport-lora
example by accepting the new LoRA format introduced in a previous update. These changes improve the functionality and compatibility of the project.
2.2 Closed Pull Requests
Closed Pull Requests This Week: 51
Summarized Pull Requests:
2.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open pull requests within the past week to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open pull requests from the past week.
III. Commits
3.1 Commits
Commits This Week: 42
Summarized Commits:
- Initialization and Error Handling Improvements: Enhancements have been made to handle null names during initialization in the gguf module and to provide more informative error messages when model files fail to load in the ggml project. Additionally, the
--help
option for theexport-lora
command has been fixed to prevent errors.
- Tokenizer and Pre-tokenizer Updates: Support for the Tekken pre-tokenizer has been introduced in the llama project, including updates to the order of pre-tokenizers and removal of unnecessary assignments. The checksum for the Tekken tokenizer has also been updated.
- Bug Fixes in Language Models: A bug in the llama.swiftui project that caused the system to generate blank lines after receiving an EOT or EOS token has been fixed. Additionally, a typo in the chat template for chatglm4 has been corrected.
- Script and Tool Enhancements: The
gguf_dump.py
script has seen multiple improvements, including better string handling and markdown key-value array printing. Theconvert_hf
tool has been optimized for performance with the--dry-run
option and a memory leak issue has been addressed.
- Quantization and Dot Product Fixes: Multiple fixes related to quantized dot product operations in the ggml library have been implemented, targeting issues with odd numbers of blocks across various quantization formats and platforms.
- Layer and Error Handling Enhancements: The maximum number of layers for the "llama" component has been increased from 256 to 512, and assert statements have been replaced with exceptions for better error handling.
- Documentation and README Updates: The README file has been updated to correct the server badge and to reflect the latest output of the
llama-server --help
command. Broken links within the development documentation have also been corrected.
- Server and UI Improvements: The server has been updated to use relative routes for static files in the new UI and to respect the
--special
command-line argument. Theapi_url
on non-index pages has also been fixed.
- Metadata and Filename Refactoring: The naming convention for GGUF output filenames has been refactored, and the metadata structure has been updated with new entries and a metadata class with automatic heuristics.
- Partial Offloading and CUDA Fixes: An issue with partial offloading in CUDA when the value of
ne0
is not a multiple of 256 has been addressed. Additionally, iquant support for CUDA has been added, and the number of parallel jobs for the CI build has been reduced by one.
- Continuous Integration and Build Process: Issues related to the continuous integration (CI) process have been resolved, and warnings during the Docker build process have been addressed. A macro guard for the pragma GCC directive has also been introduced to prevent warnings on Windows systems.
- NPU Backend Integration: The Ascend NPU backend has been introduced, including various modifications such as renaming, adding logging, and making certain headers private to integrate Huawei's CANN for enhanced AI computing capabilities.
- Parameter and Argument Fixes: A bug related to the
n_predict
parameter has been fixed, and a new argument,--no-cont-batching
, has been added to the common module.
- Normalization and Deduplication: Issues with duplicate entries in the deepseek2 normalization process have been addressed by implementing a de-duplication mechanism.
- Vulkan Module Fixes: The Vulkan module has been updated to add a missing
LOAD_VEC_A
parameter and to resolve a build error in the Vulkan operation result checker.
- Pydantic Library Updates: The pydantic library has been updated by replacing the use of
__annotations__
withget_type_hints
to improve compatibility and support for Python versions 3.9 and 3.10.
- Script Refactoring and Enhancements: The
pydantic_models_to_grammar_examples.py
script has been rewritten for better readability, configurability, and error checking. The output has been made easier to parse, and the code more executable and maintainable.
- Metadata Extraction and Finetune Handling: Various edge cases in metadata name extraction for the
gguf-py
module have been addressed, improving robustness and handling multiple finetune versions. Several fixes and enhancements have been made, including using the correct directory for model card paths and adding more tests.
- Conversion Process Fixes: The Gemma v1 conversion in the
convert_hf
function has been fixed, allowing token renaming with a warning and ensuring that the BOS and EOS tokens are properly set.
- File and Element Management: Unused file types have been removed from the project, and elements have been aligned vertically for better organization.
- Checksum and Lock File Updates: The
flake.lock
file has been updated, and the checksum for the Tekken tokenizer has been revised.
- Command-line Argument Handling: The server has been updated to respect the
--special
command-line argument, and a new argument,--no-cont-batching
, has been added to the common module.
- Concatenation Functionality: Functionality to concatenate through dimensions 1 and 2 has been introduced in the SYCL project.
- Fibonacci Hashing and Crashes: Crashes related to the implementation of Fibonacci hashing have been resolved.
- LoRA Adapter Support: The LoRA adapter support has been comprehensively refactored, including various fixes, updates, and enhancements such as loading to device buffers, adding patch tensor functions, and implementing conversion scripts.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, or created at least 1 pull request in the past month.
Contributor | Commits | Pull Requests | Issues |
---|---|---|---|
GitHub | 44 | 0 | 0 |
ggerganov | 0 | 13 | 1 |
mofosyne | 0 | 7 | 0 |
compilade | 0 | 6 | 0 |
0wwafa | 0 | 0 | 6 |
JohannesGaessler | 0 | 5 | 0 |
ngxson | 0 | 4 | 0 |
maruel | 0 | 2 | 2 |
AndreasKunar | 0 | 1 | 2 |
sorasoras | 0 | 0 | 3 |
Georgi Gerganov | 2 | 0 | 0 |
jaime-m-p | 0 | 2 | 0 |
iamlemec | 0 | 2 | 0 |
0cc4m | 0 | 2 | 0 |
RunningLeon | 0 | 2 | 0 |
AmgadHasan | 0 | 1 | 1 |
amochkin | 0 | 1 | 1 |
stduhpf | 0 | 1 | 1 |
HanClinto | 0 | 2 | 0 |
Stillerman | 0 | 1 | 1 |
mirek190 | 0 | 0 | 2 |
Arashimu | 0 | 0 | 2 |
yli147 | 0 | 0 | 2 |
MathiasSchindler | 0 | 0 | 2 |
Sokartecnologi | 0 | 0 | 2 |
perpendicularai | 0 | 0 | 2 |
IMbackK | 0 | 1 | 0 |
katsu560 | 0 | 1 | 0 |
jukofyork | 0 | 1 | 0 |
joeatodd | 0 | 1 | 0 |
Zor-X-L | 0 | 1 | 0 |
ho2103 | 0 | 1 | 0 |
Alcpz | 0 | 1 | 0 |
agray3 | 0 | 1 | 0 |
yeahdongcn | 0 | 1 | 0 |
zhipenghan | 0 | 1 | 0 |
kriation | 0 | 1 | 0 |
danbev | 0 | 1 | 0 |
iboB | 0 | 1 | 0 |
teleprint-me | 0 | 1 | 0 |
65a | 0 | 1 | 0 |
NikolaiLyssogor | 0 | 1 | 0 |
airMeng | 0 | 1 | 0 |
sbonds | 0 | 1 | 0 |
SommerEngineering | 0 | 1 | 0 |
msy-kato | 0 | 1 | 0 |
amitj1jan | 0 | 1 | 0 |
nopperl | 0 | 1 | 0 |
slaren | 0 | 1 | 0 |
luoyu-intel | 0 | 1 | 0 |
EZForever | 0 | 1 | 0 |
ClarkChin08 | 0 | 1 | 0 |
m18coppola | 0 | 1 | 0 |
thxCode | 0 | 1 | 0 |
hankeke303 | 0 | 1 | 0 |
kaetemi | 0 | 1 | 0 |
Adriankhl | 0 | 0 | 1 |
tybalex | 0 | 0 | 1 |
ch1y0q | 0 | 0 | 1 |
warren-lei | 0 | 0 | 1 |
apresence | 0 | 0 | 1 |
RakshitAralimatti | 0 | 0 | 1 |
oldmanjk | 0 | 0 | 1 |
LDLINGLINGLING | 0 | 0 | 1 |
Sakura4036 | 0 | 0 | 1 |
markat1 | 0 | 0 | 1 |
sealad886 | 0 | 0 | 1 |
lin72h | 0 | 0 | 1 |
jie80219 | 0 | 0 | 1 |
nne998 | 0 | 0 | 1 |
StatPan | 0 | 0 | 1 |
jeroen-mostert | 0 | 0 | 1 |
1cekrim | 0 | 0 | 1 |
bong-furiosa | 0 | 0 | 1 |
djain-fujitsu | 0 | 0 | 1 |
m828 | 0 | 0 | 1 |
Battlehub0x | 0 | 0 | 1 |
Fulgurance | 0 | 0 | 1 |
criminact | 0 | 0 | 1 |
VelocityRa | 0 | 0 | 1 |
bartowski1182 | 0 | 0 | 1 |
dafei2017 | 0 | 0 | 1 |
metal3d | 0 | 0 | 1 |
Emmanuel97460 | 0 | 0 | 1 |
vmarchenkoff | 0 | 0 | 1 |
jpoly1219 | 0 | 0 | 1 |
ciekawy | 0 | 0 | 1 |
DanielusG | 0 | 0 | 1 |
ericcurtin | 0 | 0 | 1 |
hgftrdw45ud67is8o89 | 0 | 0 | 1 |
qnixsynapse | 0 | 0 | 1 |
rhvall | 0 | 0 | 1 |
zucchini-nlp | 0 | 0 | 1 |
hipudding | 0 | 0 | 1 |
suncloudsmoon | 0 | 0 | 1 |
newsletternewsletter | 0 | 0 | 1 |
yancaoweidaode | 0 | 0 | 1 |
simon-krannig | 0 | 0 | 1 |
RonanKMcGovern | 0 | 0 | 1 |
nicoboss | 0 | 0 | 1 |
MangoTCF | 0 | 0 | 1 |