Weekly GitHub Report for Llama.cpp - 2024-08-05 12:00:09
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
I. Issues
1.1 Open Issues
Open Issues This Week: 32
Summarized Issues:
- Compilation Issues: Compilation issues have been reported across various environments, including Windows and Linux. Problems include errors related to C++14 extensions, missing pkg-config, and redefinition errors in the type_traits header file. These issues often require changes in configuration files or additional dependencies to resolve.
- Model Conversion and Quantization: Several issues have been reported regarding the conversion and quantization of models. These include missing pre-tokenizers, errors due to unexpected file formats, and guidance requests for converting specific models. These problems often require script updates or detailed instructions to resolve.
- Performance and Efficiency: Performance issues have been noted, including significant slowdowns in token generation and overheating of CPUs with certain quantized models. Requests for support of IGPUs or NPUs and multiple queues for model loading have also been made to enhance performance.
- Vulkan Backend Issues: Multiple issues have been reported with the Vulkan backend, including device errors, crashes, and performance degradation. These problems often require debugging and updates to the Vulkan implementation.
- Memory and Resource Management: Issues related to memory management include out-of-memory errors, excessive memory allocation, and crashes due to large context sizes. These problems often require configuration adjustments or bug fixes in the code.
- Inference and Execution Errors: Various bugs have been reported during inference, including device lost errors, floating point exceptions, and corrupted output in batched inference. These issues often require debugging and code fixes to ensure stable execution.
- Feature Requests: Several feature requests have been made, including support for new models, additional PPL calculation methods, and function support. These requests aim to enhance the functionality and usability of the llama.cpp project.
- Service Deployment Issues: Issues have been reported when deploying services using the llama.cpp library, including compatibility problems with specific processors and service unavailability errors. These issues often require compatibility fixes or configuration changes.
- Script and Code Bugs: Bugs in scripts and code, such as infinite loops, missing hyperparameters, and linker errors, have been reported. These issues often require code fixes or updates to ensure proper functionality.
1.2 Top 5 Active Issues:
We consider active issues to be issues that have generated much discussion in the issue's comments.
-
server : improvements and maintenance: This issue is about improving and maintaining the server example in the GitHub project, which has grown in functionality but is currently unstable and missing important features. The issue aims to track these points and draw community attention to them, as some tasks are significant and require considerable effort to complete.
- The comments discuss various improvements and suggestions for the server, including adding new features like look-ahead decoding, speculative sampling, and function calling. There are also discussions about refactoring the code for better stability and performance, handling multiple user requests efficiently, and supporting different chat templates. Some comments suggest using external libraries or frameworks to simplify the code, while others emphasize keeping the server lightweight and minimalistic.
- Number of comments: 109
-
Investigate gemma 2 generation quality: This issue is about investigating the quality of the Gemma 2 model generation, particularly focusing on potential problems with the tokenizer in the llama.cpp implementation. Initial reports and discussions suggest discrepancies in the model's output, especially when compared to other implementations and quantization levels.
- The comments discuss various aspects of the issue, including hard-coded window sizes, tokenizer discrepancies, quantization effects, and specific test cases showing incorrect outputs. Users share their findings, suggest fixes, and compare results with other implementations, highlighting the complexity and ongoing nature of the investigation.
- Number of comments: 89
-
Support BitNet b1.58 ternary models: This issue is about implementing support for BitNet b1.58 ternary models in the llama.cpp project. The BitNet b1.58 models use ternary values (1, 0, -1) and are reported to show performance improvements over fp16 models, but they need to be trained in this ternary mode from the start.
- The comments discuss the potential benefits and challenges of implementing BitNet, including the need for new quantization methods, the feasibility of training ternary models directly, and the potential for hardware optimizations. There is also a debate about the practicality and economic motivation for training such models, with some commenters expressing skepticism and others highlighting the potential for significant memory and power savings. Several implementations and resources are shared, and there is a call for more substantial model releases to validate the approach.
- Number of comments: 88
-
Support for Phi-3 models: This issue is about adding support for Microsoft's newly released Phi-3 models, which come in three variants: mini, small, and medium. The request is to integrate this new family of models into the project.
- The comments discuss various aspects of integrating Phi-3 models, including successful initial tests, compatibility issues, and the need for new prompt templates. There are also technical challenges related to the "longrope" technique used in the 128K variant, with users sharing errors and potential solutions. The conversation includes references to relevant research papers, code snippets, and ongoing efforts to implement support for these models. The issue is partially resolved with support for 4K context length models, but the 128K context length models still require further work.
- Number of comments: 83
-
Bug: QWEN2 quantization GGML_ASSERT: This issue involves a bug encountered when attempting to quantize the Qwen2 7B Instruct model to IQ2_XS, resulting in an assertion error related to the grid index. The user also reports that the same error occurs when attempting to quantize to IQ2_S and provides relevant logs and system information for debugging.
- The comments discuss various errors encountered during different quantization attempts, potential causes such as
nan
values in the imatrix, and possible solutions including using flash attention and modifying the code to handle specific architectures. Users share their experiences, suggest patches, and provide feedback on the effectiveness of different approaches, with some noting improvements and others still facing issues. - Number of comments: 73
- The comments discuss various errors encountered during different quantization attempts, potential causes such as
1.3 Top 5 Quiet Issues:
We consider quiet issues to be issues that have been opened in this project for the longest time. The team should work together to get these issues resolved and closed as soon as possible.
-
llama : add test for saving/loading sessions to the CI: This issue involves adding a test for saving and loading sessions to the continuous integration (CI) process of the project. It suggests examining the
save-load-state
example and incorporating a simple test into theci/run.sh
script.- Open for 354 days, 01 hours, 18 minutes
-
llama : tool for evaluating quantization results per layer: This issue proposes the development of a tool to evaluate and compare the results of classical and quantum models by analyzing intermediate results from
ggml
exported graphs. The tool aims to identify points of significant deviation in inference results and determine which computation nodes require higher precision to minimize quantization differences.- Open for 345 days, 05 hours, 57 minutes
-
CUDA non-determinism on identical requests: This issue describes a problem where identical requests sent to a server using CUDA for layer offloading return different responses the first time, but consistent responses thereafter until a new prompt is introduced. The expected behavior is that the output should remain the same when parameters and seed are constant, and this non-deterministic behavior is not observed with Metal offload or without CUDA offload.
- Open for 342 days, 22 hours, 21 minutes
-
Windows ROCm Build.: This issue involves a user attempting to compile the llama.cpp project for ROCm on a Windows system, encountering difficulties due to CMake's default paths for the Clang compiler not matching the Windows directory structure. The user reports that CMake is set to use Unix-style paths for Clang, while the correct paths on Windows are located in "C:\Program Files\AMD\ROCm\5.5\bin\clang.exe" and "C:\Program Files\AMD\ROCm\5.5\bin\clang++.exe", and seeks guidance on how to resolve this discrepancy.
- Open for 342 days, 20 hours, 44 minutes
-
Please support the also official Falcon-rw-1b and Falcon-rw-7b model variants: This issue requests support for the Falcon-RW-1B and Falcon-RW-7B model variants, which are official versions of the Falcon model series. The user has encountered errors when attempting to convert and quantize these models using the
convert-falcon-hf-to-gguf.py
script and is seeking assistance or confirmation on whether these models will be supported.- Open for 341 days, 08 hours, 09 minutes
1.4 Closed Issues
Closed Issues This Week: 38
Average Issue Close Time (This Week): 35.81 days
Summarized Issues:
- Vulkan Shader Headers Management: This issue addresses the need to generate Vulkan shader headers at build time using
make
orCMake
targets instead of including them in source control, and proposes organizing shaders into a separate directory to improve clarity and reduce commit conflicts.
- Code Formatting in Server UI: This issue addresses the problem of code snippets in the server UI being incorrectly formatted with
<em>
tags due to underscores being replaced, and suggests either adding an option to prevent this or adjusting the regex to avoid affecting code blocks.
- Flash Attention in SYCL Backend: This issue is about implementing Flash attention, an IO-aware exact attention algorithm, in the SYCL backend to potentially benefit Intel GPUs, as it is already available in CUDA and Metal backends.
- Model Conversion Issues: This issue describes a problem where the
convert.py
script fails to convert the llama3 8B Instruct model downloaded directly from Meta, while the same script works without issues for the model downloaded from Huggingface, likely due to differences in tokenizer and configuration file formats.
- Embedding Output Discrepancies: This issue questions why the embedding outputs differ between GPU and CPU executions for the same input in the llama.cpp project, and seeks clarification on whether this discrepancy is expected and if there are specific instructions or documentation for using the underlying API functions.
- Context Size Bug in OSX: This issue describes a bug in the Meta-Llama-3-70B-Instruct model running on OSX, where the default context size (CTX) of 512 tokens causes incoherent output around the 512-token mark, which can be resolved by setting the context size to 0, thereby utilizing the maximum context for the model.
- Inference Discrepancies: This issue addresses a discrepancy in inference results between the llama.cpp implementation using a MobileVLM model in GGUF format and the official PyTorch project, questioning whether such differences are normal.
- Reintroduction of CLBlast Backend: This issue is about a user requesting the reintroduction of the CLBlast backend in the llama.cpp project to enable GPU offloading and improve performance on older NVIDIA GPUs, as the removal of this support has left them and others with similar hardware without a functional and efficient alternative.
- CUDA Out-of-Memory Error: This issue describes a CUDA out-of-memory error encountered when sending a large prompt (20k+ tokens) to the Phi-3 Mini 128k model on a laptop with an Nvidia A2000 4GB GPU, leading to a crash without returning any tokens, and includes a request for guidance on disabling GPU usage to test CPU inference.
- HTML Quoting Bug in Web UI: This issue describes a bug in the web UI chat of the "server" where the "<" and ">" characters are not properly quoted in the HTML output, leading to incorrect rendering of expected content such as C code include files.
llama-cli
Tool Issues: This issue describes a bug in thellama-cli
tool where, after recent updates, the tool either engages in self-dialogue, outputs random tokens, or stops responding entirely when used in chat mode, affecting both CPU and NVIDIA GPU environments.
- Performance Degradation in New UI: This issue describes a performance degradation in the "New UI" of the examples/server project, where the chat becomes progressively slower with each new user message due to parts of the chat history being re-evaluated, potentially bypassing the KV cache, unlike the old UI mode or
llama-cli
.
- Guidance for Gemma-7b Model: This issue is about seeking guidance on how to properly run and interact with the Gemma-7b model using the "main" and "server" options, as the user experiences erratic behavior when using the chat template.
- Updating GGUF My Repo Tool Scripts: This issue involves updating the GGUF my Repo tool's scripts on Hugging Face to align with the new naming scheme introduced by LlamaCPP, as the current scripts are outdated and causing the tool to malfunction.
- Vulkan Model Instruction Following Issue: This issue describes a problem where the latest Vulkan version of a model fails to follow instructions from a prompt file, causing it to repeatedly ask the user what they want instead of responding appropriately, whereas reverting to an earlier version resolves the issue.
- Discrepancy Between
llama-cli
andllama-server
: This issue highlights a discrepancy between the outputs of thellama-cli
andllama-server
when using the same model and configuration, with thellama-cli
providing the desired result while thellama-server
produces confusing outputs.
- Inference Server Model Loading Failure: This issue describes a problem where the user is unable to call the llama.cpp inference server with a llama 3 model due to a failure in loading the model, resulting in a segmentation fault.
- Docker Image Build Issues: This issue describes a problem where the Docker image generated by the provided Dockerfile fails to run due to a missing version of the
libstdc++.so.6
library, which was resolved by adding theapt install build-essential
command.
- CUDA Build Failure on ALT Linux: This issue describes a build failure for CUDA on ALT Linux due to ambiguous calls to the
std::forward
function in theggml-cuda.cu
file, resulting in compilation errors.
- LLava Model Server Bug: This issue describes a bug where the LLava model provides accurate image descriptions when run via the command line interface (
./llama-llava-cli
), but produces hallucinated responses when hosted on a server (./llama-server
), due to the server not supporting multimodal inputs.
- Redundant
--config Release
Option: This issue questions the necessity of including the--config Release
option in the project's README, suggesting it might be redundant as it appears to be the default behavior.
- Poor Inference Outputs for CodeShell Model: This issue describes a bug where the latest version of llama.cpp produces poor inference outputs for the CodeShell model, which previously worked well, after updating and converting the model using specific scripts.
- Quantization Error on SYCL Backend: This issue describes a problem encountered when running the Meta-Llama-3.1-8B-Instruct model with Q8 and Q4 quantization on the SYCL backend using an Intel ARC A770 GPU on Windows 11, resulting in a
GGML_ASSERT
error due to a null backend buffer base.
- CUDA Build Failure on NVIDIA GH200: This issue involves a compilation failure when building the llama.cpp project with GPU acceleration on the NVIDIA GH200 platform, specifically due to a 'ptxas' error caused by an invalid memory reference during the CUDA build process.
- Tensor Mismatch Error: This issue describes a problem where the Llama 3.1 model quantized to q4_0 using the "GGUF my repo" space fails to load due to an error indicating a mismatch in the expected number of tensors.
- Build Failure on Alpine Linux: This issue describes a build failure on Alpine Linux systems due to the absence of the
execinfo.h
header file, which is not supported by musl, and suggests that the CMake build scripts should detect this dependency and disable the related functionality on unsupported Linux systems.
- Documentation for
--embeddings
Flag: This issue addresses the confusion and inaccuracies in the documentation for the--embeddings
flag inserver.cpp
, proposing updates to reflect its evolved usage and ensure it accurately describes its function with dedicated embedding models.
- Metadata Editing Issue: This issue describes a problem with the
gguf_set_metadata.py
script being unable to edit thetokenizer.chat_template
metadata for Llama 3.1 due to its unsupported type, and suggests using thegguf_new_metadata.py
script as a workaround.
- GGUF Metadata Value Error: This issue describes a bug where a "ValueError: Invalid GGUF metadata value type or value" occurs due to missing tags in the model card when using the
transformers
library'smodel.push_to_hub
function, which is resolved by manually providing the tags.
- Inference Server Usage Confusion: This issue describes a problem where the user is unable to get the llama-server to load the model completely and produce outputs, only to find out that the server runs an HTTP service that needs to be accessed via a web browser or API requests rather than directly in the console.
- Non-Deterministic Compilation Error: This issue describes a non-deterministic compilation error that occurs when using multiple jobs (
make -j8
) to compile thellama.cpp
source code on Ubuntu 22.04, likely due to concurrent overwriting of thedeprecation-warning.o
object file by legacy binaries compiled from the same source file.
- IndexError During Model Conversion: This issue involves an IndexError occurring when attempting to convert a fine-tuned Meta-Llama-3-8B-Instruct model from
pytorch_model.bin
togguf
format using theconvert_hf_to_gguf.py
script.
- Feature Request for PULP CVA6: This issue is a feature request for running the llama software on the PULP CVA6 with the ARA vector extension using Verilator, where the user encounters a core dump error during execution.
- Feature Request for XLMRobertaModel Embeddings: This issue is a feature request for adding support for embeddings based on the XLMRobertaModel architecture to the project, highlighting the potential benefits and improved performance of such models compared to proprietary alternatives.
- Compilation Failure on RISC-V: This issue describes a compilation failure when using the musl toolchain on RISC-V architecture, specifically due to the absence of the
execinfo.h
header file in versions after b3468, which previously allowed successful cross-compilation.
- CUDA Illegal Memory Access in GPT4All: This issue describes a bug in the GPT4All project where a CUDA illegal memory access occurs during token decoding with specific model configurations, likely related to the handling of F16 data types and padding in the
ggml-cuda.cu
file.
- Quantization Error in GGUF My Repo: This issue describes a bug where users encounter a "No such file or directory 'consolidated.00.pth'" error when attempting to quantize the
meta-llama/Meta-Llama-3.1-8B
model through the "GGUF my repo" space on GitHub, with the problem persisting intermittently and being linked to specific repository actions.
1.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open issues within the past week to identify potentially heated exchanges and to maintain a constructive project environment.
- what about functions ?
- Toxicity Score: 0.55 (Assertive tone, confrontational language, warning to other commenters)
- This GitHub conversation begins with cercatore submitting a feature request, expressing a strong opinion about the necessity of functions in the project. The tone is assertive and somewhat confrontational, with cercatore emphasizing the importance of their opinion and the effort they have put into their own improvements. The message includes a mix of praise for the project maintainer and a warning to other commenters to be kind, hinting at potential tension. There are no comments yet, so the trajectory of the conversation is currently undetermined, but the initial tone sets a foundation for possible future conflicts.
II. Pull Requests
2.1 Open Pull Requests
Open Pull Requests This Week: 18
Pull Requests:
- Windows Build Improvements: This topic includes pull requests that address various issues related to Windows builds. One pull request adds conditional compilation directives to ensure specific options are applied exclusively for Windows builds. Another pull request adds a function to accurately detect physical and logical CPU cores on Windows, improving the program's ability to set the default number of threads. Additionally, there is a pull request that adds support for retrieving and displaying CPU information on Windows when running the
llama-bench
command with the-o json
option.
- Parameter Handling and Metadata: This topic covers pull requests that improve parameter handling and metadata management. One pull request addresses an issue where the server was ignoring user-specified parameters and using default values from the OpenAI API. Another pull request aims to add more authorship metadata from the model card and provide users with the option to input such data organically. These changes ensure more accurate and user-specific configurations.
- Documentation Updates: This topic includes pull requests that update the README file with new information. One pull request adds information about "ramalama," a repository-agnostic CLI tool. Another pull request adds GPUStack to the UI list, enhancing the documentation with more tools and options for users.
- New CLI Tools: This topic includes pull requests that introduce new CLI tools to the project. One pull request adds the
dir-assistant
CLI runner, which utilizesllama-cpp-python
and a RAG system for directory interaction. These additions expand the functionality and usability of the project.
- Model and Quantization Enhancements: This topic covers pull requests that improve model and quantization strategies. One pull request introduces epsilon as a configurable parameter for group normalization operators to support the RWKV implementation. Another pull request adds CPY functionality for Q4_0, and another improves IQ2 and IQ3 model quantization strategies. These changes enhance the performance and flexibility of the models.
- Codebase Refactoring and Simplification: This topic includes pull requests that refactor and simplify the codebase. One pull request removes a duplicate function and updates the return type of related functions. Another pull request replaces a
std::tuple
with a newllama_init_result
struct. Additionally, there is a pull request that simplifies quantization type support in thegguf-py
library by introducing an abstract base class.
- Build and Compatibility Fixes: This topic includes pull requests that address build and compatibility issues. One pull request resolves a build failure in the
vulkan-shaders-gen
component by linking it with the pthreads library. Another pull request addresses an issue with the detection ofstorageBuffer16BitAccess
on certain Adreno drivers. These fixes ensure smoother builds and better compatibility across different systems.
- Bug Fixes: This topic includes pull requests that fix various bugs in the project. One pull request addresses a segmentation fault in the
llama-batched-bench
tool by adding a check to set the maximum sequence size to 1 when the number of parallel prompts is zero. Another pull request enhances the codebase by removing a redundant inclusion of the vector library, simplifying the project structure.
2.2 Closed Pull Requests
Closed Pull Requests This Week: 27
Summarized Pull Requests:
- bfloat16 (bf16) support for CUDA: This topic covers the introduction and improvement of bfloat16 (bf16) support in the ggml-cuda project. The initial pull request introduces preliminary bf16 support, demonstrating performance metrics on an RTX 4090 and acknowledging limitations compared to fp16. Another pull request addresses issues with GGUF weights by truncating FP32 values directly to BF16 and fixes related errors in the
__compute_fp32_to_bf16
function.
- SYCL backend support: This topic includes pull requests that enhance the SYCL backend for the stablediffusion.cpp project. One pull request adds convolution support as a temporary solution with plans for future improvements using OneDNN. Another introduces a
TIMESTEP_EMBEDDING
operator modeled after the CUDA kernel to support the project.
- Unified memory for CUDA: This pull request introduces an environment variable to enable unified memory for running llama.cpp on CUDA. It helps prevent out-of-memory errors and improves token generation speed when the model nearly exceeds VRAM capacity.
- Runtime SVE configuration: This pull request updates the code to read the runtime SVE configuration of the CPU. The changes are moved from
ggml.c
toggml-quants.c
and replace an older pull request that has been closed.
- Nix build configuration: This topic includes pull requests that update the Nix build configuration. One pull request updates the
flake.lock
file by changing the input for 'nixpkgs' to a newer commit. Another modifies the build configuration to rely onpropagatedBuildInputs
andpropagatedBuildOutputs
for CUDA.
- CI pipeline improvements: This pull request proposes using an additional thread for non-ggml tasks in the CI pipeline. It addresses issues observed in a previous pull request and ensures that more parallel jobs can run for all targets except during the CUDA build.
- Vendor-specific headers organization: This pull request refactors the project by creating a new
vendors
directory to organize vendor-specific headers. It confirms that the changes have been successfully tested with themake GGML_MUSA=1
command.
- RISC-V vector operations bug fix: This pull request addresses a bug fix in the ggml module to ensure that inactive elements retain their previous values when the mask is false. This fix is specifically for RISC-V vector operations.
- Android implementation of
ggml_print_backtrace_symbols
: This pull request introduces an Android implementation of theggml_print_backtrace_symbols
function to the project.
- CMake configuration for CANN backend: This pull request updates the CMake configuration for the CANN backend. It incorporates the new
GGML_EXTRA_LIBDIRS
and aligns with the changes introduced in PR 8480.
- Fix for
add_array()
function: This pull request addresses an issue where theadd_array()
function was incorrectly adding empty arrays to the key-value store. This is not permitted in the kv metadata store, as discussed in issue #8769.
- GGUF file generation from Phi weights: This pull request modifies Python scripts to enable the generation of GGUF files from Phi-2, Phi-1.5, and Phi-1 original weights. It utilizes the same tokenizer ("phi-2") for all three and aims to resolve issue #7667.
- Documentation update for embedding flag: This pull request updates the documentation for the embedding flag in the llama-server. It provides context and links to a related issue for further details.
- Race condition fix in build process: This pull request addresses a potential race condition in the build process. It implements a fix suggested by @fairydreaming, as discussed in issue #8776.
- Conditional inclusion of
execinfo.h
: This pull request addresses the inclusion of theexecinfo.h
header file only on Linux systems where it is present. This fixes issue #8762.
- Q scaling and Gemma 2 model sizes update: This pull request involves updating the Q scaling and Gemma 2 model sizes. It aligns with the v2 2B model.
- CMake configuration for external ggml library: This pull request addresses the issue of correctly identifying and using the directory of an external ggml library in the CMake configuration for the llama.cpp project. It ensures that directory properties are accurately retrieved regardless of whether ggml is internal or external.
- CUDA implementation fix for dmmv columns: This pull request addresses a potential issue by fixing the requirement for dmmv columns to be twice the value of GGML_CUDA_DMMV_X in the CUDA implementation. It may resolve issue #8798.
- MPI support for distributed computing: This topic includes pull requests that introduce modifications to the mpi-cli to enable support for running tasks across multiple nodes using MPI. One pull request facilitates distributed computing, while another adds an example of using the mpi-cli to distribute computational loads across multiple nodes in a cluster.
- CANN backend support for
Q8_0 Model
: This pull request fixes theMulMat_Q8_0
function on theCANN
backend. It adds support for theQ8_0 Model
in theCANN
backend, as demonstrated by testing with theLLama2-7b-Q8-0
model.
- SYCL device-related refactoring: This pull request refactors the development code by separating SYCL device-related classes and functions. It merges device classes, simplifies device ID retrieval, moves a helper file, and supports a single device mode for setting the main GPU.
- Fix for SYCL backend tests: This pull request addresses the issue of failing tests for the SYCL backend. It sets the IQ4_NL Vec Dot Ratio to the correct value of 2, thereby fixing the MUL_MAT and MUL_MAT_ID tests for all supported device vendors.
- Fix for
ggml_cann_im2col
operation: This pull request addresses a fix for theggml_cann_im2col
operation to correctly handle 1D im2col. It includes the addition of related test cases.
2.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open pull requests within the past week to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open pull requests from the past week.
III. Commits
3.1 Commits
Commits This Week: 22
Summarized Commits:
- CPU Runtime Configuration and Initialization: The ggml library has been updated to read the runtime SVE configuration of the CPU, implementing a one-time initialization to prevent performance drops. Variables have been prefixed to avoid potential conflicts, and a previous xxhash fix was reverted with added brackets.
- BF16 Weight Conversion Fixes: Conversion issues of unnormalized BF16 to BF16 weights have been addressed by adding truncation, fixing masking, removing unnecessary casts, and ensuring consistent handling of subnormal values across different platforms.
- 1D im2col Function and Build Warning Fixes: The
ggml_cann_im2col
function has been fixed for 1D im2col operations, and a build warning in the project has been resolved.
- SYCL VDR iq4nl Value Correction: An issue with the VDR iq4nl value in the SYCL project has been corrected, ensuring accurate value handling.
- Unified Memory and CUDA Refactoring: Support for unified memory in the ggml-cuda project has been introduced, along with code refactoring for better organization, fixing a compilation error with hipblas, and updating the documentation.
- Linux-Specific Backtrace Functionality: The
execinfo.h
header is now included only on Linux systems that support it, enabling backtrace functionality specifically for GLIBC Linux systems and fixing a missing file from a previous copy.
- CUDA dmmv Column Requirement Fixes: The CUDA implementation has been updated to fix the requirement for dmmv columns to be twice the value of GGML_CUDA_DMMV_X, updating assertions, ensuring dmmv is used only for supported types, and adding a corresponding test.
- Ascend Backend q8_0 Support: Support for the q8_0 data type in the Ascend backend has been introduced.
- Llama-server Embedding Flag Documentation: The documentation for the embedding flag in the llama-server has been updated, addressing and fixing the issue referenced as #8763.
- Race Condition and Build Optimization: A potential race condition identified in issue #8776 has been addressed by referencing the .o files to avoid unnecessary rebuilding, adding CXXFLAGS and LDFLAGS, and removing redundant linker flags.
- Gemma 2 2B Model Configurations: Configurations for the Gemma 2 2B model have been introduced, including updates to Q scaling and model sizes to align with the v2 2B model.
- CMake External ggml Library Fixes: Issues related to the use of an external ggml library within the CMake build configuration have been resolved.
- Nix Build Configuration for CUDA: The Nix build configuration for CUDA has been updated to rely on
propagatedBuildInputs
, eliminating the need to list individual outputs to reduce the runtime closure size.
- Python
add_array()
Function Update: Theadd_array()
function ingguf_writer.py
has been updated to ensure it does not add an empty array to the key-value store, incorporating feedback from a code review.
- Android Backtrace Symbols Implementation: The Android implementation of the
ggml_print_backtrace_symbols
function in theggml.c
file has been introduced.
- Flake.lock File Update: The
flake.lock
file has been updated as part of pull request #8729.
- CMake Configuration Update: The CMake configuration in the project has been updated.
- SYCL TIMESTEP_EMBEDDING Operation: The
TIMESTEP_EMBEDDING
operation has been introduced to the SYCL project.
- RISC-V Vector Operations Bug Fix: A bug in the GGML project has been addressed by ensuring that inactive elements retain their previous values when the mask is false, using the undisturbed policy.
- CUDA Vendor-Specific Headers Reorganization: CUDA-related vendor-specific headers have been moved into a dedicated 'vendors' directory.
- SYCL Convolution Support: Support for convolution operations in the SYCL project has been introduced.
- CI Configuration Update: The continuous integration (CI) configuration has been updated to utilize one additional thread for non-ggml tasks in the CMake build process.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, or created at least 1 pull request in the past month.
Contributor | Commits | Pull Requests | Issues |
---|---|---|---|
GitHub | 166 | 0 | 0 |
ggerganov | 0 | 32 | 2 |
Georgi Gerganov | 17 | 0 | 0 |
0wwafa | 0 | 1 | 10 |
ngxson | 0 | 8 | 2 |
compilade | 0 | 10 | 0 |
mofosyne | 0 | 10 | 0 |
JohannesGaessler | 0 | 9 | 0 |
HanClinto | 0 | 7 | 1 |
danbev | 0 | 6 | 0 |
iboB | 0 | 5 | 0 |
maruel | 0 | 2 | 3 |
kaetemi | 0 | 1 | 3 |
wangshuai09 | 0 | 4 | 0 |
oldgithubman | 0 | 0 | 4 |
perpendicularai | 0 | 1 | 2 |
Alcpz | 0 | 3 | 0 |
iamlemec | 0 | 3 | 0 |
ClarkChin08 | 0 | 2 | 1 |
RunningLeon | 0 | 2 | 1 |
AndreasKunar | 0 | 1 | 2 |
slaren | 0 | 3 | 0 |
joeatodd | 0 | 3 | 0 |
ericcurtin | 0 | 1 | 2 |
RakshitAralimatti | 0 | 0 | 3 |
yli147 | 0 | 0 | 3 |
sorasoras | 0 | 0 | 3 |
mtasic85 | 0 | 2 | 0 |
standby24x7 | 0 | 2 | 0 |
b4b4o | 0 | 1 | 1 |
kevmo314 | 0 | 2 | 0 |
jaime-m-p | 0 | 2 | 0 |
jdomke | 0 | 2 | 0 |
yeahdongcn | 0 | 2 | 0 |
fairydreaming | 0 | 1 | 1 |
zhipenghan | 0 | 2 | 0 |
nicholaiTukanov | 0 | 1 | 1 |
msy-kato | 0 | 2 | 0 |
0cc4m | 0 | 2 | 0 |
airMeng | 0 | 2 | 0 |
AmgadHasan | 0 | 1 | 1 |
amochkin | 0 | 1 | 1 |
luoyu-intel | 0 | 2 | 0 |
stduhpf | 0 | 1 | 1 |
Stillerman | 0 | 1 | 1 |
jeroen-mostert | 0 | 1 | 1 |
bdashore3 | 0 | 1 | 1 |
Septa2112 | 0 | 2 | 0 |
kylo5aby | 0 | 2 | 0 |
okigan | 0 | 1 | 1 |
acon96 | 0 | 1 | 1 |
hatgrey2 | 0 | 2 | 0 |
curvedinf | 0 | 1 | 1 |
Patater | 0 | 1 | 1 |
SimplyCorbett | 0 | 0 | 2 |
yancaoweidaode | 0 | 0 | 2 |
Battlehub0x | 0 | 0 | 2 |
Arashimu | 0 | 0 | 2 |
MathiasSchindler | 0 | 0 | 2 |
Sokartecnologi | 0 | 0 | 2 |
bartowski1182 | 0 | 0 | 2 |
mirek190 | 0 | 0 | 2 |
vt-alt | 0 | 0 | 2 |
Azirine | 0 | 0 | 2 |
ElaineWu66 | 0 | 0 | 2 |
ExtReMLapin | 0 | 0 | 2 |
ThomasBaruzier | 0 | 0 | 2 |
grigohas | 0 | 0 | 2 |
hafezmg48 | 0 | 0 | 2 |
bviksoe | 0 | 1 | 0 |
diimdeep | 0 | 1 | 0 |
prfd | 0 | 1 | 0 |
youth123 | 0 | 1 | 0 |
brochure | 0 | 1 | 0 |
agray3 | 0 | 1 | 0 |
daghanerdonmez | 0 | 1 | 0 |
andysalerno | 0 | 1 | 0 |
laik | 0 | 1 | 0 |
monatis | 0 | 1 | 0 |
AragonerUA | 0 | 1 | 0 |
kriation | 0 | 1 | 0 |
danielhanchen | 0 | 1 | 0 |
teleprint-me | 0 | 1 | 0 |
65a | 0 | 1 | 0 |
NikolaiLyssogor | 0 | 1 | 0 |
sbonds | 0 | 1 | 0 |
SommerEngineering | 0 | 1 | 0 |
amitj1jan | 0 | 1 | 0 |
nopperl | 0 | 1 | 0 |
EZForever | 0 | 1 | 0 |
m18coppola | 0 | 1 | 0 |
thxCode | 0 | 1 | 0 |
hankeke303 | 0 | 1 | 0 |
devojony | 0 | 1 | 0 |
zqb-all | 0 | 1 | 0 |
Xarbirus | 0 | 1 | 0 |
FanShupei | 0 | 1 | 0 |
themanyone | 0 | 1 | 0 |
Oliver-Y | 0 | 1 | 0 |
0x4139 | 0 | 1 | 0 |
Ujjawal-K-Panchal | 0 | 1 | 0 |
fmz | 0 | 1 | 0 |
MorganRO8 | 0 | 1 | 0 |
jmorganca | 0 | 1 | 0 |
ElYaiko | 0 | 1 | 0 |
sasha0552 | 0 | 1 | 0 |
DavidKorczynski | 0 | 1 | 0 |
bsquizz | 0 | 1 | 0 |
foldl | 0 | 1 | 0 |
zhentaoyu | 0 | 1 | 0 |
Srihari-mcw | 0 | 1 | 0 |
zihaoccc | 0 | 1 | 0 |
Tianzhengshuyuan | 0 | 1 | 0 |
norgera | 0 | 1 | 0 |
CarterLi999 | 0 | 1 | 0 |
l3utterfly | 0 | 1 | 0 |
ardfork | 0 | 1 | 0 |
SomeoneSerge | 0 | 1 | 0 |
RhinoDevel | 0 | 1 | 0 |
pculliton | 0 | 1 | 0 |
arthw | 0 | 1 | 0 |
OuadiElfarouki | 0 | 1 | 0 |
linyinli | 0 | 1 | 0 |
MollySophia | 0 | 1 | 0 |
MengqingCao | 0 | 1 | 0 |
Nexesenex | 0 | 1 | 0 |
rhjdvsgsgks | 0 | 1 | 0 |
cunnie | 0 | 1 | 0 |
jim-plus | 0 | 0 | 1 |
Yan-Xiangjun | 0 | 0 | 1 |
josharian | 0 | 0 | 1 |
Aridbhdkkj | 0 | 0 | 1 |
AUTOMATIC1111 | 0 | 0 | 1 |
isaac-mcfadyen | 0 | 0 | 1 |
d-kleine | 0 | 0 | 1 |
warren-lei | 0 | 0 | 1 |
dspasyuk | 0 | 0 | 1 |
ch1y0q | 0 | 0 | 1 |
andreys42 | 0 | 0 | 1 |
gpacix | 0 | 0 | 1 |
guinmoon | 0 | 0 | 1 |
bandoti | 0 | 0 | 1 |
apresence | 0 | 0 | 1 |
kasrahabib | 0 | 0 | 1 |
LDLINGLINGLING | 0 | 0 | 1 |
Hardik-Choraria | 0 | 0 | 1 |
99991 | 0 | 0 | 1 |
Sakura4036 | 0 | 0 | 1 |
markat1 | 0 | 0 | 1 |
amakropoulos | 0 | 0 | 1 |
MeemeeLab | 0 | 0 | 1 |
joshknnd1982 | 0 | 0 | 1 |
sealad886 | 0 | 0 | 1 |
lin72h | 0 | 0 | 1 |
jie80219 | 0 | 0 | 1 |
nne998 | 0 | 0 | 1 |
StatPan | 0 | 0 | 1 |
1cekrim | 0 | 0 | 1 |
bong-furiosa | 0 | 0 | 1 |
djain-fujitsu | 0 | 0 | 1 |
m828 | 0 | 0 | 1 |
Fulgurance | 0 | 0 | 1 |
criminact | 0 | 0 | 1 |
VelocityRa | 0 | 0 | 1 |
dafei2017 | 0 | 0 | 1 |
metal3d | 0 | 0 | 1 |
Emmanuel97460 | 0 | 0 | 1 |
vmarchenkoff | 0 | 0 | 1 |
jpoly1219 | 0 | 0 | 1 |
ciekawy | 0 | 0 | 1 |
DanielusG | 0 | 0 | 1 |
hgftrdw45ud67is8o89 | 0 | 0 | 1 |
qnixsynapse | 0 | 0 | 1 |
rhvall | 0 | 0 | 1 |
zucchini-nlp | 0 | 0 | 1 |
hipudding | 0 | 0 | 1 |
suncloudsmoon | 0 | 0 | 1 |
newsletternewsletter | 0 | 0 | 1 |
simon-krannig | 0 | 0 | 1 |
RonanKMcGovern | 0 | 0 | 1 |
nicoboss | 0 | 0 | 1 |
MangoTCF | 0 | 0 | 1 |
TanLam01 | 0 | 0 | 1 |
peter-ch | 0 | 0 | 1 |
auriocus | 0 | 0 | 1 |
cloud11665 | 0 | 0 | 1 |
wencan | 0 | 0 | 1 |
Vaibhavs10 | 0 | 0 | 1 |
Tureti | 0 | 0 | 1 |
tc-wolf | 0 | 0 | 1 |
akashaero | 0 | 0 | 1 |
artiomborovinskii | 0 | 0 | 1 |
mudler | 0 | 0 | 1 |
creeves-anaconda | 0 | 0 | 1 |
hackey | 0 | 0 | 1 |
chigkim | 0 | 0 | 1 |
IcyXi | 0 | 0 | 1 |
8XXD8 | 0 | 0 | 1 |
matteoserva | 0 | 0 | 1 |
Volko61 | 0 | 0 | 1 |
riedgar-ms | 0 | 0 | 1 |
mgroeber9110 | 0 | 0 | 1 |
mgonzs13 | 0 | 0 | 1 |
yuanzhiyong1999 | 0 | 0 | 1 |
windowsagent | 0 | 0 | 1 |
rajesh-s | 0 | 0 | 1 |
RedHeartSecretMan | 0 | 0 | 1 |
AngelaZhang3913 | 0 | 0 | 1 |
Exploder98 | 0 | 0 | 1 |
MeeCreeps | 0 | 0 | 1 |
jarroddavis68 | 0 | 0 | 1 |
cjsdurj | 0 | 0 | 1 |
renbuarl | 0 | 0 | 1 |
vTuanpham | 0 | 0 | 1 |
a3rnj | 0 | 0 | 1 |
yanwun | 0 | 0 | 1 |
IzzyHibbert | 0 | 0 | 1 |
Zant12 | 0 | 0 | 1 |
3058132083 | 0 | 0 | 1 |
tomasmcm | 0 | 0 | 1 |
rankaiyx | 0 | 0 | 1 |
cebtenzzre | 0 | 0 | 1 |
MarioSimou | 0 | 0 | 1 |
m-arbaro | 0 | 0 | 1 |
17Reset | 0 | 0 | 1 |
scalvin1 | 0 | 0 | 1 |
yan-zh | 0 | 0 | 1 |
shuangxiangkan | 0 | 0 | 1 |
LaurentBonnaud | 0 | 0 | 1 |
LSXAxeller | 0 | 0 | 1 |
Yuriy-Paramonov | 0 | 0 | 1 |
thonore75 | 0 | 0 | 1 |
cercatore | 0 | 0 | 1 |
RandUser123sa | 0 | 0 | 1 |
hexbinoct | 0 | 0 | 1 |