Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Llama.cpp - 2024-07-15 12:00:12

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


I. Issues

1.1 Open Issues

Open Issues This Week: 21

Summarized Issues:

  • Dynamic NTK Rope Scaling: This feature request aims to support dynamic NTK rope scaling in the LlamaDynamicNTKScalingRotaryEmbedding component. It is commonly used in long-context language models to improve performance. Implementing this feature would enhance the flexibility and efficiency of the model.
    • github.com/...
  • Compilation and Build Issues: Several issues are related to problems encountered during the compilation and building of the project. These include incorrect program output with the DGGML_VULKAN=ON flag on Ubuntu 20.04 for aarch64, a warning in the ggml.c file, and misplaced documentation files. These issues can lead to build failures or unexpected behavior in the resulting binaries.
    • github.com/...
    • github.com/...
    • github.com/...
  • Segmentation Faults: Multiple issues report segmentation faults under different conditions. These include running specific commands on a Mac with Yi 1.5, using the --mem parameter in the rpc-server command, and using the mlock command with large models on a 32GB Mac. These faults can cause crashes and disrupt the normal operation of the software.
    • github.com/...
    • github.com/...
    • github.com/...
  • Model and API Enhancements: There are requests for enhancements to support new features and improve performance. These include adding support for the "content" field in user messages to accept both string and array formats, and improving prompt evaluation and generation times by inverting quantization to fp16. These enhancements aim to make the models more versatile and efficient.
    • github.com/...
    • github.com/...
  • Inference and Offloading Issues: Issues have been reported with inference and GPU offloading. These include no layers being offloaded to the GPU when using SYCL for the Qwen2 MoE model, and the Meta-Llama-3-8B-Instruct-Q8_0 model not following system prompts properly. These issues affect the performance and accuracy of the models during inference.
    • github.com/...
    • github.com/...
  • Documentation and Usability: Problems with documentation and usability have been highlighted. These include broken tutorials due to renaming of executables, missing port bindings in Docker run commands, and a "No such file or directory" error due to renamed tools. Addressing these issues would improve the user experience and accessibility of the project.
    • github.com/...
    • github.com/...
    • github.com/...
  • Tool Call Formatting: There are issues with the formatting of tool call outputs in the InternLM 2.5 Chat Tool Calls. The outputs are inconsistent and do not adhere to the strict format outlined in the documentation, making it difficult to parse and utilize the tool calls effectively. This affects the usability of the tool calls in practical applications.
    • github.com/...
  • Model Evaluation Guidance: Users are seeking guidance on how to evaluate converted gguf models using various benchmarks such as MMLU, ARC, and Perplexity. Providing clear instructions and support for these benchmarks would help users assess the performance of their models more effectively.
    • github.com/...
  • Android Compatibility Issues: Several issues are related to running the project on Android. These include missing library dependencies like "libllama.so" and execution failures on Android Termux. These compatibility issues hinder the use of the project on Android devices.
    • github.com/...
    • github.com/...
  • Vulkan and GPU Issues: Issues have been reported with Vulkan and GPU usage. These include no output generated on a RISC-V board with an Imagination iGPU and crashes with an std::out_of_range error when loading specific models. These issues affect the performance and stability of the project when using GPU acceleration.
    • github.com/...
    • github.com/...
  • Model Training Precision: A user inquired about training a model from scratch using f16 or q8 precision instead of f32. The response indicated that this feature is not currently available but is a goal for future implementation. This feature would allow for more efficient training of models.
    • github.com/...

1.2 Top 5 Active Issues:

We consider active issues to be issues that have generated much discussion in the issue's comments.

  1. server : improvements and maintenance: This issue is about improving and maintaining the server example in the GitHub project, which has grown in functionality but is currently unstable and missing important features. The issue aims to track these points and draw community attention to tasks that require significant effort to complete.
  • The comments discuss various improvements and suggestions, including adding new features like look-ahead decoding, contrastive search, speculative sampling, and function calling. There are also discussions about refactoring the code, improving stability, and making the server production-ready. Some comments suggest focusing on specific use cases, like large-scale deployments and hobbyist workflows, while others debate the implementation of chat templates and the use of Jinja2 for templating. The conversation also touches on the need for better error handling, caching, and support for multiple LoRAs with the same base model.
    • Number of comments: 108
  1. Support BitNet b1.58 ternary models: This issue is about implementing support for BitNet b1.58 ternary models in the llama.cpp project. The BitNet b1.58 models use ternary values (1, 0, -1) and are claimed to offer performance improvements over fp16 models, but they need to be trained in this ternary mode from the start.

    • The comments discuss the potential benefits and challenges of implementing BitNet, including the need for new quantization methods, the feasibility of training ternary models directly, and the potential for hardware optimizations. There are also mentions of various implementations and experiments with ternary models, as well as the release of related tools and models by Microsoft and other contributors.
    • Number of comments: 88
  2. Investigate gemma 2 generation quality: This issue is about investigating the quality of the Gemma 2 model generation in the llama.cpp project, with initial reports suggesting potential problems with the tokenizer and quantization. The discussion includes various tests, comparisons with other implementations, and suggestions for potential fixes and improvements.

    • The comments section includes detailed discussions on the hard-coded window size, issues with math questions indicating tokenizer problems, differences in quantization quality, and various tests and benchmarks comparing different implementations and configurations. There are also suggestions for code changes, observations on tokenizer behavior, and discussions on potential fixes and improvements.
    • Number of comments: 88
  3. Support for Phi-3 models: This issue is about adding support for Microsoft's newly released Phi-3 models, which come in three variants: mini, small, and medium. The request is to integrate these models into the project, with a particular focus on addressing compatibility issues and implementing necessary features like long context support.

    • The comments discuss various aspects of integrating Phi-3 models, including initial success with partial functionality, issues with long context support, and specific errors encountered during conversion. There are also references to external resources, ongoing efforts to implement necessary features, and community contributions to resolve these issues. The discussion highlights the complexity of supporting new model architectures and the collaborative effort required to achieve full compatibility.
    • Number of comments: 83
  4. Bug: QWEN2 quantization GGML_ASSERT: This issue is about a bug encountered when attempting to quantize the Qwen2 7B Instruct model to IQ2_XS, resulting in a GGML_ASSERT error. The user also reports that the same error occurs when trying IQ2_S and provides relevant logs and system details for debugging.

    • The comments discuss various errors encountered with different quantization methods, potential causes such as nan values in the imatrix, and suggestions for debugging and fixes. Users share their experiences, including successful and unsuccessful attempts, and discuss potential patches and workarounds, such as using flash attention or modifying precision settings.
    • Number of comments: 71

1.3 Top 5 Quiet Issues:

We consider quiet issues to be issues that have been opened in this project for the longest time. The team should work together to get these issues resolved and closed as soon as possible.

  1. Study how LM Evaluation Harness works and try to implement it: This issue involves studying and implementing the LM Evaluation Harness to perform quantitative analysis of ggml-based inference models. The goal is to integrate this evaluation tool into the project to estimate the quality of the generated output and ensure the project is progressing correctly.

    • Open for 485 days, 17 hours, 18 minutes
  2. llama : add RWKV models support: This issue is about adding support for RWKV models, which are 100% RNN language models that can match transformers in quality and scaling while being faster and more memory-efficient. The issue highlights the advantages of RWKV models, such as their CPU-friendliness on large context lengths and provides various resources and links for further information and experimental implementations.

    • Open for 463 days, 19 hours, 18 minutes
  3. The procedure entry point PrefetchVirtualMemory could not be located in the dynamic link library KERNEL32.dll: This issue describes a problem where running a specific command with certain versions of the software results in an error message stating that the procedure entry point PrefetchVirtualMemory could not be located in the dynamic link library KERNEL32.dll. The user reports that this error occurs with versions after master-180b693, while the last working version for them is master-f2d1c47.

    • Open for 460 days, 12 hours, 07 minutes
  4. [Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors?: This issue is a feature request inquiring about the potential support for the AMD XDNA AI Engine on AMD Ryzen 7x40 series processors. The user has followed the necessary prerequisites and is seeking information on whether there are any plans to integrate this support into the project.

    • Open for 424 days, 15 hours, 52 minutes
  5. Support CoreML like whisper.cpp?: This issue is about a user requesting support for CoreML in the llama.cpp project, similar to the existing support in whisper.cpp, which they have successfully used on their iPhone with impressive performance. The user is inquiring whether it is feasible to implement CoreML support in llama.cpp to achieve similar results.

    • Open for 404 days, 16 hours, 27 minutes

1.4 Closed Issues

Closed Issues This Week: 24

Average Issue Close Time (This Week): 41.66 days

Summarized Issues:

  • Conversion and Quantization Issues: Several issues arise when converting and quantizing models using the convert-hf-to-gguf.py script. These include assertion failures, unrecognized pre-tokenizers, and errors during inference due to improper token handling. Workarounds are often required, but they are not sustainable solutions.
    • github.com/...
    • github.com/...
    • github.com/...
    • github.com/...
  • Platform-Specific Issues: Users report problems specific to certain platforms, such as Intel MacBook Pro and Windows 11. These issues include executable errors and compilation failures due to incompatible argument types for AVX-512 intrinsic functions.
    • github.com/...
    • github.com/...
  • Performance Optimization Requests: There are multiple requests for performance optimizations, including native support for Intel IPEX-LLM, memory allocation improvements for AMD/HIP GPUs, and support for AVX2 SIMD instructions to enhance inference efficiency.
    • github.com/...
    • github.com/...
    • github.com/...
  • Inference and Sampling Issues: Users encounter issues with model inference, such as unexpected token sampling probabilities and response length limitations. These problems affect the usability and accuracy of the language models.
    • github.com/...
    • github.com/...
    • github.com/...
  • Feature Requests: There are several feature requests, including exporting conversations to text files, implementing multi-task parallel processing, and adding support for NVPL BLAS for NVIDIA Grace CPU.
    • github.com/...
    • github.com/...
    • github.com/...
  • Bug Reports: Various bugs are reported, such as segmentation faults, long model loading times, inconsistent tokenization, and issues with specific model metadata keys. These bugs hinder the functionality and reliability of the llama.cpp project.
    • github.com/...
    • github.com/...
    • github.com/...
    • github.com/...
    • github.com/...
    • github.com/...
    • github.com/...

1.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open issues within the past week to identify potentially heated exchanges and to maintain a constructive project environment.

  1. Can't run the program
    • Toxicity Score: 0.55 (Frustration, condescending response, issue template criticism, suggestion to close issue)
    • This GitHub conversation begins with mike2003 expressing frustration over a technical issue, which is met with a helpful suggestion from another user. However, mike2003's follow-up indicates that the suggestion did not resolve the problem, leading to a slightly condescending response from another user, suggesting that the issue is likely a path problem. The conversation continues with another user sharing a similar experience and offering a potential solution, which mike2003 responds to with a detailed example of the problem. The tone shifts as another user questions mike2003's hardware capabilities, leading to a back-and-forth about technical specifications. The conversation culminates with a user pointing out that mike2003 did not follow the issue template and suggesting that the issue be closed, emphasizing the need for respect and proper issue reporting.

II. Pull Requests

2.1 Open Pull Requests

Open Pull Requests This Week: 13

Pull Requests:

  • KV Cache Management: This topic covers improvements to the KV cache mechanism. One pull request introduces a tweakable offset named n_truncate to handle contexts exceeding n_ctx. Another pull request introduces caching of the GGML graph to avoid unnecessary rebuilds, with an option to disable this feature.
    • github.com/ggerganov/llama.cpp/pull/8359
    • github.com/ggerganov/llama.cpp/pull/8366
  • Tokenizer Issues: This topic addresses various issues with the tokenizer. The pull request focuses on fixing discrepancies in expected and actual token outputs across different vocabulary files.
    • github.com/ggerganov/llama.cpp/pull/8379
  • CPU Configuration: This topic involves correctly reading the runtime Scalable Vector Extension (SVE) configuration of the CPU. The pull request changes the method from using svcntb() to prctl(PR_SVE_GET_VL).
    • github.com/ggerganov/llama.cpp/pull/8382
  • Moore Threads GPU Support: This topic introduces initial support for Moore Threads GPU (MTGPU). The pull request integrates MUSA to replace CUDA APIs, adds new build options, and enhances LLM inference performance.
    • github.com/ggerganov/llama.cpp/pull/8383
  • Memory-Constrained Device Support: This topic introduces a method to host multiple fine-tuned models on memory-constrained devices. The pull request splits GGUF files into shared and task-specific tensors, allowing dynamic loading and swapping of task-specific tensors.
    • github.com/ggerganov/llama.cpp/pull/8415
  • Batch Management: This topic ensures proper batch management between embedding and completion tasks. The pull request adds checks to prevent crashes in embedding mode and highlights limitations with n_parallel > 1.
    • github.com/ggerganov/llama.cpp/pull/8420
  • Minor Naming Changes: This topic involves making minor naming changes in the ggml project. The pull request is indicated by the title and the self-reported review complexity options.
    • github.com/ggerganov/llama.cpp/pull/8433
  • Compile Warnings: This topic addresses compile warnings by replacing sprintf with snprintf in examples. The pull request adheres to the project's contributing guidelines.
    • github.com/ggerganov/llama.cpp/pull/8434
  • Ukrainian Tokens: This topic adds Ukrainian tokens into specific files. The pull request updates the strings in convert_hf_to_gguf.py and convert_hf_to_gguf_update.py.
    • github.com/ggerganov/llama.cpp/pull/8435
  • BF16 Support: This topic aims to add BF16 support to the metal component of the project. The pull request includes pending tasks for MoE and Flash Attention.
    • github.com/ggerganov/llama.cpp/pull/8439
  • Script Filename Change: This topic addresses an issue caused by a filename change in a script. The pull request fixes the script's execution within a Docker container by updating the filename.
    • github.com/ggerganov/llama.cpp/pull/8441
  • Loop Counter Assignment: This topic addresses a specific issue reported by SonarQube regarding improper assignment of loop counters. The pull request fixes the issue in the public_simplechat/datautils.mjs file.
    • github.com/ggerganov/llama.cpp/pull/8362

2.2 Closed Pull Requests

Closed Pull Requests This Week: 41

Summarized Pull Requests:

  • Optimized GEMV and GEMM Kernels: This topic covers the introduction of optimized GEMV and GEMM kernels for the Arm AArch64 architecture, specifically for the q4_0_q8_0 and q8_0_q8_0 quantization methods. These optimizations result in significant performance improvements on AWS Graviton3 processors. The changes are aimed at enhancing computational efficiency and speed.
    • github.com/...
  • Token Management and Truncation: This topic addresses the introduction of a n_truncate flag to manage the shifting of both prompt and cached tokens when the prompt exceeds n_ctx. It also splits the truncate flag into shifted and truncated to differentiate between shifts during and before inference. Further testing is required to ensure the changes work as intended.
    • github.com/...
  • Numpy Types in gguf-py Module: This topic resolves an issue related to the use of internal numpy types in the gguf-py module. The pull request aims to fix issue #7380 and includes pending tasks to test conversions with different versions of Numpy. The changes are intended to improve compatibility and functionality.
    • github.com/...
  • Ukrainian Tokens in Script: This topic involves updating the convert-hf-to-gguf-update.py script to include Ukrainian tokens in its string processing. The update ensures that the script can handle Ukrainian language tokens correctly. This change enhances the script's multilingual capabilities.
    • github.com/...
  • SYCL Backend Fixes: This topic covers multiple fixes and updates for the SYCL backend. It includes a fix for the mul_mat_id function for MOE, re-enabling the mmvq path for the Nvidia backend, and correcting a powf function call. These changes aim to improve performance and ensure consistent behavior across different backends.
    • github.com/...
    • github.com/...
    • github.com/...
  • Deprecated Code Replacement: This topic introduces a get_pointer() helper function to replace deprecated code with the more current get_multi_ptr. This change significantly reduces build warnings and modernizes the codebase. The update ensures better maintainability and compatibility.
    • github.com/...
  • Multilingual Language Models: This topic introduces an option named ignore_english_tokens to prevent multilingual language models from generating text that mixes English with other languages. The option avoids tokens with two or more English characters unless they include angle brackets. This change improves the quality of multilingual text generation.
    • github.com/...
  • Deprecation Warning System: This topic introduces a temporary deprecation warning system to help users transition to new binary names. It addresses confusion from a previous name change by providing clear messages and replacement binaries for commonly used files. The system aims to improve user experience during the transition period.
    • github.com/...
    • github.com/...
  • Code Simplification: This topic covers the removal of the K_QUANTS_PER_ITERATION variable and always using a value of 2 in the ggml module. It also includes the removal of the loop over h in the llama_set_inputs function. These changes aim to simplify the code and improve readability.
    • github.com/...
    • github.com/...
  • End-of-Sequence Token Fix: This topic addresses a fix for the internlm2 converter to ensure the end-of-sequence (eos) token is correctly output at the end of each conversation. The fix ensures that conversations are properly terminated. This change improves the accuracy of text generation.
    • github.com/...
  • Dependency Updates: This topic involves updating the flake.lock file with new versions of dependencies, specifically flake-parts, flake-parts/nixpkgs-lib, and nixpkgs. The updates are part of an automated process by the update-flake-lock GitHub Action. These changes ensure that the project uses the latest versions of its dependencies.
    • github.com/...
  • Web Link and Documentation Fixes: This topic addresses a web link error and updates the README file to include information about supported Generalized Linear Models (GLM). It also fixes a broken link to the "Performance troubleshooting" documentation. These changes improve the accuracy and completeness of the project's documentation.
    • github.com/...
    • github.com/...
    • github.com/...
  • Performance Optimization: This topic aims to enhance the sampling performance by preallocating the sampling token data vector. The change results in a significant reduction in execution time from approximately 500 microseconds per operation to 40 microseconds per operation. This optimization provides a notable performance boost, particularly for the examples/lookahead implementation.
    • github.com/...
  • Synchronization with ggml: This topic involves synchronizing updates or changes related to 'ggml' in the GitHub project 'llama.cpp'. The synchronization ensures that the project remains up-to-date with the latest developments in the ggml module. This change helps maintain consistency and compatibility.
    • github.com/...
  • CMake Configuration Update: This topic introduces a minor change to the CMake configuration, allowing the use of the llama.cpp project as a subdirectory with an externally provided ggml library. The update ensures compatibility when ggml is added from a parent project. This change improves the flexibility of the build system.
    • github.com/...
  • Labeler Action Update: This topic updates the labeler action to align with the new structure of the SYCL backend in the project. The update ensures that the labeler action correctly categorizes pull requests based on the updated project structure. This change improves the accuracy of the labeling process.
    • github.com/...
  • Typo Correction: This topic addresses a small typo in the README file, correcting "Bakus-Naur" to "Backus-Naur". The correction ensures that the documentation is accurate and free of errors. This change improves the readability and professionalism of the documentation.
    • github.com/...
  • Option Renaming: This topic addresses the discrepancy between LLAMA_CCACHE and GGML_CCACHE by renaming the option to GGML_NO_CCACHE. The renaming resolves issue #8380 in the project. This change ensures consistency in the naming of configuration options.
    • github.com/...
  • Source File Reorganization: This topic involves reorganizing the source files by moving the sgemm sources to a subfolder named llamafile within the ggml directory. The reorganization aims to improve the structure and maintainability of the codebase. This change makes it easier to navigate and manage the source files.
    • github.com/...
  • Deprecation Warnings for codecvt_utf8: This topic addresses the deprecation warnings for codecvt_utf8 in C++17 by silencing them when compiling with MSVC. The change ensures that the project can be compiled without warnings related to deprecated features. This update improves the compatibility of the code with modern C++ standards.
    • github.com/...
  • API User Assertion: This topic adds an assertion to the build_t5() function to inform API users about the necessity of calling llama_encode() first. The assertion prevents a cryptic error message when using encoder-decoder models like T5 without prior encoding. This change improves the usability and error handling of the API.
    • github.com/...
  • Default Sampling Parameters: This topic enables the server to set default sampling parameters via the command-line. The change ensures that the specified grammar file is applied to all requests unless explicitly overridden by the user. This update improves the consistency and predictability of the server's behavior.
    • github.com/...
  • Space Errors Fix: This topic addresses and fixes space errors in the convert_hf_to_gguf.py script. The fix ensures that the script handles spaces correctly during conversion. This change improves the accuracy and reliability of the conversion process.
    • github.com/...
  • C++20 Compilation Fix: This topic addresses compilation errors in the llama.cpp project when using C++20. The fix involves adding a macro to explicitly cast u8"string" from const char8_t* to const char*. This change ensures compatibility with the C++20 standard.
    • github.com/...
  • Automatic Release Process Fix: This topic attempts to fix the automatic release process for the gguf-py project using tags. The fix ensures that releases are correctly generated and published. This change improves the reliability of the release process.
    • github.com/...
  • F32 Precision in Qwen2 Attention: This topic addresses the issue of certain models generating "GGGG" when FA is disabled by using F32 precision in Qwen2 attention. The change ensures that the models generate text correctly. This update improves the accuracy of text generation.
    • github.com/...
  • Noreturn Warning Suppression: This topic adds a while(true) loop to the no_device_code function in common.cuh to suppress a 'noreturn' warning. The change reduces the number of warnings when compiling with GGML_HIPBLAS=ON. This update improves the cleanliness of the build process.
    • github.com/...
  • MMQ Code Optimization for CUDA: This topic optimizes the performance of the MMQ (Matrix Multiplication Quantization) code for CUDA. The optimization includes adjusting shared memory tile dimensions, optimizing loop structures, and unifying code paths for different quantization formats. These changes result in various performance improvements and refactorings in preparation for i-quants.
    • github.com/...
  • Server Parameters Endpoint Update: This topic updates the /props endpoint to correctly return the default server parameters set via the command line. The update ensures accurate reflection of values such as n_ctx and grammar. This change improves the reliability of the server's configuration reporting.
    • github.com/...
  • Tokenizer Option: This topic introduces a --no-parse-special option to the tokenizer. The option simplifies the explanation of how the parse_special setting impacts tokenization, particularly in scenarios where parse_special = false causes issues with tokenizing consecutive spaces. This change improves the flexibility and usability of the tokenizer.
    • github.com/...
  • NVPL BLAS Support: This topic adds NVPL BLAS support to the project by introducing GGML_NVPL as a build option in the makefile. The update modifies ggml-blas.cpp to include NVPL BLAS and manage its threads via NVPL_ENABLE_CBLAS. This change enhances the project's support for different BLAS implementations.
    • github.com/...
  • SVE Macro: This topic introduces a GGML_USE_SVE macro to disable SVE by default. The change addresses issues with 256-bit operations on SVE 128-bit CPUs by ensuring SVE is only enabled if explicitly set at compile time. This update improves compatibility with different CPU architectures.
    • github.com/...
  • SYCL Unit Test Fix: This topic addresses and resolves issues related to the mul_mat_id unit test in the SYCL component of the project. The fix ensures that the unit test runs correctly and produces accurate results. This change improves the reliability of the SYCL component.
    • github.com/...

2.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open pull requests within the past week to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open pull requests from the past week.


III. Commits

3.1 Commits

Commits This Week: 38

Summarized Commits:

  • SYCL Unit Tests and Fixes: The mul_mat_id unit tests for SYCL have been updated to fix part of the tests and skip the bfloat16 SYCL unit tests. Additionally, deprecated warnings in the SYCL codebase have been addressed by utilizing the multi_ptr class, and the powf function call within the device code has been fixed.
  • BLAS and GEMM Enhancements: NVPL BLAS support has been introduced to the ggml library, replacing the <BLASLIB>_ENABLE_CBLAS directive with GGML_BLAS_USE_<BLASLIB>. Optimized GEMV and GEMM kernels for AArch64 architecture have also been added, targeting specific quantization methods.
  • CUDA and HIPBLAS Improvements: The MMQ component in CUDA has been optimized and refactored with explicit q8_1 memory layouts, and the __trap macro in common.cuh has been updated to suppress a 'noreturn' warning when compiling with GGML_HIPBLAS=ON.
  • Build and Configuration Updates: The build process has been modified to ensure the deprecation-warning 'main' binary is built every time, and the macro LLAMA_NO_CCACHE has been replaced with GGML_NO_CCACHE in the CMake files. Additionally, the CMake configuration has been updated to allow the use of an external GGML library.
  • Documentation and README Updates: The README.md file has been updated to fix a broken link to the "Performance troubleshooting" documentation, and a typographical error has been corrected. The gguf-py README file has also been updated, and the patch version incremented for release.
  • Tokenization and Sampling Enhancements: A new --no-parse-special option has been introduced to the tokenize feature, and the sampling performance has been optimized by preallocating the sampling token data vector to the vocabulary size, significantly reducing execution time.
  • Attention Mechanism and Precision Updates: The Qwen2 attention mechanism has been updated to use F32 precision, and the FA component has been removed. This change is part of ongoing improvements to the llama project.
  • Code Refactoring and Cleanup: The sgemm source files have been relocated to a subfolder named 'llamafile' within the ggml project, and various code refactorings have been performed, including the removal of an unused file and fixing linting issues.
  • Deprecation and Compatibility Fixes: Deprecated warnings related to the codecvt functionality in C++17 when using the MSVC compiler have been addressed, and C++20 compatibility for u8 strings in the llama project has been ensured.
  • Assertion and Error Handling: An assertion has been introduced to ensure that the llama_encode() function is called, and assertions for prefix and suffix tokens have been added to the infill functionality, along with the removal of outdated space handling logic.
  • Synchronization and Integration: The project has been synchronized with the latest changes from the ggml repository, and the SYCL labeler has been updated to ensure consistency with the documentation.
  • Whitespace and Typographical Corrections: Whitespace issues in the test files have been corrected, and an extra space in the convert_hf_to_gguf.py script has been fixed.
  • Server and Slot Sampling Parameters: Default server sampling parameters can now be set through the command-line interface, ensuring they are loaded from the server context by default, with improvements to the comments for clarity.
  • Deprecation Warning Program: A deprecation warning program has been introduced to aid users in transitioning to new binary names, ensuring legacy replacement binaries are built only if they pre-exist and verifying their presence consistently.
  • External Library and Build Configuration: Changes to the CMake configuration now allow the use of an external GGML library, and the build configuration has been updated to replace the macro LLAMA_NO_CCACHE with GGML_NO_CCACHE.
  • Performance Optimization: The code has been optimized by avoiding unnecessary fetching of logits, resulting in performance improvements.
  • Web Link and Documentation Fixes: A web link error in the README file has been corrected, ensuring the correct link is provided, and the README file has been updated to include information about the supported Generalized Linear Models (GLM).
  • Flake.lock and Dependency Updates: The flake.lock file has been updated by refreshing the inputs for 'flake-parts', 'flake-parts/nixpkgs-lib', and 'nixpkgs' to their latest versions.
  • CUDA Implementation for Convolution: A CUDA implementation for the ggml_conv_transpose_1d function has been introduced, ensuring it passes various tests and addressing several bugs and style issues.
  • Synchronization Process for SYCL: The synchronization process for SYCL within the scripts has been addressed to ensure proper functionality and performance.
  • Comment and Documentation Improvements: The verbosity of a comment in the SYCL Nvidia Backend has been reduced, and the gguf-py README file has been updated as part of the release pipeline improvements.

IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, or created at least 1 pull request in the past month.

Contributor Commits Pull Requests Issues
GitHub 202 0 0
ggerganov 0 28 1
ngxson 0 15 3
Georgi Gerganov 15 0 0
slaren 0 14 0
JohannesGaessler 0 13 0
danbev 0 11 0
0wwafa 0 0 10
Someone 8 0 0
HanClinto 0 7 0
fairydreaming 0 5 0
jukofyork 0 4 1
ochafik 0 5 0
compilade 0 5 0
AidanBeltonS 0 4 0
luoyu-intel 0 4 0
OuadiElfarouki 0 4 0
Alcpz 0 4 0
RunningLeon 0 1 3
oldmanjk 0 0 4
joeatodd 0 3 0
mdegans 0 2 1
CISC 0 3 0
HatsuneMikuUwU33 0 2 1
criminact 0 2 1
iboB 0 3 0
RakshitAralimatti 0 0 3
ghchris2021 0 0 3
arthw 0 2 0
0cc4m 0 2 0
Galunid 0 2 0
sasha0552 0 2 0
Adriankhl 0 1 1
youth123 0 2 0
matteoserva 0 1 1
jaime-m-p 0 2 0
hamdoudhakem 0 2 0
airMeng 0 2 0
mofosyne 0 2 0
AragonerUA 0 2 0
daniandtheweb 0 2 0
iamlemec 0 2 0
isaac-mcfadyen 0 1 1
bandoti 0 1 1
ZeusXuan 0 2 0
jpodivin 0 2 0
LDLINGLINGLING 0 1 1
dspasyuk 0 1 1
standby24x7 0 2 0
b4b4o 0 1 1
kevmo314 0 2 0
nicholaiTukanov 0 1 1
duynt575 0 0 2
stduhpf 0 0 2
liuda1980 0 0 2
uwu-420 0 0 2
cmp-nct 0 0 2
Billzhong2022 0 0 2
takosalad 0 0 2
kidoln 0 0 2
Smupk2778 0 0 2
wangzi7654321 0 0 2
jygmysoul 0 0 2
QIANXUNZDL123 0 0 2
ch1y0q 0 0 2
SimplyCorbett 0 0 2
hanishkvc 0 1 0
calvin-laurenson 0 1 0
hopkins385 0 1 0
zkh2016 0 1 0
akx 0 1 0
thxCode 0 1 0
drepper 0 1 0
abhishek-rn 0 1 0
0xspringtime 0 1 0
rgerganov 0 1 0
edude03 0 1 0
NickCrews 0 1 0
netrunnereve 0 1 0
ltoniazzi 0 1 0
ddh0 0 1 0
joecryptotoo 0 1 0
IMbackK 0 1 0
Eddie-Wang1120 0 1 0
fmz 0 1 0
katsu560 0 1 0
kustaaya 0 1 0
contentis 0 1 0
pculliton 0 1 0
zhentaoyu 0 1 0
loonerin 0 1 0
salaxieb 0 1 0
mgroeber9110 0 1 0
abetlen 0 1 0
AlexsCode 0 1 0
iacore 0 1 0
Zor-X-L 0 1 0
crashr 0 1 0
hackingthekernel 0 1 0
andy-tai 0 1 0
mcharytoniuk 0 1 0
Quantaindew 0 1 0
MistApproach 0 1 0
foldl 0 1 0
ho2103 0 1 0
hopto-dot 0 1 0
akemimadoka 0 1 0
NeoZhangJianyu 0 1 0
dwoolworth 0 1 0
pouwerkerk 0 1 0
bviksoe 0 1 0
mtasic85 0 1 0
diimdeep 0 1 0
perpendicularai 0 1 0
prfd 0 1 0
brochure 0 1 0
agray3 0 1 0
jdomke 0 1 0
yeahdongcn 0 1 0
daghanerdonmez 0 1 0
andysalerno 0 1 0
laik 0 1 0
monatis 0 1 0
zhipenghan 0 1 0
msy-kato 0 1 0
ClarkChin08 0 1 0
kriation 0 1 0
Zibri 0 0 1
Nexesenex 0 0 1
INZA111 0 0 1
rankaiyx 0 0 1
xiangyang-95 0 0 1
steampunque 0 0 1
ztrong-forever 0 0 1
chigkim 0 0 1
apar2021 0 0 1
bartowski1182 0 0 1
apcameron 0 0 1
aymane-eljerari 0 0 1
lld1995 0 0 1
vecorro 0 0 1
arch-btw 0 0 1
richardanaya 0 0 1
vt-alt 0 0 1
farnazj 0 0 1
anunknowperson 0 0 1
JMPSequeira 0 0 1
skoulik 0 0 1
zhaoyuchen1128 0 0 1
Deputation 0 0 1
Ther-nullptr 0 0 1
mneedham 0 0 1
Edw590 0 0 1
EverythingForAI 0 0 1
cikkle 0 0 1
marcingomulkiewicz 0 0 1
mirekphd 0 0 1
hnfong 0 0 1
ffroquemartinez 0 0 1
idekel 0 0 1
nivibilla 0 0 1
DerekJuba-NIST 0 0 1
abgulati 0 0 1
perp 0 0 1
moqimoqidea 0 0 1
thesyntaxinator 0 0 1
SteelPh0enix 0 0 1
justinsteven 0 0 1
palindsay 0 0 1
differentprogramming 0 0 1
lcarrere 0 0 1
MarsBlessed 0 0 1
sreenivasraghavan71 0 0 1
Lookforworld 0 0 1
nmandic78 0 0 1
Green-Sky 0 0 1
eliranwong 0 0 1
quarterturn 0 0 1
rudiservo 0 0 1
werruww 0 0 1
unclemusclez 0 0 1
JohnClaw 0 0 1
micsthepick 0 0 1
kherud 0 0 1
tomgm777 0 0 1
chiranko 0 0 1
Gomez12 0 0 1
starP-W 0 0 1
nathanodle 0 0 1
tybalex 0 0 1
akhilkapil 0 0 1
LiquidGunay 0 0 1
mirek190 0 0 1
flatsiedatsie 0 0 1
tihom77 0 0 1
sorasoras 0 0 1
lorihuang 0 0 1
ctb111 0 0 1
aahouzi 0 0 1
jim-plus 0 0 1
Yan-Xiangjun 0 0 1
josharian 0 0 1
Aridbhdkkj 0 0 1
AUTOMATIC1111 0 0 1
d-kleine 0 0 1
warren-lei 0 0 1
yancaoweidaode 0 0 1
andreys42 0 0 1
gpacix 0 0 1
guinmoon 0 0 1
apresence 0 0 1
kasrahabib 0 0 1
Hardik-Choraria 0 0 1
yli147 0 0 1
99991 0 0 1
Don't miss what's next. Subscribe to Weekly Project News:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.