Weekly GitHub Report for Llama.cpp - 2024-07-29 12:00:11
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
I. Issues
1.1 Open Issues
Open Issues This Week: 25
Summarized Issues:
- macOS GitHub Actions hosted runners issues: This issue describes a bug where macOS GitHub Actions hosted runners hang or fail to return results when running small models like Qwen2 1.5B and Phi-3 mini. These models work fine on other operating systems and locally on an M3 Max. This suggests a potential problem with GPU virtualization on GitHub's macOS runners.
- Integrated GPU issues on Framework Laptop 16: This issue describes a bug where attempting to use the integrated GPU (iGPU) on a Framework Laptop 16 with an AMD Radeon RX 7700S and AMD Radeon 780M results in crashes. The crashes are due to out-of-memory errors and segmentation faults when running the
llama-server
application. This indicates a problem with memory management on these specific GPUs.
- OpenAI API
max_tokens
parameter bug: This issue describes a bug where non-chat completions using the OpenAI API do not respect themax_tokens
parameter. This causes the model to generate tokens indefinitely until the context length is reached. The current workaround is to stream the response and manually close the connection upon reaching the desired token count.
llama.pc
file issues: This issue pertains to thellama.pc
file containing an incorrectVersion:
line, resulting in erroneous output from thepkg-config --print-provides
command. This affects downstream distributions by failing to auto-detect dependencies due to an empty@PROJECT_VERSION@
variable. Another issue involves the incorrect packaging of thellama.pc
file into/usr/lib/pkgconfig/llama.pc
instead of the architecture-dependent "libdir" directory.
- Docker image conversion issues: This issue describes a bug encountered when using the llama.cpp Docker image to convert certain vector models to gguf format. The conversion results in a
NotImplementedError
due to an unrecognized BPE pre-tokenizer. The issue requests guidance on resolving the problem and identifying compatible older versions of the software.
- Support for Llama 3.1 model: This issue requests the addition of support for the newly released Llama 3.1 model in the
llama.cpp
project. Necessary updates for RoPE scaling are included to ensure coherent text generation. This would enhance the project's compatibility with the latest models.
- HuggingFace documentation update: This issue highlights that the HuggingFace documentation for the "Use This Model" button is referring to outdated binary names for llama.cpp. It suggests updating the instructions to use
llama-cli
instead ofmain
. This would ensure users follow the correct procedures.
- GGML_ASSERT error with Meta-Llama-3.1-8B-Instruct models: This issue involves a GGML_ASSERT error occurring when attempting to run Meta-Llama-3.1-8B-Instruct models with Q8 and Q4 quantization on the SYCL backend using an Intel ARC A770 GPU on Windows 11. Other models work fine, indicating a specific compatibility issue with these models.
- Quantization issues with Llama 3.1 70B model: This issue involves a bug where attempting to quantize the Llama 3.1 70B model to Q4_K_S using imatrix results in NaN values. The issue occurs specifically at block 48, and similar issues are observed with other quant sizes like Q3_K_L and Q3_K_M. This indicates a problem with the quantization process.
- Llama 3.1 model breaking on macOS: This issue describes a bug introduced in version b3383 that causes the Llama 3.1 model to break when running specific commands. The error results in an "input is empty" message on macOS. A workaround and partial fixes are mentioned in the comments.
- Docker tags issue: This issue is about Docker tags incorrectly starting with build number b1 followed by the commit hash. The problem is due to the action/checkout in docker.yml not setting the build depth, which should be configured to fetch-depth: 0 to match the build.yml workflow.
- Memory usage issues on Mac: This issue describes a problem where the memory usage for running inference on a llama model fluctuates between cached files and wired memory. This causes inefficiencies and crashes when attempting to keep layers in wired memory on a Mac system. The issue highlights the need for better memory management.
- Reintroduction of chat/instruct templates: This issue is a feature request to reintroduce the previously removed chat/instruct templates in the llama.cpp project. The user found them extremely useful and is experiencing difficulties with the current alternatives. This would improve usability for those relying on these templates.
- Conversion issues with fine-tuned Code Llama model: This issue describes a bug encountered when attempting to convert a fine-tuned Code Llama model file into a GGUF file. The conversion results in an "out of range" error due to token IDs exceeding the maximum allowed value. This indicates a problem with the tokenization process.
- Multi-core support for full GPU offload: This issue requests the implementation of multi-core support for full GPU offload in the llama.cpp project. This would improve performance on systems with lower single-core performance by enabling the
--threads
argument. The feature would enhance the project's efficiency.
- Corrupted outputs with multiple CUDA GPUs: This issue describes a bug where models produce corrupted outputs when offloading to multiple CUDA GPUs. Specific problems include incorrect parsing of prompts and reuse of information from previous prompts. This indicates a problem with the GPU offloading process.
- Offloading specific operations to GPU: This issue is about inquiring whether it is possible to offload specific operations, such as attention calculations, to the GPU while keeping other operations, like layer normalization, on the CPU. This would allow for more efficient use of resources in a GitHub project.
- Quantization process error on mamba architecture: This issue describes a bug where an error occurs during the quantization process to gguf q5_k_m on a mamba architecture. The error is due to a missing 'architectures' key in the model's parameters, resulting in a KeyError. This indicates a problem with the model's metadata.
- LlamaCpp tokenizer bug: This issue describes a bug in the LlamaCpp tokenizer where it fails to correctly tokenize partial UTF-8 byte sequences. This results in an invalid character error and breaks functionality, as demonstrated with the character '歪' being split into two tokens. This indicates a problem with the tokenizer's handling of UTF-8 sequences.
- Addition of
chat_example
property: This issue requests the addition of achat_example
property to the/props
endpoint of the server. This would provide a recommended chat template for models, facilitating easier access and verification of templates through the server UI.
- Compilation issues on Linux: This issue describes a problem where the user is unable to compile the llama.cpp project on a Linux operating system using either the make or cmake methods. Different errors are encountered with each approach despite trying multiple versions of gcc. This indicates a problem with the build process on Linux.
- Missing 'libggml.so' file during installation: This issue describes a bug where attempting to install a project using CMake on a Linux system fails due to a missing 'libggml.so' file. The error message indicates a problem during the installation process. This highlights the need for proper dependency management.
- GPU acceleration on Android: This issue is about seeking guidance on how to utilize the GPU on an Android device to accelerate inference for the llama.cpp demo. The user encountered problems with OpenCL and is looking for alternatives like Vulkan. This would enhance the performance of the demo on Android devices.
- Lightweight tests for LoRA: This issue is about adding lightweight tests for LoRA by training adapters based on specific datasets. The tests would be conducted with a command-line interface and the outputs verified. An optional task includes creating small models with different architectures.
1.2 Top 5 Active Issues:
We consider active issues to be issues that have generated much discussion in the issue's comments.
-
Feature Request: Proper Llama 3.1 Support in llama.cpp: This issue is a feature request to add support for Llama 3.1 in the llama.cpp project, which involves updating the model to handle new requirements such as RoPE scaling to ensure coherent text generation. The motivation behind this request is to unlock the full potential of the Llama 3.1 model, as the current implementation without these updates results in suboptimal text generation.
- The comments discuss various aspects of implementing Llama 3.1 support, including function calling, new tokens, and template adjustments. Users share their experiences with different configurations and quantizations, noting issues with accuracy and performance. Some users report success with specific settings, while others highlight ongoing problems, particularly with long context handling and CUDA implementation. The conversation also includes links to relevant code changes and external resources for further testing and validation.
- Number of comments: 111
-
server : improvements and maintenance: This issue is about improving and maintaining the server example in the GitHub project, which has grown in functionality but is currently unstable and missing important features. The issue aims to track these points and draw community attention to them, as some tasks are significant and require considerable effort to complete.
- The comments discuss various improvements and suggestions, including adding look-ahead decoding, contrastive search, speculative sampling, and function calling support. They also touch on the need for better error handling, refactoring for stability and performance, and the potential use of templating systems like Jinja. There is a consensus on the importance of making the server more robust and user-friendly, with some debate on the best approaches to achieve these goals.
- Number of comments: 108
-
Support BitNet b1.58 ternary models: This issue is about implementing support for BitNet b1.58 ternary models, which use 1.58 bits with ternary values (1, 0, -1) for training, showing performance improvements over fp16 models. The issue highlights the potential for running larger models with less VRAM and discusses the feasibility and benefits of integrating this new training method into llama.cpp.
- The comments discuss the novelty and potential of training models directly in a quantized state, the technical details and challenges of implementing ternary models, the need for further validation and code release from the original authors, and the community's interest in exploring and optimizing this approach for practical use.
- Number of comments: 88
-
Investigate gemma 2 generation quality: This issue is about investigating the quality of the Gemma 2 generation in the llama.cpp project, with initial reports suggesting potential problems with the tokenizer. The discussion includes various tests and observations, including discrepancies in tokenization and output quality, especially in handling specific tokens and formatting issues.
- The comments discuss the hard-coded window size of Gemma 2, issues with math questions indicating potential tokenizer problems, differences in quantization quality, and various tests comparing outputs from different implementations. There are also suggestions for fixes, such as changing the vocabulary conversion method and adjusting logit softcapping values, with some users reporting improvements after applying these changes.
- Number of comments: 88
-
Support for Phi-3 models: This issue is about adding support for Microsoft's recently released Phi-3 models, which come in three variants: mini, small, and medium. The request is to integrate these new models into the project, ensuring compatibility and functionality.
- The comments discuss various aspects of integrating Phi-3 models, including successful initial tests, issues with long context support, and specific errors encountered during conversion. There are also discussions about the need for new prompt templates, the implementation of longrope techniques, and the eventual merging of a pull request that adds support for Phi-3 4K context length models. However, support for the 128K context length variant remains unresolved, with ongoing efforts and community contributions to address this.
- Number of comments: 83
1.3 Top 5 Quiet Issues:
We consider quiet issues to be issues that have been opened in this project for the longest time. The team should work together to get these issues resolved and closed as soon as possible.
-
llama : add test for saving/loading sessions to the CI: This issue involves adding a test for saving and loading sessions to the continuous integration (CI) process of the llama project. The task requires understanding the
save-load-state
example and incorporating a simple test into theci/run.sh
script.- Open for 347 days, 01 hours, 18 minutes
-
llama : tool for evaluating quantization results per layer: This issue proposes the development of a tool to evaluate quantization results per layer by comparing classical and quantum models using
ggml
exported graphs. The tool aims to provide detailed statistical information on intermediate results after each graph node to identify where precision is needed to minimize quantization differences.- Open for 338 days, 05 hours, 57 minutes
-
CUDA non-determinism on identical requests: This issue describes a problem where identical requests to a server using CUDA for layer offloading return different responses the first time, but consistent responses thereafter, suggesting a potential caching issue. The expected behavior is that the output should remain the same when parameters and seed are constant, and this non-deterministic behavior is not observed with Metal offload or without CUDA offload.
- Open for 335 days, 22 hours, 22 minutes
-
Windows ROCm Build.: This issue involves a user attempting to compile the llama.cpp project for ROCm on a Windows system, encountering difficulties due to CMake's default paths for the clang and clang++ compilers, which differ from their actual locations on Windows. The user reports that attempts to compile using Visual Studio and CMake result in an error message indicating that "CC is not recognized as an internal or external command."
- Open for 335 days, 20 hours, 44 minutes
-
Please support the also official Falcon-rw-1b and Falcon-rw-7b model variants: This issue requests support for the Falcon-RW-1B and Falcon-RW-7B model variants, which are official versions of the Falcon model series. The user has encountered errors when attempting to convert and quantize these models using the
convert-falcon-hf-to-gguf.py
script, and is seeking assistance or confirmation on whether these models will be supported.- Open for 334 days, 08 hours, 09 minutes
1.4 Closed Issues
Closed Issues This Week: 31
Average Issue Close Time (This Week): 26.31 days
Summarized Issues:
- Log Probabilities in
create_chat_completions
: Users are experiencing issues with obtaining log probabilities using thecreate_chat_completions
function in thellama-cpp
library. Despite settinglogprobs=True
, the expected log probabilities are not included in the output. This issue affects the functionality of the library for users who rely on log probabilities for their applications.
- Model Conversion Errors: Several issues have been reported regarding errors encountered during model conversion to GGUF format. These include unrecognized rope scaling types, debugging errors in
tensor_mapping.py
, and problems with configuration files and model architecture parameters.
- Server and Streaming Issues: Users have reported various issues with the server and completion streaming. These include special tokens being returned as empty strings, performance slowdowns, infinite loops with batch requests, and server crashes after specific commits.
- SYCL Backend Bugs: Multiple issues have been identified with the SYCL backend, including device index errors, operation failures, and build errors with Intel OneAPI. These bugs affect the stability and functionality of the backend in various scenarios.
- Embedding and Tokenization Issues: Problems have been reported with the embedding endpoint and tokenization processes. These include crashes in the tokenizer, unwanted spaces in tokenization, and incorrect formatting in chat templates.
- Compilation and Build Errors: Users have encountered various compilation and build errors, including issues with
ggml-aarch64.c
on Windows ARM64, CUDA build process failures, and symbol lookup errors due to incorrect library linking.
- Feature Requests: There have been requests for new features, such as multi-session chat processing, support for the SmolLM family of models, and support for the Mistral-Large model from Hugging Face. These requests aim to enhance the functionality and versatility of the project.
- Miscellaneous Bugs: Various other bugs have been reported, including issues with the
export-lora
command,llama_print_system_info
function,train-text-from-scratch
command, and grammar-related generation differences.
1.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open issues within the past week to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open issues from the past week.
II. Pull Requests
2.1 Open Pull Requests
Open Pull Requests This Week: 18
Pull Requests:
- Graph Nodes Determination: This pull request proposes to determine the maximum number of graph nodes based on the model information, such as architecture and hyperparameters. It addresses issue #8615 and aims to optimize the graph node allocation. This enhancement is crucial for improving the model's performance and scalability.
- CMakePresets Fix: This pull request addresses a fix in the CMakePresets by ensuring that the host value for the MSVC compiler in the toolchain is correctly set to either x86 or x64. This change aligns with the CMake documentation. It ensures proper configuration and compilation of the project.
- Python Binding and Installation: This pull request introduces a pre-compiled Python binding for llama.cpp using CFFI. It supports both CPU and CUDA 12.5 execution and simplifies installation to a single
pip install
command. This enhancement makes it easier for users to integrate and use the library.
- CLI Template Argument: This pull request introduces an optional
--template
argument to thellava-cli
tool. It allows users to format the output of bulk image descriptions according to a specified template. This feature enhances the utility and customization of the generated descriptions.
- Library Refactoring: This pull request aims to refactor the
llama
library by moving thellama_sampling_context
. It updates the sampling API to utilize it instead ofllama_context
and removesLLAMA_API_INTERNAL
. These changes improve the library's structure and maintainability.
- SHA-256 Tensor Hash: This pull request introduces a SHA-256 tensor hash to the key-value store in the project. It provides a strong cryptographic method for tracking models and ensuring data integrity. This feature is particularly useful for model repository maintainers like Hugging Face.
- Lookup Example Overhaul: This pull request overhauls the lookup example to use a tree of sequences instead of a single sequence. It aims to improve the prediction accuracy of multiple tokens per evaluation. The change potentially increases the token generation rate by 33-50% through a more efficient intermediate data format and cost function prioritization.
- XLMRoberta Embedding Models: This pull request adds support for XLMRoberta embedding models. It modifies tokenization using the new T5 Unigram work and includes necessary adjustments to the position embedding matrix and Unigram tokenizer. These changes enhance the model's capabilities and compatibility.
- Threadpool Management API: This pull request introduces an API for explicit management and fine-grain control of threadpools. It allows for the creation, pausing, resuming, and releasing of multiple threadpools independently. This optimization improves thread scheduling and performance in various execution contexts.
- Rope Scaling Factors: This pull request introduces the generation and integration of rope scaling factors into the Llama 3.1 model during conversion. It enhances inference performance for context windows exceeding 8192 tokens. This improvement is crucial for handling larger context windows efficiently.
- SYCL Backend Convolution Support: This pull request aims to add convolution support to the SYCL backend for the stablediffusion.cpp project. It serves as a temporary solution with plans to introduce OneDNN for improved convolution performance in the future. This addition enhances the backend's capabilities.
- Dockerfile Curl Installation: This pull request aims to install 'curl' in the runtime layer of the
llama-server.Dockerfile
. It enables docker health checks for the basic server image. This addition ensures better monitoring and maintenance of the server.
- Hash Table Reset Optimization: This pull request aims to reduce the reset cost of hash tables by using a bit table to indicate slot usage instead of a
NULL
pointer. It significantly decreases the memory that needs to be cleared during resets. This optimization improves performance, especially in small models.
- Session File Management: This pull request aims to simplify and unify the session file management in the llama project. It consolidates the format for
seq_id
-specific and whole KV cache session files. These changes reduce the number of places that need updating when changes are made and introduce several improvements and breaking changes to enhance maintainability and performance.
- SYCL Backend TIMESTEP_EMBEDDING Operator: This pull request introduces a
TIMESTEP_EMBEDDING
operator for the SYCL backend. It is modeled after the corresponding CUDA kernel and serves as a temporary solution to support the stablediffusion.cpp project. This addition enhances the backend's functionality.
- Runtime SVE Configuration: This pull request involves updating the code to read the runtime SVE configuration of the CPU. The changes are moved from
ggml.c
toggml-quants.c
and it supersedes a previous pull request which will be closed. This update ensures accurate runtime configuration.
- Multi-NPU Execution Fix: This pull request addresses and resolves the issue #8580 by fixing the Multi-NPU execution error on the
CANN
backend. It allows users to utilize multiple NPUs with the-sm layer
option. This fix enhances the backend's multi-NPU capabilities.
- CLI No-Warmup Option: This pull request introduces a
--no-warmup
option to thellama-cli
. It allows users to bypass the warmupllama_decode
call. This option can be particularly useful for debugging purposes.
2.2 Closed Pull Requests
Closed Pull Requests This Week: 44
Summarized Pull Requests:
- Memory Optimization on 64-bit Platforms: This topic focuses on optimizing memory usage by aligning various structs, resulting in reduced byte sizes for several data structures. The pull requests address the alignment of
ggml_type_traits_t
,llama_batch
,llama_model_params
,hash_node
,ggml_compute_state
, andgguf_tensor_info
. These changes aim to improve memory efficiency and performance on 64-bit platforms.
- Docker Container Library Updates: This topic addresses the issue of the missing
libgomp.so.1
library in thellama.cpp
Docker container. The pull request updates the Dockerfile to include the installation oflibgomp1
, ensuring the necessary library is present. This prevents related errors and ensures smoother operation of the Docker container.
- Performance State Management with NvAPI: This topic introduces support for changing the performance state using NvAPI in the llama project. The pull request includes implementing performance state switching functions and conditional compilation based on CUDA. It also plans for logging and synchronization across multiple instances.
- Chat Template Adjustments: This topic makes adjustments to the pre-defined chat templates for Llama2, Llama3, and Zephyr in the new server UI. The pull request aligns them with recommended versions and removes redundant start-of-text tokens for the Llama models. These changes aim to improve the user experience and template accuracy.
- Python Script Style Improvements: This topic involves making stylistic adjustments to Python scripts. The pull request removes superfluous parentheses, unused arguments, and variables, renames constants, and prevents variable redefinition. These changes aim to improve code readability without affecting functionality.
- Runtime SVE Configuration Reading: This topic addresses the issue of accurately reading the runtime Scalable Vector Extension (SVE) configuration of the CPU in the ggml library. The pull request uses
prctl(PR_SVE_GET_VL)
instead ofsvcntb()
. This ensures correct configuration reading and improves compatibility.
- Hosting Multiple Fine-Tuned Models: This topic introduces a method to host multiple fine-tuned derived models on memory-constrained devices. The pull request splits GGUF files into shared and task-specific tensors, allowing dynamic loading and swapping of task-specific tensors. This approach keeps only one copy of the shared tensors in memory.
- Documentation Updates: This topic includes various updates to the documentation. The pull requests add AI Studio to the list of user interfaces, clarify the
n_keep
parameter, and correct the term "quantum models" to "quantized models". These changes aim to improve clarity and accuracy in the documentation.
- Code Refactoring: This topic involves refactoring the
llama
code to improve organization and prepare for future API changes. The pull requests move vocabulary, grammar, and sampling implementations into separate files and update Swift and Android bindings. These changes enhance code clarity and maintainability.
- Windows and ARM Support: This topic addresses improvements for running the project on Windows with Snapdragon X. The pull request adds documentation for building on Windows, especially for ARM, and fixes issues related to MSVC's lack of support for C in-line assembly for ARM. These changes ensure better compatibility and support for Windows and ARM platforms.
- Multi-GPU and SYCL Improvements: This topic addresses issues related to multi-GPU crashes and SYCL support. The pull requests fix a multi-GPU crash on SYCL, add the
-fsycl
flag back toGGML_EXTRA_LIBS
, and ensure CI builds both static and dynamic libraries for the GGML_SYCL backend. These changes improve stability and compatibility for multi-GPU and SYCL environments.
- CodeShell Support Fixes: This topic addresses issues with CodeShell support that arose after updating
llama.cpp
. The pull request syncs with the latest version of the repository and implements necessary fixes. These changes ensure continued compatibility and functionality of CodeShell.
- Model and Embedding Support: This topic includes updates to support different models and embeddings. The pull requests address shape issues in Mistral Nemo, update the
llama-export-lora
example for the new LoRA format, and add support for the SmolLM pre-tokenizer and XLMRoberta model. These changes enhance the flexibility and compatibility of the project with various models and embeddings.
- Tokenizer and Quantization Enhancements: This topic addresses various enhancements related to tokenizers and quantization. The pull requests re-enable tokenizer tests, add IQ4_NL support to Vulkan, allow overriding specific tokenizer flags, and dequantize tensors from the base model for compatibility with lora adapters. These changes improve the functionality and flexibility of tokenizers and quantization processes.
- SYCL and DPC++ Build Support: This topic enables the llama.cpp project to be built using non-release versions of DPC++ and oneMKL. The pull request uses
clang++
instead oficpx
, removes some duplicate or unnecessary flags, and slightly rearranges the build logic. These changes enhance the flexibility and compatibility of the build process.
- Browser Compatibility Fixes: This topic addresses compatibility issues in the llama-server UI. The pull request replaces the
URL.parse
method, which is not supported in Safari, with the more universally supportednew URL()
constructor. This ensures consistent functionality across all browsers.
- Adapter Management: This topic introduces the
llama_lora_adapter_clear
function. The pull request allows users to clear loaded adapters inllama_context
to facilitate switching adapters without knowing which ones are currently loaded. This enhances the flexibility and usability of adapter management.
- System Message Formatting: This topic addresses the issue of incorrect formatting of system messages in the
llama_chat_format_single
function for Mistral. The pull request adds logs and test cases and provides an example of the output with the proposed changes. These changes ensure accurate and consistent message formatting.
- System Info Display Fixes: This topic adds a new function
ggml_cpu_has_llamafile()
to the ggml library. The pull request uses this function to fix the system info display issue when usingllamafile
, addressing issue #8656. This ensures accurate system information display.
- Example Removal and Corrections: This topic involves removing non-functional examples and making minor corrections. The pull request removes the
finetune
andtrain-text-from-scratch
examples due to their high maintenance requirements and corrects theexport-lora/README
file. These changes reduce maintenance overhead and improve documentation accuracy.
- README Updates: This topic updates the README.md file. The pull request adds a link to a game created by the contributor that depends on the llama library. This addition highlights the practical applications of the library.
- Voice Mode in UI: This topic adds a simple voice mode to the UI. The pull request incorporates features such as speech-to-text initiation, automatic message sending after speech recognition, text-to-speech voice options, and play/pause functionality for messages. These features have been tested across multiple browsers and operating systems.
- Compile Warning Fixes: This topic addresses build issues and fixes compile warnings related to the
fabs
function. The pull request ensures that the code compiles without warnings, improving code quality and maintainability.
- Lifecycle Script Support: This topic introduces support for lifecycle scripts in the
common
module of the project. The pull request allows specific scripts to be executed at various stages of the application's lifecycle, enhancing the flexibility and control over performance state management.
- NULL Pointer Dereference Prevention: This topic addresses a potential NULL pointer dereference issue in the
ggml_init
function. The pull request ensures the code bails out if no unused context is found, preventing a segmentation fault during subsequent calls toggml_set_on_alloc
.
- Parameter Order Correction: This topic addresses the issue of parameter order in the usage of the
aclrtGetMemInfo
function. The pull request ensures it aligns with the correct usage as documented, improving code accuracy and functionality.
2.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open pull requests within the past week to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open pull requests from the past week.
III. Commits
3.1 Commits
Commits This Week: 35
Summarized Commits:
- Function Parameter Fixes: The order of parameters in the
aclrtGetMemInfo
function was corrected to align with the proper implementation as per the documentation.
- Speech Recognition and Synthesis Integration: Speech Recognition and Synthesis functionalities were integrated into the server's user interface, addressing and fixing related issues.
- Quantized Model Issues: Issues related to quantized base models in the 'export-lora' examples were resolved, as indicated by pull request #8687.
- NULL Pointer Dereference Prevention: A potential NULL pointer dereference issue in the
ggml
module was addressed by ensuring theggml_init
function handles cases where no unused context is found.
- Build and Compile Warning Fixes: Build issues and compile warnings related to the
fabs
function in thellama
project were resolved, as indicated by the message "llama : fix build + fix fabs compile warnings (#8683)".
- Windows on ARM Build Improvements: The build process for Windows on ARM, specifically targeting Snapdragon X, was improved, including reverting a previous commit and updating the documentation.
- Printf Statement Fixes: Issues related to printf statements in test files were resolved, as indicated by the message 'tests : fix printfs (#8068)'.
- Multi-GPU Issue Resolution in SYCL: A multi-GPU issue in SYCL was addressed and resolved, with contributions from Intel's Chen Xi and Hengyu Meng.
- New Function in ggml: The
ggml_cpu_has_llamafile()
function was introduced and utilized within the ggml project.
- Example and Build Process Updates: The
finetune
andtrain-text-from-scratch
examples were removed, the build process was fixed, the help message was updated, and a small typo related toexport-lora
was corrected.
- Documentation Updates: References to "quantum models" were replaced with "quantized models" in the imatrix and server README files.
- Sliding Window for phi3 Function: A sliding window was introduced for the phi3 function, a typo was corrected, and the
conver_hf_to_gguf.py
script was updated to incorporate the phi3 sliding window functionality.
- README File Update: The README file was updated to include a link to a game created by the author that relies on the llama dependency.
- SYCL CI Build Enhancements: The SYCL CI builds were updated to include both static and dynamic libraries for testing purposes in the Llama project.
- User Interface List Update: The user interface list in the README file was updated.
- Function Correction in Mistral Project: The
llama_chat_format_single
function for the Mistral project was corrected, a typo was fixed, and the use ofprintf
was incorporated.
- Reintroduction of
-fsycl
Flag: The previously removed-fsycl
flag was reintroduced to theGGML_EXTRA_LIBS
configuration, addressing issue #8667.
- New Feature Addition: The
llama_lora_adapter_clear
feature was added, as referenced in pull request #8653.
- Example Fixes and Improvements: The
llama-export-lora
example was fixed by adding more logging, rejecting merging subsets, improving checks, and correcting typos.
- URL Parsing Fix: An issue in the server was addressed by fixing the URL.parse function within the user interface, as referenced in issue #8646.
- CMake Configuration Updates: The CMake configuration was updated to support NVIDIA hardware and an open-source compiler, adding support for non-release versions of DPC++ and oneMKL.
- Project Reorganization: The Llama project was reorganized by moving vocabulary, grammar, and sampling code into separate files, deprecating certain functions, updating dependencies, and redirecting external APIs to internal ones with a "_impl" suffix.
- Vulkan Support Enhancements: Multiple issues and enhancements were addressed, including fixing compile errors in Vulkan matmul tests, adding support for Vulkan IQ4_NL, and resolving support issues for Vulkan DeepSeek-Coder-V2-Lite MoE.
- RDNA2 Architecture Support: All RDNA2 architectures were allowed to utilize the
__builtin_amdgc_sdot4
intrinsic by replacing the specific check for gfx1030 with a more generic RDNA2 define.
- Contribution Guidelines Update: The process of pull request squashing was clarified, a typo was corrected, and a list of modules was added to the contribution guidelines.
- Scratch Size Allocation Fix: The scratch size allocation for the softmax function in the SYCL project was corrected, as referenced by issue number 8642.
- Codeshell Support Fix: The issue of codeshell support in the llama project was addressed by fixing its implementation and adjusting the order of codeshell and smollm to align with the enum sequence.
- SmolLm Pre-tokenizer Support: Support for the SmolLm pre-tokenizer was introduced in the llama project, including updates to relevant scripts and files, handling regex, and removing certain
.inp
and.out
gguf files.
- Python Code Style Adjustments: Various stylistic adjustments were made to Python files, including removing superfluous parentheses, eliminating unused arguments, replacing an unused variable with an underscore, initializing certain attributes, renaming a constant to uppercase, and preventing the redefinition of a variable.
- Tokenizer Flag Overrides: The ability to override tokenizer flags in the llama project was introduced.
- Tokenizer Test Re-enablement: Tokenizer tests for MPT and DeepSeek were re-enabled, duplicated vocabularies were removed, and the CMake configuration was updated.
- Mistral Nemo Inference Support: Mistral Nemo inference support was added to the llama project.
- Server Documentation Update: The server documentation was updated to clarify the usage of the
n_keep
parameter when a beginning-of-sequence (BOS) token is present.
- RISC-V Compilation Error Fix: A compilation error specific to the RISC-V architecture in the ggml project was addressed.
- Android Example Generation Fix: The issue in the Android example generation process was addressed by ensuring the
completion_loop()
function returns NULL instead of an empty string when the generation ends.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, or created at least 1 pull request in the past month.
Contributor | Commits | Pull Requests | Issues |
---|---|---|---|
GitHub | 184 | 0 | 0 |
ggerganov | 0 | 39 | 2 |
ngxson | 0 | 15 | 2 |
0wwafa | 0 | 0 | 15 |
JohannesGaessler | 0 | 11 | 0 |
compilade | 0 | 10 | 0 |
HanClinto | 0 | 9 | 1 |
Georgi Gerganov | 9 | 0 | 0 |
danbev | 0 | 9 | 0 |
Someone | 8 | 0 | 0 |
mofosyne | 0 | 8 | 0 |
RunningLeon | 0 | 3 | 3 |
luoyu-intel | 0 | 4 | 0 |
slaren | 0 | 4 | 0 |
Alcpz | 0 | 4 | 0 |
iboB | 0 | 4 | 0 |
maruel | 0 | 2 | 2 |
sorasoras | 0 | 0 | 4 |
oldgithubman | 0 | 0 | 4 |
AidanBeltonS | 0 | 3 | 0 |
perpendicularai | 0 | 1 | 2 |
iamlemec | 0 | 3 | 0 |
AndreasKunar | 0 | 1 | 2 |
stduhpf | 0 | 1 | 2 |
joeatodd | 0 | 3 | 0 |
RakshitAralimatti | 0 | 0 | 3 |
yli147 | 0 | 0 | 3 |
mgroeber9110 | 0 | 1 | 1 |
jpodivin | 0 | 2 | 0 |
OuadiElfarouki | 0 | 2 | 0 |
LDLINGLINGLING | 0 | 1 | 1 |
foldl | 0 | 2 | 0 |
dspasyuk | 0 | 1 | 1 |
mtasic85 | 0 | 2 | 0 |
standby24x7 | 0 | 2 | 0 |
b4b4o | 0 | 1 | 1 |
kevmo314 | 0 | 2 | 0 |
jaime-m-p | 0 | 2 | 0 |
jdomke | 0 | 2 | 0 |
zhipenghan | 0 | 2 | 0 |
nicholaiTukanov | 0 | 1 | 1 |
msy-kato | 0 | 2 | 0 |
ClarkChin08 | 0 | 2 | 0 |
0cc4m | 0 | 2 | 0 |
airMeng | 0 | 2 | 0 |
AmgadHasan | 0 | 1 | 1 |
amochkin | 0 | 1 | 1 |
Stillerman | 0 | 1 | 1 |
kaetemi | 0 | 1 | 1 |
jeroen-mostert | 0 | 1 | 1 |
QIANXUNZDL123 | 0 | 0 | 2 |
mirek190 | 0 | 0 | 2 |
ch1y0q | 0 | 0 | 2 |
SimplyCorbett | 0 | 0 | 2 |
yancaoweidaode | 0 | 0 | 2 |
Battlehub0x | 0 | 0 | 2 |
Arashimu | 0 | 0 | 2 |
MathiasSchindler | 0 | 0 | 2 |
Sokartecnologi | 0 | 0 | 2 |
bartowski1182 | 0 | 0 | 2 |
ericcurtin | 0 | 0 | 2 |
vt-alt | 0 | 0 | 2 |
abetlen | 0 | 1 | 0 |
ochafik | 0 | 1 | 0 |
AlexsCode | 0 | 1 | 0 |
iacore | 0 | 1 | 0 |
Zor-X-L | 0 | 1 | 0 |
crashr | 0 | 1 | 0 |
hackingthekernel | 0 | 1 | 0 |
andy-tai | 0 | 1 | 0 |
mcharytoniuk | 0 | 1 | 0 |
Quantaindew | 0 | 1 | 0 |
MistApproach | 0 | 1 | 0 |
ho2103 | 0 | 1 | 0 |
hopto-dot | 0 | 1 | 0 |
akemimadoka | 0 | 1 | 0 |
NeoZhangJianyu | 0 | 1 | 0 |
dwoolworth | 0 | 1 | 0 |
daniandtheweb | 0 | 1 | 0 |
pouwerkerk | 0 | 1 | 0 |
bviksoe | 0 | 1 | 0 |
diimdeep | 0 | 1 | 0 |
prfd | 0 | 1 | 0 |
youth123 | 0 | 1 | 0 |
brochure | 0 | 1 | 0 |
agray3 | 0 | 1 | 0 |
yeahdongcn | 0 | 1 | 0 |
daghanerdonmez | 0 | 1 | 0 |
andysalerno | 0 | 1 | 0 |
fairydreaming | 0 | 1 | 0 |
laik | 0 | 1 | 0 |
monatis | 0 | 1 | 0 |
AragonerUA | 0 | 1 | 0 |
kriation | 0 | 1 | 0 |
danielhanchen | 0 | 1 | 0 |
teleprint-me | 0 | 1 | 0 |
65a | 0 | 1 | 0 |
NikolaiLyssogor | 0 | 1 | 0 |
sbonds | 0 | 1 | 0 |
SommerEngineering | 0 | 1 | 0 |
amitj1jan | 0 | 1 | 0 |
nopperl | 0 | 1 | 0 |
EZForever | 0 | 1 | 0 |
m18coppola | 0 | 1 | 0 |
thxCode | 0 | 1 | 0 |
hankeke303 | 0 | 1 | 0 |
devojony | 0 | 1 | 0 |
zqb-all | 0 | 1 | 0 |
Xarbirus | 0 | 1 | 0 |
FanShupei | 0 | 1 | 0 |
themanyone | 0 | 1 | 0 |
Oliver-Y | 0 | 1 | 0 |
0x4139 | 0 | 1 | 0 |
Ujjawal-K-Panchal | 0 | 1 | 0 |
fmz | 0 | 1 | 0 |
MorganRO8 | 0 | 1 | 0 |
jmorganca | 0 | 1 | 0 |
ElYaiko | 0 | 1 | 0 |
sasha0552 | 0 | 1 | 0 |
DavidKorczynski | 0 | 1 | 0 |
bsquizz | 0 | 1 | 0 |
zhentaoyu | 0 | 1 | 0 |
wangshuai09 | 0 | 1 | 0 |
Smupk2778 | 0 | 0 | 1 |
Green-Sky | 0 | 0 | 1 |
eliranwong | 0 | 0 | 1 |
quarterturn | 0 | 0 | 1 |
rudiservo | 0 | 0 | 1 |
werruww | 0 | 0 | 1 |
unclemusclez | 0 | 0 | 1 |
JohnClaw | 0 | 0 | 1 |
micsthepick | 0 | 0 | 1 |
kherud | 0 | 0 | 1 |
duynt575 | 0 | 0 | 1 |
tomgm777 | 0 | 0 | 1 |
chiranko | 0 | 0 | 1 |
Gomez12 | 0 | 0 | 1 |
starP-W | 0 | 0 | 1 |
nathanodle | 0 | 0 | 1 |
tybalex | 0 | 0 | 1 |
akhilkapil | 0 | 0 | 1 |
LiquidGunay | 0 | 0 | 1 |
flatsiedatsie | 0 | 0 | 1 |
tihom77 | 0 | 0 | 1 |
lorihuang | 0 | 0 | 1 |
ctb111 | 0 | 0 | 1 |
aahouzi | 0 | 0 | 1 |
jim-plus | 0 | 0 | 1 |
Yan-Xiangjun | 0 | 0 | 1 |
josharian | 0 | 0 | 1 |
Aridbhdkkj | 0 | 0 | 1 |
AUTOMATIC1111 | 0 | 0 | 1 |
isaac-mcfadyen | 0 | 0 | 1 |
d-kleine | 0 | 0 | 1 |
warren-lei | 0 | 0 | 1 |
andreys42 | 0 | 0 | 1 |
gpacix | 0 | 0 | 1 |
guinmoon | 0 | 0 | 1 |
bandoti | 0 | 0 | 1 |
apresence | 0 | 0 | 1 |
kasrahabib | 0 | 0 | 1 |
Hardik-Choraria | 0 | 0 | 1 |
99991 | 0 | 0 | 1 |
Sakura4036 | 0 | 0 | 1 |
markat1 | 0 | 0 | 1 |
amakropoulos | 0 | 0 | 1 |
MeemeeLab | 0 | 0 | 1 |
joshknnd1982 | 0 | 0 | 1 |
sealad886 | 0 | 0 | 1 |
lin72h | 0 | 0 | 1 |
jie80219 | 0 | 0 | 1 |
nne998 | 0 | 0 | 1 |
StatPan | 0 | 0 | 1 |
1cekrim | 0 | 0 | 1 |
bong-furiosa | 0 | 0 | 1 |
djain-fujitsu | 0 | 0 | 1 |
m828 | 0 | 0 | 1 |
Fulgurance | 0 | 0 | 1 |
criminact | 0 | 0 | 1 |
VelocityRa | 0 | 0 | 1 |
dafei2017 | 0 | 0 | 1 |
metal3d | 0 | 0 | 1 |
Emmanuel97460 | 0 | 0 | 1 |
vmarchenkoff | 0 | 0 | 1 |
jpoly1219 | 0 | 0 | 1 |
ciekawy | 0 | 0 | 1 |
DanielusG | 0 | 0 | 1 |
hgftrdw45ud67is8o89 | 0 | 0 | 1 |
qnixsynapse | 0 | 0 | 1 |
rhvall | 0 | 0 | 1 |
zucchini-nlp | 0 | 0 | 1 |
hipudding | 0 | 0 | 1 |
suncloudsmoon | 0 | 0 | 1 |
newsletternewsletter | 0 | 0 | 1 |
simon-krannig | 0 | 0 | 1 |
RonanKMcGovern | 0 | 0 | 1 |
nicoboss | 0 | 0 | 1 |
MangoTCF | 0 | 0 | 1 |
TanLam01 | 0 | 0 | 1 |
peter-ch | 0 | 0 | 1 |
auriocus | 0 | 0 | 1 |
cloud11665 | 0 | 0 | 1 |
wencan | 0 | 0 | 1 |
Vaibhavs10 | 0 | 0 | 1 |
Tureti | 0 | 0 | 1 |
tc-wolf | 0 | 0 | 1 |
akashaero | 0 | 0 | 1 |
artiomborovinskii | 0 | 0 | 1 |
mudler | 0 | 0 | 1 |
Azirine | 0 | 0 | 1 |
creeves-anaconda | 0 | 0 | 1 |
hackey | 0 | 0 | 1 |
chigkim | 0 | 0 | 1 |
IcyXi | 0 | 0 | 1 |
8XXD8 | 0 | 0 | 1 |
matteoserva | 0 | 0 | 1 |
Volko61 | 0 | 0 | 1 |
riedgar-ms | 0 | 0 | 1 |
mgonzs13 | 0 | 0 | 1 |
yuanzhiyong1999 | 0 | 0 | 1 |
windowsagent | 0 | 0 | 1 |
ElaineWu66 | 0 | 0 | 1 |
ExtReMLapin | 0 | 0 | 1 |