Weekly GitHub Report for Llama.cpp - 2024-07-29 15:10:06

                July 29, 2024

            Weekly GitHub Report for Llama.cpp - 2024-07-29 15:10:06

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

I. Issues
1.1 Open Issues
Open Issues This Week: 25
Summarized Issues:

macOS GitHub Actions hosted runners issues: This issue describes a bug where macOS GitHub Actions hosted runners hang or fail to return results when running small models like Qwen2 1.5B and Phi-3 mini. These models work fine on other operating systems and locally on an M3 Max. This suggests a potential problem with GPU virtualization on GitHub's macOS runners.
github.com/ggerganov/llama.cpp/issues/8617

Integrated GPU issues on Framework Laptop 16: This issue describes a bug where attempting to use the integrated GPU (iGPU) on a Framework Laptop 16 with an AMD Radeon RX 7700S and AMD Radeon 780M results in crashes. The crashes are due to out-of-memory errors and segmentation faults when running the llama-server application. This indicates a problem with memory management on these specific GPUs.
github.com/ggerganov/llama.cpp/issues/8618

OpenAI API max_tokens parameter bug: This issue describes a bug where non-chat completions using the OpenAI API do not respect the max_tokens parameter. This causes the model to generate tokens indefinitely until the context length is reached. The current workaround is to stream the response and manually close the connection upon reaching the desired token count.
github.com/ggerganov/llama.cpp/issues/8634

llama.pc file issues: This issue pertains to the llama.pc file containing an incorrect Version: line, resulting in erroneous output from the pkg-config --print-provides command. This affects downstream distributions by failing to auto-detect dependencies due to an empty @PROJECT_VERSION@ variable. Another issue involves the incorrect packaging of the llama.pc file into /usr/lib/pkgconfig/llama.pc instead of the architecture-dependent "libdir" directory.
github.com/ggerganov/llama.cpp/issues/8639
github.com/ggerganov/llama.cpp/issues/8640

Docker image conversion issues: This issue describes a bug encountered when using the llama.cpp Docker image to convert certain vector models to gguf format. The conversion results in a NotImplementedError due to an unrecognized BPE pre-tokenizer. The issue requests guidance on resolving the problem and identifying compatible older versions of the software.
github.com/ggerganov/llama.cpp/issues/8649

Support for Llama 3.1 model: This issue requests the addition of support for the newly released Llama 3.1 model in the llama.cpp project. Necessary updates for RoPE scaling are included to ensure coherent text generation. This would enhance the project's compatibility with the latest models.
github.com/ggerganov/llama.cpp/issues/8650

HuggingFace documentation update: This issue highlights that the HuggingFace documentation for the "Use This Model" button is referring to outdated binary names for llama.cpp. It suggests updating the instructions to use llama-cli instead of main. This would ensure users follow the correct procedures.
github.com/ggerganov/llama.cpp/issues/8659

GGML_ASSERT error with Meta-Llama-3.1-8B-Instruct models: This issue involves a GGML_ASSERT error occurring when attempting to run Meta-Llama-3.1-8B-Instruct models with Q8 and Q4 quantization on the SYCL backend using an Intel ARC A770 GPU on Windows 11. Other models work fine, indicating a specific compatibility issue with these models.
github.com/ggerganov/llama.cpp/issues/8660

Quantization issues with Llama 3.1 70B model: This issue involves a bug where attempting to quantize the Llama 3.1 70B model to Q4_K_S using imatrix results in NaN values. The issue occurs specifically at block 48, and similar issues are observed with other quant sizes like Q3_K_L and Q3_K_M. This indicates a problem with the quantization process.
github.com/ggerganov/llama.cpp/issues/8661

Llama 3.1 model breaking on macOS: This issue describes a bug introduced in version b3383 that causes the Llama 3.1 model to break when running specific commands. The error results in an "input is empty" message on macOS. A workaround and partial fixes are mentioned in the comments.
github.com/ggerganov/llama.cpp/issues/8671

Docker tags issue: This issue is about Docker tags incorrectly starting with build number b1 followed by the commit hash. The problem is due to the action/checkout in docker.yml not setting the build depth, which should be configured to fetch-depth: 0 to match the build.yml workflow.
github.com/ggerganov/llama.cpp/issues/8674

Memory usage issues on Mac: This issue describes a problem where the memory usage for running inference on a llama model fluctuates between cached files and wired memory. This causes inefficiencies and crashes when attempting to keep layers in wired memory on a Mac system. The issue highlights the need for better memory management.
github.com/ggerganov/llama.cpp/issues/8678

Reintroduction of chat/instruct templates: This issue is a feature request to reintroduce the previously removed chat/instruct templates in the llama.cpp project. The user found them extremely useful and is experiencing difficulties with the current alternatives. This would improve usability for those relying on these templates.
github.com/ggerganov/llama.cpp/issues/8681

Conversion issues with fine-tuned Code Llama model: This issue describes a bug encountered when attempting to convert a fine-tuned Code Llama model file into a GGUF file. The conversion results in an "out of range" error due to token IDs exceeding the maximum allowed value. This indicates a problem with the tokenization process.
github.com/ggerganov/llama.cpp/issues/8682

Multi-core support for full GPU offload: This issue requests the implementation of multi-core support for full GPU offload in the llama.cpp project. This would improve performance on systems with lower single-core performance by enabling the --threads argument. The feature would enhance the project's efficiency.
github.com/ggerganov/llama.cpp/issues/8684

Corrupted outputs with multiple CUDA GPUs: This issue describes a bug where models produce corrupted outputs when offloading to multiple CUDA GPUs. Specific problems include incorrect parsing of prompts and reuse of information from previous prompts. This indicates a problem with the GPU offloading process.
github.com/ggerganov/llama.cpp/issues/8685

Offloading specific operations to GPU: This issue is about inquiring whether it is possible to offload specific operations, such as attention calculations, to the GPU while keeping other operations, like layer normalization, on the CPU. This would allow for more efficient use of resources in a GitHub project.
github.com/ggerganov/llama.cpp/issues/8686

Quantization process error on mamba architecture: This issue describes a bug where an error occurs during the quantization process to gguf q5_k_m on a mamba architecture. The error is due to a missing 'architectures' key in the model's parameters, resulting in a KeyError. This indicates a problem with the model's metadata.
github.com/ggerganov/llama.cpp/issues/8690

LlamaCpp tokenizer bug: This issue describes a bug in the LlamaCpp tokenizer where it fails to correctly tokenize partial UTF-8 byte sequences. This results in an invalid character error and breaks functionality, as demonstrated with the character '歪' being split into two tokens. This indicates a problem with the tokenizer's handling of UTF-8 sequences.
github.com/ggerganov/llama.cpp/issues/8691

Addition of chat_example property: This issue requests the addition of a chat_example property to the /props endpoint of the server. This would provide a recommended chat template for models, facilitating easier access and verification of templates through the server UI.
github.com/ggerganov/llama.cpp/issues/8694

Compilation issues on Linux: This issue describes a problem where the user is unable to compile the llama.cpp project on a Linux operating system using either the make or cmake methods. Different errors are encountered with each approach despite trying multiple versions of gcc. This indicates a problem with the build process on Linux.
github.com/ggerganov/llama.cpp/issues/8700

Missing 'libggml.so' file during installation: This issue describes a bug where attempting to install a project using CMake on a Linux system fails due to a missing 'libggml.so' file. The error message indicates a problem during the installation process. This highlights the need for proper dependency management.
github.com/ggerganov/llama.cpp/issues/8701

GPU acceleration on Android: This issue is about seeking guidance on how to utilize the GPU on an Android device to accelerate inference for the llama.cpp demo. The user encountered problems with OpenCL and is looking for alternatives like Vulkan. This would enhance the performance of the demo on Android devices.
github.com/ggerganov/llama.cpp/issues/8705

Lightweight tests for LoRA: This issue is about adding lightweight tests for LoRA by training adapters based on specific datasets. The tests would be conducted with a command-line interface and the outputs verified. An optional task includes creating small models with different architectures.
github.com/ggerganov/llama.cpp/issues/8708

1.2 Top 5 Active Issues:
We consider active issues to be issues that have generated much discussion in the issue's comments. 

Feature Request: Proper Llama 3.1 Support in llama.cpp: This issue is a feature request to add support for Llama 3.1 in the llama.cpp project, which involves updating the model to handle new requirements such as RoPE scaling to ensure coherent text generation. The motivation behind this request is to unlock the full potential of the Llama 3.1 model, as the current implementation without these updates results in suboptimal text generation.

The comments discuss various aspects of implementing Llama 3.1 support, including function calling, new tokens, and template adjustments. Users share their experiences with different configurations and quantizations, noting issues with accuracy and performance. Some users report success with specific settings, while others highlight ongoing problems, particularly with long context handling and CUDA implementation. The conversation also includes links to relevant code changes and external resources for further testing and validation.
Number of comments: 111

server : improvements and maintenance: This issue is about improving and maintaining the server example in the GitHub project, which has grown in functionality but is currently unstable and missing important features. The issue aims to track these points and draw community attention to them, as some tasks are significant and require considerable effort to complete.

The comments discuss various improvements and suggestions, including adding look-ahead decoding, contrastive search, speculative sampling, and function calling support. They also touch on the need for better error handling, refactoring for stability and performance, and the potential use of templating systems like Jinja. There is a consensus on the importance of making the server more robust and user-friendly, with some debate on the best approaches to achieve these goals.
Number of comments: 108

Support BitNet b1.58 ternary models: This issue is about implementing support for BitNet b1.58 ternary models, which use 1.58 bits with ternary values (1, 0, -1) for training, showing performance improvements over fp16 models. The issue highlights the potential for running larger models with less VRAM and discusses the feasibility and benefits of integrating this new training method into llama.cpp.

The comments discuss the novelty and potential of training models directly in a quantized state, the technical details and challenges of implementing ternary models, the need for further validation and code release from the original authors, and the community's interest in exploring and optimizing this approach for practical use.
Number of comments: 88

Investigate gemma 2 generation quality: This issue is about investigating the quality of the Gemma 2 generation in the llama.cpp project, with initial reports suggesting potential problems with the tokenizer. The discussion includes various tests and observations, including discrepancies in tokenization and output quality, especially in handling specific tokens and formatting issues.

The comments discuss the hard-coded window size of Gemma 2, issues with math questions indicating potential tokenizer problems, differences in quantization quality, and various tests comparing outputs from different implementations. There are also suggestions for fixes, such as changing the vocabulary conversion method and adjusting logit softcapping values, with some users reporting improvements after applying these changes.
Number of comments: 88

Support for Phi-3 models: This issue is about adding support for Microsoft's recently released Phi-3 models, which come in three variants: mini, small, and medium. The request is to integrate these new models into the project, ensuring compatibility and functionality.

The comments discuss various aspects of integrating Phi-3 models, including successful initial tests, issues with long context support, and specific errors encountered during conversion. There are also discussions about the need for new prompt templates, the implementation of longrope techniques, and the eventual merging of a pull request that adds support for Phi-3 4K context length models. However, support for the 128K context length variant remains unresolved, with ongoing efforts and community contributions to address this.
Number of comments: 83

1.3 Top 5 Quiet Issues:
We consider quiet issues to be issues that have been opened in this project for the longest time. The team should work together to get these issues resolved and closed as soon as possible. 

llama : add test for saving/loading sessions to the CI: This issue involves adding a test for saving and loading sessions to the continuous integration (CI) process of the llama project. The task requires understanding the save-load-state example and incorporating a simple test into the ci/run.sh script.

Open for 347 days, 01 hours, 18 minutes

llama : tool for evaluating quantization results per layer: This issue proposes the development of a tool to evaluate quantization results per layer by comparing classical and quantum models using ggml exported graphs. The tool aims to provide detailed statistical information on intermediate results after each graph node to identify where precision is needed to minimize quantization differences.

Open for 338 days, 05 hours, 57 minutes

CUDA non-determinism on identical requests: This issue describes a problem where identical requests to a server using CUDA for layer offloading return different responses the first time, but consistent responses thereafter, suggesting a potential caching issue. The expected behavior is that the output should remain the same when parameters and seed are constant, and this non-deterministic behavior is not observed with Metal offload or without CUDA offload.

Open for 335 days, 22 hours, 22 minutes

Windows ROCm Build.: This issue involves a user attempting to compile the llama.cpp project for ROCm on a Windows system, encountering difficulties due to CMake's default paths for the clang and clang++ compilers, which differ from their actual locations on Windows. The user reports that attempts to compile using Visual Studio and CMake result in an error message indicating that "CC is not recognized as an internal or external command."

Open for 335 days, 20 hours, 44 minutes

Please support the also official Falcon-rw-1b and Falcon-rw-7b model variants: This issue requests support for the Falcon-RW-1B and Falcon-RW-7B model variants, which are official versions of the Falcon model series. The user has encountered errors when attempting to convert and quantize these models using the convert-falcon-hf-to-gguf.py script, and is seeking assistance or confirmation on whether these models will be supported.

Open for 334 days, 08 hours, 09 minutes

1.4 Closed Issues
Closed Issues This Week: 31
Average Issue Close Time (This Week): 26.31 days
Summarized Issues:

Log Probabilities in create_chat_completions: Users are experiencing issues with obtaining log probabilities using the create_chat_completions function in the llama-cpp library. Despite setting logprobs=True, the expected log probabilities are not included in the output. This issue affects the functionality of the library for users who rely on log probabilities for their applications.
github.com/...

Model Conversion Errors: Several issues have been reported regarding errors encountered during model conversion to GGUF format. These include unrecognized rope scaling types, debugging errors in tensor_mapping.py, and problems with configuration files and model architecture parameters.
github.com/...
github.com/...
github.com/...
github.com/...

Server and Streaming Issues: Users have reported various issues with the server and completion streaming. These include special tokens being returned as empty strings, performance slowdowns, infinite loops with batch requests, and server crashes after specific commits.
github.com/...
github.com/...
github.com/...
github.com/...

SYCL Backend Bugs: Multiple issues have been identified with the SYCL backend, including device index errors, operation failures, and build errors with Intel OneAPI. These bugs affect the stability and functionality of the backend in various scenarios.
github.com/...
github.com/...
github.com/...

Embedding and Tokenization Issues: Problems have been reported with the embedding endpoint and tokenization processes. These include crashes in the tokenizer, unwanted spaces in tokenization, and incorrect formatting in chat templates.
github.com/...
github.com/...
github.com/...

Compilation and Build Errors: Users have encountered various compilation and build errors, including issues with ggml-aarch64.c on Windows ARM64, CUDA build process failures, and symbol lookup errors due to incorrect library linking.
github.com/...
github.com/...
github.com/...

Feature Requests: There have been requests for new features, such as multi-session chat processing, support for the SmolLM family of models, and support for the Mistral-Large model from Hugging Face. These requests aim to enhance the functionality and versatility of the project.
github.com/...
github.com/...
github.com/...

Miscellaneous Bugs: Various other bugs have been reported, including issues with the export-lora command, llama_print_system_info function, train-text-from-scratch command, and grammar-related generation differences.
github.com/...
github.com/...
github.com/...
github.com/...

1.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open issues within the past week to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open issues from the past week. 

II. Pull Requests
2.1 Open Pull Requests
Open Pull Requests This Week: 18
Pull Requests:

Graph Nodes Determination: This pull request proposes to determine the maximum number of graph nodes based on the model information, such as architecture and hyperparameters. It addresses issue #8615 and aims to optimize the graph node allocation. This enhancement is crucial for improving the model's performance and scalability.
github.com/ggerganov/llama.cpp/pull/8622

CMakePresets Fix: This pull request addresses a fix in the CMakePresets by ensuring that the host value for the MSVC compiler in the toolchain is correctly set to either x86 or x64. This change aligns with the CMake documentation. It ensures proper configuration and compilation of the project.
github.com/ggerganov/llama.cpp/pull/8624

Python Binding and Installation: This pull request introduces a pre-compiled Python binding for llama.cpp using CFFI. It supports both CPU and CUDA 12.5 execution and simplifies installation to a single pip install command. This enhancement makes it easier for users to integrate and use the library.
github.com/ggerganov/llama.cpp/pull/8633

CLI Template Argument: This pull request introduces an optional --template argument to the llava-cli tool. It allows users to format the output of bulk image descriptions according to a specified template. This feature enhances the utility and customization of the generated descriptions.
github.com/ggerganov/llama.cpp/pull/8637

Library Refactoring: This pull request aims to refactor the llama library by moving the llama_sampling_context. It updates the sampling API to utilize it instead of llama_context and removes LLAMA_API_INTERNAL. These changes improve the library's structure and maintainability.
github.com/ggerganov/llama.cpp/pull/8643

SHA-256 Tensor Hash: This pull request introduces a SHA-256 tensor hash to the key-value store in the project. It provides a strong cryptographic method for tracking models and ensuring data integrity. This feature is particularly useful for model repository maintainers like Hugging Face.
github.com/ggerganov/llama.cpp/pull/8645

Lookup Example Overhaul: This pull request overhauls the lookup example to use a tree of sequences instead of a single sequence. It aims to improve the prediction accuracy of multiple tokens per evaluation. The change potentially increases the token generation rate by 33-50% through a more efficient intermediate data format and cost function prioritization.
github.com/ggerganov/llama.cpp/pull/8648

XLMRoberta Embedding Models: This pull request adds support for XLMRoberta embedding models. It modifies tokenization using the new T5 Unigram work and includes necessary adjustments to the position embedding matrix and Unigram tokenizer. These changes enhance the model's capabilities and compatibility.
github.com/ggerganov/llama.cpp/pull/8658

Threadpool Management API: This pull request introduces an API for explicit management and fine-grain control of threadpools. It allows for the creation, pausing, resuming, and releasing of multiple threadpools independently. This optimization improves thread scheduling and performance in various execution contexts.
github.com/ggerganov/llama.cpp/pull/8672

Rope Scaling Factors: This pull request introduces the generation and integration of rope scaling factors into the Llama 3.1 model during conversion. It enhances inference performance for context windows exceeding 8192 tokens. This improvement is crucial for handling larger context windows efficiently.
github.com/ggerganov/llama.cpp/pull/8676

SYCL Backend Convolution Support: This pull request aims to add convolution support to the SYCL backend for the stablediffusion.cpp project. It serves as a temporary solution with plans to introduce OneDNN for improved convolution performance in the future. This addition enhances the backend's capabilities.
github.com/ggerganov/llama.cpp/pull/8688

Dockerfile Curl Installation: This pull request aims to install 'curl' in the runtime layer of the llama-server.Dockerfile. It enables docker health checks for the basic server image. This addition ensures better monitoring and maintenance of the server.
github.com/ggerganov/llama.cpp/pull/8693

Hash Table Reset Optimization: This pull request aims to reduce the reset cost of hash tables by using a bit table to indicate slot usage instead of a NULL pointer. It significantly decreases the memory that needs to be cleared during resets. This optimization improves performance, especially in small models.
github.com/ggerganov/llama.cpp/pull/8698

Session File Management: This pull request aims to simplify and unify the session file management in the llama project. It consolidates the format for seq_id-specific and whole KV cache session files. These changes reduce the number of places that need updating when changes are made and introduce several improvements and breaking changes to enhance maintainability and performance.
github.com/ggerganov/llama.cpp/pull/8699

SYCL Backend TIMESTEP_EMBEDDING Operator: This pull request introduces a TIMESTEP_EMBEDDING operator for the SYCL backend. It is modeled after the corresponding CUDA kernel and serves as a temporary solution to support the stablediffusion.cpp project. This addition enhances the backend's functionality.
github.com/ggerganov/llama.cpp/pull/8707

Runtime SVE Configuration: This pull request involves updating the code to read the runtime SVE configuration of the CPU. The changes are moved from ggml.c to ggml-quants.c and it supersedes a previous pull request which will be closed. This update ensures accurate runtime configuration.
github.com/ggerganov/llama.cpp/pull/8709

Multi-NPU Execution Fix: This pull request addresses and resolves the issue #8580 by fixing the Multi-NPU execution error on the CANN backend. It allows users to utilize multiple NPUs with the -sm layer option. This fix enhances the backend's multi-NPU capabilities.
github.com/ggerganov/llama.cpp/pull/8710

CLI No-Warmup Option: This pull request introduces a --no-warmup option to the llama-cli. It allows users to bypass the warmup llama_decode call. This option can be particularly useful for debugging purposes.
github.com/ggerganov/llama.cpp/pull/8712

2.2 Closed Pull Requests
Closed Pull Requests This Week: 44
Summarized Pull Requests:

Memory Optimization on 64-bit Platforms: This topic focuses on optimizing memory usage by aligning various structs, resulting in reduced byte sizes for several data structures. The pull requests address the alignment of ggml_type_traits_t, llama_batch, llama_model_params, hash_node, ggml_compute_state, and gguf_tensor_info. These changes aim to improve memory efficiency and performance on 64-bit platforms.
github.com/ggerganov/llama.cpp/pull/7267

Docker Container Library Updates: This topic addresses the issue of the missing libgomp.so.1 library in the llama.cpp Docker container. The pull request updates the Dockerfile to include the installation of libgomp1, ensuring the necessary library is present. This prevents related errors and ensures smoother operation of the Docker container.
github.com/ggerganov/llama.cpp/pull/7775

Performance State Management with NvAPI: This topic introduces support for changing the performance state using NvAPI in the llama project. The pull request includes implementing performance state switching functions and conditional compilation based on CUDA. It also plans for logging and synchronization across multiple instances.
github.com/ggerganov/llama.cpp/pull/8116

Chat Template Adjustments: This topic makes adjustments to the pre-defined chat templates for Llama2, Llama3, and Zephyr in the new server UI. The pull request aligns them with recommended versions and removes redundant start-of-text tokens for the Llama models. These changes aim to improve the user experience and template accuracy.
github.com/ggerganov/llama.cpp/pull/8196

Python Script Style Improvements: This topic involves making stylistic adjustments to Python scripts. The pull request removes superfluous parentheses, unused arguments, and variables, renames constants, and prevents variable redefinition. These changes aim to improve code readability without affecting functionality.
github.com/ggerganov/llama.cpp/pull/8233

Runtime SVE Configuration Reading: This topic addresses the issue of accurately reading the runtime Scalable Vector Extension (SVE) configuration of the CPU in the ggml library. The pull request uses prctl(PR_SVE_GET_VL) instead of svcntb(). This ensures correct configuration reading and improves compatibility.
github.com/ggerganov/llama.cpp/pull/8382

Hosting Multiple Fine-Tuned Models: This topic introduces a method to host multiple fine-tuned derived models on memory-constrained devices. The pull request splits GGUF files into shared and task-specific tensors, allowing dynamic loading and swapping of task-specific tensors. This approach keeps only one copy of the shared tensors in memory.
github.com/ggerganov/llama.cpp/pull/8415

Documentation Updates: This topic includes various updates to the documentation. The pull requests add AI Studio to the list of user interfaces, clarify the n_keep parameter, and correct the term "quantum models" to "quantized models". These changes aim to improve clarity and accuracy in the documentation.
github.com/ggerganov/llama.cpp/pull/8505
github.com/ggerganov/llama.cpp/pull/8619
github.com/ggerganov/llama.cpp/pull/8666

Code Refactoring: This topic involves refactoring the llama code to improve organization and prepare for future API changes. The pull requests move vocabulary, grammar, and sampling implementations into separate files and update Swift and Android bindings. These changes enhance code clarity and maintainability.
github.com/ggerganov/llama.cpp/pull/8508
github.com/ggerganov/llama.cpp/pull/8651

Windows and ARM Support: This topic addresses improvements for running the project on Windows with Snapdragon X. The pull request adds documentation for building on Windows, especially for ARM, and fixes issues related to MSVC's lack of support for C in-line assembly for ARM. These changes ensure better compatibility and support for Windows and ARM platforms.
github.com/ggerganov/llama.cpp/pull/8531

Multi-GPU and SYCL Improvements: This topic addresses issues related to multi-GPU crashes and SYCL support. The pull requests fix a multi-GPU crash on SYCL, add the -fsycl flag back to GGML_EXTRA_LIBS, and ensure CI builds both static and dynamic libraries for the GGML_SYCL backend. These changes improve stability and compatibility for multi-GPU and SYCL environments.
github.com/ggerganov/llama.cpp/pull/8554
github.com/ggerganov/llama.cpp/pull/8667
github.com/ggerganov/llama.cpp/pull/8668

CodeShell Support Fixes: This topic addresses issues with CodeShell support that arose after updating llama.cpp. The pull request syncs with the latest version of the repository and implements necessary fixes. These changes ensure continued compatibility and functionality of CodeShell.
github.com/ggerganov/llama.cpp/pull/8599

Model and Embedding Support: This topic includes updates to support different models and embeddings. The pull requests address shape issues in Mistral Nemo, update the llama-export-lora example for the new LoRA format, and add support for the SmolLM pre-tokenizer and XLMRoberta model. These changes enhance the flexibility and compatibility of the project with various models and embeddings.
github.com/ggerganov/llama.cpp/pull/8604
github.com/ggerganov/llama.cpp/pull/8607
github.com/ggerganov/llama.cpp/pull/8609
github.com/ggerganov/llama.cpp/pull/8638

Tokenizer and Quantization Enhancements: This topic addresses various enhancements related to tokenizers and quantization. The pull requests re-enable tokenizer tests, add IQ4_NL support to Vulkan, allow overriding specific tokenizer flags, and dequantize tensors from the base model for compatibility with lora adapters. These changes improve the functionality and flexibility of tokenizers and quantization processes.
github.com/ggerganov/llama.cpp/pull/8611
github.com/ggerganov/llama.cpp/pull/8613
github.com/ggerganov/llama.cpp/pull/8614
github.com/ggerganov/llama.cpp/pull/8687

SYCL and DPC++ Build Support: This topic enables the llama.cpp project to be built using non-release versions of DPC++ and oneMKL. The pull request uses clang++ instead of icpx, removes some duplicate or unnecessary flags, and slightly rearranges the build logic. These changes enhance the flexibility and compatibility of the build process.
github.com/ggerganov/llama.cpp/pull/8644

Browser Compatibility Fixes: This topic addresses compatibility issues in the llama-server UI. The pull request replaces the URL.parse method, which is not supported in Safari, with the more universally supported new URL() constructor. This ensures consistent functionality across all browsers.
github.com/ggerganov/llama.cpp/pull/8646

Adapter Management: This topic introduces the llama_lora_adapter_clear function. The pull request allows users to clear loaded adapters in llama_context to facilitate switching adapters without knowing which ones are currently loaded. This enhances the flexibility and usability of adapter management.
github.com/ggerganov/llama.cpp/pull/8653

System Message Formatting: This topic addresses the issue of incorrect formatting of system messages in the llama_chat_format_single function for Mistral. The pull request adds logs and test cases and provides an example of the output with the proposed changes. These changes ensure accurate and consistent message formatting.
github.com/ggerganov/llama.cpp/pull/8657

System Info Display Fixes: This topic adds a new function ggml_cpu_has_llamafile() to the ggml library. The pull request uses this function to fix the system info display issue when using llamafile, addressing issue #8656. This ensures accurate system information display.
github.com/ggerganov/llama.cpp/pull/8664

Example Removal and Corrections: This topic involves removing non-functional examples and making minor corrections. The pull request removes the finetune and train-text-from-scratch examples due to their high maintenance requirements and corrects the export-lora/README file. These changes reduce maintenance overhead and improve documentation accuracy.
github.com/ggerganov/llama.cpp/pull/8669

README Updates: This topic updates the README.md file. The pull request adds a link to a game created by the contributor that depends on the llama library. This addition highlights the practical applications of the library.
github.com/ggerganov/llama.cpp/pull/8673

Voice Mode in UI: This topic adds a simple voice mode to the UI. The pull request incorporates features such as speech-to-text initiation, automatic message sending after speech recognition, text-to-speech voice options, and play/pause functionality for messages. These features have been tested across multiple browsers and operating systems.
github.com/ggerganov/llama.cpp/pull/8679

Compile Warning Fixes: This topic addresses build issues and fixes compile warnings related to the fabs function. The pull request ensures that the code compiles without warnings, improving code quality and maintainability.
github.com/ggerganov/llama.cpp/pull/8683

Lifecycle Script Support: This topic introduces support for lifecycle scripts in the common module of the project. The pull request allows specific scripts to be executed at various stages of the application's lifecycle, enhancing the flexibility and control over performance state management.
github.com/ggerganov/llama.cpp/pull/8689

NULL Pointer Dereference Prevention: This topic addresses a potential NULL pointer dereference issue in the ggml_init function. The pull request ensures the code bails out if no unused context is found, preventing a segmentation fault during subsequent calls to ggml_set_on_alloc.
github.com/ggerganov/llama.cpp/pull/8692

Parameter Order Correction: This topic addresses the issue of parameter order in the usage of the aclrtGetMemInfo function. The pull request ensures it aligns with the correct usage as documented, improving code accuracy and functionality.
github.com/ggerganov/llama.cpp/pull/8706

2.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open pull requests within the past week to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open pull requests from the past week. 

III. Commits
3.1 Commits
Commits This Week: 35
Summarized Commits:

Function Parameter Fixes: The order of parameters in the aclrtGetMemInfo function was corrected to align with the proper implementation as per the documentation.

Speech Recognition and Synthesis Integration: Speech Recognition and Synthesis functionalities were integrated into the server's user interface, addressing and fixing related issues.

Quantized Model Issues: Issues related to quantized base models in the 'export-lora' examples were resolved, as indicated by pull request #8687.

NULL Pointer Dereference Prevention: A potential NULL pointer dereference issue in the ggml module was addressed by ensuring the ggml_init function handles cases where no unused context is found.

Build and Compile Warning Fixes: Build issues and compile warnings related to the fabs function in the llama project were resolved, as indicated by the message "llama : fix build + fix fabs compile warnings (#8683)".

Windows on ARM Build Improvements: The build process for Windows on ARM, specifically targeting Snapdragon X, was improved, including reverting a previous commit and updating the documentation.

Printf Statement Fixes: Issues related to printf statements in test files were resolved, as indicated by the message 'tests : fix printfs (#8068)'.

Multi-GPU Issue Resolution in SYCL: A multi-GPU issue in SYCL was addressed and resolved, with contributions from Intel's Chen Xi and Hengyu Meng.

New Function in ggml: The ggml_cpu_has_llamafile() function was introduced and utilized within the ggml project.

Example and Build Process Updates: The finetune and train-text-from-scratch examples were removed, the build process was fixed, the help message was updated, and a small typo related to export-lora was corrected.

Documentation Updates: References to "quantum models" were replaced with "quantized models" in the imatrix and server README files.

Sliding Window for phi3 Function: A sliding window was introduced for the phi3 function, a typo was corrected, and the conver_hf_to_gguf.py script was updated to incorporate the phi3 sliding window functionality.

README File Update: The README file was updated to include a link to a game created by the author that relies on the llama dependency.

SYCL CI Build Enhancements: The SYCL CI builds were updated to include both static and dynamic libraries for testing purposes in the Llama project.

User Interface List Update: The user interface list in the README file was updated.

Function Correction in Mistral Project: The llama_chat_format_single function for the Mistral project was corrected, a typo was fixed, and the use of printf was incorporated.

Reintroduction of -fsycl Flag: The previously removed -fsycl flag was reintroduced to the GGML_EXTRA_LIBS configuration, addressing issue #8667.

New Feature Addition: The llama_lora_adapter_clear feature was added, as referenced in pull request #8653.

Example Fixes and Improvements: The llama-export-lora example was fixed by adding more logging, rejecting merging subsets, improving checks, and correcting typos.

URL Parsing Fix: An issue in the server was addressed by fixing the URL.parse function within the user interface, as referenced in issue #8646.

CMake Configuration Updates: The CMake configuration was updated to support NVIDIA hardware and an open-source compiler, adding support for non-release versions of DPC++ and oneMKL.

Project Reorganization: The Llama project was reorganized by moving vocabulary, grammar, and sampling code into separate files, deprecating certain functions, updating dependencies, and redirecting external APIs to internal ones with a "_impl" suffix.

Vulkan Support Enhancements: Multiple issues and enhancements were addressed, including fixing compile errors in Vulkan matmul tests, adding support for Vulkan IQ4_NL, and resolving support issues for Vulkan DeepSeek-Coder-V2-Lite MoE.

RDNA2 Architecture Support: All RDNA2 architectures were allowed to utilize the __builtin_amdgc_sdot4 intrinsic by replacing the specific check for gfx1030 with a more generic RDNA2 define.

Contribution Guidelines Update: The process of pull request squashing was clarified, a typo was corrected, and a list of modules was added to the contribution guidelines.

Scratch Size Allocation Fix: The scratch size allocation for the softmax function in the SYCL project was corrected, as referenced by issue number 8642.

Codeshell Support Fix: The issue of codeshell support in the llama project was addressed by fixing its implementation and adjusting the order of codeshell and smollm to align with the enum sequence.

SmolLm Pre-tokenizer Support: Support for the SmolLm pre-tokenizer was introduced in the llama project, including updates to relevant scripts and files, handling regex, and removing certain .inp and .out gguf files.

Python Code Style Adjustments: Various stylistic adjustments were made to Python files, including removing superfluous parentheses, eliminating unused arguments, replacing an unused variable with an underscore, initializing certain attributes, renaming a constant to uppercase, and preventing the redefinition of a variable.

Tokenizer Flag Overrides: The ability to override tokenizer flags in the llama project was introduced.

Tokenizer Test Re-enablement: Tokenizer tests for MPT and DeepSeek were re-enabled, duplicated vocabularies were removed, and the CMake configuration was updated.

Mistral Nemo Inference Support: Mistral Nemo inference support was added to the llama project.

Server Documentation Update: The server documentation was updated to clarify the usage of the n_keep parameter when a beginning-of-sequence (BOS) token is present.

RISC-V Compilation Error Fix: A compilation error specific to the RISC-V architecture in the ggml project was addressed.

Android Example Generation Fix: The issue in the Android example generation process was addressed by ensuring the completion_loop() function returns NULL instead of an empty string when the generation ends.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, or created at least 1 pull request in the past month. 

Contributor
Commits
Pull Requests
Issues

GitHub
184
0
0

ggerganov
0
39
2

ngxson
0
15
2

0wwafa
0
0
15

JohannesGaessler
0
11
0

compilade
0
10
0

HanClinto
0
9
1

Georgi Gerganov
9
0
0

danbev
0
9
0

Someone
8
0
0

mofosyne
0
8
0

RunningLeon
0
3
3

luoyu-intel
0
4
0

slaren
0
4
0

Alcpz
0
4
0

iboB
0
4
0

maruel
0
2
2

sorasoras
0
0
4

oldgithubman
0
0
4

AidanBeltonS
0
3
0

perpendicularai
0
1
2

iamlemec
0
3
0

AndreasKunar
0
1
2

stduhpf
0
1
2

joeatodd
0
3
0

RakshitAralimatti
0
0
3

yli147
0
0
3

mgroeber9110
0
1
1

jpodivin
0
2
0

OuadiElfarouki
0
2
0

LDLINGLINGLING
0
1
1

foldl
0
2
0

dspasyuk
0
1
1

mtasic85
0
2
0

standby24x7
0
2
0

b4b4o
0
1
1

kevmo314
0
2
0

jaime-m-p
0
2
0

jdomke
0
2
0

zhipenghan
0
2
0

nicholaiTukanov
0
1
1

msy-kato
0
2
0

ClarkChin08
0
2
0

0cc4m
0
2
0

airMeng
0
2
0

AmgadHasan
0
1
1

amochkin
0
1
1

Stillerman
0
1
1

kaetemi
0
1
1

jeroen-mostert
0
1
1

QIANXUNZDL123
0
0
2

mirek190
0
0
2

ch1y0q
0
0
2

SimplyCorbett
0
0
2

yancaoweidaode
0
0
2

Battlehub0x
0
0
2

Arashimu
0
0
2

MathiasSchindler
0
0
2

Sokartecnologi
0
0
2

bartowski1182
0
0
2

ericcurtin
0
0
2

vt-alt
0
0
2

abetlen
0
1
0

ochafik
0
1
0

AlexsCode
0
1
0

iacore
0
1
0

Zor-X-L
0
1
0

crashr
0
1
0

hackingthekernel
0
1
0

andy-tai
0
1
0

mcharytoniuk
0
1
0

Quantaindew
0
1
0

MistApproach
0
1
0

ho2103
0
1
0

hopto-dot
0
1
0

akemimadoka
0
1
0

NeoZhangJianyu
0
1
0

dwoolworth
0
1
0

daniandtheweb
0
1
0

pouwerkerk
0
1
0

bviksoe
0
1
0

diimdeep
0
1
0

prfd
0
1
0

youth123
0
1
0

brochure
0
1
0

agray3
0
1
0

yeahdongcn
0
1
0

daghanerdonmez
0
1
0

andysalerno
0
1
0

fairydreaming
0
1
0

laik
0
1
0

monatis
0
1
0

AragonerUA
0
1
0

kriation
0
1
0

danielhanchen
0
1
0

teleprint-me
0
1
0

65a
0
1
0

NikolaiLyssogor
0
1
0

sbonds
0
1
0

SommerEngineering
0
1
0

amitj1jan
0
1
0

nopperl
0
1
0

EZForever
0
1
0

m18coppola
0
1
0

thxCode
0
1
0

hankeke303
0
1
0

devojony
0
1
0

zqb-all
0
1
0

Xarbirus
0
1
0

FanShupei
0
1
0

themanyone
0
1
0

Oliver-Y
0
1
0

0x4139
0
1
0

Ujjawal-K-Panchal
0
1
0

fmz
0
1
0

MorganRO8
0
1
0

jmorganca
0
1
0

ElYaiko
0
1
0

sasha0552
0
1
0

DavidKorczynski
0
1
0

bsquizz
0
1
0

zhentaoyu
0
1
0

wangshuai09
0
1
0

Smupk2778
0
0
1

Green-Sky
0
0
1

eliranwong
0
0
1

quarterturn
0
0
1

rudiservo
0
0
1

werruww
0
0
1

unclemusclez
0
0
1

JohnClaw
0
0
1

micsthepick
0
0
1

kherud
0
0
1

duynt575
0
0
1

tomgm777
0
0
1

chiranko
0
0
1

Gomez12
0
0
1

starP-W
0
0
1

nathanodle
0
0
1

tybalex
0
0
1

akhilkapil
0
0
1

LiquidGunay
0
0
1

flatsiedatsie
0
0
1

tihom77
0
0
1

lorihuang
0
0
1

ctb111
0
0
1

aahouzi
0
0
1

jim-plus
0
0
1

Yan-Xiangjun
0
0
1

josharian
0
0
1

Aridbhdkkj
0
0
1

AUTOMATIC1111
0
0
1

isaac-mcfadyen
0
0
1

d-kleine
0
0
1

warren-lei
0
0
1

andreys42
0
0
1

gpacix
0
0
1

guinmoon
0
0
1

bandoti
0
0
1

apresence
0
0
1

kasrahabib
0
0
1

Hardik-Choraria
0
0
1

99991
0
0
1

Sakura4036
0
0
1

markat1
0
0
1

amakropoulos
0
0
1

MeemeeLab
0
0
1

joshknnd1982
0
0
1

sealad886
0
0
1

lin72h
0
0
1

jie80219
0
0
1

nne998
0
0
1

StatPan
0
0
1

1cekrim
0
0
1

bong-furiosa
0
0
1

djain-fujitsu
0
0
1

m828
0
0
1

Fulgurance
0
0
1

criminact
0
0
1

VelocityRa
0
0
1

dafei2017
0
0
1

metal3d
0
0
1

Emmanuel97460
0
0
1

vmarchenkoff
0
0
1

jpoly1219
0
0
1

ciekawy
0
0
1

DanielusG
0
0
1

hgftrdw45ud67is8o89
0
0
1

qnixsynapse
0
0
1

rhvall
0
0
1

zucchini-nlp
0
0
1

hipudding
0
0
1

suncloudsmoon
0
0
1

newsletternewsletter
0
0
1

simon-krannig
0
0
1

RonanKMcGovern
0
0
1

nicoboss
0
0
1

MangoTCF
0
0
1

TanLam01
0
0
1

peter-ch
0
0
1

auriocus
0
0
1

cloud11665
0
0
1

wencan
0
0
1

Vaibhavs10
0
0
1

Tureti
0
0
1

tc-wolf
0
0
1

akashaero
0
0
1

artiomborovinskii
0
0
1

mudler
0
0
1

Azirine
0
0
1

creeves-anaconda
0
0
1

hackey
0
0
1

chigkim
0
0
1

IcyXi
0
0
1

8XXD8
0
0
1

matteoserva
0
0
1

Volko61
0
0
1

riedgar-ms
0
0
1

mgonzs13
0
0
1

yuanzhiyong1999
0
0
1

windowsagent
0
0
1

ElaineWu66
0
0
1

ExtReMLapin
0
0
1

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues
GitHub	184	0	0
ggerganov	0	39	2
ngxson	0	15	2
0wwafa	0	0	15
JohannesGaessler	0	11	0
compilade	0	10	0
HanClinto	0	9	1
Georgi Gerganov	9	0	0
danbev	0	9	0
Someone	8	0	0
mofosyne	0	8	0
RunningLeon	0	3	3
luoyu-intel	0	4	0
slaren	0	4	0
Alcpz	0	4	0
iboB	0	4	0
maruel	0	2	2
sorasoras	0	0	4
oldgithubman	0	0	4
AidanBeltonS	0	3	0
perpendicularai	0	1	2
iamlemec	0	3	0
AndreasKunar	0	1	2
stduhpf	0	1	2
joeatodd	0	3	0
RakshitAralimatti	0	0	3
yli147	0	0	3
mgroeber9110	0	1	1
jpodivin	0	2	0
OuadiElfarouki	0	2	0
LDLINGLINGLING	0	1	1
foldl	0	2	0
dspasyuk	0	1	1
mtasic85	0	2	0
standby24x7	0	2	0
b4b4o	0	1	1
kevmo314	0	2	0
jaime-m-p	0	2	0
jdomke	0	2	0
zhipenghan	0	2	0
nicholaiTukanov	0	1	1
msy-kato	0	2	0
ClarkChin08	0	2	0
0cc4m	0	2	0
airMeng	0	2	0
AmgadHasan	0	1	1
amochkin	0	1	1
Stillerman	0	1	1
kaetemi	0	1	1
jeroen-mostert	0	1	1
QIANXUNZDL123	0	0	2
mirek190	0	0	2
ch1y0q	0	0	2
SimplyCorbett	0	0	2
yancaoweidaode	0	0	2
Battlehub0x	0	0	2
Arashimu	0	0	2
MathiasSchindler	0	0	2
Sokartecnologi	0	0	2
bartowski1182	0	0	2
ericcurtin	0	0	2
vt-alt	0	0	2
abetlen	0	1	0
ochafik	0	1	0
AlexsCode	0	1	0
iacore	0	1	0
Zor-X-L	0	1	0
crashr	0	1	0
hackingthekernel	0	1	0
andy-tai	0	1	0
mcharytoniuk	0	1	0
Quantaindew	0	1	0
MistApproach	0	1	0
ho2103	0	1	0
hopto-dot	0	1	0
akemimadoka	0	1	0
NeoZhangJianyu	0	1	0
dwoolworth	0	1	0
daniandtheweb	0	1	0
pouwerkerk	0	1	0
bviksoe	0	1	0
diimdeep	0	1	0
prfd	0	1	0
youth123	0	1	0
brochure	0	1	0
agray3	0	1	0
yeahdongcn	0	1	0
daghanerdonmez	0	1	0
andysalerno	0	1	0
fairydreaming	0	1	0
laik	0	1	0
monatis	0	1	0
AragonerUA	0	1	0
kriation	0	1	0
danielhanchen	0	1	0
teleprint-me	0	1	0
65a	0	1	0
NikolaiLyssogor	0	1	0
sbonds	0	1	0
SommerEngineering	0	1	0
amitj1jan	0	1	0
nopperl	0	1	0
EZForever	0	1	0
m18coppola	0	1	0
thxCode	0	1	0
hankeke303	0	1	0
devojony	0	1	0
zqb-all	0	1	0
Xarbirus	0	1	0
FanShupei	0	1	0
themanyone	0	1	0
Oliver-Y	0	1	0
0x4139	0	1	0
Ujjawal-K-Panchal	0	1	0
fmz	0	1	0
MorganRO8	0	1	0
jmorganca	0	1	0
ElYaiko	0	1	0
sasha0552	0	1	0
DavidKorczynski	0	1	0
bsquizz	0	1	0
zhentaoyu	0	1	0
wangshuai09	0	1	0
Smupk2778	0	0	1
Green-Sky	0	0	1
eliranwong	0	0	1
quarterturn	0	0	1
rudiservo	0	0	1
werruww	0	0	1
unclemusclez	0	0	1
JohnClaw	0	0	1
micsthepick	0	0	1
kherud	0	0	1
duynt575	0	0	1
tomgm777	0	0	1
chiranko	0	0	1
Gomez12	0	0	1
starP-W	0	0	1
nathanodle	0	0	1
tybalex	0	0	1
akhilkapil	0	0	1
LiquidGunay	0	0	1
flatsiedatsie	0	0	1
tihom77	0	0	1
lorihuang	0	0	1
ctb111	0	0	1
aahouzi	0	0	1
jim-plus	0	0	1
Yan-Xiangjun	0	0	1
josharian	0	0	1
Aridbhdkkj	0	0	1
AUTOMATIC1111	0	0	1
isaac-mcfadyen	0	0	1
d-kleine	0	0	1
warren-lei	0	0	1
andreys42	0	0	1
gpacix	0	0	1
guinmoon	0	0	1
bandoti	0	0	1
apresence	0	0	1
kasrahabib	0	0	1
Hardik-Choraria	0	0	1
99991	0	0	1
Sakura4036	0	0	1
markat1	0	0	1
amakropoulos	0	0	1
MeemeeLab	0	0	1
joshknnd1982	0	0	1
sealad886	0	0	1
lin72h	0	0	1
jie80219	0	0	1
nne998	0	0	1
StatPan	0	0	1
1cekrim	0	0	1
bong-furiosa	0	0	1
djain-fujitsu	0	0	1
m828	0	0	1
Fulgurance	0	0	1
criminact	0	0	1
VelocityRa	0	0	1
dafei2017	0	0	1
metal3d	0	0	1
Emmanuel97460	0	0	1
vmarchenkoff	0	0	1
jpoly1219	0	0	1
ciekawy	0	0	1
DanielusG	0	0	1
hgftrdw45ud67is8o89	0	0	1
qnixsynapse	0	0	1
rhvall	0	0	1
zucchini-nlp	0	0	1
hipudding	0	0	1
suncloudsmoon	0	0	1
newsletternewsletter	0	0	1
simon-krannig	0	0	1
RonanKMcGovern	0	0	1
nicoboss	0	0	1
MangoTCF	0	0	1
TanLam01	0	0	1
peter-ch	0	0	1
auriocus	0	0	1
cloud11665	0	0	1
wencan	0	0	1
Vaibhavs10	0	0	1
Tureti	0	0	1
tc-wolf	0	0	1
akashaero	0	0	1
artiomborovinskii	0	0	1
mudler	0	0	1
Azirine	0	0	1
creeves-anaconda	0	0	1
hackey	0	0	1
chigkim	0	0	1
IcyXi	0	0	1
8XXD8	0	0	1
matteoserva	0	0	1
Volko61	0	0	1
riedgar-ms	0	0	1
mgonzs13	0	0	1
yuanzhiyong1999	0	0	1
windowsagent	0	0	1
ElaineWu66	0	0	1
ExtReMLapin	0	0	1