Weekly GitHub Report for Llama.cpp: June 16, 2025 - June 23, 2025 (12:03:41)

            Weekly GitHub Report for Llama.cpp: June 16, 2025 - June 23, 2025 (12:03:41)

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Eval bug: Inconsistent Embedding Similarity between llama-server and LlamaCppEmbeddings for BGE-M3 Model: This issue highlights a significant discrepancy in embedding similarity results when using the BGE-M3_FP16.gguf model with llama-server compared to the LlamaCppEmbeddings integration. The embeddings generated via LlamaCppEmbeddings yield expected similarity scores, while those from the llama-server endpoint produce much lower and incorrect similarity scores for the same text pairs.

The comments discuss potential causes for the discrepancy, such as normalization differences, and suggest debugging steps like posting prompts and raw embedding values. The conversation reveals that the issue might be related to the model conversion process, with suggestions to reconvert the model from a different source. Despite attempts to resolve the issue, the problem persists, with some users experiencing correct results while others do not, indicating a possible configuration or version mismatch.
Number of comments this week: 21

Misc. bug: [Windows] GPU layers/tensors still consume system memory after load when mmap = true: This issue describes a problem with the llama-server on Windows where GPU layers and tensors continue to consume system memory even after being loaded to the GPU when the mmap option is set to true. This behavior does not occur on Linux, and it poses a significant problem when performing CPU+GPU inference on models that exceed the available system memory, as the system memory remains fully utilized regardless of the GPU allocation.

The comments discuss the issue's occurrence primarily on Windows, with users noting that disabling mmap resolves the excessive RAM usage but prevents loading models larger than the system memory. Some users highlight that this problem does not occur on Linux, where models can be loaded using swap memory without issues. There is a suggestion that the problem might be related to how Windows handles memory mapping compared to Linux.
Number of comments this week: 4

Compile bug: gcc-12: error: unrecognized command-line option ‘-compress-mode=size’: This issue involves a compilation error encountered when attempting to build the project with CUDA support, specifically due to the unrecognized command-line option '-compress-mode=size' using gcc-12. The problem seems to be related to a mismatch between the detected CUDA version (12.9) and the compiler version (12.4), which previously did not cause issues but now results in build failures.

The comments discuss the mismatch between the detected CUDA version and the compiler version, with users noting that previous builds worked despite this disparity. Attempts to specify the correct CUDA paths and update environment variables to point to the newer CUDA version (12.9) still result in errors, indicating a persistent issue after server updates.
Number of comments this week: 4

Misc. bug: LLAMA-SERVER is 40% slower than LLAMA-CLI when using identical parameters including -ot option for tensor offloading: This issue reports a performance discrepancy where the LLAMA-SERVER is operating 40% slower than the LLAMA-CLI when using identical parameters, including the tensor offloading option. The user notes that while CPU usage is at 100% for LLAMA-CLI, it is only 75-80% for LLAMA-SERVER, despite equal GPU/VRAM usage in both cases.

A comment discusses the use of a 0.01 minimum probability with top-p sampling, suggesting that a temperature of 0.6 and top-p of 0.95 are used to prevent repetitive loops, and that sampling methods should be adjusted based on the situation.
Number of comments this week: 1

Eval bug: OpenAI streaming API changed/broken: This issue reports a problem with the OpenAI streaming API, which appears to have changed or broken after a recent upgrade, causing issues with event parsing in JavaScript clients. The user describes that the server seems to push two events at once, which may not be valid, and suggests that a whitespace issue might be causing the events to merge incorrectly.

A commenter suggests that the server might be pushing two events simultaneously, which could be causing the issue, and notes that while the Python sseclient module can handle this, it still requires filtering out "content: null" events.
Number of comments this week: 1

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the CURLOPT_NOPROGRESS option to ensure accurate and non-conflicting progress updates during the download process.
kubernetes example: This issue highlights the need for a Helm chart for the llama.cpp server to facilitate its deployment on Kubernetes, which is a popular platform for managing containerized applications at scale. The author has initiated the development of this chart and is seeking community assistance to further progress the project.
Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the GGML backend with CUDA on a system equipped with an NVIDIA GeForce RTX 3060. The problem arises from a tensor type mismatch and block size inconsistency, leading to a failure in reading tensor information and subsequently preventing the model from loading successfully.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 22
Summarized Issues:

Performance Discrepancies in LLAMA-SERVER: The LLAMA-SERVER is experiencing performance issues, operating 40% slower than the LLAMA-CLI despite using identical parameters, including tensor offloading. Observations indicate full CPU usage for LLAMA-CLI but only 75-80% for LLAMA-SERVER, with equal GPU/VRAM usage.  
issues/14201, issues/14235

CUDA-Specific Bugs in RWKV Inference: A bug in the RWKV inference process using llama-parallel results in incorrect output when the language model head is offloaded to the GPU using CUDA. This issue does not occur with other backends like Metal and Vulkan, indicating a CUDA-specific problem.  
issues/14211, issues/14228

Bugs in JSON Schema Processing: The llama-server encounters an "Unrecognized Schema" exception when processing complex JSON schemas for tool calling, particularly those using not and anyOf options. This results in generation failures, while other inference providers can parse them successfully.  
issues/14218, issues/14227

Compilation and Build Issues: A compilation error occurs when building the llama.cpp project with CUDA support on Linux, where the GCC compiler version does not recognize a specific command-line option. This may be due to a mismatch between the detected CUDA version and the compiler version, leading to build failures.  
issues/14260

Memory and Resource Management Problems: A memory leak problem is reported when using CANN as the backend on an Ascend 910B4 hardware setup, with CPU memory usage increasing significantly over time. Additionally, storing a model on a hard drive causes a crash, whereas storing it on an SSD allows it to function correctly.  
issues/14257, issues/14297

Web UI and Input Handling Bugs: The web UI of the llama-server module on Linux exhibits erratic cursor placement when using specific command line parameters. Additionally, pasted content is not recognized as a valid prompt, requiring additional input to send a query.  
issues/14233, issues/14251

Feature Requests for Model Support and Enhancements: There are requests to add support for the moonshotai/Kimi-VL-A3B-Instruct model and to implement a flag to limit the maximum input image size for vision models. These enhancements aim to improve model efficiency and prevent out-of-memory errors.  
issues/14216, issues/14318

Bugs in Model Execution and Output: Bugs are reported in the execution of specific models, such as the DeepSeek-r1 32B_Q8 model on an Ascend 910B instance, where the system terminates due to an error related to a pre-allocated tensor. Additionally, the Mistral small 2506 model encounters GPU memory issues with large images.  
issues/14291, issues/14310

Discrepancies in Embedding Similarity Scores: A significant discrepancy in embedding similarity scores is observed when using the BGE-M3_FP16.gguf model with llama-server compared to the LlamaCppEmbeddings integration. This may be due to differences in normalization methods, prompting further investigation.  
issues/14280

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 16
Summarized Issues:

Compile Bug in vulkan-shaders-gen Installation: The vulkan-shaders-gen binary is incorrectly installed into the build directory instead of the intended global directory, causing build failures on BSD systems. Discussions suggest removing the installation step to prevent this behavior.
issues/14190

Qwen 3 Model Reasoning Display Issue: Users experienced a problem where the reasoning thought process was not displayed in the web UI despite being generated by the llama-cli. The issue was resolved by adding the --reasoning-format none setting.
issues/14199

Embedding Request Errors in Llama-Server: A bug in the llama-server caused embedding requests to result in a 500 error with the message "Invalid input batch" when using the Qwen3-Embedding-0.6B-GGUF model. A suggested fix from a pull request resolved the error.
issues/14204, issues/14210

Simultaneous Embeddings and Chat Completions in Llama-Server: The llama-server module did not support running both embeddings and chat completions simultaneously, requiring a workaround with the --embeddings flag. A subsequent fix allowed both functionalities to work together.
issues/14223

Feature Request for --no-warmup Flag in llama-bench: A feature request was made to add a --no-warmup flag to the llama-bench tool, allowing users to disable the internal warm-up phase. This would improve efficiency in automated benchmarking pipelines by eliminating redundant operations.
issues/14224

Incorrect Output in Qwen3-Embedding-0.6B Model: The GGUF model produced incorrect output compared to the original safetensors and ONNX models, potentially due to missing settings. A suggested solution involved manually adding an EOS token to the input.
issues/14234, issues/14252

Conversion Failure of Llama 4 Scout Model: A failure occurred when converting the Llama 4 Scout model to the mmproj format due to a missing tensor. The issue was resolved by identifying a typo in the conversion script and adding the missing + ".weight" to the tensor name.
issues/14237

Conversion Issues with convert_hf_to_gguf.py Script: Users faced difficulties converting Hugging Face models to GGUF format due to an outdated gguf package and incorrect symlink usage. Updating the package to version 0.17.1 or higher resolved the issue.
issues/14243, issues/14315

BLAS Backend Configuration Problem: Users encountered a problem where the system information did not reflect the BLAS configuration and defaulted to using the CPU. This occurred despite following the documentation and build steps correctly.
issues/14259

Dynamic Backend Loading in Llama.cpp: The function ggml_backend_reg_count() returned 0 on Linux systems due to a change in how backends are dynamically loaded. This required explicit loading of backends using functions like ggml_backend_load() or ggml_backend_load_all().
issues/14302

RPC Connection Bug on Windows: A bug in the RPC feature caused the connection to close immediately upon client connection, preventing model transfer or execution. The issue was resolved by specifying the number of layers to offload to the RPC server using the -ngl option.
issues/14307

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 19
Key Open Pull Requests
1. MODEL: Falcon-H1 support: This pull request introduces support for the Falcon-H1 architecture in the llama.cpp project, building upon previous contributions from other developers and pending the merging of related pull requests, while also addressing minor issues before finalizing the integration.

URL: pull/14238

Merged: No

Associated Commits: 1f0fe, dceff, 2bfe9, aff96, e0491, fa358, 38913, 0e601, 273e7, 7d6cb, 2c77d, 87b97, 03d0e, 7a351, 8b15b, 5b8ec, 62b09, 038d9, 80551, 7d16e, 3bc71, 8d8f0, b4e9c, 1ee6c, c9ecf, 35d06, cf4f0, 6def5, 79199, 94c3d, 929fe, d55b0, e94f3, 9864b, 2fa5f, 757aa, 0b6f6, 4dff8, 18224, 4782a, 6a397, 45217, 263d9, a7bd7, af916, 1ee27, 781d5, a3f58, c8c9a, e7750, d327a, ee9b3, c3c0e, df695, 9520a, 6361a, b1150, 86904, fa6db, 42459, 74ad4, b3c94, f3e34, 6253c, 0d80e, 0385f, c5f9f, 665a3, 6497b, acd55, cc40c, f5656, 17abb, b63be, 4c373, 0ffbe, 936da, fb220, 40e7a, 309d7, eaaf4, 58111, 84e30, ae7a0, 1e490, ddc85, 368c9, b82a2, 8b7f6, 3ee22, bd37f, 2e5e4, a42b9, beddd, 8d240, 95b66, e16aa, 9d30f, aff19, 9150f, 11857, ebfcb, 82f59, 837be, 41f25

2. ggml-cpu: enable IBM NNPA Vector Intrinsics: This pull request introduces support for the IBM NNPA instruction set on IBM z16 mainframes and the s390x platform, primarily focusing on enhancing the performance of FP16 to FP32 and FP32 to FP16 data conversions, resulting in a significant performance improvement of approximately 31.21% for F16 token generation, and it supersedes a previous implementation (#14303) that was incorrect.

URL: pull/14317

Merged: No

Associated Commits: 58018, 45a4c, ebf9f, ffe29, 0394a, 48b82, d9cc6, 94f10, 8f3a5, 575ea, 93304, ebc1d, 6a25f, f9f6c, 6d507, 8312a, 27b4c, e0f8f, bb934, 5424d, 4f017, 27131, 946c7, 433d5, 54811, e12e9, 7413d, 373fa, 4621a, 987d1, 8ef51, f1b1d, 1547e, 0e571, 4ad6e, 81298, e7910, 157f8, 48df9, 3004a, 1cacd, fadc1, ed76f, 84593, 72c91, a91c3, ba351, 18d79, 781c2, 263b8, 04a39, c8b3b, ebb84, 3ec0b, e43dc, 5c9b0, 1b4db, 46227, 72965, 07de5, 489cd, 5004e

3. kv-cache : use ggml_set_rows: This pull request introduces the use of the ggml_set_rows() function to update the KV cache in a more efficient manner by making the graph static with respect to the KV cells' head offset and relaxing the requirement for continuous KV slots, with the new implementation being conditionally enabled via the LLAMA_SET_ROWS environment variable until it is fully supported by all backends.

URL: pull/14285

Merged: No

Associated Commits: 990ee, 70e3d, bec0c, d1da9, 113c7, d4be3, 14554, ed7c2, fe1bb

Other Open Pull Requests

SmolLM3 Addition: This topic covers the introduction of SmolLM3 to the project, which includes several commits that initialize the feature, refactor a model class, fix conversion errors, and update the graph. The pull request references a related pull request from the Hugging Face Transformers repository.
pull/14240

Llama-bench Enhancements: Enhancements to the llama-bench tool include separate timing outputs for prompt processing and token generation, and a new feature to specify the number of threads per batch. These changes provide more detailed performance metrics and improved configurability.
pull/14219

CUDA Kernel and Operation Improvements: The mul_mat_vec CUDA kernels are extended to support larger batch sizes, enhancing speculative decoding and batched inference with significant performance improvements. Additionally, a new mean operation is introduced to the CUDA implementation, refactoring the sum_rows function for normalization and demonstrating speedup on an RTX 3090 GPU.
pull/14262, pull/14313

NUMA Optimization: The GGML_NUMA_MIGRATE feature is introduced to optimize cross-NUMA operation computation, enhancing the ggml_barrier for NUMA awareness. This includes a build option and command-line option for page migration, resulting in significant performance improvements on systems with multiple NUMA nodes.
pull/14232

Backend Device Selection: A feature is introduced allowing users to select the backend device for the Clip/vision encoder by setting the MTMD_BACKEND_DEVICE environment variable. This provides flexibility in choosing the desired hardware for processing.
pull/14236

Windows Null Pointer Bug Fix: A critical null pointer bug on the Windows platform within the ggml-sycl module is addressed by adding a null pointer check for tensor data. This improves memory operations for better stability and performance, with additional enhancements in error handling and logging.
pull/14290

End-Of-Generation Bias: A configurable escalating End-Of-Generation (EOG) bias is introduced, increasing with each token generated after a specified threshold. New command line options are added to control when this bias is applied, based on token count.
pull/14229

Server PID File Option: A new option is introduced to the server for creating a pidfile, allowing the process ID to be tracked. This is detailed in the commit by Eric Curtin.
pull/14242

OpenCL Context Management: A reference counter is introduced to the ggml_backend_opencl_context for accurate backend reference management in multimodal models. Profiling is refactored, and the enqueue_ndrange_kernel function is added for kernel launching.
pull/14254

RPATH Adjustment for Binaries: The issue of Linux and macOS binaries having absolute RPATHs is addressed by setting the RPATH to "$ORIGIN" for Linux and "@loader_path" for macOS. This ensures executables and dynamic libraries can correctly locate their dependencies.
pull/14309

Vulkan and CONV_2D Enhancements: A new CONV_2D operation and a direct GEMM Vulkan implementation are introduced, enabling efficient 2D convolution computations. This significantly improves performance on Vulkan backends, particularly on NVIDIA GeForce RTX 2060 SUPER.
pull/14316

C++17 Transition: The project transitions to C++17, replacing a polyfill with std::string_view. This includes a commit with the necessary changes.
pull/14319

Conv2D CPU Version: A CPU version of the Conv2D operation is introduced, currently in draft status due to inconsistent performance compared to the im2col method. The aim is to optimize it for potential use in issue #14316.
pull/14320

Matrix Row Copy Function: The function ggml_set_rows(a, b, c) is introduced, facilitating the copying of rows from matrix 'b' into matrix 'a' based on indices in 'c'. This is referenced in issue #8366.
pull/14274

Compilation Warning Fix: A simple compilation warning in the llama.cpp project is addressed by implementing a fix. This is detailed in the commit with SHA a4988383056bb38b3d100df0ceecd5036670bacf and is currently not merged.
pull/14209

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 58
Key Closed Pull Requests
1. Add security vulnerability assessment for llama.cpp: This pull request introduces a comprehensive security vulnerability assessment for the llama.cpp repository, focusing on identifying and documenting potential security issues within the HTTP server implementation and core components, using a systematic methodology to analyze security-critical files and enhance the project's security posture.

URL: pull/14213

Merged: No

Associated Commits: 759e3, b69f1, a4090, be023, b7a17, fb1ca, 33983, b4489, 797f2, 42158, 0d5c7, cf4cb, eb0f5, 2aa77, c7653, 5fbfe, 8e186, d643b, edbf4, a4e89, 6b56a, d394a, 5be24, cc74d, ab863, 79799, 8a1d2, 3079e, a127f, c10ed, 1dcd0, e16c4, faaaf, 9ecf3, 8a2af, d13d0, a70a8, b7753, ffd0e, c3a26, 4c328, 25946, 2bd1b, 17fc8, a2d02, f5cd2, 515fd, 4032c, d785f, a08c1, 40aaa, c5082, de2ef, aa50b, 2f099, e121e, 2d38b, fef69, 9012e, 22229, 79c13, f1384, d74e9, 88c12, 03f58, 6f180, 4265a, a26c4, cdf94, 4f81b, f9cd6, 81713, 952f3, 7fe03, 72b09, bc583, 05f6a, a8ea0, 1c49c, f3101, 34b7c, bef81, 1701d, a3c30, 1e865, 26b79, a6824, f7873, a3938, c962a, aa6df, e0e3a, d98f2, 10961, 763d0, 53ae3, 1b8fb, 6385b, 5ca82, 66c92, dd8ba, 21fcc, 54a2c, 2b131, e83ba, ec9e0, 2c90d, 291f2, 07e43, db387, 53f92, b49a8, df0c0, dd665, b47ab, e562e, eb394, 12d01, 51fa7, 3f55f, c7e0a, 3600c, 803f8, e1589, b3a89, 053b1, 0fc16, c0462, 87263, fedf0, 6eba7, a7b8d, af6f9, d3372, 10800, f3a4b, e57bb, c496f, 5e1c3, 7675c, 66344, 093e3, bfd32, c9bbc, 5582c, ea394, 36375, bfb1e, 71e74, ea143, 7e00e, e0e80, 0b4be, 3ac67, 48254, 2589a, 3e63a, 0d398, 5a8ae, 9e31b, 9f47f, d01d1, 3a077, 7f37b, 146b8, 669c1, 1caae, d17a8, 487a5, 745aa, 0974a, 228f3, 5787b, 247e5, 056eb, 91a8e, b460d, 87d34, dc062, e21d2, 201b3, 8f47e, f470b, 7f4fb, 40cbf, 1f63e, 1a3b5, b8e21, 2bb04, 97340, b7ce1, ae92c, 3a12d, 652b7, 3678b, 55f6b, dad5c, 4c763, 1f7d5, 7ae29, 2baf0, 89a18, 7781e, bd248, cc66a, d4e0d, 53280, 2e89f, a20b2, 95965, e2c0b, c3ee4, f6e1a, 7d516, a681b, ed52f, c33fe, 09cf2, c6128, 0889e, ffad0, d714d, cc8d0, b7cc7, 60c66, 26ff3, 80709, 3cfbb, 40643, fb85a, 2e42b, 3cb20, 00ba7, b9912, c311a, 9ae41, 5fce5, 2c2ca, e54b3, 30e5b, cd355, d7da8, 3555b, c89c2, 4ad24, 0bf49, 3ba0d, d3e64, 7d6d9, ad590

2. pr against the wrong repo ...: This pull request, titled "pr against the wrong repo," was not merged and includes multiple commits aimed at reducing logging verbosity, adding helper scripts, integrating a remoting frontend and backend build system, and making various improvements and refactorings to the ggml project, specifically focusing on enhancing the remoting capabilities and support for virtual GPU operations.

URL: pull/14284

Merged: No

Associated Commits: 74eae, 41846, ee79e, 1beda, cd541, ece33, b4b90, 53b42, 5049b, ffa65, 3ba78, a87e3, 3d7b1, 4419c, a25b6, 5febf, 2ee2a, 847a2, 3270c, d8d3c, 151c0, 52d8e, d1185, a5422, 78d16, e3973, 022dd, 9bde8, b5ac3, 0cdcd, f15fe, 0c264, 938ba, 0582b, be5f5, 60bac, abd17, 74772, 00be4, 3dd26, 11f65, f9a01, 2461d, 9ba6e, 9d523, 95ccc, ad578, 1d9d4, 4a687, 0b77f, 8c81f, 319af, 73ed5, 88e8e, 43af3, 25f8d, db107, 248f6, 2e70a, 4d7d6, 6f057, 6f396, d40a3, b815c, b24fb, 3a201, c5608, 14292, 9913b, ede86, 1927c, d3541, 49bb0, 8edd5, 372e6, 1a826, 6ce80, 6fc0c, c927b, f29aa, 17dd2, 14f32, e80e4, 31c0c, 03935, 67d40, c5d44, 83596, 6b4bc, 9ab69, da8bd, 55ce3, 55962, 4f9a2, 4fa0b, 484d5, 609c7, 6ea6a, aac3c, b4837, ecb7a, 3f362, b2067, f0127, 11e2e, efe68, 3769b, 3a730, 7ef07, 6d985, 50326, eeba6, 66b34, 5b5ff, 94fb1, 38b13, ade08, 5c93f, af7ca, 65b92, 34e68, 49213, 1d4bb, 67d00, b511b, af1aa, 1f2c4, a6186, 61a6b

3. feature attn with state: This pull request introduces a series of enhancements and new features to the ggml library, including the implementation of various benchmarking scripts, the addition of T-MAC and QLUTATTN quantization support, improvements in flash attention and mixed precision KV cache functionalities, as well as updates to documentation and testing frameworks, all aimed at optimizing performance and expanding the library's capabilities.

URL: pull/14299

Merged: No

Associated Commits: c5bcd, b5988, 8fb4c, 9ebdd, 794d1, 12bcd, a5e38, a6a9e, 87a1b, 9cc44, ef75f, f89ac, 6c555, 2823d, eca77, b6a6d, 7160c, 3a329, c011e, 85f5f, b9ce3, a4aaf, c7d81, 554b4, 4a258, e7432, 63901, 86c52, 00731, b2744, 12ee6, ca52e, 84742, 8b2e2, 4ba6a, c699a, 1847b, b9a9f, 5db11, f715a, 52682, 76fa9, 2c8f2, 0de95, ff5e9, c5130, afa2b, 2ece7, 7a59d, e889f, 395a4, 47439, f014b, c9bf8, e1f99, 30b9d, 6f224, d5062, 70ed6, 05281, 62fc0, a3896, d447a, 3e9e6, 93fba, 7e847, a4899, ebe5b, 3c820, 8912d, b7459, d34f3, c342f, 86a48, 46ffe, bd2f7, dc1b4, 0dbb2, a4a42, 5f4ad, 0cb4d, 26ffe, 42de4, 2e32a, 4587f, 7f031, bc1dd, 985f7, a525b, 104e5

Other Closed Pull Requests

IBM NNPA and SIMD Enhancements on s390x: The pull requests focus on enhancing performance on the s390x platform by introducing support for the IBM NNPA instruction set and integrating the SIMD instruction set into Llamafile. These changes result in significant performance improvements for FP16 to FP32 conversions and prompt processing, verified through tests on IBM mainframes.
pull/14303, pull/14273

SYCL and CUDA Kernel Improvements: These pull requests introduce performance optimizations by using the sycl_ext_oneapi_enqueue_functions extension and adding CUDA implementations for convolution operations. The changes improve performance across various models and configurations, with successful tests confirming the enhancements.
pull/14244, pull/14265, pull/14287

Embedding and Token Handling Enhancements: The pull requests rework embeddings logic and address separator token handling in sequence classification models. They introduce support for embeddings with causal attention and a new add_sep_token state, improving model functionality and flexibility.
pull/14208, pull/14272, pull/14225

Documentation and Build Synchronization: These pull requests enhance documentation for the s390x architecture and synchronize the ggml project by aligning build outputs across platforms. They include updates to model conversion steps and performance optimizations, improving clarity and consistency.
pull/14264, pull/14255

Error Handling and API Improvements: The pull requests address issues related to silent overflows and optional library dependencies, improving error handling and API documentation. They ensure compatibility and prevent errors by updating logic and versioning strategies.
pull/14301, pull/14200

Code Refactoring and Optimization: These pull requests focus on refactoring code for better performance and consistency, including renaming interfaces and optimizing output ID handling. The changes improve code clarity and efficiency, as evidenced by successful tests and commits.
pull/14296, pull/14275

Quantization and Metadata Handling: The pull requests address issues in the quantization process by correcting metadata parameter types and allowing key-value pair overrides. These changes ensure compatibility and prevent errors during model loading.
pull/14197, pull/14194

Miscellaneous Enhancements: Various pull requests introduce improvements such as adding a new "mean kernel," fixing compile warnings, and refactoring preprocessing logic for image handling. These changes enhance functionality and performance across different components of the project.
pull/14267, pull/14261, pull/14247

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ggerganov
135
16
0
55

CISC
60
11
0
44

taronaeo
110
4
0
0

gabe-l-hart
59
0
1
1

ngxson
23
3
0
23

kpouget
48
1
0
0

am17an
27
6
0
5

Zijie-Tian
32
1
0
0

slaren
23
2
0
5

JohannesGaessler
10
2
0
16

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ggerganov	135	16	0	55
CISC	60	11	0	44
taronaeo	110	4	0	0
gabe-l-hart	59	0	1	1
ngxson	23	3	0	23
kpouget	48	1	0	0
am17an	27	6	0	5
Zijie-Tian	32	1	0	0
slaren	23	2	0	5
JohannesGaessler	10	2	0	16