Weekly GitHub Report for Llama.cpp: June 16, 2025 - June 23, 2025 (12:03:41)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Eval bug: Inconsistent Embedding Similarity between llama-server and LlamaCppEmbeddings for BGE-M3 Model: This issue highlights a significant discrepancy in embedding similarity results when using the BGE-M3_FP16.gguf model with llama-server compared to the LlamaCppEmbeddings integration. The embeddings generated via LlamaCppEmbeddings yield expected similarity scores, while those from the llama-server endpoint produce much lower and incorrect similarity scores for the same text pairs.
- The comments discuss potential causes for the discrepancy, such as normalization differences, and suggest debugging steps like posting prompts and raw embedding values. The conversation reveals that the issue might be related to the model conversion process, with suggestions to reconvert the model from a different source. Despite attempts to resolve the issue, the problem persists, with some users experiencing correct results while others do not, indicating a possible configuration or version mismatch.
- Number of comments this week: 21
-
Misc. bug: [Windows] GPU layers/tensors still consume system memory after load when mmap = true: This issue describes a problem with the llama-server on Windows where GPU layers and tensors continue to consume system memory even after being loaded to the GPU when the
mmap
option is set to true. This behavior does not occur on Linux, and it poses a significant problem when performing CPU+GPU inference on models that exceed the available system memory, as the system memory remains fully utilized regardless of the GPU allocation.- The comments discuss the issue's occurrence primarily on Windows, with users noting that disabling
mmap
resolves the excessive RAM usage but prevents loading models larger than the system memory. Some users highlight that this problem does not occur on Linux, where models can be loaded using swap memory without issues. There is a suggestion that the problem might be related to how Windows handles memory mapping compared to Linux. - Number of comments this week: 4
- The comments discuss the issue's occurrence primarily on Windows, with users noting that disabling
-
Compile bug: gcc-12: error: unrecognized command-line option ‘-compress-mode=size’: This issue involves a compilation error encountered when attempting to build the project with CUDA support, specifically due to the unrecognized command-line option '-compress-mode=size' using gcc-12. The problem seems to be related to a mismatch between the detected CUDA version (12.9) and the compiler version (12.4), which previously did not cause issues but now results in build failures.
- The comments discuss the mismatch between the detected CUDA version and the compiler version, with users noting that previous builds worked despite this disparity. Attempts to specify the correct CUDA paths and update environment variables to point to the newer CUDA version (12.9) still result in errors, indicating a persistent issue after server updates.
- Number of comments this week: 4
-
Misc. bug: LLAMA-SERVER is 40% slower than LLAMA-CLI when using identical parameters including -ot option for tensor offloading: This issue reports a performance discrepancy where the LLAMA-SERVER is operating 40% slower than the LLAMA-CLI when using identical parameters, including the tensor offloading option. The user notes that while CPU usage is at 100% for LLAMA-CLI, it is only 75-80% for LLAMA-SERVER, despite equal GPU/VRAM usage in both cases.
- A comment discusses the use of a 0.01 minimum probability with top-p sampling, suggesting that a temperature of 0.6 and top-p of 0.95 are used to prevent repetitive loops, and that sampling methods should be adjusted based on the situation.
- Number of comments this week: 1
-
Eval bug: OpenAI streaming API changed/broken: This issue reports a problem with the OpenAI streaming API, which appears to have changed or broken after a recent upgrade, causing issues with event parsing in JavaScript clients. The user describes that the server seems to push two events at once, which may not be valid, and suggests that a whitespace issue might be causing the events to merge incorrectly.
- A commenter suggests that the server might be pushing two events simultaneously, which could be causing the issue, and notes that while the Python
sseclient
module can handle this, it still requires filtering out "content: null" events. - Number of comments this week: 1
- A commenter suggests that the server might be pushing two events simultaneously, which could be causing the issue, and notes that while the Python
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is causing a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
- Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - kubernetes example: This issue highlights the need for a Helm chart for the
llama.cpp
server to facilitate its deployment on Kubernetes, which is a popular platform for managing containerized applications at scale. The author has initiated the development of this chart and is seeking community assistance to further progress the project. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue pertains to a bug encountered while attempting to load a model using the GGML backend with CUDA on a system equipped with an NVIDIA GeForce RTX 3060. The problem arises from a tensor type mismatch and block size inconsistency, leading to a failure in reading tensor information and subsequently preventing the model from loading successfully.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 22
Summarized Issues:
- Performance Discrepancies in LLAMA-SERVER: The LLAMA-SERVER is experiencing performance issues, operating 40% slower than the LLAMA-CLI despite using identical parameters, including tensor offloading. Observations indicate full CPU usage for LLAMA-CLI but only 75-80% for LLAMA-SERVER, with equal GPU/VRAM usage.
- CUDA-Specific Bugs in RWKV Inference: A bug in the RWKV inference process using llama-parallel results in incorrect output when the language model head is offloaded to the GPU using CUDA. This issue does not occur with other backends like Metal and Vulkan, indicating a CUDA-specific problem.
- Bugs in JSON Schema Processing: The llama-server encounters an "Unrecognized Schema" exception when processing complex JSON schemas for tool calling, particularly those using
not
andanyOf
options. This results in generation failures, while other inference providers can parse them successfully.
- Compilation and Build Issues: A compilation error occurs when building the
llama.cpp
project with CUDA support on Linux, where the GCC compiler version does not recognize a specific command-line option. This may be due to a mismatch between the detected CUDA version and the compiler version, leading to build failures.
- Memory and Resource Management Problems: A memory leak problem is reported when using CANN as the backend on an Ascend 910B4 hardware setup, with CPU memory usage increasing significantly over time. Additionally, storing a model on a hard drive causes a crash, whereas storing it on an SSD allows it to function correctly.
- Web UI and Input Handling Bugs: The web UI of the llama-server module on Linux exhibits erratic cursor placement when using specific command line parameters. Additionally, pasted content is not recognized as a valid prompt, requiring additional input to send a query.
- Feature Requests for Model Support and Enhancements: There are requests to add support for the moonshotai/Kimi-VL-A3B-Instruct model and to implement a flag to limit the maximum input image size for vision models. These enhancements aim to improve model efficiency and prevent out-of-memory errors.
- Bugs in Model Execution and Output: Bugs are reported in the execution of specific models, such as the DeepSeek-r1 32B_Q8 model on an Ascend 910B instance, where the system terminates due to an error related to a pre-allocated tensor. Additionally, the Mistral small 2506 model encounters GPU memory issues with large images.
- Discrepancies in Embedding Similarity Scores: A significant discrepancy in embedding similarity scores is observed when using the BGE-M3_FP16.gguf model with llama-server compared to the LlamaCppEmbeddings integration. This may be due to differences in normalization methods, prompting further investigation.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 16
Summarized Issues:
- Compile Bug in
vulkan-shaders-gen
Installation: Thevulkan-shaders-gen
binary is incorrectly installed into the build directory instead of the intended global directory, causing build failures on BSD systems. Discussions suggest removing the installation step to prevent this behavior.
- Qwen 3 Model Reasoning Display Issue: Users experienced a problem where the reasoning thought process was not displayed in the web UI despite being generated by the llama-cli. The issue was resolved by adding the
--reasoning-format none
setting.
- Embedding Request Errors in Llama-Server: A bug in the llama-server caused embedding requests to result in a 500 error with the message "Invalid input batch" when using the Qwen3-Embedding-0.6B-GGUF model. A suggested fix from a pull request resolved the error.
- Simultaneous Embeddings and Chat Completions in Llama-Server: The llama-server module did not support running both embeddings and chat completions simultaneously, requiring a workaround with the
--embeddings
flag. A subsequent fix allowed both functionalities to work together.
- Feature Request for
--no-warmup
Flag inllama-bench
: A feature request was made to add a--no-warmup
flag to thellama-bench
tool, allowing users to disable the internal warm-up phase. This would improve efficiency in automated benchmarking pipelines by eliminating redundant operations.
- Incorrect Output in Qwen3-Embedding-0.6B Model: The GGUF model produced incorrect output compared to the original safetensors and ONNX models, potentially due to missing settings. A suggested solution involved manually adding an EOS token to the input.
- Conversion Failure of Llama 4 Scout Model: A failure occurred when converting the Llama 4 Scout model to the mmproj format due to a missing tensor. The issue was resolved by identifying a typo in the conversion script and adding the missing
+ ".weight"
to the tensor name.
- Conversion Issues with
convert_hf_to_gguf.py
Script: Users faced difficulties converting Hugging Face models to GGUF format due to an outdatedgguf
package and incorrect symlink usage. Updating the package to version 0.17.1 or higher resolved the issue.
- BLAS Backend Configuration Problem: Users encountered a problem where the system information did not reflect the BLAS configuration and defaulted to using the CPU. This occurred despite following the documentation and build steps correctly.
- Dynamic Backend Loading in Llama.cpp: The function
ggml_backend_reg_count()
returned 0 on Linux systems due to a change in how backends are dynamically loaded. This required explicit loading of backends using functions likeggml_backend_load()
orggml_backend_load_all()
.
- RPC Connection Bug on Windows: A bug in the RPC feature caused the connection to close immediately upon client connection, preventing model transfer or execution. The issue was resolved by specifying the number of layers to offload to the RPC server using the
-ngl
option.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 19
Key Open Pull Requests
1. MODEL: Falcon-H1 support: This pull request introduces support for the Falcon-H1 architecture in the llama.cpp
project, building upon previous contributions from other developers and pending the merging of related pull requests, while also addressing minor issues before finalizing the integration.
- URL: pull/14238
- Merged: No
- Associated Commits: 1f0fe, dceff, 2bfe9, aff96, e0491, fa358, 38913, 0e601, 273e7, 7d6cb, 2c77d, 87b97, 03d0e, 7a351, 8b15b, 5b8ec, 62b09, 038d9, 80551, 7d16e, 3bc71, 8d8f0, b4e9c, 1ee6c, c9ecf, 35d06, cf4f0, 6def5, 79199, 94c3d, 929fe, d55b0, e94f3, 9864b, 2fa5f, 757aa, 0b6f6, 4dff8, 18224, 4782a, 6a397, 45217, 263d9, a7bd7, af916, 1ee27, 781d5, a3f58, c8c9a, e7750, d327a, ee9b3, c3c0e, df695, 9520a, 6361a, b1150, 86904, fa6db, 42459, 74ad4, b3c94, f3e34, 6253c, 0d80e, 0385f, c5f9f, 665a3, 6497b, acd55, cc40c, f5656, 17abb, b63be, 4c373, 0ffbe, 936da, fb220, 40e7a, 309d7, eaaf4, 58111, 84e30, ae7a0, 1e490, ddc85, 368c9, b82a2, 8b7f6, 3ee22, bd37f, 2e5e4, a42b9, beddd, 8d240, 95b66, e16aa, 9d30f, aff19, 9150f, 11857, ebfcb, 82f59, 837be, 41f25
2. ggml-cpu: enable IBM NNPA Vector Intrinsics: This pull request introduces support for the IBM NNPA instruction set on IBM z16 mainframes and the s390x platform, primarily focusing on enhancing the performance of FP16 to FP32 and FP32 to FP16 data conversions, resulting in a significant performance improvement of approximately 31.21% for F16 token generation, and it supersedes a previous implementation (#14303) that was incorrect.
- URL: pull/14317
- Merged: No
- Associated Commits: 58018, 45a4c, ebf9f, ffe29, 0394a, 48b82, d9cc6, 94f10, 8f3a5, 575ea, 93304, ebc1d, 6a25f, f9f6c, 6d507, 8312a, 27b4c, e0f8f, bb934, 5424d, 4f017, 27131, 946c7, 433d5, 54811, e12e9, 7413d, 373fa, 4621a, 987d1, 8ef51, f1b1d, 1547e, 0e571, 4ad6e, 81298, e7910, 157f8, 48df9, 3004a, 1cacd, fadc1, ed76f, 84593, 72c91, a91c3, ba351, 18d79, 781c2, 263b8, 04a39, c8b3b, ebb84, 3ec0b, e43dc, 5c9b0, 1b4db, 46227, 72965, 07de5, 489cd, 5004e
3. kv-cache : use ggml_set_rows: This pull request introduces the use of the ggml_set_rows()
function to update the KV cache in a more efficient manner by making the graph static with respect to the KV cells' head
offset and relaxing the requirement for continuous KV slots, with the new implementation being conditionally enabled via the LLAMA_SET_ROWS
environment variable until it is fully supported by all backends.
- URL: pull/14285
- Merged: No
Other Open Pull Requests
- SmolLM3 Addition: This topic covers the introduction of SmolLM3 to the project, which includes several commits that initialize the feature, refactor a model class, fix conversion errors, and update the graph. The pull request references a related pull request from the Hugging Face Transformers repository.
- Llama-bench Enhancements: Enhancements to the llama-bench tool include separate timing outputs for prompt processing and token generation, and a new feature to specify the number of threads per batch. These changes provide more detailed performance metrics and improved configurability.
- CUDA Kernel and Operation Improvements: The
mul_mat_vec
CUDA kernels are extended to support larger batch sizes, enhancing speculative decoding and batched inference with significant performance improvements. Additionally, a new mean operation is introduced to the CUDA implementation, refactoring the sum_rows function for normalization and demonstrating speedup on an RTX 3090 GPU.
- NUMA Optimization: The GGML_NUMA_MIGRATE feature is introduced to optimize cross-NUMA operation computation, enhancing the ggml_barrier for NUMA awareness. This includes a build option and command-line option for page migration, resulting in significant performance improvements on systems with multiple NUMA nodes.
- Backend Device Selection: A feature is introduced allowing users to select the backend device for the Clip/vision encoder by setting the
MTMD_BACKEND_DEVICE
environment variable. This provides flexibility in choosing the desired hardware for processing.
- Windows Null Pointer Bug Fix: A critical null pointer bug on the Windows platform within the
ggml-sycl
module is addressed by adding a null pointer check for tensor data. This improves memory operations for better stability and performance, with additional enhancements in error handling and logging.
- End-Of-Generation Bias: A configurable escalating End-Of-Generation (EOG) bias is introduced, increasing with each token generated after a specified threshold. New command line options are added to control when this bias is applied, based on token count.
- Server PID File Option: A new option is introduced to the server for creating a pidfile, allowing the process ID to be tracked. This is detailed in the commit by Eric Curtin.
- OpenCL Context Management: A reference counter is introduced to the
ggml_backend_opencl_context
for accurate backend reference management in multimodal models. Profiling is refactored, and theenqueue_ndrange_kernel
function is added for kernel launching.
- RPATH Adjustment for Binaries: The issue of Linux and macOS binaries having absolute RPATHs is addressed by setting the RPATH to "$ORIGIN" for Linux and "@loader_path" for macOS. This ensures executables and dynamic libraries can correctly locate their dependencies.
- Vulkan and CONV_2D Enhancements: A new CONV_2D operation and a direct GEMM Vulkan implementation are introduced, enabling efficient 2D convolution computations. This significantly improves performance on Vulkan backends, particularly on NVIDIA GeForce RTX 2060 SUPER.
- C++17 Transition: The project transitions to C++17, replacing a polyfill with
std::string_view
. This includes a commit with the necessary changes.
- Conv2D CPU Version: A CPU version of the Conv2D operation is introduced, currently in draft status due to inconsistent performance compared to the im2col method. The aim is to optimize it for potential use in issue #14316.
- Matrix Row Copy Function: The function
ggml_set_rows(a, b, c)
is introduced, facilitating the copying of rows from matrix 'b' into matrix 'a' based on indices in 'c'. This is referenced in issue #8366.
- Compilation Warning Fix: A simple compilation warning in the llama.cpp project is addressed by implementing a fix. This is detailed in the commit with SHA a4988383056bb38b3d100df0ceecd5036670bacf and is currently not merged.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 58
Key Closed Pull Requests
1. Add security vulnerability assessment for llama.cpp: This pull request introduces a comprehensive security vulnerability assessment for the llama.cpp repository, focusing on identifying and documenting potential security issues within the HTTP server implementation and core components, using a systematic methodology to analyze security-critical files and enhance the project's security posture.
- URL: pull/14213
- Merged: No
- Associated Commits: 759e3, b69f1, a4090, be023, b7a17, fb1ca, 33983, b4489, 797f2, 42158, 0d5c7, cf4cb, eb0f5, 2aa77, c7653, 5fbfe, 8e186, d643b, edbf4, a4e89, 6b56a, d394a, 5be24, cc74d, ab863, 79799, 8a1d2, 3079e, a127f, c10ed, 1dcd0, e16c4, faaaf, 9ecf3, 8a2af, d13d0, a70a8, b7753, ffd0e, c3a26, 4c328, 25946, 2bd1b, 17fc8, a2d02, f5cd2, 515fd, 4032c, d785f, a08c1, 40aaa, c5082, de2ef, aa50b, 2f099, e121e, 2d38b, fef69, 9012e, 22229, 79c13, f1384, d74e9, 88c12, 03f58, 6f180, 4265a, a26c4, cdf94, 4f81b, f9cd6, 81713, 952f3, 7fe03, 72b09, bc583, 05f6a, a8ea0, 1c49c, f3101, 34b7c, bef81, 1701d, a3c30, 1e865, 26b79, a6824, f7873, a3938, c962a, aa6df, e0e3a, d98f2, 10961, 763d0, 53ae3, 1b8fb, 6385b, 5ca82, 66c92, dd8ba, 21fcc, 54a2c, 2b131, e83ba, ec9e0, 2c90d, 291f2, 07e43, db387, 53f92, b49a8, df0c0, dd665, b47ab, e562e, eb394, 12d01, 51fa7, 3f55f, c7e0a, 3600c, 803f8, e1589, b3a89, 053b1, 0fc16, c0462, 87263, fedf0, 6eba7, a7b8d, af6f9, d3372, 10800, f3a4b, e57bb, c496f, 5e1c3, 7675c, 66344, 093e3, bfd32, c9bbc, 5582c, ea394, 36375, bfb1e, 71e74, ea143, 7e00e, e0e80, 0b4be, 3ac67, 48254, 2589a, 3e63a, 0d398, 5a8ae, 9e31b, 9f47f, d01d1, 3a077, 7f37b, 146b8, 669c1, 1caae, d17a8, 487a5, 745aa, 0974a, 228f3, 5787b, 247e5, 056eb, 91a8e, b460d, 87d34, dc062, e21d2, 201b3, 8f47e, f470b, 7f4fb, 40cbf, 1f63e, 1a3b5, b8e21, 2bb04, 97340, b7ce1, ae92c, 3a12d, 652b7, 3678b, 55f6b, dad5c, 4c763, 1f7d5, 7ae29, 2baf0, 89a18, 7781e, bd248, cc66a, d4e0d, 53280, 2e89f, a20b2, 95965, e2c0b, c3ee4, f6e1a, 7d516, a681b, ed52f, c33fe, 09cf2, c6128, 0889e, ffad0, d714d, cc8d0, b7cc7, 60c66, 26ff3, 80709, 3cfbb, 40643, fb85a, 2e42b, 3cb20, 00ba7, b9912, c311a, 9ae41, 5fce5, 2c2ca, e54b3, 30e5b, cd355, d7da8, 3555b, c89c2, 4ad24, 0bf49, 3ba0d, d3e64, 7d6d9, ad590
2. pr against the wrong repo ...: This pull request, titled "pr against the wrong repo," was not merged and includes multiple commits aimed at reducing logging verbosity, adding helper scripts, integrating a remoting frontend and backend build system, and making various improvements and refactorings to the ggml project, specifically focusing on enhancing the remoting capabilities and support for virtual GPU operations.
- URL: pull/14284
- Merged: No
- Associated Commits: 74eae, 41846, ee79e, 1beda, cd541, ece33, b4b90, 53b42, 5049b, ffa65, 3ba78, a87e3, 3d7b1, 4419c, a25b6, 5febf, 2ee2a, 847a2, 3270c, d8d3c, 151c0, 52d8e, d1185, a5422, 78d16, e3973, 022dd, 9bde8, b5ac3, 0cdcd, f15fe, 0c264, 938ba, 0582b, be5f5, 60bac, abd17, 74772, 00be4, 3dd26, 11f65, f9a01, 2461d, 9ba6e, 9d523, 95ccc, ad578, 1d9d4, 4a687, 0b77f, 8c81f, 319af, 73ed5, 88e8e, 43af3, 25f8d, db107, 248f6, 2e70a, 4d7d6, 6f057, 6f396, d40a3, b815c, b24fb, 3a201, c5608, 14292, 9913b, ede86, 1927c, d3541, 49bb0, 8edd5, 372e6, 1a826, 6ce80, 6fc0c, c927b, f29aa, 17dd2, 14f32, e80e4, 31c0c, 03935, 67d40, c5d44, 83596, 6b4bc, 9ab69, da8bd, 55ce3, 55962, 4f9a2, 4fa0b, 484d5, 609c7, 6ea6a, aac3c, b4837, ecb7a, 3f362, b2067, f0127, 11e2e, efe68, 3769b, 3a730, 7ef07, 6d985, 50326, eeba6, 66b34, 5b5ff, 94fb1, 38b13, ade08, 5c93f, af7ca, 65b92, 34e68, 49213, 1d4bb, 67d00, b511b, af1aa, 1f2c4, a6186, 61a6b
3. feature attn with state: This pull request introduces a series of enhancements and new features to the ggml library, including the implementation of various benchmarking scripts, the addition of T-MAC and QLUTATTN quantization support, improvements in flash attention and mixed precision KV cache functionalities, as well as updates to documentation and testing frameworks, all aimed at optimizing performance and expanding the library's capabilities.
- URL: pull/14299
- Merged: No
- Associated Commits: c5bcd, b5988, 8fb4c, 9ebdd, 794d1, 12bcd, a5e38, a6a9e, 87a1b, 9cc44, ef75f, f89ac, 6c555, 2823d, eca77, b6a6d, 7160c, 3a329, c011e, 85f5f, b9ce3, a4aaf, c7d81, 554b4, 4a258, e7432, 63901, 86c52, 00731, b2744, 12ee6, ca52e, 84742, 8b2e2, 4ba6a, c699a, 1847b, b9a9f, 5db11, f715a, 52682, 76fa9, 2c8f2, 0de95, ff5e9, c5130, afa2b, 2ece7, 7a59d, e889f, 395a4, 47439, f014b, c9bf8, e1f99, 30b9d, 6f224, d5062, 70ed6, 05281, 62fc0, a3896, d447a, 3e9e6, 93fba, 7e847, a4899, ebe5b, 3c820, 8912d, b7459, d34f3, c342f, 86a48, 46ffe, bd2f7, dc1b4, 0dbb2, a4a42, 5f4ad, 0cb4d, 26ffe, 42de4, 2e32a, 4587f, 7f031, bc1dd, 985f7, a525b, 104e5
Other Closed Pull Requests
- IBM NNPA and SIMD Enhancements on s390x: The pull requests focus on enhancing performance on the s390x platform by introducing support for the IBM NNPA instruction set and integrating the SIMD instruction set into Llamafile. These changes result in significant performance improvements for FP16 to FP32 conversions and prompt processing, verified through tests on IBM mainframes.
- SYCL and CUDA Kernel Improvements: These pull requests introduce performance optimizations by using the
sycl_ext_oneapi_enqueue_functions
extension and adding CUDA implementations for convolution operations. The changes improve performance across various models and configurations, with successful tests confirming the enhancements.
- Embedding and Token Handling Enhancements: The pull requests rework embeddings logic and address separator token handling in sequence classification models. They introduce support for embeddings with causal attention and a new
add_sep_token
state, improving model functionality and flexibility.
- Documentation and Build Synchronization: These pull requests enhance documentation for the s390x architecture and synchronize the ggml project by aligning build outputs across platforms. They include updates to model conversion steps and performance optimizations, improving clarity and consistency.
- Error Handling and API Improvements: The pull requests address issues related to silent overflows and optional library dependencies, improving error handling and API documentation. They ensure compatibility and prevent errors by updating logic and versioning strategies.
- Code Refactoring and Optimization: These pull requests focus on refactoring code for better performance and consistency, including renaming interfaces and optimizing output ID handling. The changes improve code clarity and efficiency, as evidenced by successful tests and commits.
- Quantization and Metadata Handling: The pull requests address issues in the quantization process by correcting metadata parameter types and allowing key-value pair overrides. These changes ensure compatibility and prevent errors during model loading.
- Miscellaneous Enhancements: Various pull requests introduce improvements such as adding a new "mean kernel," fixing compile warnings, and refactoring preprocessing logic for image handling. These changes enhance functionality and performance across different components of the project.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 135 | 16 | 0 | 55 |
CISC | 60 | 11 | 0 | 44 |
taronaeo | 110 | 4 | 0 | 0 |
gabe-l-hart | 59 | 0 | 1 | 1 |
ngxson | 23 | 3 | 0 | 23 |
kpouget | 48 | 1 | 0 | 0 |
am17an | 27 | 6 | 0 | 5 |
Zijie-Tian | 32 | 1 | 0 | 0 |
slaren | 23 | 2 | 0 | 5 |
JohannesGaessler | 10 | 2 | 0 | 16 |