Weekly GitHub Report for Llama.cpp: April 07, 2025 - April 14, 2025 (14:19:23)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Eval bug: ggml_vulkan: Device memory allocation of size N failed with ub > 4096 and c > 4096 and b > 4096: This issue involves a bug in the GGML Vulkan backend where attempting to allocate device memory for running llama-server with specific parameters results in a crash, despite having sufficient GPU memory available. The problem occurs when using parameters
-ub 8192 -b 8192 -c 8192
, leading to a failed memory allocation error, but the model loads successfully when any of these parameters are reduced to 4096.- The comments discuss potential reasons for the memory allocation failure, suggesting that the unusually high
-ub
value might be causing the issue. It is noted that embedding models require the entire context to fit within a single micro-batch due to non-causal attention, which may lead to large memory allocations. A workaround is proposed by reducing the-ub
or batch size, and it is confirmed that this resolves the issue. Additionally, it is mentioned that the problem might be specific to AMD hardware on Windows, and a feature like Flash Attention could help, but it is currently limited to beta Nvidia drivers. - Number of comments this week: 12
- The comments discuss potential reasons for the memory allocation failure, suggesting that the unusually high
-
Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions.: This issue highlights a significant decrease in the reasoning performance of a model using different versions of the same model architecture, despite employing identical parameters and the same set of questions. The problem is particularly evident when comparing versions b4756 and b4759 of the llama.cpp server, where the latter shows a noticeable degradation in performance, consuming more tokens and failing to solve tasks as efficiently as the former.
- The comments discuss potential issues with the flash attention update in version b4759, with users conducting tests on different hardware setups, including RTX 4090 GPUs with CUDA. Some users confirm inconsistent results across versions when flash attention is enabled, while others note that disabling it leads to consistent performance. There is speculation about precision issues introduced in recent updates, and some users provide perplexity results to track changes across versions, suggesting that small backend math changes might be exposing model instability.
- Number of comments this week: 9
-
Misc. bug: LLaVa Projector Possibly Wrong Tensor Order: This issue reports a potential bug in the LLaVa Projector of the llama.cpp project, where the
v.patch_embd.weight
tensor might be written in the wrong order, potentially affecting performance. The user suspects that the tensor shape should be1152, 3, 14, 14
instead of14, 14, 3, 1152
, as it is the only tensor with a different order compared to others when using thegguf_dump.py
script.- The comments discuss whether the tensor order is incorrect, with one user suggesting that the tensor is written correctly in row-major order, as per GGML's guidelines, and another user clarifying that the shape is correct and does not affect performance. The conversation includes explanations about row-major and column-major orders, and it is concluded that the tensor shape is appropriate for the intended operations.
- Number of comments this week: 6
-
OpenCL: Performance comparison depending on gpu_offloads: This issue discusses the unexpected performance results when using different levels of GPU offloads with the OpenCL backend on a QCS8550 device, specifically when running a model from Hugging Face. The user expected better performance with more GPU offloads, but the benchmark results did not align with these expectations, prompting further investigation and testing with different configurations.
- The comments reveal a discussion about the performance issues with the Q4_K_M model, suggesting trying the Q4_0 model instead, which is optimized for better performance. It is noted that the Adreno 740 should work fine, and the user confirms improved performance with the Q4_0 model, especially with more GPU layers, after following the advice given.
- Number of comments this week: 5
-
Eval bug: LLaVa convert_image_encoder_to_gguf.py fails to byteswap v.head.ffn_up.bias tensor on Big-Endian system: This issue describes a bug in the Python script
convert_image_encoder_to_gguf.py
used in the LLaVa project, where thev.head.ffn_up.bias
tensor is not correctly byteswapped on Big-Endian systems, causing the generated model file to fail on such systems. The problem is specific to the Python code and does not affect the C/C++ code, and it has been identified that the issue arises when the model file is generated on a Big-Endian machine, while the same file generated on a Little-Endian machine works correctly on Big-Endian systems.- The comments discuss a patch that unexpectedly resolves the issue by conditionally applying a byteswap, questioning whether tensors are byteswapped before being added. Suggestions include testing with other scripts and models, and checking if the tensor appears twice in the original file, potentially causing double byteswapping.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is generating a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
- Feature Request: Task Cancellation on Client Disconnection: This issue is about a feature request to enhance the current embedding server setup by implementing task cancellation when a client disconnects, as the existing system continues processing queued tasks even after a client cancels a request, leading to inefficiencies and potential server overload. The proposed modification aims to terminate task processing upon request cancellation to prevent delays in processing subsequent requests, thereby improving server performance and resource management.
- Question: How to generate an MPS gputrace: This issue involves a query about generating a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a project at Hugging Face. The user is seeking guidance or documentation on how to produce this specific type of debugger output, similar to what is available through the Xcode Metal Debugger.
- common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the
CURLOPT_NOPROGRESS
option to ensure accurate and non-conflicting progress updates during the download process. - Eval bug: [/SOLUTION] visible in granite 8B: This issue pertains to a bug in the evaluation process of the granite 8B model, where the output unexpectedly includes the text "[/SOLUTION]" in the web UI interface. The problem occurs when interacting with the model using a specific setup involving an NVIDIA GeForce RTX 3090 GPU and the CUDA backend on a Linux operating system.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 31
Summarized Issues:
- Metric Naming Conventions in Docker Version: The llama.cpp server's Docker version has an issue with metric names using colons, which violates Prometheus naming conventions. It is suggested to replace colons with underscores to comply with these guidelines.
- Performance Issues with GPU Offloading: Unexpected performance results occur when using different levels of GPU offloading with OpenCL on a Qualcomm Snapdragon platform. Increasing GPU offloads does not improve performance as expected, particularly with the EXAONE-3.5-2.4B-Instruct-Q4_K_M model.
- Model Performance and Reasoning Decrease: There is a significant decrease in reasoning performance of the llama.cpp model when using different versions of the same architecture. Version b4759 performs worse than b4756, with increased token consumption and inconsistent results linked to CUDA changes.
- Memory Allocation and Server Crashes: Memory allocation failures occur in the Vulkan backend of the llama-server when running models with large parameters, leading to crashes on AMD hardware. This is linked to large kv buffer sizes and non-causal attention requirements.
- Tokenizer and Conversion Script Errors: Errors occur when using the
convert_hf_to_gguf.py
script due to tokenizer mismatches and missing attributes. These issues prevent successful conversion of models from Hugging Face to GGUF format.
- Compilation Errors with CUDA and Vulkan: Compilation errors occur when building the llama.cpp project with CUDA and Vulkan backends. These errors are due to undefined identifiers and multiple definition errors, requiring specific flags or environment variables to resolve.
- Performance Bugs with Unused GPUs: A performance bug in a CUDA-based system occurs when an unused GPU card slows down token generation speed. It is suggested that unused GPUs should not be utilized by default to prevent such slowdowns.
- Attribute and Assertion Errors in Scripts: Various scripts encounter AttributeErrors and assertion failures due to missing attributes or incorrect parameter settings. These issues cause failures in model loading and execution.
- CUDA Implementation and Performance Enhancements: Modifications to the CUDA implementation of the
mul_mat_id
function are suggested to enhance performance by avoiding data transfer back to the host. This change is important for the growing adoption of the Mixture of Experts (MoE) architecture.
- Evaluation and Tokenization Issues: Users encounter issues with the
llama-perplexity
tool andlm-evaluation-harness
, where tasks do not fit within the context window or tokenizer names are not implemented. Adjustments to parameters or guidance on implementation are needed.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 10
Summarized Issues:
- GPU Memory Management: A user sought assistance to locate the 'kv buffer' on the CPU instead of the GPU to reduce GPU memory usage. The issue was resolved by disabling KV offload, which successfully addressed the problem.
- Performance Degradation in RDNA4 Prefill: A significant performance degradation was reported in the RDNA4 prefill process when using the CUBLAS_COMPUTE_32F setting. The performance was almost halved compared to previous configurations, and clarification was sought on whether this is a known issue with the latest hipBLASLt and if a fix is anticipated.
- Integration of Llama 4: A feature request was made for integrating the newly released Llama 4, a multimodal large language model (LLM), into the ggml-org/llama.cpp project. The request highlighted its potential benefits and challenges in implementation, particularly regarding the support for chunked attention masks.
- Bugs in Multimodal Inference and Voice Generation: A bug in the
llama-gemma3-cli.exe
on Windows caused the Gemma 3 Vision GGUF model to fail during multimodal inference, traced to a missinggeneral.file_type
in themmproj
file. Additionally, thellama-tts
tool crashed when generating voice using OuteTTS version 0.3 or 1.0, but functioned correctly with version 0.2.
- Random Crashes in Embedding Model: The llama.cpp server randomly crashed with the error message
GGML_ASSERT(q_to_vec_dot && "fattn: unsupported K-type") failed
when running an embedding model on a Windows system using Vulkan. The issue was related to the CPU backend fallback when Flash Attention is not supported on AMD GPUs.
- Quantization and Compilation Failures: A failure occurred in the quantization process of a model using the
llama-quantize
tool due to an unrecognized model architecture labeled as 'granite'. Additionally, build failures were reported on the Power architecture (ppc64le) and s390x architecture due to errors in simd-mappings and a regression introduced by a specific commit.
- Token Omission in Converted Models: The absence of a specific token with ID 262144, labeled as
<image_soft_token>
, was noted in the converted gguf models for the Gemma3-it token map. This token was intentionally omitted during conversion as it is designed to be replaced by image tokens.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 28
Key Open Pull Requests
1. Llama-3_1-Nemotron-Ultra-253B-v1 support: This pull request introduces support for the new Llama-3_1-Nemotron-Ultra-253B-v1 model by modifying the convert_hf_to_gguf.py
script and src/llama-model.*
files to accommodate a new type of layer, referred to as a dummy layer, which lacks attention and feed-forward network components, and requests community assistance to test the changes on the 253B model due to resource constraints.
- URL: pull/12843
- Merged: No
- Associated Commits: ecad9, 12ade, 643e5, e68c7, 6a480, f9a1c, c1736, 984ff, 909a7, 4dad2, cc615, 0ac08, 80af2, 2a260, f8f67, 1600d, bd3d4
2. DeepSeek V2/V3 MLA implementation: This pull request introduces the implementation of the DeepSeek V2/V3 MLA (Multi-Headed Linear Attention) in the llama.cpp
project, ensuring backward compatibility with legacy non-MLA GGUF files while adding context-shifting capabilities and optimizing the handling of MQA models by converting 3D batched matrix multiplication into 2D matrix multiplication, with significant contributions from @fairydreaming and additional optimizations for CUDA performance.
- URL: pull/12801
- Merged: No
- Associated Commits: 8c024, ddab5, 76125, fed66, c4494, 2a4e1, 77fe5, e2153, 815f4, 57788, 77ad5, 5d037, 638b0
3. metal : add memory pool for temp allocs: This pull request introduces a memory pool for temporary allocations in the Metal backend of the project, aiming to facilitate the allocation of temporary buffers for storing intermediate results during composite operations, similar to the existing functionality in the CUDA backend, with tasks including creating and managing MTLHeap
objects, dynamically resizing heaps, and ensuring memory efficiency and leak prevention.
- URL: pull/12850
- Merged: No
Other Open Pull Requests
- OpenCL and GPU Compatibility Enhancements: This topic covers improvements to the OpenCL backend and GPU compatibility, including splitting the
ggml-opencl.cl
file for better compatibility with older Adreno GPUs and addressing a logging issue in OpenCL profiling. These changes ensure the OpenCL backend functions with newer compilers and correct the output of local_size dimensions for accurate profiling.
- CI and Build Process Improvements: This topic addresses synchronization issues in the Linux cross-compile CI setup, which were causing build failures and delays. The changes include continuing on error during setup and caching packages in the toolchain to streamline the build process.
- Date Handling and Template Injection: This topic involves injecting a
date_string
into the llama 3.x template and fixing date handling for firefunction v2. The pull request ensures the date appears correctly in prompts by passing the current time to Minja and includes an end-to-end test for verification.
- Tensor Writing and Conversion Speed: This topic introduces parallel writing of tensors using a thread pool to enhance conversion speed. It adds a
--threads
argument to specify the number of threads and addresses compatibility with multithreaded remote tensor fetching.
- Performance Optimization: This topic focuses on optimizing the ROPE operator to enhance inference performance by approximately 10% through reducing memory allocations and eliminating redundant operations. It also includes code style adjustments and a fix for a precision issue.
- Image Token Access and API Enhancements: This topic introduces new methods to access
mtmd_image_tokens
, including enhancements like additional API functionalities and the ability to calculate image hashes. It also suggests future improvements for handling image preprocessing.
- Regex Support and String Functionality: This topic introduces a new feature called
common_regex
for partial regex support by implementing a method to reverse patterns for matching. It includes updates to move certain string functions to a common area and add support for partial matches.
- Intel GPU and SYCL Enhancements: This topic covers the implementation of a reordered Q4_0 MMVQ for Intel GPUs using SYCL, enhancing text generation performance. It includes refactoring vecdot traits and adding a new entry point for reordered MMVQ vecdots.
- GEMM Operation and AVX512 Implementation: This topic introduces an AVX512 implementation of the GEMM operation for the Q4_Kx8 model, showing significant performance improvements in prompt processing. The implementation demonstrates a 25.70% speedup in the Q4_K_M model and a 33.52% speedup in the Q4_K_S model.
- Byteswapping and Tensor Shape Restoration: This topic focuses on improving byteswapping functionality in the gguf-py module by implementing byteswapping for the Q4_0 format. It restores original tensor shapes post-byteswap for potential future use and reorganizes the byteswapping code.
- Asynchronous Task Submission and Memory Allocation: This topic introduces asynchronous task submission to the CANN module and a new method for allocating memory from the CANN memory pool. It enhances code efficiency and allows users to select the allocation method via an environment variable.
- SSE 4.2 and x64 Architecture Support: This topic introduces a new variant of the ggml library supporting SSE 4.2 and x64 base architecture for CPUs lacking AVX support. It addresses issue #12866 by providing compatibility for older CPU architectures.
- Vision Support and Integration: This topic introduces experimental vision support to the server by integrating
libmtmd
intoserver.cpp
, supporting only GEMMA 3 for now. It aims to test the integration of vision models and proposes a new input structure for managing text and image tokens.
- SYCL Implementation and Compatibility Fixes: This topic addresses a necessary fix for the im2col function in the SYCL implementation to ensure compatibility with Gemma 3 vision. It includes restoring local workgroup size adjustments for handling large inputs.
- Web UI and Conversation Management: This topic introduces a "Clear All Conversations" feature to the llama-server web UI, allowing users to delete all chat history from IndexedDB. It includes a new button in the sidebar with a confirmation dialog and necessary backend logic.
- Simple API Implementation: This topic proposes a minimal implementation of a simple API for the llama.cpp project, focusing on basic functionalities such as chat, inference, and grammar. It is detailed in the commit with SHA 0ea5f86845c9724332c4fd704a37812dc4ab0016.
- Llama-bench Tool Enhancements: This topic enhances the llama-bench tool by introducing separate measurements for end-to-end, prompt processing, and token generation throughput. It provides more accurate performance metrics and corrects the previously incorrect throughput calculation.
- ROPE Vision Kernel and SYCL Framework: This topic introduces a new ROPE vision kernel to the SYCL framework, essential for Vision Transformers (ViTs). It includes image projectors for Vision-Language Models (VLMs) with all backend operations tests passing successfully.
- Standard Input and Text Prompt Handling: This topic introduces a feature to the llama-tts project that allows reading input from standard input (stdin) when a text prompt is missing. It enables functionalities like piping text and input redirection while ensuring the program exits with an error if the text prompt is empty.
- CUDA Graphs and Compilation Error Fixes: This topic addresses issue #12798 by disabling CUDA graphs for unsupported DUP and CONT node types. It also addresses a compilation error related to CUDA in the llama.cpp project.
- CPU Model Retrieval on FreeBSD: This topic enhances the functionality of the
ggml_backend_cpu_device_context
by enabling it to retrieve the CPU model on FreeBSD systems. It is indicated by the commit titled "Get CPU model in ggml_backend_cpu_device_context on FreeBSD."
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 58
Key Closed Pull Requests
1. DeepSeek V2/V3 with -mla
option (final): This pull request involves the implementation of DeepSeek V2/V3 with a -mla
option, addressing issues related to tensor duplication by maintaining attn_k_b_trans
and attn_v_b
as separate tensors, and includes various code fixes, optimizations, and renaming for clarity, although the author has decided to step back after extensive testing and development efforts.
- URL: pull/12772
- Merged: No
- Associated Commits: b4c16, 10207, ea3c0, 1f604, 1de07, 7f92e, 319e3, ee4b3, c00cd, 55ad3, 0c86f, b0c8a, 8c329, 68302, 937a4, 1fd0a, 4fb43, f9a0e, 5fe40, 9b862, 8e23e, b3840, 5dbf9, 01a61, c0ffe, 8d12c, 997a4
2. llama : Support llama 4 text-only: This pull request introduces support for the Llama 4 text-only model, specifically targeting the Llama-4-Scout-17B-16E-Instruct, by adding a new architecture llama4
with modifications to the existing computational graph and attention scaling, while also addressing design considerations for future-proofing and performance improvements, as evidenced by the detailed performance metrics and the series of commits that include model conversion, tokenizer fixes, and attention mask adjustments.
- URL: pull/12791
- Merged: 2025-04-07T21:06:45Z
- Associated Commits: 79ebe, b19db, f6d8e, 1fb18, 869d7, 6ceae, 7cfc2, edbaa, a518c, 46fe5, ab91a, f9c78, e4012, 2a9b2, ee06e, f8f1b, e6a28, af196, 09eba, b28cd, d3e67
3. convert : ability to lazy-load safetensors remotely without downloading to disk: This pull request introduces the capability to lazy-load safetensors files remotely without downloading them to disk by adding a SafetensorRemote
class, which facilitates reading tensor data directly from a remote source, and includes multiple commits that address style fixes, add multithreading support, and implement a writable buffer for remote lazy tensors.
- URL: pull/12820
- Merged: 2025-04-10T15:24:45Z
- Associated Commits: 2507c, 7f61d, 08ecb, 3a368, df95a, 4f657, b584e, 78094, 4c017, 42fc8, 63f06, c8760, 2e535, e8b7d
Other Closed Pull Requests
- Multimodal Library Introduction: This pull request introduces a new library called
mtmd
designed to provide a unified vision API that supports multimodal inputs such as text, images, and audio. It aims to streamline the integration of new models by eliminating the need for separate command-line interfaces for each model, and includes features like C++ and C-style APIs, tokenization of input prompts with markers, and encoding/decoding of image and audio data.
- GLM-4-0414 Model Implementation: This pull request introduces the implementation of the
Glm4Model
for the GLM-4-0414, a multilingual, multitask autoregressive language model developed by Zhipu AI. It includes several commits addressing bug fixes, formatting, and integration with the master branch.
- Clip Image Batch Refactor: This pull request introduces a breaking change to the
clip.cpp
file by converting thestruct clip_image_f32_batch
into an opaque pointer. It replaces direct data access with specific API calls and provides guidance on migrating existing code to accommodate the new smart pointer usage.
- Llama-Perplexity Tool Enhancement: This pull request introduces an option to the llama-perplexity tool that allows users to ignore context window overflow errors during the calculation of scores for various benchmarks. It also logs the offending tasks for further analysis.
- Windows ARM64 Support: This pull request aims to add support for Windows ARM64 by introducing CMake presets and toolchain files. It enhances logging checks for Clang, enables optimized builds with MSVC and LLVM, and improves the matmul-int8 implementations by fixing typos and removing unnecessary casts.
- Web UI Textarea Enhancement: This pull request introduces an enhancement to the chat input
textarea
in the web UI by implementing an auto-sizing mechanism. It dynamically adjusts the height of thetextarea
to fit the content as the user types, improving the user experience for composing multi-line messages.
- CANN Backend Operator Optimization: This pull request optimizes the
CONV_TRANSPOSE_1D
andELU
operators in the CANN backend using the aclnn acceleration library. It ensures successful testing on the Ascend910B3 device and passes all tests across both CANN and CPU backends.
- SYCL Unary Operation Kernel Support: This pull request adds support for the fp16 data type to unary operation kernels in a SYCL context. It includes multiple commits for refining casting operations and adjusting device support checks.
- GGML Component Synchronization: This pull request involves synchronizing the ggml component of the llama.cpp project, including fixing a script and adding a more generic custom operation. It also introduces bilinear upscale support and addresses a compatibility issue with CUDA 12 and ARM Neon.
- ModelScope Community Support: This pull request adds support for downloading and using gguf models from the ModelScope community across Linux, Mac, and Windows platforms. It includes various commits for downloading support, login functionality, code fixes, and platform-specific adjustments.
- CPU Operations Refactor: This pull request involves refactoring the CPU operations in the ggml project by moving most operators into a separate C++ file. It addresses warnings, makes improvements to the SIMD mappings and vectorized operations, and simplifies Arm fp16 CPU logic.
- SYCL Backend Memory Copy Reversion: This pull request reverts a previous change that removed a redundant memory copy operation in the function
ggml_backend_sycl_buffer_set_tensor
within the SYCL backend. It includes additional updates to theggml-sycl.cpp
file.
- CANN Backend Operator Optimization: This pull request optimizes the
LOG
,MEAN
,PAD_REFLECT_1D
,COUNT_EQUAL
,STEP
, andSGN
operators in the CANN backend using the aclnn acceleration library. It ensures successful testing on the Ascend910B3 device with all tests passing for the CANN0 backend.
- s390x Platform Compilation Fix: This pull request addresses and resolves a compilation error specific to the s390x platform that was introduced in a previous commit. It includes multiple commits for fixing the error, adding documentation, and refactoring code with type-casting.
- Qwen3 Architecture Support: This pull request aims to introduce support for the Qwen3 architecture in the llama.cpp project. It remains unmerged while awaiting the availability of model files.
- Android Build CI Issue Resolution: This pull request aims to resolve a persistent continuous integration issue affecting the Android build of the ggml-org/llama.cpp project. It has been verified through a specific GitHub Actions run before submission.
- FA Branch Handling Fix: This pull request addresses an issue in the llama.cpp project by fixing the handling of the FA branch when the KV cache is not used. It adds CPU FA support for F32 V-type and introduces a server test for embeddings with FA enabled.
- Qwen3 Model Support: This pull request adds support for the upcoming Qwen3 and Qwen3MoE models to the project. It was successfully merged on April 9, 2025.
- Llama4 Tensor Name Mapping Fix: This pull request addresses the issue of improper tensor name mapping for Llama4 by removing a hacky renaming method in the Python code. It has been successfully tested with random weights.
- Segmentation Fault Fix in Clip.cpp: This pull request addresses a segmentation fault issue in the
clip.cpp
file of thellava
project by properly managing and using backendunique_ptr
s. It ensures that thebackend_cpu
is correctly moved and checked to prevent null pointer access.
- BF16 KV-Cache Copy Operation: This pull request introduces a feature that adds a copy operation from f32 to bf16 to enable BF16 KV-cache on CUDA. It was moved from ggml-org/ggml#1182 to avoid synchronization conflicts.
- Lazy Tensor Splitting Support: This pull request introduces support for lazy tensor splitting in the gguf-py module to optimize RAM usage during the conversion of Llama4 models. It includes testing to ensure consistent output and compliance with flake8 linting standards.
- Include Path Clashes Resolution: This pull request addresses the issue of potential include path clashes by detaching the 'common' component from the 'llama' library. It also fixes a compilation problem in the chat template test by adding the 'u8' prefix to strings.
- POWER Architecture Compilation Fix: This pull request addresses a compilation issue on the POWER architecture, specifically for ppcle64 and AIX, caused by a recent redesign of ggml-cpu. It includes necessary changes such as adding a limits file on AIX.
- RPC Backend Documentation Update: This pull request involves adding an RPC backend to the project's README. It also includes a subsequent commit to fix a typo.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ngxson | 147 | 13 | 0 | 90 |
ggerganov | 92 | 7 | 2 | 54 |
zhouwg | 90 | 2 | 1 | 0 |
BradHutchings | 79 | 0 | 0 | 0 |
ochafik | 50 | 3 | 0 | 1 |
CISC | 34 | 2 | 0 | 16 |
jukofyork | 41 | 2 | 1 | 1 |
0cc4m | 12 | 1 | 0 | 31 |
No author found | 43 | 0 | 0 | 0 |
bandoti | 34 | 1 | 1 | 7 |