Weekly GitHub Report for Llama.cpp: April 07, 2025 - April 14, 2025 (12:06:53)

            Weekly GitHub Report for Llama.cpp: April 07, 2025 - April 14, 2025 (12:06:53)

                    Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the data. Notable highlights or trends cannot be identified without additional information.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Eval bug: ggml_vulkan: Device memory allocation of size N failed with ub > 4096 and c > 4096 and b > 4096: This issue involves a bug in the GGML Vulkan backend where attempting to allocate device memory for running llama-server with specific parameters results in a crash due to exceeding the device memory allocation limit. The problem occurs when using parameters -ub 8192 -b 8192 -c 8192, leading to a failed memory allocation, while reducing any of these parameters to 4096 allows the model to load successfully.

The comments discuss potential reasons for the memory allocation issue, suggesting that the unusually high -ub value might be causing the problem. Users share workarounds, such as reducing -ub or context size, and discuss the requirements of embedding models for non-causal attention. The conversation also touches on the competition for context in concurrent tasks and the limitations of Vulkan on AMD hardware, with suggestions for parameter adjustments to avoid crashes.
Number of comments this week: 12

Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions.: This issue highlights a significant decrease in the reasoning performance of the llama.cpp model across different versions, despite using identical parameters and the same set of questions. The problem is particularly evident when comparing versions b4756 and b4759, where the latter shows a marked degradation in performance, consuming more tokens and failing to solve tasks efficiently.

The comments discuss potential issues with the flash attention update in version b4759, with users conducting tests on different hardware setups, including RTX 4090 GPUs with CUDA. They note inconsistent performance across versions when flash attention is enabled, suggesting a possible precision issue. Some users report that disabling flash attention results in consistent performance, while others provide perplexity results indicating minor changes across versions. There is a consensus that small backend changes might be exposing model instability, particularly with the QwQ 32b model.
Number of comments this week: 9

Misc. bug: Something recently has broken the -ot option to override model tensor buffers - causes CUDA crash: This issue reports a bug in the llama.cpp project where the -ot option to override model tensor buffers is causing a CUDA crash, specifically an "unspecified launch failure" error when using certain NVIDIA GPUs. The problem seems to be related to recent changes in the codebase, and the issue persists across different model versions, indicating a broader compatibility issue with the CUDA setup.

The comments discuss potential workarounds and fixes, including disabling CUDA graphs, which resolves the issue temporarily. A contributor mentions being on vacation but plans to address the problem soon, while another contributor identifies the root cause related to the ggml_cuda_cpy mechanism and proposes a fix to disable CUDA graphs for specific node types.
Number of comments this week: 6

Misc. bug: LLaVa Projector Possibly Wrong Tensor Order: This issue concerns a potential bug in the LLaVa Projector of the llama.cpp project, where the tensor order for v.patch_embd.weight might be incorrect when using the convert_image_encoder_to_gguf.py script, as it appears to be reversed compared to other tensors. The user suspects that this discrepancy in tensor shape could affect performance, although it does not seem to impact the output directly.

The comments discuss whether the tensor shape is indeed incorrect, with one user suggesting that the tensor is written in the wrong order to disk, while another clarifies that the discrepancy is due to the difference between row-major and column-major order. Further clarification is provided that the shape is correct and does not affect performance, as incorrect shapes would prevent computation altogether.
Number of comments this week: 6

Eval bug: MUSA backend cause non-sense output on unsloth/deepseek-r1 quantized model: This issue reports a bug in the MUSA backend that causes nonsensical output when using the unsloth/deepseek-r1 quantized model on a system with multiple MTT S4000 devices. The problem is characterized by the model outputting repeated "DDDDD" characters, and the GPUs showing inconsistent utilization patterns during execution.

The comments discuss GPU utilization issues and request additional system information, including kernel and driver versions. The user provides the requested details, and a recommendation is made to update to MUSA SDK rc3.1.1. The user confirms the issue is related to multi-GPU setups and the MUSA SDK version, noting that official support for the S4000 is lacking, and they are awaiting a driver update.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is displaying a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
Feature Request: Task Cancellation on Client Disconnection: This issue is a feature request for implementing task cancellation in the embedding server setup when a client disconnects, as currently, tasks continue processing even after a client cancels a request, leading to inefficiencies and potential server overload. The requester highlights the need for the server to terminate task processing upon request cancellation to prevent delays in subsequent requests and avoid server paralysis when a client makes numerous requests and then disconnects.
Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference. The user is working on improving the Metal backend for a project and is looking for documented methods or known practices to obtain this type of debugger output, similar to what is available in the Xcode Metal Debugger.
common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the CURLOPT_NOPROGRESS option to ensure accurate and non-conflicting progress updates during the download process.
Eval bug: [/SOLUTION] visible in granite 8B: This issue pertains to a bug in the evaluation process of the "granite-3.1-8b-instruct-Q6_K.gguf" model, where the output unexpectedly includes the text "[/SOLUTION]" when interacting with the web UI interface. The problem is observed on a system running Linux with an NVIDIA GeForce RTX 3090 GPU, and it has been open for over 30 days without a response identifying the first bad commit.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 33
Summarized Issues:

Feature Requests for llama-server and Web UI: This topic covers requests for enhancing the llama-server and its web UI to support image input, which would enable compatibility with OpenAI frontends like open-webui. This enhancement aims to reduce the need for additional servers to handle multi-modal input with models such as gemma3.
issues/12792

CUDA and Compilation Errors: Several issues involve errors related to CUDA and compilation, such as a crash caused by the -ot option due to an unspecified launch failure, and a compilation error on NVIDIA AGX Orin due to undefined identifiers. These issues highlight the need for specific build configurations and flags to resolve these errors.
issues/12798, issues/12826, issues/12838, issues/12893, issues/12921

Docker and Metric Naming Conventions: An issue with the current Docker version of the llama.cpp server involves the use of colons in metric names, which violates Prometheus naming conventions. This requires replacing colons with underscores to adhere to the correct format.
issues/12803

Performance and Compatibility Issues: Various issues discuss unexpected performance results and compatibility problems, such as decreased reasoning performance with different model versions and unexpected results on specific hardware. These issues necessitate further investigation and testing with different settings and configurations.
issues/12810, issues/12816, issues/12840, issues/12878

Memory Allocation and Parameter Issues: Memory allocation failures are reported in the Vulkan backend when running models with high parameter values, leading to crashes on AMD hardware. Reducing parameter values can mitigate these issues, highlighting the need for careful resource management.
issues/12817, issues/12860

Tokenizer and Conversion Script Errors: Several issues involve errors in conversion scripts, such as tokenizer mismatches and missing attributes, which prevent successful model conversion and execution. These errors indicate the need for script updates and environment checks.
issues/12819, issues/12847, issues/12862, issues/12880, issues/12923

Server and Execution Errors: Issues with server crashes and execution errors are reported, such as assertion failures due to parameter mismatches and segmentation faults on older hardware. These issues suggest the need for parameter checks and hardware compatibility assessments.
issues/12836, issues/12866, issues/12877

Quantization and Model Loading Bugs: Bugs in the quantization process and model loading are highlighted, such as calibration issues with MoE layers and missing hyperparameters. These issues affect model efficiency and loading success, requiring adjustments in the quantization methods and parameter settings.
issues/12913, issues/12857

Evaluation and Benchmarking Issues: Problems in evaluation and benchmarking tools are reported, such as incorrect tensor split values and errors in the llama-perplexity tool. These issues necessitate tool updates and configuration adjustments to ensure accurate performance assessments.
issues/12917, issues/12905

Miscellaneous Bugs and Errors: Various other bugs and errors are reported, including issues with the llama-cli tool, compilation failures with Vulkan support, and potential bugs in the LLaVa Projector. These issues require specific fixes and further investigation to resolve.
issues/12884, issues/12899, issues/12912

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 13
Summarized Issues:

GPU Memory Usage and KV Offload: Users experienced excessive GPU memory usage when the 'kv buffer' was located on the GPU instead of the CPU. The issue was resolved by disabling KV offload, which alleviated the memory strain.
issues/12663

Compilation Warnings and Errors: Several issues were reported regarding compilation warnings and errors across different platforms and compilers. These included warnings from Clang due to casting issues, a missing 'curl/curl.h' file in Visual Studio, and build failures on Power and s390x architectures due to macro and intrinsic function errors.
issues/12685, issues/12793, issues/12823, issues/12846

Performance Degradation in RDNA4 Prefill: A significant performance degradation was reported in the RDNA4 prefill process when using the CUBLAS_COMPUTE_32F setting. The performance was almost halved compared to previous configurations, prompting inquiries about known issues and potential fixes.
issues/12764

Feature Request for Llama 4 Integration: A feature request was made for integrating Llama 4, a newly released multimodal large language model by Meta, into the project. The request highlighted Llama 4's advanced capabilities in multimodality and improved processing efficiency.
issues/12774

Bugs in Multimodal Inference and Benchmarking: Bugs were reported in the llama-gemma3-cli.exe and llama-bench tools, where the former failed to generate output during multimodal inference, and the latter incorrectly defaulted to using the GPU or NPU backend. Both issues were traced to specific code and configuration problems.
issues/12784, issues/12787

Crashes and Errors in Model Execution: Users reported crashes and errors when running models, such as the llama-tts tool crashing with certain OuteTTS versions and the server crashing with an unsupported K-type error. These issues were linked to specific software and hardware configurations.
issues/12807, issues/12815

Quantization and Token Map Issues: Problems were encountered during the quantization process and with token maps, such as an unrecognized model architecture labeled 'granite' and the absence of a specific token in converted models. These issues were due to design and conversion decisions.
issues/12818, issues/12876

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 31
Key Open Pull Requests
1. Llama-3_1-Nemotron-Ultra-253B-v1 support: This pull request introduces support for the new Deci model, Llama-3_1-Nemotron-Ultra-253B-v1, by modifying the convert_hf_to_gguf.py script and src/llama-model.* files to accommodate a new type of layer called a "dummy layer," which lacks attention and feed-forward network components, and requests community assistance to test the changes on the 253B model due to resource constraints.

URL: pull/12843

Merged: No

Associated Commits: ecad9, 12ade, 643e5, e68c7, 6a480, f9a1c, c1736, 984ff, 909a7, 4dad2, cc615, 0ac08, 80af2, 2a260, f8f67, 1600d, bd3d4

2. DeepSeek V2/V3 MLA implementation: This pull request introduces the implementation of DeepSeek V2/V3 MLA, ensuring backward compatibility with legacy non-MLA GGUF files while adding context-shifting capabilities and optimized paths for MQA models, requiring new GGUF files with specific metadata and involving changes primarily in llm_graph_context::build_attn_mha() to support MLA, with a focus on maintaining clean and maintainable code.

URL: pull/12801

Merged: No

Associated Commits: 8c024, ddab5, 76125, fed66, c4494, 2a4e1, 77fe5, e2153, 815f4, 57788, 77ad5, 5d037, 638b0

3. metal : add memory pool for temp allocs: This pull request introduces a memory pool mechanism for temporary allocations in the Metal backend of the project, aiming to efficiently manage intermediate buffers for operations like convolution and data rearrangement, similar to the existing CUDA backend functionality, with tasks including creating and managing MTLHeap objects, dynamically resizing heaps, and ensuring memory reuse and leak prevention.

URL: pull/12850

Merged: No

Associated Commits: fcbf0, 27dff, dd559, 404fe, 1a597, 500b8, c125f, 694bb, 1df09, 4c8e3, 9ebaa

Other Open Pull Requests

OpenCL and GPU Compatibility Enhancements: This topic covers improvements to the OpenCL backend and GPU compatibility, including splitting the ggml-opencl.cl file for better compatibility with older Adreno GPUs and fixing crash issues related to kernel launch failures and sub-buffer creation. These changes ensure that the OpenCL backend can run on newer compilers and improve stability on devices with specific limitations.
pull/12886, pull/12795

Performance Optimizations: Several pull requests focus on enhancing performance, such as optimizing the ROPE operator for a 10% inference speedup and implementing an AVX512 GEMM function for significant prompt processing improvements. These optimizations reduce unnecessary operations and leverage advanced instruction sets to boost efficiency.
pull/12865, pull/12829

SYCL and Intel GPU Enhancements: Enhancements to the SYCL backend include implementing a reordered Q4_0 MMVQ for Intel GPUs and introducing a new ROPE vision kernel for Vision Transformers. These updates improve text generation performance and support Vision-Language Models with successful backend operations tests.
pull/12858, pull/12887

New Features and Functionalities: New features include a "Clear All Conversations" button in the llama-server web UI and a feature allowing llama-tts to read input from stdin. These additions enhance user interaction by providing more control over chat history and input methods.
pull/12924, pull/12890

Memory and Task Management Improvements: Improvements in memory and task management include introducing asynchronous task submission in the CANN component and a new method for memory allocation from the CANN memory pool. These changes enhance functionality and allow users to select allocation methods via environment variables.
pull/12864, pull/12875

Vision and Image Processing Enhancements: Enhancements in vision and image processing include experimental vision support using libmtmd and new methods for accessing mtmd_image_tokens. These updates aim to integrate vision models and improve image preprocessing capabilities.
pull/12898, pull/12906

CI and Build Process Improvements: Improvements to the CI and build process address synchronization issues during Linux cross-compile CI setup and enhance the llama-bench tool for more accurate performance metrics. These changes reduce build failures and provide granular throughput measurements.
pull/12804, pull/12874

Regex and String Handling Enhancements: The introduction of common_regex provides partial regex support by implementing pattern reversal for string matching. This update includes moving certain string functions to a common area and updating tests for partial matches.
pull/12808

Date Handling and Template Injection: Enhancements in date handling involve injecting a date_string into the llama 3.x template and fixing date handling for firefunction v2. These changes ensure correct date appearance in prompts and include end-to-end tests for verification.
pull/12802

Tensor Writing and Conversion Speed: Parallel writing of tensors using a thread pool in the GGUFWriter.write_tensors_to_file function enhances conversion speed. A new --threads argument in the convert_hf_to_gguf.py script allows specifying the number of threads for multithreaded operations.
pull/12837

TTS and Audio Processing Enhancements: Support for OuteTTS 1.0 includes features like JSON speaker loading and text chunking for long inputs. These updates reorganize code and address compatibility issues with newer PyTorch versions.
pull/12794

Byteswapping and Tensor Shape Restoration: Improvements in byteswapping functionality for the Q4_0 format include restoring original tensor shapes post-byteswap. These changes reorganize the byteswapping code to separate details from tensor blocks.
pull/12851

API and Interface Enhancements: A minimal implementation of a simple API focuses on basic functionalities like chat, inference, and grammar. This update aims to streamline interactions with the llama.cpp project.
pull/12835

Logging and Profiling Corrections: Corrections in OpenCL profiling address a logging issue by ensuring accurate output of local_size dimensions. This update corrects the previous omission of local_size[1] in logs.
pull/12868

CUDA and Node Type Support: Addressing issue #12798, CUDA graphs are disabled for unsupported DUP and CONT node types. This change ensures compatibility and stability in the ggml project.
pull/12891

SSE and x64 Support: A new SSE 4.2 and x64 base variant supports CPUs without AVX capabilities. This update addresses issue #12866 and expands compatibility for older hardware.
pull/12871

Key-Value Cache Management: The creation of distinct classes for handling key-value cache implementations separates recurrent and non-recurrent types. This update seeks guidance for effective implementation within the llama.cpp project.
pull/12799

Vision Kernel and Image Processing: A necessary fix for the im2col function in the SYCL implementation ensures compatibility with Gemma 3 vision. This update includes adjustments for handling large inputs.
pull/12910

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 62
Key Closed Pull Requests
1. DeepSeek V2/V3 with -mla option (final): This pull request involves the implementation of DeepSeek V2/V3 with a -mla option, addressing issues related to tensor separation for attn_k_b_trans and attn_v_b, and includes various code fixes, optimizations, and renaming for clarity, although it remains unmerged due to unresolved challenges and the author's decision to step back after extensive efforts.

URL: pull/12772

Merged: No

Associated Commits: b4c16, 10207, ea3c0, 1f604, 1de07, 7f92e, 319e3, ee4b3, c00cd, 55ad3, 0c86f, b0c8a, 8c329, 68302, 937a4, 1fd0a, 4fb43, f9a0e, 5fe40, 9b862, 8e23e, b3840, 5dbf9, 01a61, c0ffe, 8d12c, 997a4

2. llama : Support llama 4 text-only: This pull request introduces support for the Llama 4 text-only model, specifically targeting the Llama-4-Scout-17B-16E-Instruct, by adding a new architecture to accommodate updated hyperparameters and implementing various modifications such as attention scaling and chunked masks, while also ensuring compatibility with existing structures to prevent breaking changes.

URL: pull/12791

Merged: 2025-04-07T21:06:45Z

Associated Commits: 79ebe, b19db, f6d8e, 1fb18, 869d7, 6ceae, 7cfc2, edbaa, a518c, 46fe5, ab91a, f9c78, e4012, 2a9b2, ee06e, f8f1b, e6a28, af196, 09eba, b28cd, d3e67

3. cmake : enable curl by default: This pull request enables the use of curl by default in the llama.cpp project, reflecting its integral role in the user experience of the project's examples, and addresses various build and configuration issues across different platforms to ensure compatibility and functionality.

URL: pull/12761

Merged: 2025-04-07T11:35:20Z

Associated Commits: 6080f, 64557, 2cc89, 79307, 2238e, 707f2, 79509, 21c42, 9bf42, a8a7e, a9637, 04edd, 64faa, 1c1c2

Other Closed Pull Requests

Lazy-loading Safetensors Remotely: This topic introduces the SafetensorRemote class, which allows safetensors to be lazy-loaded from a remote source without downloading them to disk. The pull request includes enhancements for functionality, style fixes, and multithreaded download support, although the latter was reverted.
pull/12820

Multimodal Library mtmd: The mtmd library provides a unified vision API supporting multimodal inputs like text, images, and audio. It aims to streamline model integration by eliminating separate command-line interfaces and includes features like C++ and C-style APIs, tokenization, and encoding/decoding capabilities.
pull/12849

GLM-4-0414 Model Implementation: This pull request implements the Glm4Model for the GLM-4-0414, a multilingual, multitask autoregressive language model. It includes bug fixes, formatting, and integration with the master branch.
pull/12867

Breaking Change in clip.cpp: A breaking change in clip.cpp converts struct clip_image_f32_batch into an opaque pointer, replacing direct data access with API calls. Guidance is provided for migrating existing code that initializes clip_image_f32_batch on the heap.
pull/12869

Llama-perplexity Tool Enhancement: An option is added to the llama-perplexity tool to ignore context window overflow errors during score calculations for various tasks. This ensures offending tasks are logged, allowing score computation to continue without interruption.
pull/12512

Windows ARM64 Support: Support for Windows ARM64 is introduced with CMake presets and toolchain files, along with enhanced logging checks for Clang. The pull request also enables optimized builds with MSVC and LLVM and improves matmul-int8 implementations.
pull/12883

Web UI Chat Input Enhancement: The chat input textarea in the web UI is enhanced with an auto-sizing mechanism that adjusts its height to fit content. This provides a more intuitive user experience by preventing indefinite growth and disabling the native resize handle.
pull/12785

CANN Backend Operator Optimization: The CONV_TRANSPOSE_1D and ELU operators are optimized in the CANN backend using the aclnn library. Testing on the Ascend910B3 device ensures successful operation across CANN and CPU backends.
pull/12786

SYCL Unary Operation Kernels: Support for the fp16 data type in SYCL unary operation kernels is introduced, with considerations for hardware lacking fp16 support. The pull request includes code improvements and adjustments for proper functionality and testing.
pull/12788

GGML Component Synchronization: Synchronization of the ggml component includes fixing a synchronization script, adding a generic custom operation, and introducing bilinear upscale support. It also addresses compatibility issues with CUDA 12 and ARM Neon.
pull/12881

GGUF Model Download Support: Support for downloading gguf models from the ModelScope community is introduced, with successful testing on multiple platforms. The pull request includes download support, login functionality, and platform-specific adjustments.
pull/12664

CPU Operations Refactoring: CPU operations in the ggml project are refactored by moving operators into a separate C++ file. This addresses warnings, simplifies Arm fp16 CPU logic, and reintroduces CUDA/MUSA checks.
pull/12732

SYCL Backend Memory Copy Reversion: A previous change removing a redundant memory copy in the SYCL backend is reverted. The pull request includes updates to ggml-sycl.cpp and minor formatting adjustments.
pull/12812

CANN Backend Operator Optimization: The LOG, MEAN, PAD_REFLECT_1D, COUNT_EQUAL, STEP, and SGN operators are optimized in the CANN backend using the aclnn library. All tests pass successfully on the Ascend910B3 device.
pull/12841

s390x Platform Compilation Error: A compilation error specific to the s390x platform is resolved, with contributions from IBM. The pull request includes documentation updates and fixes the error introduced in a previous commit.
pull/12848

Qwen3 Architecture Support: Initial support for the Qwen3 architecture is introduced, based on a related pull request from the huggingface/transformers repository. The pull request remains unmerged pending model file availability.
pull/12501

Android Build CI Issue: A continuous integration issue affecting the Android build is resolved. The pull request is verified through a specific GitHub Actions run before submission.
pull/12775

FA Branch Tensor Casting Issue: An issue where the FA branch did not cast K and V tensors to F16 format is fixed. The pull request includes support for F32 V-type in CPU FA and a server test for embeddings with FA enabled.
pull/12825

Qwen3 and Qwen3MoE Model Support: Support for the Qwen3 and Qwen3MoE models is added, with successful merging on April 9, 2025. The pull request details the commits involved in the addition.
pull/12828

Llama4 Tensor Name Mapping: Proper tensor name mapping for Llama4 is addressed by removing a hacky renaming method. Testing confirms functionality using random weights, with acknowledgment to a contributor.
pull/12870

Segmentation Fault in clip.cpp: A segmentation fault issue in clip.cpp is addressed by managing backend unique_ptrs properly. This ensures the backend_cpu is correctly moved and checked to prevent null pointer access.
pull/12907

BF16 KV-cache on CUDA: A float32 to bfloat16 copy operation is added to enable BF16 KV-cache on CUDA. The feature is inspired by the ik_llama.cpp repository and moved to avoid synchronization conflicts.
pull/12806

Lazy Tensor Splitting in gguf-py: Support for lazy tensor splitting in the gguf-py module is introduced to optimize RAM usage during Llama4 conversion. Testing ensures consistent output and improved performance.
pull/12809

Include Path Clashes Resolution: Potential include path clashes are resolved by detaching the 'common' component from the 'llama' library. The pull request also fixes a compilation problem related to the 'u8' prefix.
pull/12827

POWER Architecture Compilation Issue: A compilation issue on the POWER architecture is addressed, with necessary changes for ppcle64 and AIX systems. The pull request is successfully merged on April 9, 2025.
pull/12830

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ngxson
150
13
0
91

ggerganov
93
7
2
54

zhouwg
90
2
1
0

BradHutchings
79
0
0
0

ochafik
57
3
0
1

CISC
35
2
0
16

jukofyork
41
2
1
1

0cc4m
12
1
0
31

No author found
43
0
0
0

bandoti
34
1
1
7

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ngxson	150	13	0	91
ggerganov	93	7	2	54
zhouwg	90	2	1	0
BradHutchings	79	0	0	0
ochafik	57	3	0	1
CISC	35	2	0	16
jukofyork	41	2	1	1
0cc4m	12	1	0	31
No author found	43	0	0	0
bandoti	34	1	1	7