Weekly GitHub Report for Llama.cpp: May 05, 2025 - May 12, 2025 (12:03:26)

            Weekly GitHub Report for Llama.cpp: May 05, 2025 - May 12, 2025 (12:03:26)

            Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and changes, though specific details are not provided in the given information.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Feature Request: tensor split needs control over where CPU layers go: This issue is a feature request for the GitHub project llama.cpp, proposing a command argument switch to control whether CPU layers are loaded first or last when assigning tensor splits, which is crucial for optimizing the performance of hybrid layer quantizations. The motivation behind this request is to enable more efficient offloading of layers to GPUs by loading smaller layers into the GPU first, thus allowing for more layers or larger key-value pairs to be accommodated in the GPU while maintaining high performance.

The comments discuss various suggestions and workarounds for controlling layer distribution across devices, including the use of flags like --override-tensor and --tensor-split, and propose enhancements such as specifying layer storage order and limiting RAM usage per device. There is a consensus on the need for more user-friendly solutions, with some humor about the complexity of current methods and the potential for ugly but effective workarounds.
Number of comments this week: 12

(Discussion) Improve usability of llama-server: This issue discusses improving the usability of the llama-server by allowing users to control it entirely via a web UI, which includes functionalities like loading/unloading models and turning off the server. The author proposes three ideas: adding an API for model management, implementing a detach flag for headless operation, and creating a desktop shortcut to enhance user experience.

The comments reflect a positive reception to the proposed ideas, with suggestions for error handling, parallelizing requests across GPUs, and dynamically loading models via API. Some users mention existing solutions like llama-swap and discuss the feasibility of implementing these features, especially on different operating systems. There is also a discussion about the potential complexity of supporting multiple server instances and the need for user-friendly solutions, particularly for Windows users.
Number of comments this week: 12

Differential mode for llama-bench + plotting code: This issue proposes the addition of a differential mode to the llama-bench tool, allowing users to compare outputs more easily by providing separate numbers for each model evaluation in a benchmark run, rather than a single aggregated number. The feature would include a --differential flag and plotting capabilities using Matplotlib to visualize the performance data, with the potential for future enhancements like polynomial fitting, though the latter is deemed unnecessary for now.

The comments discuss the implementation details and potential improvements for the proposed --differential feature, including suggestions for using ranges and step sizes, alternative plotting tools like Mermaid, and the feasibility of running benchmarks with small batch sizes at high depths. There is a consensus on using familiar tools like NumPy and Matplotlib for plotting, and contributors express willingness to work on different aspects of the implementation, such as adding JSONL support.
Number of comments this week: 12

Compile bug: I tried compiling llama.cpp for HIP on my system (elementaryOS 8/ubuntu 24.04, rocm 6.4.0, gfx1100) using the installation guide: This issue involves a user attempting to compile the llama.cpp project for HIP on their system using the provided installation guide, but encountering multiple errors during the process. The user is unable to diagnose the problem due to a lack of expertise and has provided partial error logs for further assistance.

Several users report similar compilation issues, with one suggesting deleting a specific file to regenerate it, which resolves the initial compilation problem. However, subsequent issues arise when loading models, with errors indicating failure to open or load model files. Another user identifies a mistake in manually installing binaries, which causes models to be unrecognized when run from the PATH, suggesting proper installation as a solution.
Number of comments this week: 7

Misc. bug: The web UI of llama-server is not displaying correctly.: This issue reports a bug in the web UI of the llama-server, where certain buttons are not displaying correctly after a recent pull request, although they remain clickable. The problem seems to be related to a CSS class that sets the opacity of buttons to zero, preventing them from being visible unless manually adjusted.

The comments discuss potential solutions, including trying different browsers or disabling conflicting plugins, as the issue might be browser-specific. One user identifies that the problem is related to the show-on-hover class, which sets the button's opacity to zero, and suggests a CSS fix to make the buttons visible. Another user suggests adding a CSS rule specifically for Microsoft Edge if the issue is isolated to that browser.
Number of comments this week: 6

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue involves a problem with the Kompute-based Vulkan backend, which is displaying a GGML_OP_GET_ROWS error. The error does not occur with other Vulkan backends, indicating a specific compatibility or implementation issue with the Kompute-based approach.
Feature Request: Task Cancellation on Client Disconnection: This issue is a feature request for implementing task cancellation in an embedding server setup when a client disconnects, to prevent queued tasks from continuing to process unnecessarily, which can lead to inefficiencies and potential server overload. The request highlights the need for the server to terminate task processing upon request cancellation, ensuring that new requests can be processed promptly without delay, thereby improving server performance and resource management.
Question: How to generate an MPS gputrace: This issue is about a user seeking guidance on how to generate a Metal Performance Shaders (MPS) gputrace for the llama.cpp project during model inference, as part of efforts to enhance the Metal backend for a related project. The user is specifically interested in obtaining a debugger output similar to what is provided by the Metal Debugger in Xcode, and is inquiring if there is any documented or known method to achieve this.
common: download from URL, improve parallel download progress status: This issue addresses the need to improve the progress status display for parallel downloads when retrieving sharded models, as the current implementation causes conflicts in the progression indicators. The proposed solution involves properly implementing the CURLOPT_NOPROGRESS option to ensure accurate and non-conflicting progress updates during the download process.
kubernetes example: This issue highlights the need for a Helm chart for the llama.cpp server to facilitate its deployment on Kubernetes, which is a popular platform for managing containerized applications at scale. The author has initiated the development of this chart and is seeking community assistance to further progress the project.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 29
Summarized Issues:

Model Output Issues: This topic covers issues where certain models, such as Qwen3-8B, Mistral, Hermes, and LLaMA, generate repetitive or nonsensical output when using the llama.cpp backend. The problems are potentially due to tokenizer mismatches, prompt template injection errors, or quantization compatibility, while Gemma models do not exhibit this behavior.
issues/13310

Feature Requests for Model Support and Optimization: Several issues request new features for the llama.cpp project, including support for the moondream2 model and a command-line argument to optimize CPU-GPU layer allocation. These enhancements aim to improve performance and expand model compatibility in memory-constrained environments.
issues/13314, issues/13332

Compilation and Runtime Errors: Users report various compilation and runtime errors, such as failures on different architectures and environments, including RISC-V and HIP on elementaryOS. These issues often involve undefined references, linking errors, and incorrect operand specifications in assembly code.
issues/13318, issues/13340, issues/13358, issues/13375

Regex and JSON Handling Bugs: Bugs in regex handling and JSON decoding are reported, causing errors in model processing and remote conversion tasks. These issues highlight the need for improved error handling and compatibility with different data formats.
issues/13341, issues/13347, issues/13354

Performance and Usability Enhancements: Discussions on improving performance and usability include proposals for new flags and API enhancements. These suggestions aim to streamline user interaction and optimize model evaluation across different context sizes.
issues/13367, issues/13408

Assertion Failures and Segmentation Faults: Several issues involve assertion failures and segmentation faults, often linked to specific model configurations or execution parameters. These errors indicate underlying problems in model loading and token processing.
issues/13359, issues/13372, issues/13380, issues/13405, issues/13433

Model Loading and Tokenization Issues: Problems with model loading and tokenization are reported, including slow loading times on Apple Silicon and incorrect token handling in multiturn conversations. These issues affect the efficiency and accuracy of model responses.
issues/13361, issues/13402, issues/13404, issues/13427

Web UI and CSS Display Bugs: The llama-server's web UI has display issues due to CSS class problems, causing buttons to be invisible in certain browsers. This affects user interaction and requires adjustments to ensure consistent visibility across platforms.
issues/13428

Quantization and Shader Compilation Errors: Errors in quantization processes and shader compilation are reported, affecting model performance and build success. These issues highlight the need for updates to dependencies and careful handling of quantization parameters.
issues/13377, issues/13419

Token Generation Speed Decline: A significant decline in token generation speed is observed with GGUF format models on M3 Ultra machines, unlike MLX format models. This performance issue is linked to context length and affects inference speeds.
issues/13373

False Positive Errors in CI Checks: A false positive error is reported by the CI's editorconfig-checker, incorrectly identifying trailing whitespace in a pull request. This issue suggests a need for improved accuracy in automated code checks.
issues/13374

Model Loading Failures on Android: The llama.cpp model fails to load on Android using NDK with JNI, returning null from the loading method. This issue persists despite attempts to adjust build configurations and suggests using existing CMake scripts.
issues/13399

Spurious Token Addition in Responses: A bug in the llama-cli tool causes a spurious token to be added to responses, due to incorrect assumptions in token generation. This affects the accuracy of assistant responses and requires code adjustments.
issues/13402

Assertion Failures in GGML Library: Assertion failures occur in the GGML library when running specific models, particularly with long prompts or on certain hardware. These issues are linked to recent commits and require investigation to resolve.
issues/13437

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 20
Summarized Issues:

GitHub Actions CI Build Failures: The GitHub Actions CI build process for Intel container images frequently fails due to a "No space left on device" error. This issue prevents the publication of recent images, causing delays in the deployment process.
issues/13052

Dependency and Compilation Issues: Several issues involve dependency and compilation problems, such as a missing PySide6 module in gguf-dump CLI tool and incompatible pointer type errors in a Linux system with CUDA support. These issues highlight the need for better dependency management and compatibility checks during the build process.
issues/13054, issues/13412

Model and Inference Bugs: Various bugs affect model performance and inference, including crashes and incorrect outputs in models like Qwen3 30B A3B Q4_K_M and DeepSeek-R1-UD-Q2_K_XL. These issues often relate to memory limitations and configuration errors, requiring workarounds or specific settings to resolve.
issues/13164, issues/13305, issues/13327, issues/13329, issues/13394

CUDA and GPU-Related Errors: Several issues involve CUDA and GPU-related errors, such as illegal memory access and launch errors due to large batch sizes. These problems often require adjustments in configuration or code to handle edge cases and prevent crashes.
issues/13252, issues/13287, issues/13371, issues/13376, issues/13414, issues/13430

Feature Requests and Enhancements: Feature requests include adding a pure C API for mtmd functionality and support for YaRN RoPE scaling in conversion scripts. These enhancements aim to improve usability and compatibility with third-party tools and models.
issues/13124, issues/13322

Documentation and Usability Improvements: There is a need for improved documentation, such as updating the LLaMA.cpp HTTP Server README with new configuration options. Clearer documentation can help users avoid common pitfalls and better utilize available features.
issues/13333, issues/13431

Conversion and Tokenization Issues: Problems with model conversion and tokenization, such as unrecognized BPE pre-tokenizers, indicate the need for updates to conversion scripts. These issues can prevent successful model deployment and require attention to ensure compatibility.
issues/13421

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 27
Key Open Pull Requests
1. [Perf] [CPU] eliminate redundant memory access in group query attention: This pull request aims to enhance the performance of CPU-based group query attention in modern large language models by eliminating redundant memory access, thereby improving spatial locality and achieving a 25% speedup in decoding, as demonstrated by the provided test results.

URL: pull/13319

Merged: No

Associated Commits: d3b51, 54b99, 4f7d6

2. CUDA: update build CTK version to 12.8: This pull request updates the CUDA Toolkit version from 12.4 to 12.8 to enable the compilation of real architecture sm120 Blackwell GPUs, as demonstrated by a continuous integration run with the updated configuration file.

URL: pull/13360

Merged: No

Associated Commits: a8b72, 6bcce, e1db9

3. Add --disable-op-offload to improve -ot pp perf in MoE models like llama4 400B: This pull request introduces the --disable-op-offload option to enhance performance in MoE models like llama4 400B, addresses issue #13241, and includes llama-bench support for performance tuning, with commits focusing on adding the new option, avoiding negative booleans in the library, and fixing the default value of ggml_backend_sched_new.

URL: pull/13386

Merged: No

Associated Commits: 2e747, 31e19, 0d53a

Other Open Pull Requests

llama-bench Enhancements: The llama-bench tool now supports ranges for integer parameters, allowing users to specify sequences of values for benchmarking. This enhancement is demonstrated with examples like 1-5, 10-100+10, and 256, providing more flexibility in performance testing.
pull/13410

Web UI Configuration Management: Dynamic configuration loading and reset behavior have been implemented for the web UI. This ensures that the application checks for existing configurations in localStorage and fetches defaults from the server if none are found, making the server the single source of truth.
pull/13429

Helper Functions Refactoring: Helper functions have been moved to a dedicated file to prevent the accidental use of internal mtmd API. This change addresses exceptions related to mtmd_helper_bitmap_init_from_file and mtmd_helper_bitmap_init_from_buf due to their reliance on internal structures.
pull/13442

Performance Optimization: The GGUFReader has been optimized in read-only mode by utilizing native Python file I/O instead of memmap arrays. This change significantly reduces execution time and memory usage.
pull/13378

Kernel Simplification: The bin-bcast kernel has been simplified by flattening it to improve memory access checks. A special code path for contiguous inputs has been introduced, enhancing performance slightly by around 1 tk/s on some models.
pull/13383

Server Context Initialization: A constructor has been added to the server_context class to initialize server_context::batch. This prevents the destructor from calling llama_batch_free() and causing an invalid free when bind() fails.
pull/13436

Sampling Support: Smooth Sampling and Quadratic Sampling support have been ported to a refactored sampler structure. This includes additional tests and seeks assistance for testing the server implementation.
pull/13441

AMD Genoa Support: Support for AMD Genoa has been added to the project. This enhancement is indicated by the commit titled "add AMD Genoa."
pull/13334

NUMA Optimization: Cross-NUMA memory access penalties in multi-node systems have been addressed by introducing an mbind call. This ensures optimal NUMA locality by moving page cache pages to the target node where llama-bench is executed.
pull/13335

Special Token Functionality: A special token function call behavior has been introduced, along with end-of-generation detection logic. Detailed instructions for building, converting model weights, running inference, and testing function calls are provided.
pull/13339

Mistral-7B Chat Model Preset: A new preset script for the Mistral-7B chat model has been introduced. This enhances the usability of the llama-server by providing a simplified command structure and optimized settings for running the model in chat mode.
pull/13348

Transformers Library Update: The version of the Transformers library has been updated to address an issue reported by @bartowski1182. This update seeks confirmation on whether it resolves the problem in a Docker environment.
pull/13351

Reranker Presets: Default reranker presets for the models "bge-reranker-v2-m3" and "jina-reranker-v1-turbo-en" have been introduced. This enhances the reranking capabilities of the project by providing examples and instructions for running a server with these presets.
pull/13352

SYCL Backend Fixes: Crashes occurring in the SYCL backend when running specific operations on a CUDA backend have been addressed. Fixes for issues related to recording commands and blocking waits have been implemented.
pull/13357

Typographical Error Corrections: Typographical errors across multiple files in the llama.cpp project have been corrected. This change is indicated by the commit message and the associated commit link.
pull/13369

MUSA Graph Settings Restoration: The MUSA graph settings in the CMakeLists.txt file have been restored. This ensures compatibility with MUSA architectures and enables CUDA graphs.
pull/13382

Model Catalog Addition: A model catalog has been added to the project, enhancing the existing "preset" system. This includes a dedicated catalog.h file with guidelines for contributors and supports various protocols for model names as positional arguments.
pull/13385

Regex Handling Fix: The issue of handling misplaced special regex characters has been addressed. This prevents segmentation violation errors for certain regex patterns and aims to fix issue #13390.
pull/13391

Interim Server Implementation: A proof-of-concept implementation of an "interim" server has been proposed. This introduces a /load endpoint to dynamically load models via an API and seeks feedback on the approach.
pull/13400

README Update for Word Add-in: The README.md file has been updated to include instructions for integrating llama.cpp as a local Word Add-in. This enables its use within Microsoft Word and is currently open for review.
pull/13401

SYCL Backend Compatibility: The SYCL backend of LLaMA has been enabled to build with nightly DPC++ compilers. This ensures compatibility with oneMKL and oneDNN libraries, resolving a CMake error related to the missing MKL_SYCL target.
pull/13406

Main Function Refactoring: The main function of the llama-server has been refactored by breaking it down into smaller, more manageable functions. This improves code maintainability and readability.
pull/13425

CUDA FlashAttention Optimization: The CUDA FlashAttention kernel has been optimized to enhance Deepseek performance. This includes allowing batch sizes for K, V, and result combination to be set based on compute capability and the number of Q columns per CUDA block.
pull/13435

MoE Offloading Crash Fix: A crash issue related to the partial offloading of Mixture of Experts (MoE) in CUDA has been addressed. The fix involves reverting to cuBLAS to handle an edge case concerning padding.
pull/13439

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 67
Key Closed Pull Requests
1. server : (webui) revamp the input area, plus many small UI improvements: This pull request focuses on revamping the input area of the web UI and includes numerous small UI/UX improvements such as enabling file uploads, allowing non-text file uploads, checking server capabilities for multimodal support, renaming conversations, grouping conversations by time, improving autoscroll performance, removing background color in assistant messages, using consistent icons, moving conversation options to the sidebar, and enhancing the "thought process" display.

URL: pull/13365

Merged: 2025-05-08T13:37:30Z

Associated Commits: f4af3, 7d594, 44761, c8641, 7c87f, eb0d6, f994a, 9d076, d813d, 3ff07, 2a814, 993b4, 47e73, 2d2b8, 9a24d, b0be0, ae5a8, 2a3cd, a64f8, 163bd, 3da93, e7e28

2. mtmd : add C public API: This pull request introduces a C public API for the mtmd library by creating a C-only wrapper around C++ types, converting structs containing C++ types to opaque pointers, adding setter/getter functions for interaction, and implementing C++ wrappers to manage memory automatically, thereby addressing issue #13124 on the GitHub project.

URL: pull/13184

Merged: 2025-05-04T21:43:42Z

Associated Commits: 4a4f3, f6b65, e0806, 82f42, f8c27, 33579, 92d24, 111d5, 08d0f, a2308, 863db, a0fb7, 6bc7a, 4d842

3. clip : refactor graph builder: This pull request refactors the graph builder in the project by introducing a new struct clip_graph to streamline the construction of various graph types, such as build_llava and build_attn, in preparation for future support of flash attention and enhanced debugging capabilities, while also unifying the graph usage for different models like qwen2vl and qwen2.5vl.

URL: pull/13321

Merged: 2025-05-06T20:40:24Z

Associated Commits: bfd57, ba3f5, 2c6f6, 52f63, 47517, e16e3, a1551, b429d, b6940, 9eb49, 4947b, 56b41, 37e24

Other Closed Pull Requests

Scalar Flash Attention Shader Implementation: This pull request implements a scalar flash attention (FA) shader using scalar math in the Vulkan backend to address performance issues related to the lack of FA support. It includes optimizations for scenarios with few rows and seeks assistance for testing on non-NVIDIA GPUs and determining a suitable placeholder for Intel's shader core count.
pull/13324

InternVL Support and Enhancements: This pull request introduces support for InternVL versions 2.5 and 3, including testing and conversion for various model sizes. It addresses issues such as broken tokenizers and removes the MobileVLM test.
pull/13422

Llama-vscode Project Modifications: This pull request involves modifying the setText command in the parent window for the llama-vscode project to automatically send messages. It also upgrades package versions to address vulnerabilities and includes several commits for code formatting and other improvements.
pull/13309

Naming Convention and Conversion Script Updates: This pull request addresses the issue of incorrect naming of ffn_up and ffn_down in the clip.cpp file, which caused confusion during the migration of a conversion script. It establishes a new naming convention and includes additional changes to align naming with the llama.cpp style.
pull/13290

Top_n_sigma Sampler Integration and Modification: This pull request integrates the top_n_sigma sampler into the main sampling chain of the project, allowing it to be combined with other sampling methods. It also modifies the behavior of the top_n_sigma sampler to become a no-op when its value is less than or equal to zero.
pull/13264, pull/13345

RPC Server Enhancements: This pull request introduces enhancements to the RPC server by incorporating a backend registry and adding support for GGML_BACKEND_DL. It also provides a new -d, --device option for device selection and relocates CPU memory detection code to the CPU backend.
pull/13304

FlashAttention CUDA Support for Deepseek Models: This pull request introduces FlashAttention CUDA support for Deepseek models on Ampere or newer architectures, optimizing memory and speed. It also implements the ability to use different head sizes for K and V and makes a matrix multiplication optimization in llama-graph.cpp.
pull/13306

Security and Functionality Improvements in CI Workflow: This pull request addresses security and functionality improvements by limiting write permissions to only the release step in the CI workflow. It also fixes the Windows CUDA release file name and corrects the license file copy process for multi-config generators.
pull/13392

Dependency and Script Updates: This pull request addresses an issue where the gguf-dump script incorrectly required the PySide6 module by removing the unnecessary dependency. It updates the pyproject.toml to directly reference the main functions of the scripts and adds PySide6 to the *-extra devShells.
pull/13036

Cache-less Context for Embeddings-only Models: This pull request introduces a feature that allows for a cache-less context when using embeddings-only models like BERT, eliminating the need to create a KV cache. It includes commits that enable reranking with the encode() function and ensure encode() clears the embd_seq.
pull/13108

Directory Renaming and Synchronization: This pull request involves renaming the 'llava' directory to 'mtmd' in the ggml-org/llama.cpp project. It also synchronizes the 'ggml' component by removing 'stdc++fs' from CMake configurations and eliminating MSVC-specific warning pragmas.
pull/13311, pull/13355

Feed-forward Network Gate Check: This pull request addresses an issue in the llm_graph_context::build_ffn function where the absence of a gate in the feed-forward network (FFN) leads to incorrect calculations. It proposes either to add a gate check or assert its presence to ensure correct functionality.
pull/13336

Rope Scaling Type Renaming: This pull request addresses the renaming of the rope_scaling type to rope_type in the transformers library. It adds support for both naming conventions to ensure compatibility and functionality within the project.
pull/13349

New API Function for Image Embeddings: This pull request introduces a new API function, mtmd_helper_decode_image_chunk, which allows for the standalone decoding of image embeddings previously encoded and cached using mtmd_encode. It includes additional functions for managing image tokens and output embeddings.
pull/13366

Web UI Modality Support Update: This pull request involves renaming the has_multimodal property to modalities in the server's web UI to accommodate future support for multiple input types. It requires updating the /props handler to return a list of supported modalities.
pull/13393

Mistral-Small-2503 Model Chat Template Fix: This pull request addresses a one-off fix for the Mistral-Small-2503 model by adding a default chat template to prevent potential performance issues. The model Mistral-Small-3.1-24B-Instruct-2503-GGUF lacked this feature despite the availability of numerous GGUF quantizations online.
pull/13398

Vocabulary Support for ByteDance-Seed/Seed-Coder-8B Model: This pull request adds vocabulary support for the ByteDance-Seed/Seed-Coder-8B model to the ggml-org/llama.cpp project. It was successfully merged on May 10, 2025.
pull/13423

Experimental Vision Support Integration: This pull request introduces experimental vision support to the server using libmtmd, specifically integrating it into server.cpp to support GEMMA 3. It aims to test the integration of vision models and adapt libmtmd for non-CLI contexts.
pull/12898

SYCL Component Optimization Workaround: This pull request addresses a workaround for issue #13163 by disabling the reorder optimization by default in the SYCL component. It ensures that tensor extras are not set when this optimization is disabled.
pull/13254

ChatGLM Architecture Support for Tied Embeddings: This pull request addresses an issue with the chatglm architecture by implementing support for tied embeddings. It resolves the 'missing tensor ‘output.weight’’ error encountered with the GLM 1.5B model.
pull/13328

YaRN Metadata for Qwen2/3MoE Project: This pull request involves setting YaRN metadata for the Qwen2/3MoE project, similar to the existing implementation in Qwen2/3. It includes commits that add this functionality and provide comments on enabling YaRN.
pull/13331

Matrix Multiplication Fixes in SYCL Backend: This pull request addresses and fixes issues with non-contiguous source matrices in matrix multiplication operations within the SYCL backend. It involves disabling a specific path while acknowledging a slight performance regression.
pull/13343

RPC_CMD_SET_TENSOR_HASH Request Struct: This pull request introduces a dedicated struct for handling the RPC_CMD_SET_TENSOR_HASH request. It enhances code clarity and organization in the ggml-org/llama.cpp project.
pull/13353

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ngxson
243
21
7
79

ggerganov
57
7
0
27

JohannesGaessler
26
13
1
45

slaren
32
10
0
43

danielhanchen
62
0
1
3

CISC
22
6
0
34

BradHutchings
61
0
0
1

matteoserva
33
2
4
12

jeffbolznv
16
3
1
9

No author found
27
0
0
0

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
ngxson	243	21	7	79
ggerganov	57	7	0	27
JohannesGaessler	26	13	1	45
slaren	32	10	0	43
danielhanchen	62	0	1	3
CISC	22	6	0	34
BradHutchings	61	0	0	1
matteoserva	33	2	4	12
jeffbolznv	16	3	1	9
No author found	27	0	0	0