Weekly GitHub Report for Llama.cpp: July 28, 2025 - August 04, 2025 (12:00:41)
Weekly GitHub Report for Llama.cpp
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is b4991
1.2 Version Information:
The version released on March 29, 2025, introduces key updates and improvements, focusing on enhanced performance and user experience. Notable highlights include optimized features and bug fixes that streamline functionality and increase stability.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Eval bug: unsloth Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf and bartowski/Q8_0 seems to be broken, repeats itself: This issue reports a problem where running the unsloth Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf and bartowski Q8_0 models on the Vulkan backend causes the model to produce repetitive output, effectively repeating itself indefinitely. The user confirms that the models work correctly on CPU and CUDA backends, indicating the issue is specific to Vulkan, and further investigation suggests it may be related to driver bugs with AMD Vega20 GPUs or the RADV Vulkan driver.
- The discussion explores whether the problem is due to sampling parameters or templates but rules these out. Users confirm the issue persists on multiple commits and with different tools, but not on CPU or CUDA. Suggestions include testing with the older amdvlk driver, running Vulkan backend debug checks, and using test-backend-ops to identify Vulkan operation errors. Debug logs reveal some Vulkan ops failing or missing checks, and the issue is suspected to stem from driver-level bugs in Mesa RADV for Vega20 GPUs. The conversation also notes that the problem does not appear in llama-cli but only in llama-server, and that fixing it may require filing detailed bug reports to Mesa with shader references and hardware info.
- Number of comments this week: 19
-
Eval bug: Unable to run Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL: This issue reports a problem running the Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL model locally using llama.cpp with CUDA backend on Linux, where the model fails to execute tool calls properly and instead prints the intended commands without running them. The discussion in the comments revolves around troubleshooting template errors, updating chat templates, fixing function calling issues, and coordinating fixes between llama.cpp and qwen-code CLI, eventually leading to partial success with updated templates and code changes, though some instability and crashes remain.
- The commenters investigated template-related errors and function call failures, applied patches to llama.cpp and qwen-code, tested updated chat templates from Hugging Face, and confirmed partial fixes enabling the model to run and respond correctly; however, some users still experienced crashes and random outputs, indicating ongoing instability and the need for further debugging.
- Number of comments this week: 16
-
Feature Request: Qwen3-Coder Tool Call Parser: This issue requests a feature to support native tool call parsing for the Qwen3-Coder model, which uses a custom XML format for tool calls that current systems expecting JSON cannot handle properly. The user provides detailed explanations and examples contrasting Qwen3-Coder’s XML-based tool call format with Qwen3-Instruct’s JSON-based format, and suggests implementing a custom parser to convert Qwen3-Coder outputs into JSON for compatibility.
- The comments discuss related issues and attempts to fix or improve the parser, including a quick fix branch and a pull request for testing on different model sizes. Users report crashes and parsing errors when using the Qwen3-Coder template, especially with complex tool calls, and share configurations and logs illustrating these failures. Some confirm partial success with fixes or alternative templates, while others experience persistent runtime errors and server crashes, indicating ongoing challenges with stable integration and parsing of the custom XML tool call format.
- Number of comments this week: 11
-
Feature Request: Implement missing ops from backends: This issue requests the implementation of missing operations across different backends in the ggml project to achieve feature parity, encouraging contributors to pick unimplemented operations, implement and test them, and submit pull requests without asking for assignments. It provides guidelines on how to contribute, including testing procedures and notes on operations related to training, aiming to make this a good first issue for contributors.
- The comments include contributors expressing interest in working on specific operations, clarifications on whether partially supported ops can be targeted, and updates on which operations are already in progress, with maintainers encouraging contributions and providing guidance on how to proceed without creating duplicate issues.
- Number of comments this week: 6
-
Eval bug: imatrix generation for LFM2 fails; collect_imatrix: inconsistent size for blk.0.shortconv.in_proj.weight: This issue reports a bug in the llama.cpp project where generating the importance matrix (imatrix) for LFM2 models fails due to inconsistent tensor sizes, specifically for the parameter
blk.0.shortconv.in_proj.weight
, with observed size mismatches following a 4/3 ratio pattern. The user provides detailed reproduction steps and logs showing the error occurs across multiple LFM2 model sizes, and a problematic commit is identified; subsequent comments diagnose the cause related to handling multiple parallel sequences and propose a workaround and a fix.- The discussion identifies that the error stems from the shape of intermediate tensors when processing multiple sequences in parallel, which does not match the expected dimensions in imatrix computation. A workaround to use a batch size equal to the chunk size is suggested, and a fix is implemented and merged. Additionally, a related but separate issue about missing importance matrices during quantization of another model is raised, with detailed logs provided, but no resolution is included in this thread.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 490 days and highlights a discrepancy in behavior between different Vulkan backends used in the project.
- Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce debugger output similar to that provided by Apple's Metal debugger, to aid in performance analysis and debugging.
- common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress indicators when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the parallel download progress status by properly utilizing the CURLOPT_NOPROGRESS option to ensure accurate and non-overlapping progress reporting.
- kubernetes example: This issue discusses the creation of a Kubernetes example for the
llama.cpp
project, specifically focusing on developing a Helm chart to facilitate deploying the server in a scalable and efficient manner. The author has made initial progress but seeks community assistance to continue the work, highlighting the importance of Kubernetes in industry deployments and the potential benefits for users of the project. - Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using CUDA on a Windows system with an NVIDIA GeForce RTX 3060 GPU. Specifically, the error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 31
Summarized Issues:
- Backend Support and Model Compatibility: Multiple issues highlight missing or incomplete backend support and model compatibility problems in llama.cpp, including requests for new model architectures like GLM 4.5 MoE, T5Gemma, and StepFun step3, as well as issues with loading certain GGUF models and handling unique model formats. These gaps cause failures such as segmentation faults, unsupported model type errors, and corrupted outputs, indicating a need for expanded and improved backend and model support.
- issues/14909, issues/14921, issues/14940, issues/14951, issues/14988, issues/14998
- CUDA and GPU Backend Issues: Several reports describe problems with CUDA and GPU backends, including failures to detect GPUs, crashes during inference, slow model loading on ROCm, and broken or missing output on specific hardware like Nvidia Jetson Orin Nano. These issues often involve compatibility problems, improper GPU detection, or memory errors, which prevent proper GPU acceleration and degrade performance.
- issues/14915, issues/14936, issues/14974, issues/14999, issues/15010, issues/15016, issues/15018, issues/15034, issues/15041
- Compilation and Build Failures: Multiple issues report compilation errors and build failures across different platforms and configurations, including Windows with Vulkan and CUDA, macOS with Xcode, Ubuntu 24.04 Docker builds, and older CUDA toolchains. These errors stem from missing definitions, incompatible toolchain requirements, or environment-specific problems, blocking successful builds and deployment.
- issues/14953, issues/14954, issues/15004, issues/15027
- Server and Runtime Crashes: Several issues describe runtime crashes and exceptions in llama-server, including JSON parsing errors, audio decoding failures, deferred task deadlocks, and assertion failures during CUDA operations. These bugs cause server instability, crashes, or hangs during inference or embedding requests, indicating robustness problems in server-side handling.
- issues/14923, issues/14963, issues/15008, issues/15041
- Memory and Cache Management Bugs: Reports include memory errors when running MoE models without memory mapping, KV cache failures causing repeated empty tokens, and tensor size mismatches during importance matrix generation. These issues lead to decoding failures, memory aborts, or incorrect computations, highlighting problems in memory and cache handling.
- issues/14938, issues/14972, issues/14979, [issues/14999](https://github.com/issues/14999]
- Quantization and Decoding Quality Issues: One issue notes a significant drop in acceptance rate for draft models using q4_0 quantization in speculative decoding compared to higher-bit quantization methods, suggesting a potential bug affecting decoding quality and model output reliability.
- issues/15039
- Platform-Specific Execution Failures: Problems are reported running llama.cpp on specific hardware platforms like RISC-V VisionFive2 and QCM6490 with Adreno GPU, where illegal instructions or missing GPU detection prevent execution or GPU utilization. These issues indicate incomplete platform support or hardware compatibility gaps.
- issues/14926, issues/14936
- Model Output and Parsing Errors: Issues include failures to parse custom XML tool call outputs from Qwen3-Coder models and corrupted chat outputs from OLMoE models in containerized environments, causing errors and nonsensical responses due to incompatibilities in output handling and template processing.
- issues/15012, issues/14988
- API and Binding Bugs: A critical bug in the Swift binding causes immediate termination of text generation due to an unreset flag and unsafe optional unwrapping, leading to crashes and failed generations, indicating the need for safer state management and error handling in language bindings.
- issues/15028
- User Interface and Example Application Improvements: A proposal suggests modernizing the Android example app by adopting MVVM architecture, Jetpack Compose, and Material Design 3 to improve UI, error handling, and developer experience, aiming to create a more robust and attractive reference implementation.
- issues/15022
- Model Conversion and Loading Failures: Attempts to convert or load certain models fail due to unsupported model types or unrecognized architectures, causing errors and preventing usage of these models within llama.cpp.
- issues/14951, issues/14998, issues/15016
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 17
Summarized Issues:
- Compilation and Build Issues with ROCm and Vulkan Backends: Multiple issues report compilation errors and warnings related to the HIP and Vulkan backends on Linux, including missing header files, incorrect function calls, and unexpected GPU target detection. These problems cause build failures or degraded performance, complicating development and deployment on ROCm and Vulkan platforms.
- issues/14538, issues/14776, issues/14971
- Runtime Errors and Performance Problems on ROCm and Vulkan GPUs: Several issues describe runtime failures such as "invalid device function" errors on ROCm AMD GPUs and out-of-memory errors on Vulkan backends due to buffer allocation limits. These errors prevent successful model evaluation or loading, severely impacting usability on these GPU platforms.
- issues/14696, issues/15009, issues/14854
- Quantization Bugs and Assertion Failures: There are multiple bugs in the quantization tools where incorrect assumptions about dataset formats or tensor dimensions cause assertion errors. These issues block quantization workflows, especially when handling GGUF format files or tensors with non-divisible row sizes.
- issues/14952, issues/14996
- Model Output and Generation Instability: Some issues report unexpected changes in embedding outputs between commits and repeated token sequences during image-to-text generation, raising concerns about output consistency and stability. These problems affect model reliability and user trust in generated results.
- issues/14848, issues/14888
- GPU Backend Recognition and Utilization on Apple Silicon: An issue highlights that the Metal backend on Apple M4 GPUs is not properly recognized or utilized by the llama.cpp tool installed via Homebrew, resulting in no GPU acceleration during inference. This limits performance benefits on Apple hardware.
- issues/14966
- Code Refactoring to Reduce Duplication: One issue proposes merging two related functions to eliminate code duplication by adding an optional parameter, improving code maintainability and clarity. This refactor aims to simplify the codebase without changing functionality.
- issues/14920
- Server-Client Data Transmission Bug in OpenAI API Endpoint: There is a bug where delta reasoning data generated on the server is sent with empty content fields to the client, causing missing reasoning data in client responses despite correct server logs. This disrupts the expected data flow and client-side functionality.
- issues/15000
- Vulkan Backend Parallel Stream Output Errors: A bug in the Vulkan backend causes incorrect outputs when parallel streams are used with specific flags disabled, traced to issues in handling batched 4D matrix multiplications with permuted inputs. This affects multi-head attention computations and output correctness.
- issues/15006
- Documentation and Package Maintenance for Arch Linux: An issue discusses updating Arch Linux installation instructions and clarifying the maintenance status of related AUR packages, ensuring users have accurate and current setup guidance.
- issues/14976
- New Inference Engine Announcement: A notification introduces the uzu inference engine claiming significant speed improvements on Apple Silicon, suggesting potential interest in evaluating its Rust-based implementation despite lacking independent benchmarks.
- issues/14958
- Plugin Loading Failures in VS Code: The cline plugin for VS Code fails to load system prompts with GGUF models in llama-server, producing random errors and preventing effective use of certain tools. This limits developer productivity and integration capabilities.
- issues/14866
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 19
Key Open Pull Requests
1. ggml: initial IBM zDNN backend: This pull request introduces the IBM zDNN accelerator as an initial backend for the ggml
library on the s390x platform, supporting only the GGML_TYPE_F32 data type initially, and demonstrates significant performance improvements on IBM Granite models while laying the groundwork for future support of IBM Spyre PCIe accelerator cards.
- URL: pull/14975
- Merged: No
- Associated Commits: e0848, fd491, 02cfc, 36d76, 1989f, b9756, 60b98, 529bd, 11d58, 77a75, 04ddb, 7c639, ae2f6, af9f4, 9e847, 13c05, 13c64, ee0ed, b7f4b, 63fbc, da2e0, 18658, 82851, c1653, 59e98, a1d85, 1c75e, f263f, aef93, bee7d, f800c, 4b2f1, 6d717, f7e8d, 092fa, f239b, cf0e1, 032dc, 08de8, fc692, eefa9, 6f425, 4cc62, 03ec5, 09051, f99b2, e0549, fc926, 2cfa1, 28722, 1a052, 1c6ca, a9438, 0ae2d, ab60a, 2d45e, 34468, b28b4, b1376, 8dbca, e695e, 213f1, 4493b, e30b1, fd766, b4dff, ad0cb, 7b50d, 4fb6b, 20d69, 4d5ed, b7a77, 1eb7c, 70224, 803dd, e67fe, 90d46, 92a17, cf8cd, 867d3, 12e6b, 732df, 6b6eb, fb024
2. Implementation of GGML_NUMA_MIRROR for 64% inferencing performance gain on numa systems: This pull request introduces the implementation of GGML_NUMA_MIRROR, a strategy that mirrors the model in the local memory of each NUMA node to eliminate slow interconnect bottlenecks, resulting in a reported 64% performance improvement during inferencing on NUMA systems, along with necessary code cleanup, build instructions, and extensive testing on a dual Xeon Gold system.
- URL: pull/14969
- Merged: No
- Associated Commits: f98ac, 99b0e, c060a, 82483, daed6, 89567, b0012, c2ba0, b8223, 7e539, ab371, 9b8e7, 14bfb, 7cfc6, afbff, 4f0c3, ea046, 1553d, 4998a, debae, ebaf5, 07047, b97df, d1d3e, bf2d6, 0f4bf, b956e, 7faf5, fa72a, 92593, a7092, 18f3c, 1a053, 2275a, febde, 892b0, 8bbb0, f3540, f57ea, 5fa23, 3a9be, b8ce4, e6072, d82ca, 756fb, 9d664
3. model: Add support for GLM 4.5 family of models (#14921): This pull request adds comprehensive support for the newly released GLM 4.5 family of models to llama.cpp, including architecture registration with MoE components, multi-variant model loading, expert weight handling with merged 3D tensor formats, HuggingFace conversion integration, and implementation of a new graph class for expert routing, while ensuring compatibility with both the 47-layer Air and 93-layer full variants.
- URL: pull/14939
- Merged: No
- Associated Commits: c7550, 0edf7, 6b478, 96528, 07bb0, fae4d, 03fad, b61fc, 999c0, 5baa6, 62447, 58898, ab318, 6f3d9, b25f4, c90f6, bdfe0, 3d15c, dbfad
Other Open Pull Requests
- Model support and integration: Multiple pull requests add support for new models and backends, including GLM-4.5 Mixture-of-Experts (MoE) large language models, the CogVLM model with GGUF mappings and visual encoder integration, and a home-cooked Mistral Small Omni model supporting audio and image inputs. Additionally, support for the MUSA backend is introduced, expanding the project's operational capabilities.
- pull/15026, pull/15002, pull/14928, pull/14941
- Backend improvements and optimizations: Several pull requests focus on backend enhancements, including WebGPU backend improvements with parameter buffer pooling and thread safety, initial F16/F32 fused attention support in the OpenCL backend, and an optimized SIMD-based implementation of the L2 norm operation yielding a 6% CPU performance boost. There is also a fix to ensure OpenCL command queue completion before profiling to prevent crashes on Adreno 830 devices.
- pull/14978, pull/14987, pull/14970, pull/15042
- Tensor and operation handling fixes: Fixes and optimizations are made to tensor handling, such as correcting 3D activation processing in imatrix for hybrid and recurrent models, optimizing the
MUL_MAT_ID
computation path by repacking data types for GEMM operations, and adding extra CPU buffer types withCPU_REPACK
overrides. These changes improve tensor management and prepare for further backend deduplication. - pull/14994, pull/14918, pull/14925
- CUDA and memory operation enhancements: A pull request proposes adding a CUDA-based set operation with initial draft implementations and replaces the copy kernel with an asynchronous CUDA memory copy to improve performance.
- pull/14980
- Bug fixes and code corrections: Various fixes include correcting a typographical error in OpenCL-related code, fixing a compilation bug related to the BASE_CUDA_DEV_CONTAINER on Ubuntu 24.04, and ensuring the
/completion
endpoint flushes partial stop strings correctly in streaming mode to deliver complete output. - pull/14908, pull/15005, pull/15007
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 74
Key Closed Pull Requests
1. Add support for SmallThinker model series: This pull request adds support for the SmallThinker series of on-device native Mixture-of-Experts (MoE) language models to the llama.cpp project by introducing a new model architecture, conversion tools, and inference functions tailored to SmallThinker’s unique design optimized for resource-constrained local deployment.
- URL: pull/14898
- Merged: 2025-07-28T11:47:00Z
- Associated Commits: efe27, a6d6e, a5274, 8e2cb, e28d2, ebd78, 8c6af, 92b51, 4186b, f1d46, f10cd, 4af8b, e2c90, 594af, 29e1f, 5d09d, bb3dd, e338c
2. model: add hunyuan dense: This pull request adds support for the hunyuan_dense model, fixes the hunyuan_moe chat template, and includes various updates and corrections related to hunyuan model versions and chat functionality.
- URL: pull/14878
- Merged: 2025-08-01T13:31:12Z
3. ggml-cpu : deduplicate scalar implementations: This pull request cleans up and removes redundant fallback scalar implementations across multiple CPU architectures in the ggml-cpu project, streamlining the codebase following a previous refactor.
- URL: pull/14897
- Merged: 2025-07-28T15:40:24Z
Other Closed Pull Requests
- Vulkan-based convolution optimizations: This pull request introduces multiple Vulkan-based optimizations for direct convolution, including empirically chosen tile sizes, fixes for shared memory bank conflicts with 16-byte padding, explicit loop unrolling, and skipping computations for out-of-bounds tile regions. Additional improvements include fast division optimizations and disabling shuffles on NVIDIA hardware, collectively enhancing convolution performance across various test cases.
- MiniCPM-V 4.0 integration: This pull request integrates support for MiniCPM-V 4.0, a multimodal model optimized for phone-sized devices, into the llama.cpp project. It includes initial model adaptation with plans for Apple NPU acceleration and a reference app demo to enable efficient on-device deployment across Mac, iPad, and iPhone platforms.
- CUDA backend softcap fusion and refactoring: This pull request introduces softcap fusion (scale->tanh->scale) to the CUDA backend and includes minor refactoring of the
ggml_cuda_can_fuse
function to better handle unary operations.
- Configurable seed in server benchmarking: This pull request enhances the
server-bench.py
script by making the seed choice configurable, allowing users to reproduce multiple benchmark results consistently by setting a specific seed while keeping all other parameters unchanged.
- SYCL backend support for GGML_OP_SET_ROWS and BF16: This pull request adds support for the GGML_OP_SET_ROWS operation for various quantized tensor types and BF16 in the SYCL backend, moving quantization/dequantization copy kernels to a shared header. It also addresses related TODOs and notes potential GPU compatibility considerations for BF16.
- Experimental SYCL gemv kernel for q4_K: This pull request introduces an experimental SYCL-based gemv kernel for q4_K that restructures block layouts to improve performance on Intel GPUs, achieving significant speedups on phi3 hardware but causing regressions on llama2 models. It includes configurable tiling and Intel-specific intrinsics, with the feature marked as opt-in due to observed performance trade-offs.
- Test updates for LLAMA_SET_ROWS=1 default: This pull request updates several tests to support the upcoming default setting of
LLAMA_SET_ROWS=1
, including limiting CPU threads per context intest-thread-safety
to prevent hangs, settingcparams.n_seq_max
to 1, and defaulting to a unified KV cache inembedding
andsave-load-state
tests when-np
is not specified.
- Docker-based CANN build pipeline: This pull request significantly enhances the build process by adding a Docker-based CANN build pipeline to support Huawei's CANN backend integration.
- Voxtral model support in mtmd tool: This pull request adds support for the Voxtral model to the mtmd tool, including testing instructions, model conversion steps, and integration of tokenizer and chat template code from related projects.
- llama-quantize README update: This pull request updates the
llama-quantize
README.md by removing outdated information and adding current capabilities and examples to improve documentation clarity.
- Fix server crash on broken UTF-8 sequences: This pull request fixes a server crash caused by broken or incomplete UTF-8 sequences in
common_chat_parse()
by adding a UTF-8 truncation helper and skipping debug logging when partial or unfinished UTF-8 sequences are detected, preventing assertion failures during JSON serialization.
- ggml submodule synchronization and fixes: This pull request synchronizes the ggml submodule with llama.cpp, including fixes for 32-bit Vulkan builds and improvements to the BLAS link interface in CMake configuration.
- llama-server webUI reasoning content support: This pull request adds support for handling
delta.reasoning_content
in the llama-server's webUI with the default--reasoning-format deepseek
and improves compatibility with new Qwen3 2507 thinking-specific models that do not send a<think>
tag by default.
- Update vendored google/minja with chat template fixes: This pull request updates the vendored copy of google/minja to the latest version, incorporating fixes for the SmolLM3 chat template, including support for the
in
operator on strings andstring.replace
function, and resolving issues in Deepseek R1 and older Qwen chat templates.
- Model pre-tokenizer hash CI check and update script: This pull request introduces a new continuous integration check to verify that model pre-tokenizer hashes are up-to-date, ensuring synchronization between
convert_hf_to_gguf.py
andconvert_hf_to_gguf_update.py
. It also adds a--check-missing
parameter to disable downloads and trigger errors for missing models.
- Fix batched GEMM stride calculations in CUDA and SYCL: This pull request fixes the batched GEMM implementation for CUDA and SYCL backends by correcting stride calculations when
ne02 == 1
andne03 > 1
, ensuring the source tensor is always treated as contiguous during conversion, and improving variable naming and comments.
- LLaDA 8b diffusion model support and example CLI: This pull request adds support for the LLaDA 8b diffusion model with a new example CLI tool distinct from the dream-7b model due to differing token generation semantics. It includes a README for usage and initiates discussion on potential server API integration.
- SYCL 8-bit quantization refactor: This pull request refactors the 8-bit quantization implementation for SYCL by separating quantization kernels into a dedicated header, unifying kernel submission with
sycl::nd_item<1>
, restructuring thequantize_q8_1
kernel to match thereorder q8_1
kernel format, and adding exception handling to improve flexibility without affecting existing functionality.
- Reduce redundant splits in computation graph: This pull request reduces redundant splits in the computation graph for recurrent and hybrid models by creating and storing two views of
s_copy
at graph input creation instead of per model layer, significantly improving generation speed as shown by benchmark tests.
- Extended test case filtering in test-backend-ops: This pull request extends test case filtering by allowing multiple comma-separated operations and supporting full test-case variation strings for more precise selection and profiling of individual test variations, enhancing convenience and flexibility in testing and benchmarking.
- Vulkan debug mode fixes: This pull request addresses minor issues in Vulkan debug mode by fixing debug-related problems and removing broken check_results support for GGML_OP_SET_ROWS.
- ggml submodule indentation improvement: This pull request synchronizes the ggml submodule and improves readability by indenting the ggml-config.cmake file.
- CUDA roll function implementation: This pull request adds a CUDA implementation of the roll function, improving GPU support as part of issue #14909.
- Fix CLS pooling extraction logic: This pull request fixes the extraction of CLS pooling results by correcting logic in
llm_graph_input_cls::set_input
to separate handling of theRANK
case while combiningCLS
andLAST
cases, addressing issue #14848.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
ggerganov | 76 | 18 | 1 | 63 |
CISC | 31 | 2 | 0 | 107 |
taronaeo | 92 | 1 | 1 | 1 |
JohannesGaessler | 26 | 8 | 0 | 50 |
jeffbolznv | 29 | 6 | 0 | 18 |
am17an | 36 | 5 | 2 | 9 |
dbsanfte | 46 | 1 | 0 | 0 |
0cc4m | 9 | 3 | 1 | 27 |
ryan-mangeno | 33 | 0 | 1 | 1 |
sammcj | 19 | 1 | 1 | 11 |