Weekly Project News

Archives
Subscribe

Weekly GitHub Report for Llama.cpp: December 01, 2025 - December 08, 2025 (12:01:08)

Weekly GitHub Report for Llama.cpp

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is b4991

1.2 Version Information:

The version released on March 29, 2025, introduces key updates that enhance overall performance and user experience, reflecting a continued focus on stability and feature improvements. Notable highlights include optimized system processes and refined interface elements to streamline usability.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Possible Improvement Required: Prompt Processing speed decreases on Vulkan AMD Renoir APU with -fa 1: This issue reports a performance regression observed when enabling flash attention (-fa 1) on an AMD Renoir APU using Vulkan backend in llama.cpp, where prompt processing speed decreases notably, especially with larger prompt sizes. The user seeks to understand if this slowdown is expected behavior for integrated GPUs, the underlying reasons, and whether flash attention should be enabled on such hardware, supported by detailed benchmarks and hardware information.

    • The discussion reveals that a slight slowdown with flash attention on some hardware, particularly integrated GPUs with limited memory bandwidth, is expected due to the scalar flash attention path and lack of AMD-specific optimizations. Multiple users share benchmark results showing performance drops ranging from minor to significant depending on model size and prompt length. Attempts to tune Vulkan pipeline parameters like split_k did not improve performance, but reducing tile size to mitigate register spilling showed promise. The issue remains open for potential future AMD optimization contributions.
    • Number of comments this week: 36
  2. Eval bug: Request for Qwen3-Next-80B-A3B Vulkan Inference Optimization: This issue reports that the Qwen3-Next-80B-A3B model's current Vulkan backend implementation is significantly slower compared to other A3B Qwen models, and requests optimization efforts to improve its inference speed. The discussion centers around identifying inefficiencies caused by unsupported or partially supported operations like DIAG and CONT, the impact of graph splits on performance, and experimentation with kernel parameters and batch sizes to enhance Vulkan inference throughput.

    • Commenters analyzed the model's graph and Vulkan backend support, noting that some operations run on CPU due to lack of Vulkan support, which fragments execution and reduces speed. They shared detailed profiling data, discussed potential kernel improvements such as increasing chunk sizes in SOLVE_TRI, and tested workarounds like replacing DIAG with TRI operations, concluding that optimizing reshape/cont/permutation ops and tuning batch sizes could yield modest speedups, while awaiting further backend enhancements.
    • Number of comments this week: 17
  3. Feature Request: Add Video Modality Support (Qwen2.5-VL) via llama-mtmd-cli: This issue requests the addition of native video modality support to the llama.cpp multimodal pipeline, specifically targeting the Qwen2.5-VL model which is designed for video understanding using dynamic-resolution processing and absolute time encoding. The goal is to enable direct video input handling for tasks like video captioning and temporal reasoning on edge devices, bridging the gap between existing Python frameworks and the C++ inference engine while also supporting Nvidia’s Cosmos Reason1 model for physical AI applications.

    • The comments clarify that Conv3D operations are already supported via decomposed Conv2D operations, and the feature is actively being worked on by multiple contributors. There is some caution expressed about AI-generated content in issues and PRs, emphasizing manual review and vetting of code. Discussion also highlights the importance of designing a flexible API to support multiple video models beyond Qwen, including future multimodal inputs like audio, and the potential for streaming video processing via frame callbacks to reduce memory usage and enable live video input.
    • Number of comments this week: 12
  4. Misc. bug: [SYCL][Intel GPU] GPT-OSS 20B not loading fully to VRAM with -ngl 99 (A770): This issue reports a problem where the GPT-OSS 20B model does not fully load into the VRAM of an Intel Arc A770 GPU when using the SYCL backend with a very high -ngl setting, causing heavy CPU load due to partial offloading of computations. The user provides detailed system information, logs, and reproduction steps, and the discussion reveals that certain tensor operations and data types (like MXFP4 and SWIGLU_OAI) are not fully supported on SYCL for this GPU, leading to fallback on CPU processing and reduced performance.

    • The comments confirm that Vulkan backend does not have this issue, and logs show many tensors cannot be used with the preferred SYCL_Host buffer and fall back to CPU memory. It is explained that the Intel Arc A770 GPU does not support some operations and data types (e.g., MXFP4, SWIGLU_OAI) on SYCL, causing partial CPU offloading. A contributor reports progress in shifting more workload to the GPU, reducing host memory usage but also lowering token processing speed, and mentions ongoing work to optimize unsupported operations and implement flash-attention support. The original poster expresses gratitude, while acknowledging limited expertise to contribute further.
    • Number of comments this week: 10
  5. Eval bug: rpc-server crashes with a gpf: This issue reports a crash of the RPC server backend when using the CUDA and RPC backends together, triggered after initial successful model loading and a brief interaction with the model. The crash manifests as a general protection fault and illegal memory access errors on NVIDIA GPUs, with detailed logs showing Vulkan and CUDA backend failures, and the user investigates by testing different versions, models, and backends, seeking clarification on whether the problem lies in llama.cpp or NVIDIA drivers.

    • The discussion includes attempts to reproduce the crash with smaller models and different versions, revealing that some versions do not crash while others do; detailed backtraces and logs are shared showing Vulkan device lost errors and CUDA illegal memory accesses; the user confirms the issue occurs across multiple servers and hardware; a workaround with an older version (7041) and manual tensor splitting is found to avoid crashes; the conversation ends with the user requesting guidance on whether this is a llama.cpp bug or an NVIDIA driver/kernel issue.
    • Number of comments this week: 9

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. Kompute-based Vulkan backend shows an GGML_OP_GET_ROWS error: This issue reports an error related to the Kompute-based Vulkan backend, specifically a GGML_OP_GET_ROWS error that does not occur with the other Vulkan backend. The problem has been open for over 616 days and highlights a discrepancy in behavior between two Vulkan backends used in the project.
  2. Question: How to generate an MPS gputrace: This issue is a request for guidance on how to generate an MPS gputrace specifically for the llama.cpp project during model inference, as part of efforts to improve the Metal backend in a related project. The user is seeking a documented or known method to produce the type of GPU debugger output used by Apple's Metal debugger to aid in performance analysis and debugging.
  3. common: download from URL, improve parallel download progress status: This issue addresses the problem of conflicting progress displays when downloading multiple shards of a model in parallel, which was introduced in a previous update. It proposes improving the implementation of the parallel download progress status by properly utilizing the CURLOPT_NOPROGRESS option from libcurl to ensure accurate and non-overlapping progress reporting.
  4. kubernetes example: This issue discusses the creation of a Kubernetes Helm chart for deploying the llama.cpp server, aiming to facilitate scalable application deployment within the community. The original poster has made initial progress on this example but is seeking additional contributions and plans to continue development when time permits.
  5. Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf: This issue reports a problem with loading the microsoft/bitnet-b1.58-2B-4T-gguf model using CUDA on a Windows system with an NVIDIA GeForce RTX 3060 GPU. The error occurs because a tensor in the model file has a number of elements per row that is not a multiple of the expected block size, causing the model loader to fail when reading tensor information.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 35

Summarized Issues:

  • Vulkan Backend Performance and Stability Issues: Multiple issues report performance regressions, crashes, and inefficiencies related to the Vulkan backend on various hardware including Intel Arc A770, AMD Renoir APU, and rk3588-based Orange Pi 5 Plus. These problems include token generation slowdowns, driver errors, and missing libraries causing device detection failures, highlighting the need for optimization and stability improvements in Vulkan support.
  • issues/17628, issues/17715, issues/17751, issues/17761, issues/17783
  • GPU Backend Crashes and Memory Errors: Several issues describe crashes and illegal memory access errors when using CUDA or RPC backends on NVIDIA GPUs, including RTX 4090 and DGX Spark systems, often triggered by specific models or input sizes. These errors cause device loss, server crashes, and unstable inference, indicating critical bugs in GPU memory handling and backend integration.
  • issues/17647, issues/17796, issues/17818
  • SYCL Backend and Intel iGPU Compatibility Problems: Issues report incomplete model loading, corrupted outputs, and fallback to CPU processing when using the SYCL backend on Intel Arc A770 and Intel iGPU hardware. Unsupported tensor operations and data types cause partial VRAM utilization and performance degradation, necessitating fixes to fully leverage GPU capabilities.
  • issues/17643, issues/17656
  • Feature Requests for Multimodal and Audio Support: Requests include adding support for Qwen Omni audio generator with Whisper-like API, image embedding models like CLIP and siglip, and native video modality support for Qwen2.5-VL in the multimodal pipeline. These enhancements aim to expand the project's capabilities for speech recognition, text-to-speech, and video understanding on edge devices.
  • issues/17634, issues/17635, issues/17660
  • Model Loading and Inference Optimization Requests: Users request optimizations for Vulkan backend speed on specific Qwen models, CUDA efficiency on Jetson AGX Orin, and GPU VRAM usage reduction during high-resolution image inference. These requests highlight performance bottlenecks and seek improvements in resource management and throughput.
  • issues/17747, issues/17751, issues/17801
  • llama-server Validation and Slot Selection Bugs: Issues describe validation failures when optional keys are omitted, inconsistent slot selection based on LRU despite prompt similarity, and rejection of non-standard content types like "thinking." These bugs cause incorrect request handling and unclear behavior in prompt caching and reasoning trace management.
  • issues/17667, issues/17673, issues/17700
  • Web UI Usability and Feature Enhancements: Requests include adding search functionality in the model selector, multiple response support, sidebar visibility toggle, and improved console readline navigation with arrow keys and history. Additionally, a UI bug causes double scroll bars on window resize, affecting user experience.
  • issues/17741, issues/17798, issues/17807, issues/17819, issues/17828, issues/17829
  • Model-Specific Output and Tag Handling Issues: Some models produce incorrect outputs such as infinite question marks or empty strings, and the <think> tag is not properly handled in certain GGUF model variants, causing mixing of thinking content with final output. These issues affect model correctness and UI presentation.
  • issues/17743, issues/17815, issues/17832
  • Codebase Refactoring and Compilation Errors: Proposals include deprecating old DIAG_MASK_* operators in favor of TRI operator usage, and reports of compile errors due to undefined functions and regex optimization flags causing platform-specific build failures. These highlight maintenance and portability challenges in the codebase.
  • issues/17654, issues/17771, issues/17830
  • Model Management and Server Mode Improvements: A feature request suggests adding an --auto-unload flag to automatically unload models in router mode to simplify model switching, improving usability for WebUI and API users managing multiple models.
  • issues/17833
  • Miscellaneous Feature Requests and Use Cases: Requests include enabling host-memory prompt caching for mtmd, using diffusion models for subtitle translation improvements, and adding arrow key support for console input, reflecting diverse user needs for enhanced functionality.
  • issues/17820, issues/17821, issues/17828

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 25

Summarized Issues:

  • Model metadata and conversion issues: Several issues highlight problems with model metadata handling and conversion tools. The llama_model_meta_val_str function currently skips array-typed metadata like 'general.tags', causing retrieval failures, while the --mistral-format conversion option fails due to compatibility issues with the latest mistral-common and pending official transformer support for Mistral 3 Large model conversion.
    • issues/17608, issues/17691, issues/17705
  • CUDA and ROCm backend bugs: Multiple issues report bugs in GPU backends affecting model execution and compilation. The CUDA backend causes incorrect outputs for the IBM Granite 4.0-h-1B model with long contexts due to floating-point precision and Flash Attention kernel problems, fixed by a patch, while ROCm backend builds fail to find libmtmd.so.0 and produce gibberish output for Qwen models, indicating unresolved runtime and compilation errors.
    • issues/17610, issues/17777, issues/17797
  • Server and API response header issues: The llama-server exhibits multiple HTTP header problems in multi-model mode. Duplicate HTTP response headers on /v1/chat/completions cause proxy errors, and conflicting Transfer-Encoding: chunked and Content-Length headers break reverse proxies like Nginx, leading to connection failures and proxy incompatibility.
    • issues/17693, issues/17710
  • Memory and quantization failures: Memory allocation and quantization processes fail under certain conditions. Using --no-warmup causes memory allocation failure for vision models on first request, and quantization crashes occur for IBM Granite 4-h 1B and DeepSeek-R1-Distill-Qwen-14B models due to assertion failures and NaN tensor values, respectively, preventing successful model preparation.
    • issues/17676, issues/17677, issues/17787
  • Build and compilation errors: Building llama.cpp with CUDA on Linux triggers symbol redefinition errors due to conflicts with system headers, causing build failures. Additionally, the ROCm-targeted Docker image build results in missing shared libraries at runtime, indicating incomplete or faulty build processes.
    • issues/17678, issues/17777
  • Server crashes and input handling bugs: The llama-server crashes with segmentation faults when processing prompts with large repeated characters due to catastrophic regex backtracking, while image input fails for the Lingshu-32B-q8_0.gguf model because the required vision-specific project file is missing, limiting functionality.
    • issues/17636, issues/17720
  • Web UI and CLI usability issues: The new WebUI incorrectly shows the model name as "gpt-3.5-turbo" in single-model mode and displays a fixed total context size of 4096 instead of the actual size in multi-model mode. The CLI tool requires the --model argument even when displaying help, which is a usability bug.
    • issues/17666, issues/17723, issues/17754
  • Server performance and architecture improvements: Refactoring is needed to move server_context::update_chat_msg execution to the HTTP thread to improve text generation performance, as the current implementation negatively impacts efficiency by calling expensive diff computations on the wrong thread.
    • issues/17726
  • Model management testing enhancements: New targeted test cases were added to test_router.py to improve model management testing, covering unloading models, enforcing maximum loaded models with eviction, disabling auto-loading, and requiring API key authentication, while keeping tests efficient for CI.
    • issues/17704
  • Filesystem and archive extraction problems: Extraction of provided archives on Linux Mint dumps files into the current directory due to a top-level directory named ".", causing clutter. Additionally, directory iteration can get stuck on symlinks, requiring fixes to handle such cases properly.
    • issues/17757, issues/17769
  • Feature requests for UI and tooling: Users requested a stop icon button to replace the loading indicator on active conversation items in the sidebar and the ability to specify the temporary file location for the convert_hf_to_gguf.py script to avoid disk space issues.
    • issues/17805, issues/17770
  • Release notes formatting improvement: A request was made to change the new releases format to show change details at the top by default and hide links initially, improving accessibility and reducing redundancy when viewing release information.
    • issues/17717
  • Server error with Codex CLI tool: Using the gpt-oss-20b model with Codex CLI causes a server HTTP 500 error when a tool role message is received without a preceding assistant message containing a tool call, indicating a likely bug in the Codex client rather than the llama.cpp backend.
    • issues/17702

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 48

Key Open Pull Requests

1. Add Support for Microsoft Phi-3.5 Vision Instruct Models: This pull request adds support for Microsoft’s Phi-3.5 Vision Instruct models to the llama.cpp project, enabling multimodal prompts that combine image and text inputs by updating the GGUF conversion script, implementing a custom high-definition image preprocessing pipeline, creating a new graph build path with specialized layers, and adding runtime logic to integrate image features with the language model.

  • URL: pull/17687
  • Merged: No
  • Associated Commits: a8ad5, 51eff, 0fcca, be9f1, e7677, a49cb, 3c5c7, d9954, 8ae6f, b4fe0, 2cd62, b8b69, e0f63, e1b7b, 689a6, 6c3fb, 01208, 88c32, 80bea, 7a0b1, 83f43, 36a3f, bf33a, 1b035, adf26, a4d1f, 5d0d7, 32973, 68167, eea11, 40845, 30108, c7ade, ced2f, 13eca, ca3ca, 88821, 3822d, 4fc34

2. llama-router, the C++ "llama-swap" for llama.cpp: This pull request introduces "llama-router," a C++ multi-model serving router for llama.cpp that auto-spawns and manages llama-server instances with dynamic backends, supports advanced per-model configuration, seamless HF Hub integration, real-time streaming via SSE, flexible VRAM management, backward compatibility with existing clients, comprehensive admin endpoints, and embedded WebUI support, all designed to provide a plug-and-play, zero-configuration experience while enabling advanced users to customize and extend functionality.

  • URL: pull/17629
  • Merged: No
  • Associated Commits: 07b50, 25f14, 4cbed, cb7c4, dbf32, 70eec, 0f090, dac95, ee94b, e4723, 728bc, cbcc8, 635b7, 23279, 7f274, bfb3e, 4bc8f, b14ea, c5fdd, cb44f, 85f41, 41f50, da65c, 919e5, b2488, 6e933, 47408, 1a014, d99d9

3. cuda: optimize SOLVE_TRI using registers and FMAF: This pull request optimizes the SOLVE_TRI CUDA kernel used in qwen3-next models by switching from shared memory to registers to eliminate bank conflicts and double occupancy, splitting the reduction loop to remove conditional branches, replacing division with inverse multiplication and fused multiply-add instructions for better pipelining, and cleaning up unused definitions, resulting in significant performance improvements in duration and throughput on RTX 4070 Ti and RTX 2070 GPUs.

  • URL: pull/17703
  • Merged: No
  • Associated Commits: 9e398, 4c9c6, b1488, 68881, a2983, 642e8, c55b5, 2fd92, b27ce, 12d10, ec9b6, a34a4, 4a637

Other Open Pull Requests

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 86

Key Closed Pull Requests

1. QVAC-8256: Prepare for MacOS dynamic backends #66: This pull request addresses the issue of enabling the __ARM_FEATURE_SVE flag incorrectly on MacOS during CI dynamic backend builds by adding additional guards to prevent this, and also updates common parsing functions to include an option that avoids loading all backends multiple times to reduce duplication when called from multiple model instances.

  • URL: pull/17662
  • Merged: No
  • Associated Commits: a3aba, 70359, 9cfd7, 3b7c4, 88782, 4e55d, 62cd6, 84e94, b1795, 03b0c, 55557, 80e1f, dc3fd, e4aae, 73529, 257a4, 9c446, 23fd1, 3ecb1, 66522, ae638, 7b780, f9575, 44f71, b8da4, bb0bd, 00ac3, 14c66, 6339e, 38ab5, c2c74, 6755f, 69f55, 8ced3, 9326f, 8cbbb, 25cad, d8d18, 8bba3, 1536f, a0186, f1adc, 440de, 8d245, f9f10, b2664, e1e62, 85fbe, 73020, aba94, fec6b, 1296c, f6ccd, 06bed, 9b6d2, 2b785, affe5, 898df, 1e727, 291d0, aaa80, 7c23b, d7776, 48ce8, 1a335, f4df3, 86e2b, ad73b, 8a342, bcdd5, 9324c, 006b1, 90f85, 7577c, d1c58, 9c1a5

2. model: support Ministral3: This pull request adds support for the Ministral3 model format to the ggml-org/llama.cpp project, including conversion scripts and architecture improvements, with collaboration from Mistral and performance testing on the 14B-Instruct variant.

  • URL: pull/17644
  • Merged: Yes
  • Associated Commits: 3e41c, 2b2f4, 4cebf, 84be0, 786b3, 55a19, a4f54, bf08f, 34234, b185b, 56003

3. convert: support Mistral 3 Large MoE: This pull request adds support for converting the Mistral 3 Large Mixture of Experts (MoE) model to the GGUF format in the llama.cpp project, enabling usage with the --mistral-format argument despite the current lack of transformers compatibility and incomplete C++ scaling code, and includes various fixes and improvements contributed through multiple commits.

  • URL: pull/17730
  • Merged: Yes
  • Associated Commits: 3e623, 1a308, 08e0a, 249ed, aebab, 646e4, ab247, 49c4e, 2f8c2, 4955d, 15f78

Other Closed Pull Requests

  • Server multiple completions support: This pull request implements support in the server for generating multiple completions from a single prompt using the OpenAI "n" option by managing parent and child tasks with synchronized slot states to ensure correct processing and output formatting. It also addresses related issues such as invalid input batches and context shifting.
    pull/17775
  • HTTP thread message state tracking: This pull request moves the message differences tracking functionality to the HTTP thread by delegating state tracking to it and creating a new task_result_state for each generation session. The state object is updated with each new result to manage partial returns and session state consistently.
    pull/17740
  • Filesystem error handling improvement: This pull request improves robustness by replacing the use of std::filesystem::exists without error handling with the version that accepts a std::error_code parameter. This prevents exceptions caused by inaccessible paths and enhances error management during file existence checks.
    pull/17653
  • MacOS Metal residency heartbeat: This pull request implements a background heartbeat thread that periodically calls Metal's residency set requestResidency() to keep memory buffers wired on MacOS. It addresses recent OS changes causing memory unwiring after idling by attaching residency sets to the Metal queue and allowing configurable keep-alive timing via an environment variable.
    pull/17766
  • Executable path resolution fix: This pull request addresses an issue where the first argument from the parent process may not correctly reflect the binary path by explicitly resolving and setting the executable path when creating a new server instance. This ensures reliable execution.
    pull/17669
  • WebUI OpenAI-compatible model listing: This pull request updates the WebUI to use the OpenAI-compatible /v1/models endpoint by default for listing models. It also includes small refactor improvements related to data fetching via services and stores.
    pull/17689
  • convert_hf_to_gguf script fixes and enhancements: These pull requests fix the convert_hf_to_gguf script to correctly map the updated output format of mistral-common's valid_tokenizer_files function, preventing crashes with the --mistral-format option, and add a configurable --temp-dir option that creates the temporary directory if it does not exist. They also ensure the conversion process uses models' existing local chat templates instead of defaulting to a generic template, supporting new Instruct variants.
    pull/17712, pull/17774, pull/17749
  • Model router tests and utilities: This pull request adds four detailed test cases for the llama-server model router covering explicit model unloading, LRU eviction behavior, disabling automatic model autoloading, and API key authentication. It also introduces related utility enhancements and fixes to support asynchronous model loading and error handling.
    pull/17722
  • WebGPU backend improvements: This pull request enhances the WebGPU backend by adding basic support for most unary operators, refactoring backend code to use dynamic pipeline storage with std::map, improving shader generation for unary operations, introducing helper functions for common calculations, fixing parameter passing issues for the XIELU operation, and updating operator documentation.
    pull/17764
  • Model name handling and display update: This pull request removes the default "gpt-3.5-turbo" model name from the server, changes model name resolution to prioritize user-set aliases or registry format before falling back to file names, and stops reflecting the input request's "model" field back in the response. These changes improve model name handling and display in the web UI.
    pull/17668
  • Logging verbosity enhancements: This pull request improves log verbosity level definitions by introducing a more granular hierarchy of logging levels ranging from output data to debug information. This enhances the usefulness of logs in CLI and server-based applications.
    pull/17630
  • Vulkan top_k shader bug fix and optimization: This pull request fixes a bug in the Vulkan top_k shader that caused incorrect output when there were ties in input values, reduces temporary memory usage for the top_k operation, and updates tests to properly verify this edge case.
    pull/17659
  • Additional file types support in text section: This pull request adds support for additional file types to the text section of the llama.cpp project, specifically including source files from the codebase, addressing issue #17556.
    pull/17670
  • ZenDNN backend addition: This pull request adds a new ZenDNN backend enabling accelerated inference on AMD EPYC CPUs by implementing optimized matrix multiplication primitives with FP32 and BF16 support. It integrates build system options, provides documentation, and demonstrates significant performance improvements across various models and batch sizes.
    pull/17690
  • Operations documentation and CSV regeneration: This pull request updates the operations documentation (ops.md) and regenerates the Metal and BLAS CSV files based on running the suggested scripts to reflect the latest project changes.
    pull/17768
  • Vulkan memory allocation priority setting: This pull request sets all Vulkan memory allocations to high priority with an option to gate this behavior via an environment variable. It aims to address issue #17605 though its effectiveness is uncertain.
    pull/17624
  • AI-generated code disclosure update: This pull request updates contributing guidelines to require contributors and collaborators to disclose the use of AI-generated code in pull requests and security reports, excluding trivial tab auto-completions.
    pull/17625
  • Vulkan validation extension replacement: This pull request replaces the deprecated Vulkan extension VK_EXT_validation_features with VK_EXT_layer_settings to reduce validation warnings and facilitate easier debugging of shaders using the Debug Printf feature.
    pull/17637
  • Code clarity refactor in ggml-cpu: This pull request refactors ggml-cpu code by removing duplicate conditional checks involving the variable 'iid' to improve code clarity and maintainability.
    pull/17650
  • mtmd_context_params warmup option: This pull request introduces a warmup option to mtmd_context_params allowing users to disable automatic warmup for different image sizes via the existing --no-warmup CLI argument. This enables manual warmup with a custom-sized image immediately after mtmd initialization.
    pull/17652
  • Server error message and store refactor: This pull request adds missing context information to server error messages and includes a small DRY refactor for the chat.svelte.ts store to improve code maintainability.
    pull/17663
  • Vulkan performance logger improvements: This pull request improves the Vulkan performance logger by moving it from the device to the context, adding an environment variable to control stats dumping frequency, introducing a fusion info string for fused operations, and fixing FLOPS calculation for the MUL_MAT_ID operation.
    pull/17672
  • Server --media-path option: This pull request introduces a new server option --media-path that allows the use of relative paths for accessing local media files, enabling requests to reference files like images via a simplified file:// URI scheme.
    pull/17697

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
hanishkvc 359 3 0 0
ngxson 210 23 10 94
CISC 18 5 0 113
pwilkin 62 7 5 60
jeffbolznv 56 13 0 63
ggerganov 79 14 2 32
allozaur 78 8 7 5
wsbagnsv1 62 2 0 27
aldehir 68 3 0 17
ServeurpersoCom 57 8 0 23

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.