Weekly GitHub Report for Pytorch: September 01, 2025 - September 08, 2025 (12:05:46)

            Weekly GitHub Report for Pytorch: September 01, 2025 - September 08, 2025 (12:05:46)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous bug fixes, performance optimizations, and deprecations such as the discontinuation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

torch compile not work properly with selective activation checkpoint: This issue reports a bug where using selective activation checkpointing (SAC) prevents torch.compile from working properly, as it causes the compiler to skip generating optimized Triton kernels. The problem arises because SAC is implemented as a TorchDispatchMode that interferes with the tracing process during compilation, and users are discussing the challenges and potential workarounds for combining SAC with torch.compile effectively.

The comments include a minimal reproducible example demonstrating the issue, explanations that SAC(compile(fn)) is unsupported due to TorchDispatchMode skipping frames during tracing, and suggestions to use the wrapping order compile(SAC(fn)) instead. Users discuss the infeasibility of this wrapping order in training scenarios, request future support for SAC with compile, and provide clarifications and fixes for errors in the repro code.
Number of comments this week: 7

ZeroBubble and DualPipeV pipeline parallel schedules fail with torch.compiled model: This issue reports a bug where the ZeroBubble and DualPipeV pipeline parallel schedules fail to work correctly with models compiled using torch.compile, resulting in a runtime error related to legacy autograd function access patterns. The user provides a minimal reproducible example and notes that other schedules like 1F1B work fine with compiled models, seeking guidance or a workaround for this incompatibility.

The comments clarify that the error occurs specifically with the aot_eager or lower compile backend and that ZeroBubble schedules are inherently incompatible with torch.compile because the backward pass fusion prevents splitting input and weight gradients. A suggested workaround is to compile entire pipeline stages rather than individual blocks, though this may not be optimal in all cases. The discussion also highlights the need for clearer error messages about this incompatibility and ongoing efforts to extend support to DualPipeV schedules.
Number of comments this week: 6

Torch.compile does not guard on requires_grad_ state - uses stale compilation: This issue reports that torch.compile does not recompile or invalidate its cached compilation when the requires_grad attribute of a parameter changes, leading to stale compilations where gradients are unexpectedly not computed. The user provides a minimal reproducible example demonstrating that after freezing a parameter and compiling the model, unfreezing the parameter does not trigger recompilation, resulting in grad=None even though gradients are expected.

The comments discuss the problem as a guard evaluation issue where the correct guard is generated but recompilation does not occur. It is suggested that enabling torch._dynamo.config.wrap_top_frame = True may resolve the issue, which is confirmed by one user, though another notes that this configuration option is missing in an earlier PyTorch version, indicating a potential version-specific workaround or limitation.
Number of comments this week: 5

inductor's process pool seems to time out while cleaning up after a cold start: This issue describes a timeout problem occurring in the inductor's process pool during cleanup after a cold start on an H100 development GPU, where the process pool waits for 5 minutes before timing out. The user observes that the number of spawned processes is capped at 32 despite a high CPU affinity count, and the timeout does not occur on warm starts or when limiting compile threads to one.

The comments reveal that others cannot reproduce the timeout issue on different hardware setups, including A100 GPUs and even the same devserver, despite attempts with cache clearing and environment variable changes. The original reporter notes the problem is specific to cold starts and does not happen on subsequent runs, suggesting a potential environment or configuration-specific bug rather than a widespread issue.
Number of comments this week: 5

torch.ScriptObjectProperty.name.deleter occurs Segmentation fault (core dumped): This issue reports a segmentation fault occurring when using the deleter method on a torch.ScriptObjectProperty in PyTorch version 2.5.1. The user provides a minimal reproducible example that triggers a core dump on Ubuntu 22.04.5 LTS with CUDA 12.4, highlighting a critical bug in the property deleter implementation for ScriptObjects.

The comments reveal multiple users expressing interest in investigating the issue, with one user confirming active work on a fix and promising an upcoming pull request; another user reports a related error in PyTorch 2.6.0+cu124 involving an invalid memory allocation abort.
Number of comments this week: 4

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch and CUDA versions, and demonstrates that the error occurs while compiling specific pipeline components with torch.compile, indicating a potential compatibility or packaging problem with the Triton compiler integration.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at /tmp having permissions set to 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting approximately 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, handling known formatting-related problems, and organizing the work by directory to facilitate incremental and reviewable formatting updates.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, specifically .debug_pkl files, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are not necessary for model execution, can occupy a significant portion of the archive, especially in small or quantized models, and removing them manually does not affect model correctness.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 84
Summarized Issues:

Segmentation Faults and Crashes in Convolution and Tensor Operations: Multiple issues report segmentation faults and crashes related to convolution operations with extremely large padding or stride values, in-place operations on tensors, and unsupported method calls on specialized tensor types. These problems occur across CPU and Linux systems and highlight the need for better input validation and error handling to prevent memory corruption and invalid pointer dereferences.  
issues/161865, issues/161871, issues/161874, issues/161875, issues/161877, issues/161915

torch.compile Compatibility and Numerical Correctness Issues: Several issues describe problems with torch.compile including incompatibility with selective activation checkpointing, pipeline parallel schedules, and the MPS backend, as well as inconsistent numerical results and failure to detect parameter state changes. These highlight limitations and bugs in the current compilation and optimization infrastructure affecting model correctness and runtime behavior.  
issues/161889, issues/161904, issues/161905, issues/161933, issues/162174

Distributed and Parallel Backend Bugs and Documentation Errors: Issues report incorrect behavior and crashes in distributed training setups, including incorrect gradient reductions with GLOO backend, silent incorrect behavior with FakeProcessGroup, and misleading documentation about backend capabilities. These problems affect distributed model training correctness and user understanding of supported features.  
issues/162129, issues/162238, issues/162248, issues/162338, issues/162157

Memory and Performance Regressions in CPU and CUDA Backends: Reports include performance regressions in the CPU inductor backend, inefficient storage I/O during FSDP1 model saving, and performance degradation due to thread yielding in NCCL 2.27.7. These issues indicate regressions and bottlenecks that impact training and inference speed.  
issues/161878, issues/162032, issues/162257

MPS Backend Bugs and Feature Requests: Several issues concern the Apple MPS backend, including segmentation faults, runtime errors during backward passes, missing native_dropout and embedding_bag implementations, and tensor contiguity bugs causing crashes. These highlight incomplete support and stability problems on MPS devices.  
issues/161865, issues/161969, issues/162002, issues/162270

Symbolic and Export-Related Errors in torch.export and torch.jit: Issues describe errors during model export and tracing, including internal attribute errors from invalid dynamic shapes, incorrect output shapes due to caching, and type errors when registering CUDA parameters. These problems affect model serialization and deployment workflows.  
issues/161902, issues/161935, [issues/162279](https://github.com/issues/162279]

Incorrect or Confusing Behavior in Tensor Indexing and Operations: Reports include inconsistent behavior combining boolean and integer indexing, errors with nested jagged tensors using min/max functions, and inconsistent handling of negative zero values across CPU and CUDA. These inconsistencies cause confusion and potential bugs in tensor computations.  
issues/161970, issues/162049, issues/162235

Bugs and Limitations in Inductor Backend and Compilation Pipeline: Multiple issues report hangs, incorrect results, missing operation support, and excessive re-recording in the Inductor backend, affecting stability and correctness of compiled models. These include problems with process pool shutdown, unsupported ops.store in ops.masked, and silent inefficiencies.  
issues/162135, issues/162146, issues/162151, issues/162198, issues/162199, [issues/162299](https://github.com/issues/162299]

Documentation and Naming Errors: Some issues point out incorrect or confusing documentation and naming, such as wrong link text in the PyTorch Basics Wiki and inaccurate claims about backend support in distributed documentation. These require corrections to improve user guidance.  
issues/162060, issues/162248, issues/161993

Build and Installation Failures: Issues include build failures due to missing symbols in CUDA 13.0 libtorch builds, import errors after editable installs, and runtime crashes on Windows with CUDA 13.0 nightly builds. These affect developer and user ability to build and run PyTorch correctly.  
issues/162280, issues/162283, issues/162333

Runtime Errors and Exceptions in Specific Operations: Reports include floating point exceptions in PixelShuffle with complex tensors, heap-buffer-overflow in max_unpool1d, and Z3 exceptions in Dynamo due to symbolic boolean casting. These bugs cause crashes or compilation failures in specific scenarios.  
issues/161871, issues/162251, issues/162287, issues/162327

Requests for New Features and Refactoring: Proposals include adding LazyGroupNorm, decoupling CUDA code for modularity, introducing XPUGraph for XPU devices, and expanding DTensor dtype coverage. These aim to improve PyTorch's modularity, hardware support, and feature set.  
issues/161869, issues/161954, issues/162143, issues/162215

Model Export and Quantization Issues: Problems include UnpicklingError due to weights_only=True default, QAT internal pattern modifications degrading accuracy, and ONNX export failures due to enable_gqa flag in scaled_dot_product_attention. These affect model portability and quantization workflows.  
issues/161960, issues/162256, issues/162258

Test Failures and CI Issues: Several issues report disabled or failing tests on ROCm platforms, test skipping due to decorator bugs, and CI job result overwrites between architectures, impacting test coverage and reliability.  
issues/161989, issues/162071, issues/162179, issues/162182

Distributed and Parallel Runtime Errors: Issues include runtime errors in DDP models compiled with torch.compile due to alias reconstruction, and argument order mismatches in all_gather_into_tensor_coalesced causing runtime errors. These affect distributed training stability.  
issues/161937, issues/162087

Miscellaneous Bugs and Questions: Other issues cover silent correctness problems with vllm inductor, questions about parameter defaults, and requests for pin_memory support in DTensor. These represent smaller but relevant concerns in PyTorch usage and development.  
issues/162070, issues/162290, issues/162331

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 72
Summarized Issues:

Test Flakiness and Disabling on ROCm and XPU Platforms: Several tests on ROCm and XPU platforms are consistently failing or flaky, leading to their disabling to maintain CI stability. These failures include SIGIOT errors on ROCm for sparse tests and multiple failing tests on XPU such as grid sampler, broadcast, autotune, and autograd inductor guard tests, which hinder unit test parity and reliable validation.  
[issues/160784, issues/160785, issues/160786, issues/160946, issues/160947, issues/160948, issues/161162, issues/161384, issues/161483, issues/161655, issues/161697, issues/161931, issues/162036, issues/162039, issues/162048, issues/162128, issues/162139, issues/162140, issues/162141, issues/162175]

PyTorch Compiler and Graph Export Issues: Multiple issues arise from PyTorch's compilation and export mechanisms, including empty FX graph generation causing unnecessary compilation, removal of synchronization calls in aot_eager mode leading to missed exceptions, inconsistent results with torch.nn.functional.interpolate under torch.compile, and errors exporting models with torch.export or ONNX using dynamo. These problems affect correctness and stability of compiled models and exports.  
[issues/160437, issues/160751, issues/160840, issues/161080, issues/161864, issues/162061, issues/161906]

XPU Platform Build and Runtime Failures: The XPU platform experiences build and runtime issues such as permission denied errors during header installation, wheel naming errors for aarch64 architecture, and missing or invalid platform specifications causing test failures or CI disruptions. These issues complicate development and deployment on XPU hardware.  
[issues/161498, issues/162255, issues/161384]

Numerical and Backend Computation Discrepancies: Several numerical operations show inconsistent or incorrect results on GPU or specific backends, including unstable gradients from logsumexp with -inf values, incorrect negation of uint8 tensors in Inductor backend, numerical instability in torch.inverse and torch.pinverse on GPU, and incorrect results from torch.linalg.tensorinv on CUDA devices. These discrepancies impact numerical reliability across devices.  
[issues/161638, issues/161763, issues/162064, issues/162065, issues/162302]

Memory and Resource Management Bugs: Memory leaks and out-of-memory errors occur in scenarios such as activation checkpointing with custom autograd Functions and large stride indexing on ROCm MI200 GPUs. These issues cause resource exhaustion and failures during training or testing.  
[issues/161186, issues/161655]

Documentation and Link Errors: The PyTorch documentation and website contain broken or malformed links, missing examples for APIs like torch.is_complex and torch.full_like, and typos such as "overrideable" instead of "overridable," which reduce usability and clarity for users.  
[issues/161375, issues/161859, issues/161899, issues/162054, issues/161985, issues/161997]

CI and Continuous Integration Failures: CI failures arise from force merging without validation, network timeouts affecting ROCm MI2xx workflows, and attempts to run removed test files, causing unstable or red CI signals and requiring reverts or fixes.  
[issues/161632, issues/161784, issues/162274]

CUDA and GPU Architecture Support Issues: Newer GPU architectures like Nvidia RTX 5090 (sm_120) and RTX 5060 Ti (sm_120) face build warnings or crashes due to unsupported or misconfigured CUDA kernel parameters and missing precompiled kernels, limiting hardware compatibility.  
[issues/161376, issues/162196]

PyTorch API and Feature Requests: Users request new features such as a torch.dtype.kind attribute for dtype identification, addition of numerical algorithm packages, and improved exception types for clarity, reflecting ongoing efforts to enhance PyTorch's usability and extensibility.  
[issues/161623, issues/161774, issues/161921]

Performance Regressions and Optimization Issues: Performance regressions are reported in operations like toDLPack conversion and CPU multithreading usage, where expected speedups are not realized due to internal bottlenecks or overhead increases.  
[issues/162113, issues/161948]

Pattern Matcher and Graph Compilation Bugs: The pattern matcher can produce topological ordering errors by using nodes before definition when replacing multi-output patterns, requiring workarounds to maintain correct graph compilation order.  
[issues/162019]

Operator and Backend Implementation Bugs: Specific operators such as rotary_embedding fail on CPU-only setups due to missing registrations, and Inductor backend's use of fast math in exp causes numerical discrepancies, indicating backend implementation issues.  
[issues/161735, issues/161944]

Model Loading and Compilation Errors: Loading traced TorchScript models in C++ can fail due to missing headers, and compilation errors occur when undefined functions like cuda_capability_geq are referenced in generated Triton code, causing runtime failures.  
[issues/162156, issues/161868]

License and Repository Metadata Issues: The PyTorch repository README and LICENSE files show inconsistencies regarding BSD licensing, which may affect clarity for users and AI parsing tools.  
[issues/162074]

ONNX Export Serialization Bugs: ONNX export incorrectly serializes attributes such as allowzero in reshape operations as booleans instead of integers, breaking downstream workflows and causing serialization errors.  
[issues/161941]

Segmentation Faults and Runtime Crashes: Segmentation faults occur in modules like torch.nn.MaxUnpool2d under specific CUDA versions and hardware, leading to crashes during test execution.  
[issues/161888]

Miscellaneous Bugs and Requests: Other issues include fixing duplicate imports, adding examples to documentation, and addressing static image output bugs on specific GPU architectures during ComfyUI compilation.  
[issues/161684, issues/161899, issues/161861]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 200
Key Open Pull Requests
1. Add LazyLayerNorm: This pull request introduces the LazyLayerNorm module to PyTorch, implementing a layer normalization layer with lazy parameter initialization that defers weight and bias creation until input shape is known, includes validation for dimension parameters, and adds comprehensive tests to ensure correctness and compatibility with existing LayerNorm functionality.

URL: pull/162339

Merged: No

Associated Commits: 60372, 04c36, f510d, b9f97, d54cd, 47fd3, ec5c2, 1db83, 9f053, d9529, aee53, aeed9, be579, f2d8c, ce786, 396a7, c7c4a, be997, 71b50, 8d144, 848cc, 682d0, 5fd6f, 12b3f, 3b0de, ca71d, 3a94d, 71a8b, 7f3e9, dfb41, 5e8c1, 8e47c, 40150, 3150f, 9324a, e47ab, 51572, be360, 2f545, 5315b, 052a1, d1ff8, 0e509, ee43f, 09ad5, 69afd, d692a, 2a916, 44ba7, 448a0, 4a3cd, 3f789, 45ddc, 4717f, 6df55

2. Build vLLM nightly wheels: This pull request introduces a nightly build system for vLLM wheels using the pytorch/manylinux2_28-builder base image, fixes bugs related to CUDA version and wheel path settings in GitHub actions, updates the vLLM Dockerfile for correct CUDA indexing and package installation, bumps the xformers version, and standardizes wheel naming conventions to align with PyTorch nightly releases for seamless compatibility.

URL: pull/162000

Merged: No

Associated Commits: 4584b, 826a5, f9901, 614b8, c3e99, 71536, 22ac2, 1d188, ede4f, 24086, 8c4a7, a7750, 0afa8, 30c8d, 8e418, 0afa0, 92a44, 44ece, 9aa00, 3bde7, 41086, e64b0, 784e9, 581fc, 2968a, 5b0c6, 8013c, 5dcf9, 2f81b, 28545, e345e, 9c59a, 591d0, 8829a, 1b357, 0e16c, c28d8, 26400, 754cd

3. Add LazyGroupNorm: This pull request introduces the initial implementation of LazyGroupNorm in PyTorch, including documentation, tests, and various fixes to ensure proper functionality and integration, addressing issue #161869.

URL: pull/161870

Merged: No

Associated Commits: f9f76, 377fd, f4e92, 91e6b, dab46, feaa8, b7517, 06240, cbc9e, 0243a, 89acb, f6191, d6bfa, 8c3a7, 4d239, 6c4be, c5630, 0a26e, 17411, 985de, 37196, bf66b, a0ec6, 02327, ee506, 6597f, f88c9, 0dee8, 06eca, c5d2d, b3992

Other Open Pull Requests

OpenReg test improvements and execution limits: These pull requests focus on enhancing the OpenReg testing framework by migrating test cases to a dedicated directory and adding test_openreg to blocklists to reduce unnecessary computation. This ensures that OpenReg runs only when needed and aligns tests more closely with their functional code.  
pull/161917, pull/161918

Inductor backend enhancements: Multiple pull requests introduce a performance model interface for ranking configurations, enable the LOAF optimization by default, and fix issues related to Triton kernel launcher signatures. These changes improve modularity, performance, and runtime stability of the Inductor compiler backend.  
pull/162017, pull/162030, pull/161924, pull/161911

Stable ABI enhancements for torch::stable::Tensor: Several pull requests add new methods such as sizes(), strides(), clone(), copy_, and template accessors to the torch::stable::Tensor class. These additions improve the stable ABI interface by increasing functionality and flexibility for tensor operations.  
pull/161892, pull/161894, pull/161895, pull/161896, pull/161897, pull/161911

Testing and test infrastructure updates: These pull requests add additional tests for the vllm module, improve testing infrastructure for strict export flow, and introduce workflows to test fallback commands. They collectively enhance test coverage and reliability across different components.  
pull/162292, pull/161977, pull/162010, pull/162102, pull/162130, pull/162010

Build and deployment improvements: These pull requests update the continuous deployment workflow by using the setup-python GitHub action for Mac wheel builds and add the NVIDIA NCCL library as a git submodule to fix CUDA build failures. These changes streamline the build process and resolve critical build issues.  
pull/162136, pull/162213

Error handling and bug fixes: Pull requests in this group address error handling improvements such as raising errors when no record is found in extra_files during save/load, fixing ROCm batchnorm memory format issues, and adding error_on_graph_break() to improve compiler error reporting. These fixes enhance robustness and correctness.  
pull/162130, pull/162112, pull/162023, pull/162130, pull/162130

Ahead-of-time (AOT) compilation interface: This pull request introduces a new _aot_compile interface to the OptimizedModule, enabling AOT compilation of models with multiple input contexts such as training and evaluation. This facilitates direct precompilation and training of models like NanoGPT.  
pull/162171

Library upgrades and third-party integrations: These pull requests upgrade the dlpack library to version 1.1 to support fp8 and fp4 data types and add the NVIDIA NCCL library as a submodule to fix build failures. These updates improve hardware compatibility and build stability.  
pull/162195, pull/162213

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 265
Key Closed Pull Requests
1. [ROCm] Bump AOTriton to 0.11b: This pull request proposes to update the AOTriton library to version 0.11b for ROCm, introducing significant new features and optimizations such as invoking high-performance AITER Assembly kernels on specific AMD GPUs, aligning logsumexp behavior with CUDA, enabling new causal variants, and implementing a revamped build and packaging system that selectively downloads GPU image packs and avoids ABI breaks, while also addressing kernel bugs and known issues with certain GPU targets.

URL: pull/161754

Merged: No

Associated Commits: 85386, 69342, c194d, bd319, 543da, f7db9, 01f1c, 287cf, 2fb59, 407d6, 38463, 829c0, 11913, 0be57, d161c, 2c261, 92ebb, 4384a, 0b7d8, b0ce2, 0f524, 37b79, 88e10, fbe1c, 61d8d, 8550a, d82aa, 6c8fe, 6afe8, d574e, a4d97, f7a9b

2. MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump: This pull request introduces support for MXFP8 grouped GEMM operations in the torch._scaled_grouped_mm function, including dispatching, input validation, meta registration for compilation compatibility, unit tests, and updates the FBGEMM submodule to incorporate recent related improvements necessary for the backward pass of MXFP8 Mixture of Experts training with grouped GEMMs.

URL: pull/162209

Merged: No

Associated Commits: 2018c, e166e, de229, 6ccc9, 345e9, d26cf, 868b5, 2ff19, 1fc6c, fd320, 240da, 6b2cf, 4cc4b, 04e26, 28387, 8d73b, 1e464, 1edeb, 79a36, df07f, ced70, 50339, 9029f, 3915c, 855c4, 950d9

3. [Inductor] Improve RoPE: This pull request aims to improve the Rotary Positional Embedding (RoPE) implementation in the Inductor backend by fusing two separate ROPE kernels into a single kernel, optimizing performance based on the Llama3 model configuration.

URL: pull/161420

Merged: No

Associated Commits: 8bed3, 5fc60, f21c2, 3818b, a0908, 24276, 48ee2, 2f0b7, a9b09, e8f3e, 218ff, 0b483, 1b6a2, d5cf6, 8edeb, 4b568, 89a05

Other Closed Pull Requests

Inductor Benchmark Docker Image Consolidation and CI Job Renaming: This pull request consolidates multiple inductor benchmark Docker images used in continuous integration into a single streamlined image to simplify workflows. It also renames related CI jobs to use shorter, more human-friendly names for better clarity.  
pull/161536

Intel GPU Support in Distributed Tests: These pull requests port distributed tensor parallel and Fully Sharded Data Parallel (FSDP) test files to support Intel GPUs by utilizing torch.accelerator for general GPU compatibility. They maintain original code styles and selectively skip tests on XPU devices with known issues to ensure stability.  
[pull/161261](https://github.com/pytorch/pytorch/pull/161261], [pull/161533](https://github.com/pytorch/pytorch/pull/161533], [pull/161601](https://github.com/pytorch/pytorch/pull/161601], pull/161604

Inductor Template Heuristics Enhancements: Multiple pull requests enhance the inductor backend by moving template arguments and workspace handling into heuristics infrastructure. This enables all template choices for matrix multiplication operations to be handled through a single call with fixed extra keyword arguments, simplifying operation handling and supporting non-tensor scalar inputs.  
[pull/161123](https://github.com/pytorch/pytorch/pull/161123], [pull/161124](https://github.com/pytorch/pytorch/pull/161124], [pull/161125](https://github.com/pytorch/pytorch/pull/161125], [pull/161126](https://github.com/pytorch/pytorch/pull/161126], pull/161533

Inductor Matrix Multiplication Refactor and Heuristics for Extra Keyword Arguments: This pull request introduces extra_kwargs for matrix multiplication configurations in the inductor component, enabling tracking and interception of consistent keyword arguments. It sets up infrastructure for template heuristics to use these extra kwargs to simplify future operation handling.  
pull/161123

CUDA Version Update and Related Dockerfile Fixes: This pull request adds CUDA 13.0 libtorch builds while removing CUDA 12.9 builds, along with various fixes and updates to Dockerfiles and dependencies to support this transition.  
pull/161916

Removal of Unused ONNX Verification Logic and Private Members: These pull requests remove unused logic from the internal ONNX verification module and eliminate the import of two private functions from the torch.onnx namespace to clean up the codebase.  
[pull/161449](https://github.com/pytorch/pytorch/pull/161449], pull/161546

XPU Device UUID Addition and Large Tensor Test Fix: One pull request adds a UUID to the XPU device properties for device identification, while another fixes the malfunction of the largeTensorTest on XPU devices caused by a previous change.  
[pull/161392](https://github.com/pytorch/pytorch/pull/161392], pull/161988

Type Consistency Fixes in torch.slice_scatter and torch.export.export: These pull requests address type inconsistency issues in torch.slice_scatter and fix a TypeError in torch.export.export by adding explicit decompositions and modifying type checking.  
[pull/160851](https://github.com/pytorch/pytorch/pull/160851], [pull/161688](https://github.com/pytorch/pytorch/pull/161688]

Inductor Contiguous Matrix Multiplication Refactor: This pull request proposes a mild refactor of the inductor contiguous matrix multiplication code by consolidating checks into new heuristics logic, correcting device type, and delegating keyword argument passing, although it was not merged.  
pull/162075

Graph Partition Module CUDA Graph Wrapper Interface: This pull request adds an interface to the Graph Partition module allowing users to specify a custom CUDA graph wrapper, demonstrated by a user example from the vllm project.  
pull/162207

Shared Module Shim Layer and nativeRT Implementation for Inference Engines: This pull request introduces a draft of a shared module shim layer and nativeRT implementation to enable various inference engines to share a unified API, allowing flexible runtime selection without user code changes.  
pull/162133

SVE128 Support Enhancements: This pull request imports and integrates multiple enhancements for SVE128 support, including vectorized template layers for various data types, differentiation from general SVE, and enabling compilation targeting SVE128 CPUs to improve performance.  
pull/162210

MIOpen Integration Revamp in ROCm Backend: This pull request revamps the MIOpen integration by updating source files to follow best practices and avoid reshape_ calls inside backward operations, although it was not merged.  
pull/161687

Linux Binary Wheels Upgrade Blocking for ROCm: This pull request explicitly specifies which Linux binary wheels should block the viable strict upgrade process to prevent ROCm binary builds from causing unnecessary delays.  
pull/162100

Autograd while_loop_with_checkpoint Feature Addition: This pull request adds a hop while_loop_with_checkpoint feature to the autograd while_loop functionality to enhance checkpointing during iterative computations.  
pull/160467

Flex_attention API Auxiliary Output Feature: This pull request adds an optional feature to the flex_attention API to return maximum post-modulation scores as auxiliary outputs, introducing a flexible request and output structure to manage these returns without breaking backward compatibility.  
pull/161667

Custom FX Backend Registration in torch._inductor.compile_aot: This pull request adds support for registering custom FX backends in torch._inductor.compile_aot, aligning FX codegen with Python and C++ codegen capabilities, and includes a CI test for verification.  
pull/162317

SymmetricMemory API Enhancements for Remote Tensor Access: This pull request introduces the get_remote_tensor API to return symmetric tensors from peer ranks without offsets and refactors get_buffer and get_signal_pad implementations to the SymmetricMemory level for backend unification.  
pull/161533

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

yangw-dev
479
6
1
7

malfet
101
12
12
113

coconutruben
179
32
0
4

guangyey
101
16
0
44

swolchok
105
26
1
12

huydhn
101
7
0
13

anijain2305
68
9
5
26

etaf
62
13
30
1

guilhermeleobas
79
18
1
7

kwen2501
59
23
0
21

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
yangw-dev	479	6	1	7
malfet	101	12	12	113
coconutruben	179	32	0	4
guangyey	101	16	0	44
swolchok	105	26	1	12
huydhn	101	7	0	13
anijain2305	68	9	5	26
etaf	62	13	30	1
guilhermeleobas	79	18	1	7
kwen2501	59	23	0	21