Weekly GitHub Report for Pytorch: October 27, 2025 - November 03, 2025 (12:05:04)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward enhanced performance, security, and streamlined deployment.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
CuTe DSL NVFP4 GEMM/Grouped GEMM kernels: This issue discusses the recent release of NVFP4 GEMM and Grouped GEMM kernels by the CuTe DSL team and explores whether integrating these kernels into PyTorch, specifically within
torch.nn.functional.scaled_mmandtorch.nn.functional.scaled_grouped_mm, is beneficial. The conversation centers on the motivation for adopting CuTe DSL kernels over existing cuBLAS or FBGemm implementations, challenges related to API compatibility, kernel authoring complexity, debugging experience, and the potential use cases in eager versus compiled modes, including inductor template dispatch.- The comments highlight concerns about the necessity and advantages of CuTe DSL kernels compared to current solutions, noting that grouped GEMM APIs require additional data layout transformations and that debugging is currently difficult but being improved. Participants discuss the feasibility of matching PyTorch’s grouped MM API, the current limitations such as lack of metadata support and dynamic K issues inherited from CUTLASS, and the question of whether these kernels should be used in eager or compiled modes. There is also mention of ongoing work to optimize and fix these kernels, with a focus on composability, performance parity, and integration with PyTorch’s dispatch mechanisms.
- Number of comments this week: 11
-
Aarch64 unit test failures from nightly/manylinux build, jammy upgrade to gcc13 needed: This issue reports test failures on the AArch64 architecture in nightly builds, which do not occur in the existing Linux AArch64 workflow. The failures appear linked to a recent commit and are suspected to be caused by discrepancies between the GCC versions used in the jammy build environment and the manylinux build environment, prompting a request to upgrade jammy images to GCC 13 and potentially revert the problematic commit.
- The discussion centers on the mismatch between GCC versions in jammy and manylinux causing test failures, with suggestions to revert a recent commit and upgrade jammy to GCC 13. Commenters debate the rationale behind building release wheels in manylinux but testing in jammy, proposing to unify build and test environments to avoid such issues. There is consensus on the need to align compiler versions and possibly support both GCC and Clang to ensure compatibility and stability across builds.
- Number of comments this week: 7
-
The design of register_fork_handler_for_device_init for poison fork: This issue discusses the design limitations of the
register_fork_handler_for_device_initfunction in PyTorch, which currently only supports initialization for the first registered device despite appearing to be designed for multiple devices. The main concern is whether to refactor the function to truly support multiple devices simultaneously or to explicitly support only a single device to avoid ambiguity, given that other parts of the codebase assume a single accelerator.- The comments reflect a consensus that while supporting multiple devices might be ideal, there is no current practical use case for it, and multiple parts of the codebase assume a single accelerator, which could cause confusion. Contributors suggest that maintaining support for a single device is reasonable to avoid complexity, and there is a request to add tests related to this issue.
- Number of comments this week: 6
-
[CUDA][B200][Inductor] Inductor Loop Ordering After Fusion (#162030) causes > 70% perf. regression: This issue reports a significant performance regression of over 70% on GB200 machines after a recent code change related to Inductor loop ordering following fusion. The user notes that performance is restored by reverting to a previous commit or disabling the new loop ordering feature via an environment variable, and requests clarification on whether this regression is a known problem since the feature is not enabled by default in the main codebase.
- The comments discuss whether to revert the change due to the large regression, with suggestions to disable the feature by default until the issue is understood. There is clarification on the correct environment variable setting to restore performance, a request for a reproducible test case, and a recommendation to investigate the root cause before deciding on a fix or disabling the feature permanently.
- Number of comments this week: 6
-
B200 queuing was observed on Oct 29: This issue reports queuing delays observed on the B200 Linux DGX system during distributed tests, which were unusually long—exceeding 9 hours instead of the typical 6 hours. The root cause was identified as a version mismatch between the NVIDIA Datacenter GPU Manager (DCGM) and CUDA, leading to failures in GPU runtime detection and extended job durations, prompting a need to optimize the distributed test suite and address ROCm-specific concerns separately.
- The comments detail diagnosing the problem as a DCGM and CUDA version mismatch, followed by planned and executed fixes involving upgrading DCGM packages from CUDA 12 to CUDA 13 on multiple DGX nodes. After applying these fixes, diagnostic checks confirmed system health, and further maintenance was scheduled to prevent recurrence, with ongoing monitoring and re-provisioning efforts to restore full GPU functionality.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0 development build, CUDA 12.1, and Ubuntu 22.04, and shares code snippets demonstrating the error occurring while compiling custom pipeline components with torch.compile.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, with demonstrated speedup in testing.
- cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at
/tmphaving permissions set to1777. Although the model compiles successfully, execution fails with an error indicating that the shared objectcuda_utils.socannot be mapped due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions. - Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase by removing approximately 1,500 files currently excluded from UFMT and applying consistent formatting to them. It outlines the process for updating the
.lintrunner.tomlconfiguration, running the formatting tool, handling known edge cases that require preparatory fixes, and organizing the work by directory to facilitate manageable and reviewable pull requests. - [JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the
torch.jit.save()function that allows users to exclude debug files, specifically.debug_pklfiles, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can significantly increase the archive size—especially for small or quantized models—without affecting the model's correctness or functionality.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 88
Summarized Issues:
- Autocast and Tracing Issues: Dynamo cannot reliably retrace output graphs due to non-traceable operations like
torch.amp.autocast_mode._enter_autocastand_exit_autocast, causing errors and improper autocast state resets during tracing. Additionally, the function___check_global_state()fails to reveal whether autocast is enabled when a graph is compiled, limiting visibility into the relevant global state information.
- Inductor Backend Compilation Errors: Multiple assertion errors and runtime failures occur during compilation with the Inductor backend, including stride length mismatches, input size mismatches in
aten.cat, and stride-related indexing errors on CUDA tensors. These errors cause failures not seen in eager execution and include LoweringExceptions and IndexErrors.
- Inductor Backend Warnings and API Conflicts: The Inductor backend triggers unnecessary warnings about deprecated TF32 API settings during
torch.compile, and compilation fails due to conflicts between legacy and new TF32 API usage for cuBlas matmul despite only the new API being set.
- ONNX Export and Progress Bar Feature Requests: There is a request to enable the progress bar when saving ONNX files in PyTorch 2.10 by setting the verbose parameter in the onnxscript stable API. Additionally, exporting a ResNet50 model via the Dynamo export path produces an incorrect ONNX model with a wrong bias tensor shape, unlike the classical export path.
- Distributed and Parallelism Issues: FSDP does not support non-contiguous parameters, preventing direct use of channels-last convolutions in image models, with a proposed workaround to add channels-last convolution support in
nn.functional. Also, a bug in DistributedDataParallel with the MPI backend causes Broadcast operations to not properly wait for completion, potentially leading to incomplete execution.
- Documentation and Usability Improvements: Requests include adding usage examples for several loss functions to improve clarity, clarifying documentation on reproducibility and randomness in distributed training with multi-process data loading, and addressing challenges in accessing documentation for custom operators on older PyTorch versions.
- Profiler and Debugging Failures: The PyTorch profiler fails to export traces after abort signals due to an AttributeError from a None internal results object, hindering debugging of NCCL deadlocks.
- Lowering and Compilation Exceptions: LoweringExceptions occur during compilation with
torch.compiledue to assertion failures in size variable guarding logic and missing required keyword arguments in pattern matching, causing exceptions instead of safe skips.
- Tensor Gathering and Device Mismatch Bugs: The function
torch.nn.parallel.comm.gatherproduces incorrect concatenated tensors when the destination device is CUDA, causing wrong outputs innn.DataParallel, while gathering to CPU works correctly. Also,torch.compilefails due to device mismatch errors during in-place addition between CPU and meta tensors.
- Flex Attention Parameter and Initialization Issues: Flex Attention does not support learnable scalar parameters in its score modification function, causing backward and forward pass errors, with a workaround that reduces training speed. Additionally, inconsistent default initialization of
in_proj_weightinnn.MultiheadAttentionleads to discrepancies that could be fixed by adjusting the gain parameter.
- Numerical and Device Inconsistencies: Functions like
torch.masked.argmaxandtorch.polygammaproduce inconsistent results across CPU and CUDA devices, withF.batch_normimproperly accepting negative epsilon values causing NaNs or infinite values depending on the device.
- Segmentation Faults and Platform-Specific Failures: A segmentation fault occurs during inference on Linux when loading models saved with AOTInductor involving mixed Torch-TensorRT compiled graphs. On Windows, importing PyTorch after PyQt6 causes DLL initialization failures starting from PyTorch 2.9.0.
- Test Disabling and CI Failures on XPU and ROCm: Multiple tests on the XPU platform are disabled due to consistent failures on the main branch, and ROCm runners experience prolonged queue times and workflow failures due to ongoing maintenance and SDK updates.
- Reference Cycles and Memory Leaks: A recursive function in PyCodegen captures its own instance in a closure, causing prolonged tensor lifetimes and out-of-memory errors, with suggested fixes including removing recursion or clearing closure variables.
- Torch.compile and FakeTensor Limitations:
torch.compilefails when using unsupported operations on FakeTensor objects, such asshare_memory_(), and when handling mixed device tensors during embedding operations, causing runtime errors during graph tracing.
- Graph Compilation and Dtype Mismatches: Compilation fails due to dtype mismatches between float16 and float32 tensors in matrix multiplication within models having mixed parameter data types, causing errors during graph compilation.
- Performance Regressions and Optimization Requests: A greater than 70% performance regression on GB200 machines is caused by Inductor loop ordering changes after fusion, and scalar zero additions to tensors are not optimized away as no-ops, prompting requests for removal of redundant operations.
- Build and Environment Failures: Build failures occur due to missing build result files on windows-arm64, undefined CUDA driver API members with CUDA 12.3, and compiler version warnings during extension compilation with CUDA 13.0. Additionally, environment variables default incorrectly causing test binaries to be included unnecessarily.
- API and Signature Inconsistencies: The type signature of
torch.nn.Module.__setattr__()restricts accepted value types toTensororModuledespite allowing any type, causing confusion for static type checkers. Also,fx.Interpreter.boxed_runsilently ignores extra input arguments without warning, leading to debugging difficulties.
- MoE and ROCm Support Limitations: MoE Dispatch and Combine layers are not supported on ROCm due to missing implementations of critical all-to-all communication operations, resulting in significantly slower performance compared to CUDA for both single-node and multi-node setups.
- Miscellaneous Feature Requests and Proposals: Proposals include adding methods for common model transformations on ONNXProgram, enabling keyword arguments in
torch.utils.checkpoint.checkpoint(), making_grouped_gemma public API with dtype control, updating programming model docs to recommendtorch.compiler.set_stance("force_eager"), and enabling GitHub Actions for Power architecture CI.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 18
Summarized Issues:
- Compilation and Runtime Errors with torch.compile and DDP: Multiple issues report failures when using
torch.compilein combination with other features. One issue describestorch.fullreturning incorrect fixed values on CPU withdtype=torch.float64, while another details a NotImplementedError when usingtorch.compile()with DistributedDataParallel and custom autograd functions due to unsupported higher order operations in the DDPOptimizer backend. - [issues/166253, issues/166305]
- Module and Dependency Issues Related to ONNX and onnxscript: Users face problems due to missing or improperly linked ONNX components. One issue reports that starting with PyTorch 2.9, the
onnxscriptmodule is no longer installed by default causingModuleNotFoundErrorfortorch.onnx.exportwithdynamo=True. Another issue describes build failures when compiling withUSE_SYSTEM_ONNX=ONdue to missing linker references to ONNX symbols. - [issues/166352, issues/166546]
- Memory and Performance Regressions: There are significant memory-related problems reported in PyTorch 2.9.0. A severe memory leak occurs in DataLoader on Windows with long-lived iterators causing resource exhaustion, and a large memory usage regression is observed in
F.conv3dwithbfloat16inputs, where memory consumption nearly triples compared to previous versions andfloat32. - [issues/166513, issues/166643]
- Test Failures and Disabling on XPU Platform: Several tests in the NoMixOrderReductionTest suite on the xpu platform were consistently failing and subsequently disabled. These include
test_rms_norm_bwd_float32_shape1andtest_layer_norm_bwd_no_bias_shape0, both of which were later fixed by linked pull requests. - [issues/166491, issues/166503]
- Type Checking and Decorator Issues: Pyright type checker reports errors due to missing type annotations in PyTorch decorators. Specifically, the
torch.no_graddecorator is considered untyped because the underlying_NoParamDecoratorContextManager.__new__method lacks type annotations, causing Pyright to emit errors about untyped function decorators. - [issues/166413]
- Sparse Tensor Support Limitations on Mac MPS: PyTorch's sparse array functionality is not supported on Mac when using the MPS device, resulting in a NotImplementedError. This limitation prevents efficient use of sparse tensors on Mac hardware despite basic tensor operations working fine on CPU.
- [issues/166426]
- Compiler and Graph Break Handling Issues in Dynamo: The Dynamo compiler has issues correctly handling graph breaks. The
error_on_graph_break(True)setting fails to raise errors when graph breaks occur in some cases, causing compilation to succeed incorrectly and masking the graph break. - [issues/166589]
- Logging and Format String Errors in Distributed Launcher: A logging error in the PyTorch distributed launcher causes a
TypeErrordue to a missing format string argumentsignals_to_handlein a logging statement, resulting in incorrect string formatting. - [issues/166630]
- Numerical Inconsistencies in Linear Algebra Operations: The behavior of
torch.linalg.inv()differs between CPU and CUDA backends when inverting singular matrices. The CPU backend correctly raises aLinAlgErrorfor singular inputs, while the CUDA backend returns unstable numerical results without error. - [issues/166490]
- Contributor License Agreement (CLA) Confusion: A user reports confusion about a persistent "Missing CLA Authorization" message on their pull request despite having signed the Contributor License Agreement, indicating a possible issue with CLA status recognition.
- [issues/166282]
- Attribute and API Compatibility Issues After Upstream Changes: The Inductor compiler fails with an AttributeError related to a missing 'constexprs' attribute in a 'JITFunction' object after a recent Triton commit, causing test failures and requiring an upstream revert. Additionally, pyrefly error codes need to be disabled for ONNX script APIs in
torch/onnx/_internal/exporter/_torchlib/opsdue to incompatibility with type checkers. - [issues/166403, issues/166475]
- Runtime Error Due to Missing Backward Implementation: A runtime error occurs in PyTorch 2.9 because the aten::_scaled_mm operation lacks a derivative implementation during the backward pass in autograd, causing failures in gradient computation.
- [issues/166729]
- External Infrastructure Outage Affecting CI/CD: A temporary failure of all Docker builds in the PyTorch project was caused by an Ubuntu archive server outage, leading to multiple CI/CD job timeouts during image calculation until the issue was resolved and a triggering pull request was reverted.
- [issues/166363]
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 223
Key Open Pull Requests
1. [DO NOT REVIEW] Inductor lite mode with CUDAGraph support: This pull request introduces an Inductor "lite" mode with CUDA Graph support by tracking the inductor fallback cudagraph backend and implementing related features such as caching get_free_symbol_uses for faster compilation, adding a light mode for the inductor backend, enabling overlap scheduling in light mode, and preparing the codebase for incremental landing through smaller PRs.
- URL: pull/166320
- Merged: No
- Associated Commits: 647cb, 0ad1c, f1ab8, c5259, d5616, be6a8, c0687, 0a648, 3f6f5, 81011, 72f24, 4f066, a4e6e, f86ce, 3e46d, bc8ef, 1e077, 5d8f9, 53731, ee8e1, 53efe, 2efa6, d316b, eaaf1, ffea6, 19b47, 9740e, 7aed6, 21446, e74ed, d8a9a, 251cf, 701b8, 31ed0, 67b1d, e26ba, 70af3, 6770e, 47244, cf176, 66569, 25668, 72986, 68628, 05bd4, 37379, eaded, b377e, 732af, 47818, 76547, 9bb2e, 8b6e8, 16bc6, 15464, 3079d, 11847, 8a55c, 4a61e, ff88d, 29d0f, 2ed5d, 78c93, 1da6a, 6abac, a609b, 78530, 5d726, 33ef8, 3e757, f100f, 71679, 45657, d69fa, dd6ad, 86cd2, 31c0d, 6b2a7, 523c8, ce01c, 58c73, e4201, 63545, dc776, 7a10b, a7b0b, b79f4, ab8f9, 43595, 0d505, 8e31a, f1dc6, 18965, 4c45f, 6685b, 53f4b, 3092c, fdbe5, ef080, 6700b, 57da3, 3e678, bca75, d9700, c34c5, 8ee7d, eaa32, 9e76a, 1bdf5, 443d1, 603ff, 87141, 8c018, 7b70c, b30ed, aea2b, 7e7a2, b323f, d2249, f49c4, 8b842, 9dc41, 8028e, 74cb0, 68162, 58bfb, 4ba8c, 7e0cc, 90ff6, 47c3e, fc72f, f6ed8, f3c4d, 02ad5, 815b6, efce2, ea14a, 6619f, 9acf5, 1a67a, 7b208, 94f25, e4f9e, c74d2, 4236b, 21e30, f6a98, 3653d, 8f785, 8ae04, 9a52d, 7752c, 7331a, 64ca5, 6fb66, 8d76d, 79652, 3ed99, 1fc4f, d9f29, c24e0, 67249, c8799, 9e226, 713b7, fc3f3, 5db34, 4df2b, 032da, b5167, f83ef, 6c22a, 6a466, edf34, 3aa0c, bc38d, 16171, d73e3, 85d6f, 22e19, fc6a0, b77f0, fb4c5, 23d65, aa7fe, 85e64, dce87, d7d72, 3b288, f31fa, 62a68, f020e, 998ab, 3e46b, 7116f, f1101, 4ce1b, b5de5, 20fd6, 54cdc, 1b362, 63fbd, e33f4, 783be, 35c4f, 9bc37, f369e, 4d2fd, 7a9e3, 25f88, 1f620, 07993, 7195a, b9c60
2. [Inductor] refine the logic in (mm + bias) -> addmm: This pull request aims to refine the logic in the PyTorch Inductor backend for the operation combining matrix multiplication with bias addition, specifically improving the implementation of the (mm + bias) to addmm transformation.
- URL: pull/166300
- Merged: No
- Associated Commits: 3dcc5, 0e103, 4faee, 31200, abda6, d5bd7, feb2c, 9d5d9, 93349, 641a0, 182b9, 191f9, dd750, 412ec, aed0a, 24bf2, 0e635, b1a54, 4f29d, 358cb, c9ff1, fb305, 18a9d
3. Introducing statefulness for single process dataloader: This pull request introduces statefulness functionality to the single process PyTorch DataLoader by upstreaming features from the StatefulDataLoader, ensuring no performance degradation through extensive benchmarking across various dataset types, and laying the groundwork for a subsequent multiprocess DataLoader update.
- URL: pull/166732
- Merged: No
- Associated Commits: 88440, 5793e, 299e6, 6581d, bf753, d7317, db69f, a43a9, 68679, e5a69, e6a99, 48945, a7fe6, f1d06, 9d008, 6fe22, c01e5, 657dd, f00e4, 16028, d639c, 5fc72, cad27
Other Open Pull Requests
- XPU Expandable Segment Feature: Multiple pull requests introduce and enhance the expandable segment feature on XPU, which dynamically extends physical memory within a reserved virtual address range to reduce fragmentation and reallocation overhead. These include the initial implementation, addition of the
ExpandableSegmentstruct, and unit tests that validate the feature while addressing driver upgrade requirements. - [pull/166292, pull/166495, pull/166299]
- DTensor Dispatch Improvements: Several pull requests focus on improving the DTensor dispatch mechanism by introducing the DTensor dispatch key and removing custom operations from the critical path to delegate dispatch responsibilities to the dispatcher. These changes aim to prepare the codebase for future optimizations and ensure compatibility.
- [pull/166369, pull/166370]
- Inductor Refactoring: A pull request proposes retiring the use of the
is_pointwise_usefunctionality in the Inductor component, aiming to simplify and clean up the codebase. - [pull/166402]
- PeerToPeerAccess API for XPU: A pull request introduces a PeerToPeerAccess API for XPU devices to enable querying and enabling peer-to-peer connections between devices for both regular and expandable segment memory allocations, optimizing copy operations when supported.
- [pull/166424]
- ROCm Group GEMM Support: One pull request enables group GEMM operations through the CK library for ROCm, supporting all four matrix dimension combinations with verified forward and backward passes on gfx942 and gfx950 GPUs, while noting pending support for gfx90a due to kernel errors.
- [pull/166334]
- PyObject Preservation Rework: A comprehensive rework of PyObject preservation is proposed, including cleanup of dead code, deletion of obsolete types, fixes to Tensor traversal and base types, and ensuring Python objects for Tensor and Storage always own their underlying data, along with linting and formatting improvements.
- [pull/166342]
- User-Streams Improvements: Two pull requests focus on the user-streams component by cleaning up the StreamVariable signature and switching to using fx annotations at trace time to improve tracing.
- [pull/166471, pull/166472]
- Dynamo Guards Refinement: A pull request removes the use of FUNCTION_MATCH and replaces cautious ID_MATCH with UNCLASSIFIED_ID_MATCH in the dynamo guards system to improve recompilation reason tracking and avoid potential failures.
- [pull/166321]
- Narrow Function and Tensor Improvements: Multiple pull requests address improvements to the narrow function and related tensor operations by fixing handling of unbacked start indices, making narrow_tensor_symint free of dynamic dispatch errors, and correcting absolute tolerance scaling in fast gradcheck for complex backpropagation with added regression tests.
- [pull/166361, pull/166379, pull/166386]
- Functionalization and Autocast Support: A pull request adds support for autocast (AC) in the default partitioner when functionalization is enabled, enhancing compatibility and performance.
- [pull/166610]
- Automatic Checkpointing Enhancements: One pull request adds an API to annotate disjoint backward passes and handle them in the automatic checkpointing context, enabling zero-bubble and DualPipeV support by ensuring recomputation triggers at most once per checkpointed region during multiple backward calls on non-overlapping graph regions.
- [pull/166536]
- Custom Operation Backpropagation Fix: A pull request modifies PyTorch to prevent hard errors when compiling backpropagation through a custom operation without an autograd key, improving robustness.
- [pull/166367]
- Operator Arguments Refactoring: A pull request extracts OperatorArgsKwargsView from the parseIValuesToPyArgsKwargs function to facilitate easier reuse of operator argument processing logic in future changes.
- [pull/166368]
- MPS Backend Error Checking: A work-in-progress pull request focuses on implementing error checking for the MPS backend in PyTorch.
- [pull/166273]
- Runtime Assertion Fix in torch.cond: A pull request fixes an issue where runtime assertions inside torch.cond subgraphs incorrectly update the global shape environment during tracing by making these asserts conditional on the outer predicate, preventing leakage of local assertion information.
- [pull/166294]
- Type Annotations Addition: A pull request adds type annotations to various Python files in PyTorch, including torch_function.py and modules under torch/_dynamo/variables, to improve type safety and coverage.
- [pull/166354]
- Local Tensor Send/Receive Prototype: A pull request introduces a prototype implementation for send and receive support in local tensors, demonstrating circular data rotation among tensor ranks.
- [pull/166595]
- Typos Linter for CI Testing: A pull request adds a new typos linter feature along with related updates and fixes, intended solely for continuous integration testing and not for merging.
- [pull/166635]
- cuSolver Backend for Eigenvalue Computations: A pull request introduces an initial implementation of NVIDIA’s cuSolver backend into the ATen/Linalg framework, adding support for torch.linalg.eig and torch.linalg.eigvals to enable faster eigenvalue computations on CUDA devices using the cuSolverDnXgeev API from CUDA 12.8, while maintaining numerical consistency and existing dispatch mechanisms without changing the public API.
- [pull/166715]
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 205
Key Closed Pull Requests
1. [CUDA][cuBLASLt] addmm -- extend bias fusions to cases with (1 by n) shapes: This pull request proposes extending bias fusion optimizations in the CUDA cuBLASLt addmm operation to support cases where the bias has a shape of (1 by n).
- URL: pull/166307
- Merged: No
- Associated Commits: a7ec7, ad49c, 45433, 69555, 078f2, 74316, e8ea5, 5cbd4, e6abd, 20aa6, bb2c8, 8602d, daf5a, 15849, 2759e, 4ff38, 4db6a, 51ef9, 2e05c
2. [FlexFlash] CuteDSL flat indexer needs to be colexigraphic in coordinate space: This pull request addresses the need for the CuteDSL flat indexer in the FlexFlash project to be colexicographic in coordinate space, as indicated by the title and supported by a series of updates and benchmark notes included in the description.
- URL: pull/166657
- Merged: No
- Associated Commits: 851f0, 53f33, 404c9, 5402f, c6c21, fb7ae, c5923, 086ca, 4bf23, 0fd3b, a6952, 361fa, 57792, 797b8, cba62
3. [GraphPartition] cache get_free_symbol_uses: This pull request addresses the performance issue in the GraphPartition component by caching the results of the recursive get_free_symbol_uses() function, which was causing significant slowdowns when processing large graphs with thousands of nodes, and validates the fix on the torchtitan benchmark.
- URL: pull/166338
- Merged: No
- Associated Commits: 0361c, 46110, 47d22, ab232, e4bc3, 1bb96, 7c593, dbc76, 261e6, 261b4, 565ff, 48172, f8b2b, 0d670
Other Closed Pull Requests
- Bucketing and Collective Communication Enhancements: Multiple pull requests improve the bucketing mechanism in PyTorch by adding features such as bucket all reduce, collective LIFO semantics awareness, and multi-dtype bucketing for gather operations. These changes optimize collective communication scheduling and data handling to enhance performance and correctness.
[pull/166528, pull/166324, pull/166527]
- Inductor Backend Optimization and Mix Order Reduction: Several pull requests focus on improving the PyTorch inductor backend by implementing more aggressive mix order reductions, allowing independent split size determination, and tuning heuristics for better fused kernel generation and performance. One of these tuning attempts was ultimately not merged.
[pull/166382, pull/166461, pull/166585]
- Flash Attention and DTensor Testing Improvements: Pull requests integrate mask_mod and blockmask functionalities into flash attention and enable local tensor mode for DTensor attention and convolution tests. These updates enhance the capabilities and testing coverage of attention mechanisms in PyTorch.
[pull/166359, pull/166406]
- Dynamo Component Enhancements: Pull requests add comprehensive type annotations to the Dynamo variables directory and propose selective application of guards on specific PyTorch APIs within Dynamo. These changes improve type coverage and control over API usage in the Dynamo component.
[pull/166569, pull/166329]
- DebugMode and Logging Enhancements: A pull request introduces customizable hooks on
__torch_dispatch__calls within DebugMode to enable logging and recording of arbitrary values, including outputs and tensor hashes. This facilitates numerical equivalence checks and detailed operation tracing.
[pull/166348]
- GitHub Actions and CI Workflow Improvements: A pull request creates a new periodic GitHub Actions workflow to separate ROCm jobs from the main periodic workflow, introduces a new label for ROCm distributed tests, and reverts the ROCm runner label to target more CI nodes due to network and scaling issues.
[pull/166544]
- Code Quality and Maintenance Fixes: Multiple pull requests address fixing typos across various folders, removing unused loop variables, and refactoring the context parallel codebase by consolidating files into a dedicated folder while preserving backward compatibility.
[pull/166606, pull/166258, pull/166456]
- Tensor Initialization and Compatibility Fixes: A pull request updates the weight tensor initialization in the RMSNormalization module to ensure compatibility with ONNX Runtime by requiring the weight to have more than one dimension. Another fixes stride comparison issues in inductor to prevent differential data errors.
[pull/166550, pull/166277]
- Type Checking and Setup Improvements: A pull request adds a warning for incomplete type checking setups, especially for users using Mercurial instead of Git, ensuring necessary binaries and type stubs are installed to prevent spurious errors.
[pull/166603]
- Miscellaneous Feature Additions: Pull requests add support for NVFP4 scaled grouped GEMM operations via FBGEMM kernels, create a timer subclass in compile_worker for inactivity-triggered actions, and move static functions into a new shim_common.cpp file preparing for versioning-aware implementations.
[pull/166308, pull/166465, pull/166373]
- Bug Fixes and Integration Repairs: A pull request attempts to fix torch.compile integration breakage caused by a previous Triton change by removing certain constexpr and argument names.
[pull/166280]
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| cyyever | 188 | 70 | 0 | 41 |
| guangyey | 166 | 17 | 0 | 48 |
| malfet | 105 | 22 | 7 | 65 |
| anijain2305 | 157 | 21 | 2 | 9 |
| Skylion007 | 11 | 6 | 1 | 160 |
| bobrenjc93 | 112 | 27 | 21 | 9 |
| pianpwk | 128 | 32 | 1 | 8 |
| eellison | 91 | 13 | 2 | 45 |
| laithsakka | 113 | 16 | 2 | 18 |
| Lucaskabela | 62 | 15 | 0 | 63 |