Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Pytorch: September 29, 2025 - October 06, 2025 (12:03:14)

Weekly GitHub Report for Pytorch

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is v2.6.0

1.2 Version Information:

Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. Failed to change backward stream: This issue discusses the challenge of running the backward pass of a PyTorch model on a different CUDA stream than the forward pass, which is currently constrained by PyTorch’s autograd engine design. The user wants to achieve fine-grained control over CUDA streams during backward computation, particularly to leverage a "green context" that partitions GPU resources for overlapping communication and computation, but finds that torch.cuda.set_stream() only affects the scope of a single autograd node in backward, limiting flexibility.

    • The comments clarify that PyTorch applies a stream guard per autograd Function, making global stream switching during backward impossible without extensive synchronization. The user explains their motivation related to FSDP performance optimization using green context streams, and after discussion, it is suggested that implementing custom backward functions or wrapping specific operators (like linear layers) to control streams locally is a more feasible and safer approach than trying to globally override backward streams.
    • Number of comments this week: 15
  2. Unexpected recompilation under TorchDispatchMode with is_infra_mode set to True: This issue reports an unexpected excessive recompilation error triggered when using a custom TorchDispatchMode subclass with is_infra_mode set to True while running a compiled function under PyTorch's Dynamo compiler. The user demonstrates that wrapping the compiled function call inside this dispatch mode context causes the recompile limit to be hit, leading to a FailOnRecompileLimitHit exception, and seeks guidance on how to prevent Dynamo from tracing through the __torch_dispatch__ method during compilation while still intercepting execution.

    • The discussion in the comments revolves around diagnosing the cause of the recompilations using guard logs and tlparse output, attempts to apply a suggested fix by adding ignore_compile_internals to the dispatch mode class which did not resolve the issue, and clarifications on how to avoid tracing inside __torch_dispatch__ during compilation, including a recommended pattern involving an assertion that compilation is not active within __torch_dispatch__.
    • Number of comments this week: 8
  3. Segfault trying to import torch: This issue reports a segmentation fault occurring when importing the torch library in a non-CUDA build, traced to a recent code change affecting the handling of an enum type during Python bindings initialization. The fault appears related to infinite recursion between the repr() and int() methods of a specific enum, causing a stack overflow in debug builds but not in release builds, and the user suspects a type casting problem introduced by the change.

    • The comments discuss that the segfault is specific to debug builds and does not occur in release builds, with multiple users confirming similar deep stack overflow backtraces. Investigation reveals an infinite recursion between repr() and int() calls on an enum used in Python bindings, likely due to a type casting regression introduced by the recent pull request, and the consensus leans toward reverting the change to maintain debug build stability while debugging the issue further offline.
    • Number of comments this week: 8
  4. Official support for sm_120 (RTX 50-series / Blackwell) in stable PyTorch builds: This issue requests official support for the sm_120 architecture (RTX 50-series / Blackwell GPUs, such as the RTX 5070 Ti) in stable PyTorch builds, highlighting that while CUDA 12.8/12.9 and PyTorch nightly builds partially support sm_120, stable releases do not, causing failures in applications like DeepLabCut despite successful GPU computations. The user provides detailed reproduction steps and logs showing intermittent DLL initialization errors when importing PyTorch after DeepLabCut modules, emphasizing the need for stable sm_120 integration to enable productive use of RTX 50-series GPUs in research workflows.

    • The discussion centers on reproducing and diagnosing an intermittent DLL initialization failure ([WinError 1114]) occurring when PyTorch is imported after DeepLabCut on Windows 11 with RTX 5070 Ti using nightly CUDA 12.9 builds; stable builds do not exhibit this issue. Contributors request environment details and repro scripts, and the user provides comprehensive logs, environment info, and a minimal repro demonstrating that the failure is specific to the nightly build and the import order, suggesting a complex interaction between DeepLabCut’s dependencies and PyTorch’s CUDA DLL loading on sm_120 hardware.
    • Number of comments this week: 6
  5. qta and export onnx error: This issue reports a failure when exporting a quantized YOLOv8 model to ONNX format, specifically encountering a NotImplementedError related to the quantized::conv2d.new operator not being supported on the CPU or CUDA backends during export. The user shares detailed code for quantization-aware training (QAT) and ONNX export, and the error persists despite attempts to switch backends and adjust the export process, indicating a possible missing or unsupported quantized operator in the PyTorch build or backend.

    • The comments suggest trying different backends such as CUDA, but the error remains unresolved on both CPU and CUDA. A link to a related issue is shared for potential help, and there is a request for guidance on tagging the appropriate team responsible for quantization operations to address the problem.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's Inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0 development build, CUDA 12.1, and Ubuntu 22.04, and includes a code snippet demonstrating the error triggered by compiling specific model components with torch.compile.
  2. Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, potentially yielding significant speed improvements as demonstrated by the provided testing code.
  3. cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at /tmp having permissions set to 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and directory permissions being correct.
  4. Enable UFMT on all files in PyTorch: This issue addresses the task of enabling UFMT (a formatting tool) on all files in the PyTorch codebase by removing approximately 1,500 files currently excluded from formatting and applying consistent code style across them. It outlines the process for updating the .lintrunner.toml configuration, running the formatter, handling known edge cases that require preparatory fixes, and organizing the work by directory to facilitate manageable and reviewable pull requests.
  5. [JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are primarily for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on resource-constrained devices like mobile.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 72

Summarized Issues:

  • Inductor Triton Backend Compilation Errors: Multiple issues report compilation failures in the Triton backend of PyTorch Inductor due to incompatible operand types or kernel configuration limits, causing kernel compilation errors or assertion failures during backward pass compilation. These errors prevent successful execution despite eager mode working correctly, highlighting type and size constraints in kernel generation.
  • [issues/164086, issues/164157, issues/164185]
  • Inductor Backend Runtime and Compilation Failures: Several issues describe runtime errors and assertion failures in the Inductor backend related to symbolic size usage before definition, invalid node outputs, and incorrect mask generation during compilation, leading to backend crashes or failed backward passes. These problems indicate challenges in graph compilation and symbolic shape handling within Inductor.
  • [issues/164088, issues/164186, issues/164428, issues/164465]
  • Dynamic Shapes and Export Issues: Problems arise when exporting models with dynamic shapes or simplifying dynamic shape expressions, causing constraint violations or assertion failures during export. These issues affect both torch.export and ONNX export workflows, especially with models using TransformersKwargs or dynamic shape dictionaries.
  • [issues/164313, issues/164375]
  • Backward Pass and CUDA Stream Control: Forcing the backward pass to run on a different CUDA stream than the forward pass is problematic because torch.cuda.set_stream only affects a single autograd node scope, making it difficult to globally control streams during backward. This limitation complicates overlapping communication and computation in FSDP with green context.
  • [issues/164094]
  • FlexAttention Module Bugs: The FlexAttention module has inconsistent tensor types returned by its create mask function, with boolean and float tensors reversed between score and mask modes. Additionally, the module fails with an IndexError when compiled with Inductor, and Dynamo graph tracing breaks due to unsupported id() calls in dynamic masking code.
  • [issues/164096, issues/164247, issues/164248]
  • Quantization and ONNX Export Failures: Exporting quantized models to ONNX fails due to unimplemented quantized convolution operators on CPU and CUDA backends, causing NotImplementedErrors after quantization-aware training. This limits deployment options for quantized YOLOv8 models.
  • [issues/164116]
  • TorchDispatchMode and Dynamo Recompilation Issues: Using a custom TorchDispatchMode subclass with is_infra_mode=True causes excessive recompilations in the Dynamo compiler, hitting recompile limits and resulting in failures, whereas running outside the dispatch mode avoids this problem.
  • [issues/164119]
  • Sparse Operations and CUDA Support Limitations: The torch.sparse.spsolve function is unavailable on CUDA due to missing cuDSS library support, causing runtime errors when solving sparse linear systems on CUDA devices. Additionally, sparse layout copying fails in GradTrackingTensor wrappers, leading to runtime errors during vector-Jacobian product computations.
  • [issues/164122, issues/164286]
  • Kernel Autotuning and Caching Bugs: Autotuning Triton kernels for multiple input shapes without an existing cache compiles multiple kernels but uses incorrect kernel configurations on the first run, only selecting correct ones after caching, causing inefficiencies and potential incorrect executions.
  • [issues/164124]
  • MPS Backend Numerical and Printing Issues: The MPS backend produces all NaN outputs for conv_transpose3d when weights contain inf values, unlike CPU backend results. Also, printing large tensors on MPS fails due to dimension limits in MPSGraph, requiring alternative implementations for operations like torch.cat.
  • [issues/164125, issues/164415]
  • Profiler and Anomaly Detection Enhancements: The PyTorch profiler needs improvements to better support bitwise equivalence verification between eager and aot_eager modes by adding documentation and consistent event listings. Additionally, anomaly detection mode should populate backward stacks with stack traces using torch.profiler.profile(with_stack=True) to enhance backward pass profiling.
  • [issues/164145, issues/164331]
  • CI and Benchmarking Workflow Failures: Operator benchmark workflows fail due to outdated baseline numbers requiring updates, and the @pytorchbot merge -f command fails to force merge pull requests because it waits for multiple CI jobs, blocking merges. There is also a request for notification alerts for operator benchmark regressions due to infrequent testing schedules.
  • [issues/164089, issues/164154, issues/164197]
  • Graph Fusion and Input Ordering Bugs: The fuse_as_graphmodule function in Executorch partitioner changes input order of partitioned ExportedPrograms due to node visitation order, causing inputs to be reordered unexpectedly, which can break downstream code relying on input order.
  • [issues/164204]
  • LibTorch CUDA Memory Allocator Thread Safety Concerns: Concurrent multi-threaded tensor allocations in LibTorch on Windows with CUDA 11.7 cause intermittent illegal memory access errors, raising concerns about the thread safety of LibTorch’s CUDA memory allocator or potential user code issues.
  • [issues/164208]
  • Test Failures Due to Missing Imports and Platform Issues: The test_custom_ops module fails due to ImportError for make_fx from functorch, and the HelionTests suite disables a kernel test on ROCm due to unknown failures linked to a pytorchbot revert.
  • [issues/164220, issues/164239]
  • Distributed Checkpointing and Optimizer State Bugs: Distributed checkpointing fails with missing internal optimizer states when optimizers include parameters unused in loss calculation, causing errors during checkpoint loading even with non-strict options. Also, torch.distributed.checkpoint.save_async fails with AttributeError in single-rank environments due to missing storage_meta attribute.
  • [issues/164179, issues/164257]
  • FSDP2 CUDA Graph and Memory Issues: FSDP2 lacks robust CUDA graph support due to unsafe pinned host memory allocation, limitations with make_graphed_callables API for modules with hooks, and stream synchronization failures, causing host latency bottlenecks in workloads using FSDP2.
  • [issues/164264]
  • Documentation and Usability Requests: Requests include adding comprehensive documentation for symmetric memory implementations, adding a lightweight tqdm-based progress bar utility to torch.utils, and improving error message clarity for sharding propagation failures in DTensor.
  • [issues/164281, issues/164360, issues/164543]
  • Memory Leaks and Segmentation Faults: A long-standing memory leak exists in the MPS backend due to unusual memory freeing logic, and a segmentation fault occurs when importing PyTorch in a non-CUDA debug build on Ubuntu due to infinite recursion between enum methods during Python bindings setup.
  • [issues/164299, issues/164297]
  • CUDA Graph Capture and Stream Dependency Conflicts: Mixing CUDA graph capture with non-CUDA graph code causes autograd backward failures due to illegal stream dependencies on the default CUDA stream, requiring special handling of the null stream in the autograd engine to avoid conflicts.
  • [issues/164302]
  • Model Export and ONNX Decomposition Errors: Exporting models with custom TorchDispatchMode applying quantize-dequantize operations fails with SpecViolationError due to missing constant tensors during ONNX export validation, blocking model deployment workflows.
  • [issues/164461]
  • Numerical Accuracy and Precision Failures: Accuracy failures occur in sum-reduction tests on NVIDIA A100 GPUs with CUDA 12.8, where float16 and float32 reductions exceed error thresholds, and numerical mismatches appear in Inductor backend on Blackwell machines due to kernel selection differences.
  • [issues/164249, issues/164563]
  • Matrix Multiplication and Sparse Backward Support Bugs: Internal functions _int_mm and _scaled_mm mishandle row-major right-hand side matrices, with _scaled_mm raising errors and _int_mm being inefficient, while sparse backward pass for CSR x dense with bfloat16 fails due to missing kernel implementation, causing NotImplementedErrors.
  • [issues/164491, issues/164666]
  • Distributed Process Group API Enhancement Proposal: A new Process Group API shrink_group() is proposed to improve fault tolerance by excluding faulty ranks without requiring all ranks to participate, enhancing distributed training flexibility and performance.
  • [issues/164529]
  • Compilation Cache and Dynamo Reset Issues: Dynamo compilation cache corruption causes incorrect outputs in compiled mode for batch sizes greater than one, which can be fixed by resetting Dynamo between runs or using nightly builds.
  • [issues/164608]
  • Build and Dependency Failures: Nightly Docker builds fail due to conda update errors related to solver backends and import issues, prompting proposals to remove conda dependency from official Docker images. Also, updating FlashInfer version for vLLM tests faces challenges with missing dependencies and CUDA driver linking on CPU-only runners.
  • [issues/164574, issues/164562]
  • GPU Architecture Support and Compatibility Issues: Requests for official support of sm_120 architecture (RTX 50-series / Blackwell GPUs) in stable PyTorch builds highlight intermittent DLL initialization failures on Windows. Additionally, running ComfyUI workflows on ROCm 7.0 with AMD gfx1151 GPUs results in invalid device function errors due to incomplete ROCm support.
  • [issues/164342, issues/164346]
  • Miscellaneous Bug Reports and Proposals: Other issues include removing obsolete CUDA bug workarounds, adopting Spin as a unified developer CLI, fixing outdated CONTRIBUTING.md references, and improving error messages for non-contiguous out= tensor usage.
  • [issues/164348, issues/164469, issues/164478, issues/164555]

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 23

Summarized Issues:

  • Test Failures and Disabling on Specific Platforms: Multiple tests on the xpu and ROCm platforms were consistently failing on the main branch, leading to their disabling to maintain stability. These include conv1d and conv3d related tests that were problematic across different hardware backends.
  • [issues/164097, issues/164098, issues/164099, issues/164100, issues/164137]
  • Typographical Errors in Code Comments: There are minor typographical errors in comments within the PyTorch codebase, such as "registraion" instead of "registration" and "Unitialized" instead of "Uninitialized," which affect documentation clarity. These issues highlight the need for careful proofreading to improve code readability.
  • [issues/164071, issues/164483]
  • PyTorch Version and ROCm Compatibility Updates: Users inquired about the release timeline for PyTorch versions compatible with ROCm 7.0, with updates provided on nightly wheel builds and eventual publication. This reflects ongoing efforts to support ROCm hardware with up-to-date PyTorch releases.
  • [issues/164074, issues/164150]
  • PyTorch Compilation and Runtime Errors: Several issues report errors during PyTorch compilation or runtime, including a TypeError with 'fp32' during backward pass compilation, ImportError due to corrupted compiled files, and a pipeline graph error causing dtype mismatches in distributed pipelining. These problems affect model training and execution reliability.
  • [issues/164063, issues/164203, issues/164486]
  • Performance Regression in TorchInductor Code Generation: A significant slowdown was observed in torch.compile inductor code generation for mxfp8 quantization on NVIDIA B200 GPUs, traced to a specific change and resolved by reverting the related pull request. This regression impacted performance on CUDA 12.8 environments.
  • [issues/164301]
  • Documentation and User Guidance Feedback: Positive feedback was given on the clarity of the Meta device documentation, while confusion was expressed regarding differences between CPU and GPU PyTorch versions on conda channels and installation recommendations. This indicates varying user experiences with documentation and installation processes.
  • [issues/164076, issues/164536]
  • Feature Limitations and Suggestions: Enabling DebugMode silently disables torch.compile without error, suggesting a need for error messaging or compatibility improvements. Additionally, using functools.partial'ed functions as context_fn in checkpointing is unsupported, causing integration issues with dynamo and sac components.
  • [issues/164143, issues/164300]
  • Build and Infrastructure Issues: The Docker build for the vllm test failed due to flashinfer problems, marking the build unstable, and B200 runners experienced offline status due to routing and switch configuration errors, causing job queuing delays. These infrastructure problems affect continuous integration and testing workflows.
  • [issues/164362, issues/164283]
  • Symbolic Computation Bug: The FloorDiv operation incorrectly generates a SymPy rational expression instead of an integer floor division result when applied to complex symbolic expressions, indicating a bug in symbolic arithmetic handling.
  • [issues/164385]
  • Project Feature Testing: The autorevert functionality was disabled temporarily to test this feature, reflecting ongoing experimentation with project automation tools.
  • [issues/164148]
  • ONNX Model Export Inquiry: A user requested guidance on exporting an ONNX model after training with libtorch, indicating interest in model interoperability and deployment workflows.
  • [issues/164133]
  • PyTorch Wheel Index Hash Issues: The PyTorch wheel index for versions after 2.6, including 2.8.0, lacked SHA256 hashes for some packages, causing downstream specification violations that were fixed by recalculating and updating the missing hashes.
  • [issues/164347]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 236

Key Open Pull Requests

1. Schema error clarity (#122129): This pull request improves the clarity of schema-related error messages by enhancing the representation of iterable contents in RuntimeErrors, ensuring that the internal types of iterables are accurately displayed for better debugging, while maintaining friendly naming conventions and adding comprehensive tests to cover various iterable scenarios.

  • URL: pull/164476
  • Merged: No
  • Associated Commits: 87bf5, efb7c, 6f5be, b747f, a657d, 00131, 2315a, c1c01, fd5ae, cdf70, f7df3, da3bf, 023b4, bfaa8, 2c1c9

2. Move version.h to torch/headeronly: This pull request proposes relocating the version.h file to the torch/headeronly directory within the PyTorch project to improve header-only organization.

  • URL: pull/164381
  • Merged: No
  • Associated Commits: 52cc8, d3592, 24574, 82b5c, 0189a, 019f5, fd651, 9fda8, 54ac7, 4becb, 55928, 45374, ce4cf

3. [Inductor] Enable Custom op Autotune Decompositions: This pull request introduces a new autotuning feature for custom operations in Inductor, enabling users to provide multiple decomposition implementations for a custom op and automatically select the best performing variant via benchmarking, with fallback to the default implementation if all decompositions fail, supported by a new CustomOpTemplate class, an autotune_custom_op API, intelligent layout inference, and comprehensive tests validating numerical equivalence and performance across different implementations.

  • URL: pull/164212
  • Merged: No
  • Associated Commits: 93031, 807e3, 72062, 46759, ce751, 75bf7, ffc17, aae72, 6efa5, 579ff, ea6d1, 13bed

Other Open Pull Requests

  • CUDA Bug Workaround Removal: This pull request proposes removing a previously implemented workaround for an old CUDA bug to verify if the workaround is still necessary, addressing the issue discussed in a related GitHub issue. The change aims to clean up legacy code that may no longer be required.
    [pull/164354]
  • Codebase Readability and Refactoring: Multiple pull requests focus on improving code organization and readability by applying clang-tidy fixes to JIT source files, moving functions like toString(ScalarType) and ScalarType ostream operator into header-only files, and converting AT_FORALL_... macros to header-only format. These changes aim to enhance maintainability and potentially improve compilation efficiency.
    [pull/164652, pull/164405, pull/164350]
  • StableIValue ABI Compatibility: This pull request adds scaffolding for StableIValue forward and backward compatibility by introducing parameters to handle ABI versioning and providing helper methods to distinguish calls from libtorch versus extension code. It does not yet include a linter for ABI changes but updates related functions and callsites accordingly.
    [pull/164332]
  • Backward Pass Implementation: This pull request introduces and iteratively updates the implementation of the backward pass ("bwd pass") in the PyTorch codebase, refining this feature through multiple commits.
    [pull/164504]
  • Build Process Simplification for AOTI Inference: This pull request rewrites the CMakeLists.txt for test_aoti_inference to enable a one-pass build that compiles models at test time using AOTI, eliminating the previous two-pass build procedure and updating the CI test script accordingly.
    [pull/164277]
  • Graph Break Logging Enhancements: This pull request adds the most recent 20 bytecode instructions or the most viable user frame bytecode to the graph break log in PyTorch's Dynamo when a graph break is triggered by the user. It improves understanding of bytecode transformation and symbolic conversion during non-error graph breaks while addressing testing challenges related to Python version-dependent bytecode.
    [pull/164422]
  • Scaled Matrix Multiplication API and Implementation: Two pull requests introduce a new _scaled_mm_v2 API to improve and future-proof scaled matrix multiplication with multi-level scaling and refactored dispatch logic, and add the torch.quantization.scaled_mm Python API as an abstraction over existing C++ methods, wrapping the newer API by default with tests included.
    [pull/164141, pull/164142]
  • Pipeline Parallelism BlockMask Splitting: This pull request modifies the pipeline parallelism implementation by splitting the BlockMask tensor into smaller micro-BlockMask tensors and adjusts the batch index input for the mask_mod function by wrapping it in a closure to ensure correct behavior after the split.
    [pull/164111]
  • CompositeImplicitAutograd Removal for _fused_rms_norm: This pull request removes the CompositeImplicitAutograd registration from the _fused_rms_norm operator to prevent unwanted decomposition by the Functionalize pass, addressing bitwise equivalence failures and simplifying backend handling.
    [pull/164289]
  • DTensor Local Tensor Mutation Support: This pull request enables mutation operations on DTensor.local_tensor by implementing it as a real operator, similar to approaches in the NJT workstream, though it currently cannot support grad placements without representing placements in the JIT schema.
    [pull/164359]
  • Thor Testing Fixes and Improvements: This pull request introduces small fixes and improvements for Thor testing, including updates to test_matmul_cuda.py, adjustments to inductor utilities from NVIDIA internal CI, addition of SM version control based on CUDA 13 or higher, removal of redundant test conditions, and simplification of Thor SM logic.
    [pull/164379]
  • Kernel Status and API Updates: This pull request addresses multiple kernel status updates as of October 3rd, including fixes to compiled autograd tests, API updates, and integration of changes from the main branch and a feature branch related to dtensor shape metadata guards.
    [pull/164633]
  • True Division Numeric Discrepancy Fix: This pull request fixes numeric discrepancies in the true division operation between PyTorch's eager execution mode and compiled mode, addressing a reported issue.
    [pull/164144]
  • Stride Computation Semantics Update: This pull request refines the behavior of the compute_elementwise_output_strides function and clone_meta to better handle meta tensor strides, especially in the single-tensor case, while renaming and adjusting related utility functions and tests for improved clarity and correctness.
    [pull/164252]
  • Einops Compatibility Patch: This pull request patches functions wrapped with lru_cache to avoid a TypeError caused by the missing hash method on SymInt, addressing a compatibility issue between einops version 0.6.1 and PyTorch nightlies using torch.compile.
    [pull/164564]
  • Inductor Overlap Enhancement: This pull request enhances the inductor component by respecting the planned overlap defined in ATen and utilizing newly introduced implicit dependencies to improve communication and computation overlap.
    [pull/164569]
  • Pyrefly Typechecker Suppressions: This pull request adds and cleans up suppressions for the Pyrefly typechecker to ensure a clean typecheck with zero errors, improving type safety and linting.
    [pull/164615]
  • All-to-All Exchange Plan API: Two pull requests introduce a new API called make_exchange_plan to create reusable plans for all-to-all variable splits and offsets, and enhance the all_to_all_v API to accept an ExchangePlan, enabling efficient multiple data exchanges without additional metadata costs.
    [pull/164164, pull/164181]
  • NCCL Scatter Integration: This pull request integrates the ncclScatter operation into the PyTorch distributed communication package (c10d) to enhance collective communication capabilities.
    [pull/164267]
  • Unbacked Symbolic Integer Allocation Fixes: This pull request fixes issues related to allocation of unbacked symbolic integers during select_scatter and setitem operations by ignoring irrelevant unbacked symbols and ensuring strides are materialized before access, resolving bugs documented in multiple GitHub issues.
    [pull/164341]
  • TORCH_TARGET_VERSION Macro Introduction: This pull request introduces the TORCH_TARGET_VERSION macro to support a stable ABI and enables version comparisons via the preprocessor, while noting the need to gate this feature and determine enforcement mechanisms.
    [pull/164356]

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 236

Key Closed Pull Requests

1. [AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24: This pull request is an autogenerated integration update for the rocm7.1_internal_testing branch of PyTorch, incorporating multiple upstream changes and fixes related to ROCm support, including build updates, test skips, performance improvements, and compatibility fixes targeting ROCm 7.1 and associated internal testing workflows as of September 24, 2025.

  • URL: pull/164327
  • Merged: No
  • Associated Commits: 2e0b1, 1f8eb, 8a7fd, 97f3d, 550bc, e7cb7, f61af, 0fd19, 167b4, 06da6, 0412e, 123a1, 2ee3a, a95ad, f070d, bef73, 95105, 0036d, 6c0d1, 0dd76, eb265, c20a8, 6894b, baf34, 51916, 3d6ba, 1a5a7, c113e, 78867, be308, ab8a9, cc13b, 63cbb, 5286c, 9d8f0, 79fa0, 1dea6, dec5b, 81e75, a771d, 2fbd2, 15f91, f7b26, 73cf3, 222ae, ec0c5, 45e1d, bb655, e4c1c, 45985, d37c4, 3a570, 46344, be4f8, befce, 1aa5d, aef0f, 5b344, dc726, b345d, bbae9, fa9fa, 9e184, 08da4, 0b79e, f1ad4, cf324, 13a86, 3057d, a0a9d, 8ffba, 80cca, 347ef, 944be, cc2a6, a7d3b, e9093, 5dd3d, afe8b, c3f75, 24dfd, c7f61, d9e68, 7435c, 2d567, a8d8a, fc804, ef94e, 0d083, a97f4, 89423, 4eaa5, 07077, 84fdb, 622c1, 345b9, 3d404, 5a29b, 54625, 1ce55, fc180, d36b5, 8e494, 377ae, d5c98, 97f8a, 1897a, 36e36, c375f, c1ee5, 23c08, 69a4c, 970a2, 5ec8b, ee4a7, 291c5, 3322e, d533e, 6e3be, 6e6e4, b65a0, 08daf, 12580, 45f88, 93fa9, 0ea05, 14550, 28f82, 88a9b, d6ae9, 272d5, 681e6, ab557, ff2ba, dba85, 9a66f, 1c576, 9e7df, 492e2, 3004b, d1c2f, 99400, de591, c762a, 48cac, b4019, f3e82, 06998, 0ad83, 63fcd, 77f45, f3ab7, 34403, 34c42, da56f, ebd29

2. Fix assertion error: This pull request addresses an assertion error in a unit test by adding a check to detect if an XPU device is available and skipping the assertion accordingly, since the XCCL backend on XPU implicitly synchronizes during certain tensor operations unlike CUDA, preventing output mismatches.

  • URL: pull/164160
  • Merged: No
  • Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55, 7dade, 580aa, c0f57, 6dedb, 20d07, 124ff, 0bea1, 7409a, 636cb, 0d5a8, 3826e, cb711, d6cd1, 624be, 8d8c5, 1cf78, 41475, 0628c, b0d93, 58eb8, e558e, 83ac5, 0e7a7, a2b2f, 8de00, 91f5d, 39e6c, 06d6c, a0590, 31ddf, 9a6df, 50cb9, 08559, f6a8c, dc0be, 9fd7e, cce98, cf3bf, e3957, cd29a, 7b81c, 42401, 1dee9, 815db, bae7d, c283c, f7846, af361, 22cc1, 482f1, 9818e, 57d20, 2a5f3, 9f0c5, bb73a, ec4ea, 084ee, 70689, 491cc, 4a2d0, a6164, 12663, a03d1, 7da46, cfe26, 4d68d, 4c74d, 42219, 267bd, 36291, b6a17, 98cbe, 7245c, 878e1, 948b9, 2176c, f632e, 84ccf, 97d4c, 07bf4, 1e846, 493b2, 9b3f5, bf100, a0522, 6c84e, d7fcd, 14f79, da233, 04b39, 95ef9, ab42f, a1c29, 62c60, 94863, 0bb6d, 79c00, c8938, 976a6, 35462

3. Fix vllm build issue: This pull request attempts to fix the vllm build issue in the PyTorch project by adding multiple configuration changes and tweaks, although it was ultimately not merged.

  • URL: pull/164361
  • Merged: No
  • Associated Commits: 1da3d, 73c23, 8cd19, 8746e, 5a722, 36201, 019d9, 03d7c, 73995, 1b278, 700d6, 98e55, 58478, 3f3d8, e6d33, 02d16, 257bf, 3ad3d, ef473, 813ca, 2f5c2, 95654, 4efdd, b54dc, cac72, 517f2, 2f10f

Other Closed Pull Requests

  • torchfuzz enhancements and additions: Multiple pull requests propose adding new operators and functionalities to the torchfuzz testing framework, including layout operators, matrix multiplication, neural network functional operations, and operator statistics tracking. These changes aim to expand the coverage and monitoring capabilities of torchfuzz, with some proposals not merged.
    • pull/164210, pull/164284, pull/164434, pull/164334, pull/164397
  • Dynamo improvements and fixes: Several pull requests focus on enhancing the PyTorch Dynamo component by adding missing trace rules, implementing a special cloning path for torch dispatch tensors, preventing StackRef compilation on Windows, and improving export functionality including handling of torch.compile and flex attention export. These changes improve tracing, execution, and export reliability in Dynamo.
    • pull/164080, pull/164081, pull/164400, pull/164171
  • ROCm and benchmarking updates: Pull requests propose adding ROCm support for operator microbenchmarks and new Dynamo benchmarks for the inductor-periodic feature on ROCm, along with CI accuracy and tolerance updates. These efforts aim to improve testing and benchmarking on AMD GPU platforms.
    • pull/164173, pull/164279
  • Type checking and linting improvements: Pull requests add suppressions for the Pyrefly type checker to ensure zero errors and apply various ruff linting rules to fix style issues such as unnecessary casts and formatting inconsistencies. These changes enhance code quality and maintainability.
    • pull/164513, pull/164333, pull/164460
  • Bug fixes in core operations: Fixes include addressing a bug in FloorDiv optimization to avoid incorrect non-integer rational results, adding input checks for out_dtype variants of bmm and baddmm to fix silent incorrectness, and resolving symnode tracking issues in aot_stage1_graph_capture to unblock internal training workflows. These fixes improve correctness and stability in core PyTorch operations.
    • pull/164398, pull/164095, pull/164113
  • Export and module support enhancements: Pull requests add support for exporting unspecialized neural network modules within the hops framework and improve printing support for tensor subclasses like DTensor in dynamo graph inputs. These changes enhance export capabilities and debugging clarity.
    • pull/164082, pull/164403
  • Code style and warning fixes: A pull request fixes ruff warnings by applying SIM rules, removing unnecessary casts, empty else statements, and simplifying boolean expressions to improve code clarity and consistency.
    • pull/164460
  • Custom operator registration: One pull request adds registration for the _varlen_attn() custom operator, including a test that verifies its invocation and demonstrates failure without registration. This adds new operator support with validation.
    • pull/164406
  • Unmerged proposals for infrastructure and data types: Some pull requests propose adding a CUDA release architecture matrix and updating mask data types, but these were not merged. These proposals aimed to improve release transparency and data type handling.
    • pull/164471, pull/164472

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
bobrenjc93 316 57 34 25
malfet 105 17 8 94
ezyang 73 19 11 121
Skylion007 15 9 1 186
laithsakka 125 23 8 41
cyyever 121 63 0 13
kwen2501 115 23 7 46
tugsbayasgalan 115 28 3 29
anijain2305 100 30 3 9
fduwjj 83 15 0 43

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.