Weekly GitHub Report for Pytorch: November 10, 2025 - November 17, 2025 (12:05:10)

            Weekly GitHub Report for Pytorch: November 10, 2025 - November 17, 2025 (12:05:10)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, a security-focused backward-incompatible change flipping the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward improved performance, security, and streamlined deployment.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[CUDA][Distributed][Symmetric Memory] _SymmetricMemory.rendezvous fails on DGX B200: This issue reports a failure of the Torch symmetric memory rendezvous function during initialization on a DGX B200 system, resulting in a CUDA driver error indicating that the system is not yet initialized. The problem appears specific to the B200 hardware and involves multicast operations in symmetric memory, with suggested workarounds including disabling multicast or rebooting the machine, but the root cause may relate to system setup, kernel issues, or fabric manager configuration.

The discussion involved attempts to reproduce the issue on other hardware, confirmation that it is likely specific to the DGX B200, and troubleshooting steps such as disabling multicast and enabling detailed stack traces. Users shared logs showing the CUDA driver error, tried workarounds like pre-allocating CUDA tensors and barriers, and concluded that the problem might stem from system-level issues requiring a reboot or OS/kernel upgrade, with recommendations to consult system administrators or NVIDIA support.
Number of comments this week: 13

Umbrella bug: failures in basic custom MemPool patterns: This issue documents multiple failures and problematic patterns encountered when using custom CUDA memory pools (MemPools) in PyTorch, particularly focusing on temporary MemPool objects and nested MemPool contexts. It highlights that temporary MemPool objects lead to unsafe memory frees and that nested MemPool usage triggers internal assertions, while discussing potential solutions such as associating private pools with active MemPools at creation time and improving tracking of nested MemPool states.

The comments clarify that some observed failures stem from misunderstandings about the caching allocator’s behavior, emphasizing that memory is only freed back to the system upon explicit calls to empty_cache rather than tensor deletion. There is consensus that temporary MemPool objects complicate memory management and that nested MemPool handling requires untangling from CUDA graph capture logic, with suggestions to expose empty_cache functionality for specific MemPools to better control memory release without imposing implicit synchronization on tensor deletion.
Number of comments this week: 13

[CI][CUDA] test_scaled_matmul_cuda.py  test_blockwise_mxfp8_nvfp4_mxfp4_numerics_test_case_name_a_eye_b_eye_fast_accum_False_128_128_128_recipe_mxfp4_cuda Failures: This issue reports a failure of the mxfp4 recipe unit tests on the RTX5090 GPU after enabling the mxfp4 recipe on CUDA, despite the tests passing on the B200 GPU. The failure is due to "Cutlass cannot initialize," and the discussion focuses on whether to disable the mxfp4 routines on non-B200 devices and verifying compatibility on other GPUs like the GB300.

The comments suggest temporarily disabling the MXFP4 routines on non-B200 GPUs to avoid failures, with a proposed pull request to implement this. Testing on the GB300 (SM103) showed the tests pass, helping to determine the scope of supported devices, and a skip test was added for SM120 or later architectures to prevent the failure on RTX5090.
Number of comments this week: 5

'import torch' fails when pytorch is built with onnx-1.19.1: Undefined symbol "_ZN4onnx29_GraphProto_default_instance_E": This issue describes a failure when importing the PyTorch library built with ONNX version 1.19.1, resulting in an undefined symbol error related to "_ZN4onnx29_GraphProto_default_instance_E". The problem occurs specifically with PyTorch version 2.9.0 on FreeBSD 15 STABLE, indicating a compatibility or build issue between PyTorch and the ONNX 1.19.1 library.

The comments clarify that ONNX 1.19.1 causes the build failure because it hides a necessary symbol while still exposing it to users, leading to import errors in PyTorch. A patch to ONNX's build system is suggested to resolve the issue, and a related ONNX issue is referenced for further context.
Number of comments this week: 3

significant torchbench regression in cudagraphs configuration: This issue reports a significant performance regression in the torchbench benchmarks specifically related to the cudagraphs configuration observed over the last three days. The regression is linked to recent changes in the codebase, with one pull request identified as the cause and subsequently reverted, while caution is advised regarding another pending pull request to avoid similar regressions.

The comments confirm that a specific pull request caused the regression and that reverting it should restore performance. There is also a request for assistance to verify that a new pull request does not introduce further regressions, including guidance on how to run torchbench on that PR.
Number of comments this week: 3

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0 development build, CUDA 12.1, and Ubuntu 22.04, and demonstrates the error occurring in a pipeline setup that uses torch.compile on custom UNet modules.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3, which reduces the computational cost per cell. The suggested modification targets the MaxPool2D layer itself to avoid additional backpropagation overhead and is expected to yield faster performance specifically on CPU.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling UFMT (a formatting tool) on all files within the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from formatting. It outlines the process for removing files from the exclusion list, running the formatter, and managing preparatory fixes for known problems, while also providing a detailed worklist organized by directory to coordinate and track progress on this large-scale formatting effort.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, specifically .debug_pkl files, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, while useful for debugging, significantly increase the model size without affecting runtime correctness, which is particularly important for deploying smaller models on mobile devices.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 111
Summarized Issues:

Tensor stride and layout inconsistencies: Multiple issues report problems related to tensor stride mismatches and layout inconsistencies during operations such as all_reduce, FFT, and complex tensor reconstruction. These mismatches cause errors in compiled code paths, leading to runtime failures or incorrect behavior when fake or meta kernels handle strides differently from eager execution.  
[issues/167430, issues/167636, issues/167641]

DTensor and distributed tensor bugs: Several issues highlight bugs in the DTensor library including API registration failures with TensorList arguments, assertEqual incompatibilities with DTensor objects, redistribution placement misreporting in DebugMode, and tracing failures with redistribute operations. These problems cause test failures, runtime errors, and difficulties in debugging distributed tensor workflows.  
[issues/167435, issues/167549, issues/167655, issues/167657]

Torch.compile and Dynamo recompilation and performance issues: There are multiple reports of recompilation problems due to hardcoded thread counts, missing recompilation reasons, excessive recompilations in loops, and missing input type names in debug logs. Additionally, performance regressions and slowdowns are noted when using features like parametrize.cached() with torch.export, and issues with guard creation causing unnecessary recompilations.  
[issues/167453, issues/167459, issues/167463, issues/167504, issues/167566, issues/167596, issues/167645]

Documentation and tutorial inaccuracies or missing content: Issues include broken links in tutorials, missing documentation for constants like torch.pi and torch.e, absent instructions for Inductor CUTLASS backend activation, and outdated or incomplete autograd.Function ctx usage guidance. These gaps hinder user understanding and proper usage of PyTorch features.  
[issues/167437, issues/167526, issues/167535, issues/167843]

Memory management and leaks: Problems are reported with memory-related APIs not supported on MPS devices, memory leaks during compilation of exported programs, and issues with custom CUDA memory pools causing unsafe frees and assertion failures. These issues affect memory efficiency and stability during training and compilation.  
[issues/167447, issues/167630, issues/167745]

Build, compatibility, and platform-specific failures: Several issues describe build failures on architectures like aarch64 due to outdated dependencies, incompatibilities with ONNX versions causing import errors, and test failures on ROCm and XPU platforms. These platform-specific problems block successful deployment and testing on diverse hardware.  
[issues/167525, issues/167642, issues/167616, issues/167793, issues/167881]

Inductor and backend-specific bugs and regressions: Inductor backend tests show multiple failures including segmentation faults, accuracy regressions, and crashes when enabling combo kernels. Performance regressions in torchbench benchmarks and issues with Triton kernel compilation on Windows due to path length limits are also reported.  
[issues/167559, issues/167561, issues/167780, issues/167891, issues/167892]

Export and serialization issues: Problems with torch.export include inability to handle dynamic string keys in ModuleDict, lack of multiple entry point exports, and failures loading exported programs due to missing deserialization attributes. These limit the usability of the export system for complex models.  
[issues/167719, issues/167631, issues/167872]

Profiling and debugging enhancements requested: Requests include adding line number tracking in profiler call stacks, improving Dynamo recompile reason messages with full stack traces, and better tensor identity tracking in DebugMode to aid debugging and performance analysis.  
[issues/167605, issues/167898, issues/167656]

Test suite stability and CI infrastructure issues: Multiple tests are disabled due to consistent failures on XPU and ROCm platforms, and CI workload imbalance caused by incomplete test timing data leads to timeouts and inefficient resource use.  
[issues/167807, issues/167808, issues/167809, issues/167810, issues/167746, issues/167616]

CUDA kernel and runtime errors: Issues include invalid kernel launch configurations causing crashes, illegal memory accesses in custom CUDA kernels, and fallback to slower kernels on ARM devices due to dtype mismatches. These errors degrade runtime stability and performance on CUDA devices.  
[issues/167509, issues/167724, issues/167754, issues/167902]

API and function enhancements requested: Proposals include extending torch.hypot to accept scalar arguments for efficiency, adding hooks in torch.pipelining for better control, and improving error messages for networkx exceptions. These aim to improve usability and developer experience.  
[issues/167567, issues/167897, issues/167708]

Miscellaneous bugs and improvements: Other issues cover typographical errors in code comments, reference cycles causing memory leaks, recursion limit handling in torch.compile, and problems with torch.jit.script on Python 3.14 due to missing annotations. These affect code quality and compatibility.  
[issues/167905, issues/167906, issues/167789, issues/167910]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 25
Summarized Issues:

Performance regressions and compilation issues: Several issues report performance regressions and failures related to PyTorch's compilation and runtime optimizations. These include a speed regression in torch.topk starting from version 2.5, failures when using torch.compile() inside CUDA Graph capture due to disallowed CUDA RNG state access, and inductor provenance tracking failures when combining torch.vmap with torch.compile, all impacting performance and usability.  
issues/167462, issues/167639, issues/167699

Security vulnerabilities in build processes: A critical security vulnerability exists in the torch.utils.cpp_extension precompiled-header build process where user input is unsafely passed to shell commands, enabling OS command injection and arbitrary code execution. This highlights a dangerous flaw in subprocess usage that could be exploited by attackers controlling certain parameters.  
issues/167480

Compiler detection and build environment bugs: The compiler detection logic in torch.utils.cpp_extension.check_compiler_is_gcc() fails to recognize versioned GCC compilers like g++-13 or g++-14, causing GCC-dependent features or tests to be disabled or skipped silently. Additionally, the initDeviceStreamState() function is not called in some OpenReg stream APIs, leading to failures, indicating issues in initialization sequences within the build or runtime environment.  
issues/167499, issues/167527

Device-specific bugs on Apple Silicon and MPS backend: Multiple issues report device- and environment-specific bugs on Apple Silicon Macs using the MPS backend. These include NaN outputs when using torch.no_grad() with padded attention masks on Apple M4 Pro, incorrect or unavailable MPS backend on macOS 26.0 (fixed by upgrading to 26.1), incorrect mm/addmm results for large complex64 tensors, and torch.clamp caching bugs causing unexpected outputs on MPS.  
issues/167515, issues/167679, issues/167727, issues/167767

ONNX export and model accuracy issues: Exporting ONNX models with the dynamo=True flag in PyTorch 2.7 causes significant accuracy drops and output mismatches during inference, which was resolved by upgrading to PyTorch 2.9. This indicates issues in the export pipeline affecting model correctness.  
issues/167533

Tensor parsing and API misuse bugs: Passing scalar torch.Tensor objects to parameters expecting IntList leads to incorrect parsing and unexpected behavior due to improper casting to Python lists. This bug necessitates updating parsing logic to correctly handle scalar tensors in such contexts.  
issues/167562

Inductor kernel and layout configuration limitations: Inductor currently lacks support for configuring layout constraints at the individual FX node level, limiting fine-grained control over layout requirements. Additionally, important components like the flex attention backward kernel are omitted from generated Inductor kernel names, causing confusion in performance analysis.  
issues/167591, issues/167706

Library loading and runtime errors on ARM CUDA setups: Running Stable Diffusion on ARM-based CUDA setups with the latest PyTorch nightly on Ubuntu 24.04 fails due to cuDNN frontend errors caused by missing preload of libnvrtc-builtinso.so.*, resulting in no valid execution plans for scaled dot product attention operations.  
issues/167602

Documentation and example code errors: The docstring for torch/onnx/_internal/exporter/_torchlib/ops/nn.py contains syntax errors, misapplies the attention mask, and omits necessary scaling factors in scaled dot-product attention calculations. A corrected version was proposed to align with PyTorch’s MultiheadAttention implementation.  
issues/167627

Feature requests and user feedback: Users have requested support for a TileLang kernel in the transformers library and provided feedback on the documentation for CUDAGraph Trees, indicating ongoing community interest in expanding features and improving documentation clarity.  
issues/167643, issues/167715

Deprecation warnings and CI test instability: Users face excessive and unhelpful deprecation warnings related to float32 matrix multiplication precision settings that cannot be suppressed. Additionally, the rocm inductor-periodic CI tests intermittently fail due to accuracy fluctuations in the repvgg_a2 model, causing instability in test results.  
issues/167644, issues/167652

Sparse tensor and memory safety bugs: Multiplying two sparse COO tensors with torch.sparse.mm in PyTorch 2.9.0 produces corrupted sparse tensors that cause segmentation faults when converting to dense, traced to invalid indices causing out-of-bounds memory access.  
issues/167716

Macro naming conflicts and integration issues: The DIM preprocessor directive in PyTorch conflicts with macros in other libraries like GROMACS, prompting a proposal to rename it to TORCH_DIM to avoid clashes and ease integration.  
issues/167718

Pattern matcher and FX tracing errors: The pattern matcher in PyTorch’s tracing does not correctly accept bound (instance) methods due to argument count mismatches caused by how inspect.signature() handles the implicit self parameter, leading to runtime errors. A fix involves wrapping bound methods to adjust their signatures for proper tracing.  
issues/167776

Memory layout inconsistencies on Metal backend: The Metal backend creates tensors with a column-major memory layout by default, demonstrated by unexpected stride values in half-precision tensors, which may affect assumptions about tensor contiguity.  
issues/167794

Profiler trace file corruption: The torch.profiler.profile function in a recent main branch commit produces broken saved trace files that cannot be viewed properly, breaking downstream tools like the vllm profiler, though this was fixed by a subsequent patch.  
issues/167707

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 195
Key Open Pull Requests
1. [HOP][print][dynamo]Add dynamo for hop print: This pull request introduces the addition of Dynamo support for the HOP print functionality in the PyTorch project, enhancing the integration between HOP and Dynamo for improved printing capabilities.

URL: pull/167571

Merged: No

Associated Commits: 1317b, f52b2, 827f9, 8ceb3, c23a9, 655b6, cc0ca, 32f86, b5e99, 9883e, e2ed8, e6511, def36, d3b2c, 5ee4f, d349e, 3dc18, 282fd

2. [Inductor][HOP][print]Add inductor and IR class for print: This pull request introduces new classes for the Inductor backend and intermediate representation (IR) to enable printing functionality within the PyTorch project.

URL: pull/167680

Merged: No

Associated Commits: 808b6, 0b0ad, 0a52f, 1ed12, f1282, c7932, c7ce4, 58a71, be424, dd65a, 75238, 474c5, 24548

3. Change NamedTupleVariable implementation to subclass UserDefinedTupleVariable: This pull request modifies the implementation of NamedTupleVariable to subclass UserDefinedTupleVariable, enabling it to handle methods that differentiate between structseq or dynamic namedtuple subclasses while defaulting to UserDefinedTupleVariable for other cases, as a continuation of previous related work.

URL: pull/167468

Merged: No

Associated Commits: 67954, 210c1, d55df, 3d9f7, 82210, b166b, 645ab, 2de3b, fb22a, 768d3, 5e249, feecc

Other Open Pull Requests

ComplexTensor subclass implementation: This pull request introduces a new ComplexTensor subclass in PyTorch, including its implementation, associated operations, and comprehensive tests. It also involves necessary refactoring and linting to integrate the subclass properly within the project's structure.

pull/167621

Inductor testing and optimization: Multiple pull requests focus on improving the Inductor backend, including adding a unit test for run-to-run determinism and proposing fusion of the activation function with the Addmm operation to optimize performance. Additionally, new continuous integration jobs were added to run Inductor core tests on Python 3.11 and 3.12, splitting tests into two shards for nightly runs.

pull/167482, pull/167469, pull/167542

DTensor Partial bug fixes: Several pull requests address bugs in the DTensor Partial implementation, including fixing the incorrect return of local tensor values instead of global values when calling .item(), and correcting behavior when adding a scalar to a Partial dTensor by forcing redistribution to replication. These fixes ensure correct handling of scalar operations and global tensor values in distributed contexts.

pull/167598, pull/167813

NestedTensor min/max operation fix: This pull request fixes the NestedTensor min/max operations by using torch.iinfo() for integer data types instead of torch.finfo(), resolving TypeErrors for integer tensors. It also adds comprehensive CPU tests to verify functionality across multiple integer and floating-point data types.

pull/167685

Triton wheel build enhancements: The Triton wheel build workflow was expanded to support building for both current and previous ROCm versions. Updates to build configuration and scripts enable multiple jobs for different ROCm versions to run concurrently.

pull/167536

MPS backend additions and fixes: Pull requests propose adding linalg.lu_solve and linalg.lu functions to the MPS backend and address crashes caused by race conditions by adding mutex guards around caches and the main MTLLibrary. These changes improve functionality and stability of the MPS backend.

pull/167569, pull/167541

Codebase improvements and refactoring: Several pull requests focus on codebase improvements such as moving the enum_tag implementation to a header-only format, extending the TORCH_STABLE_ONLY mechanism to hide all non-stable symbols, and improving error messages related to intermediate opaque objects. These changes aim to improve compilation efficiency, symbol management, and debugging clarity.

pull/167615, pull/167496, pull/167742

Performance optimizations on ROCm: A pull request optimizes the PyTorch TopK operator on ROCm by addressing performance regressions through enhancements like non-temporal loads, 2D kernel fusion, heuristic tuning for large cases, and support for larger input sizes, resulting in an overall performance boost.

pull/167650

Build and workflow management: Pull requests add a clean command and workflow regeneration functionality to the spin configuration, and test the libtorch_agnostic build configuration using the TORCH_TARGET_VERSION environment variable. These changes enhance build environment management and compatibility testing.

pull/167550, pull/167551, pull/167804

Collective operations and reduction fixes: One pull request fixes the all gather bucketing fusion process by correcting dtype cast handling during buffer allocation and copying, while another fixes reduction hint selection for innermost dimension reductions in large tensors by reusing tiling_scores ratio logic and adding tests to ensure correctness.

pull/167853, pull/167472

Inductor compiler enhancements: A pull request introduces a new option for the Inductor compiler that wraps all generated code in a Higher-Order Primitive (HOP) during compilation. This enables compatibility with torch dispatch mechanisms, ensures cache safety, and facilitates support for compiled regions alongside other torch dispatch tracers with minimal CPU overhead.

pull/167844

Multiple reductions and normalization: This pull request implements handling of multiple reductions in node splits along with normalization of read and write operations, improving the correctness and efficiency of reduction operations in the codebase.

pull/167845

Sparse tensor matrix multiplication output: A pull request enables the matrix multiplication output operation (mm out) for sparse tensors, including the addition of the operation and corresponding tests to ensure functionality.

pull/167908

XPU backend scaled matrix multiplication: This pull request implements the scaled_mm_v2 operation for the XPU backend, continuing previous work on XPU scaled matrix multiplication and oneDNN kernel integration as part of a multi-PR stack enhancing XPU support.

pull/167518

Module export enhancements: This pull request enables the use of the module.to method within the export forward process by inlining eager module.to calls, implementing polyfills for C++ bound functions, and using a specialized tracer to handle graph breaks during export.

pull/167555

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 137
Key Closed Pull Requests
1. Add Muon to CPP: This pull request attempts to port the Muon optimizer, previously added to the Python API, directly to the C++ backend of PyTorch by closely following the Python implementation and using Adam as a template, while also seeking guidance on appropriate testing strategies and potential fused versions combining AdamW and Muon for different parameter types.

URL: pull/167876

Merged: No

Associated Commits: f1cb5, 1700f, a7224, d261a, 01706, 5af5d, 0285c, a7294, e1401, e8e85, dc412, a06c7, a2124, 51e12, 8ba5d, 9768b, f74fd, f3a3a, 92908, 25df3, 52ac9, 50bb7, 73911, 2937a, ab467, b0a97, 911ce, 95ec0, 74c28, 0b86d, 17d02, 49487, e2ed5, 07d7b, a38c8, 059f6, 93aa8, 444ce, 6ed5d, e0183, fea1b, 3d569, b6ffa, e4e40, 8d625, e6464, f5917, 1492f, c6e20, c6115, e06ff, ef57a, 888d7, 86881, 40e66, 0d644, 119de, ae67f, c75f6

2. DTensor fast path: port return_and_correct_aliasing and inplace/out checks: This pull request proposes porting the functions return_and_correct_aliasing and inplace/out checks to the DTensor fast path in PyTorch, aiming to achieve a several-microsecond performance improvement in the detach benchmark, although it was ultimately not merged.

URL: pull/167475

Merged: No

Associated Commits: 6d56f, 92a4b, 9a5da, 0783b, a9cfd, 2e6c9, 600f9, 20ba0, 74e2d, 77ff3, c96e8

3. [1/N][BugFix][Refactor] fix several instances which use f = open(...) without a corresponding f.close(): This pull request addresses a bug by refactoring multiple instances where files were opened using f = open(...) without a corresponding f.close(), thereby preventing potential file descriptor leaks that could lead to resource exhaustion or other unpredictable issues.

URL: pull/167423

Merged: No

Associated Commits: cfc77, 97aeb, df391, 9642c, 6c915, e21d0, 83b15, a1389, 6a2f5, fe3bc

Other Closed Pull Requests

DebugMode Enhancements: This set of pull requests introduces new debugging features including a tensor hashing variant torch.hash_tensor for enhanced log annotations and inline stack trace display via .debug_string(show_stack_trace=True). These improvements provide more detailed debugging information by supporting multiple hash functions and showing forward and backward operation traces, although some limitations remain in capturing all backward dispatch calls and compiled region traces.  
pull/167486, pull/167589

Pallas XLA Backend Development: Multiple pull requests focus on the Pallas backend integration for PyTorch Inductor, including initial setup and infrastructure, implementation of strided and scatter access methods, and attempts to add complex indexing functionality. While the initial setup and infrastructure were merged, the strided, scatter, and complex indexing features were proposed but not merged.  
pull/167675, pull/167426, pull/167493

PyTorch Dynamo Improvements: These pull requests enhance PyTorch Dynamo by refactoring higher-order operator infrastructure with speculate_subgraph_with_auto_output_flattening() to support complex Python object returns and improve tracing flexibility. Additionally, they address restoring side effects in invoke_subgraph and include tests for non proxy-able outputs, with some tests marked as failing to be fixed later.  
pull/167438, pull/167446

ROCm CI and Workflow Enhancements: This group of pull requests adds and updates GitHub Actions workflows to improve continuous integration for ROCm environments, including Docker image caching for MI3xx ROCm and a full unit test suite for ROCm MI3xx CI runners with both default and distributed configurations. These workflows are designed to optimize CI efficiency and evaluate new CI capacity without blocking PR merges.  
pull/167554, pull/167587

Memory and Kernel Optimizations: Pull requests in this category address kernel and memory efficiency improvements such as fixing a tiling bug by preventing unnecessary splits for broadcasted loads and optimizing CUDA kernels for torch.nn.EmbeddingBag by reducing register pressure through a separate validation loop. These changes result in improved memory load efficiency and significant performance gains across various GPUs.  
pull/167771, pull/167834

Codebase Maintenance and Compatibility Fixes: These pull requests improve code quality and compatibility by refactoring file handling to ensure proper resource management, updating FX graph handling for Python 3.14 lazy function annotations, and switching to c10::filesystem for better maintainability. They also include fixes for backward kernel layout constraints and special handling for numpy array copy modes to maintain compatibility with updated dependencies.  
pull/167628, pull/167573, pull/167619, pull/167821, pull/167600, pull/167427

Inductor Backend and Configuration Updates: This set of pull requests adds meta registration and FakeTensor tests for torch._scaled_mm_v2, introduces an inductor configuration option assume_32bit_indexing for runtime assertions to improve performance, and proposes adding tensor manipulation functions like reshape, view, and flatten to the stable directory. These changes aim to enhance backend correctness, performance, and API completeness.  
pull/167653, pull/167784, pull/167588, pull/167667

Stream and Library Helper Improvements: These pull requests propose initializing device stream states for all devices to improve stream management and add the TORCH_BOX helper for the STABLE_TORCH_LIBRARY_IMPL to support user kernels with header-only concepts and unbox type mapping. These enhancements improve device stream handling and kernel development ergonomics.  
pull/167528, pull/167478

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

cyyever
199
44
0
28

malfet
110
26
2
82

williamwen42
126
22
10
29

Skylion007
11
6
1
162

guangyey
105
10
2
54

anijain2305
134
12
1
13

ezyang
63
12
17
56

pianpwk
104
18
1
18

mlazos
111
11
2
9

mikaylagawarecki
99
18
0
15

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
cyyever	199	44	0	28
malfet	110	26	2	82
williamwen42	126	22	10	29
Skylion007	11	6	1	162
guangyey	105	10	2	54
anijain2305	134	12	1	13
ezyang	63	12	17	56
pianpwk	104	18	1	18
mlazos	111	11	2	9
mikaylagawarecki	99	18	0	15