Weekly GitHub Report for Pytorch: October 06, 2025 - October 13, 2025 (12:01:44)

            Weekly GitHub Report for Pytorch: October 06, 2025 - October 13, 2025 (12:01:44)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Ban and remove plain asserts with no message in our python code: This issue addresses the removal of plain assert statements without explanatory messages in the Python codebase, aiming to replace them with proper error handling to avoid issues when Python is run with optimizations. The plan includes enabling a Ruff lint rule to detect such asserts and systematically refactoring existing code by adding descriptive messages and converting asserts into explicit exceptions, with a focus on maintaining backward compatibility by using AssertionError.

The discussion involved a contributor analyzing the codebase to quantify asserts and seeking guidance on the scope and approach, including whether to replace all asserts or only those without messages, and which exception types to use. The maintainer recommended enabling linting to block new plain asserts, fixing files incrementally with small PRs, prioritizing asserts without messages, and using AssertionError for compatibility. The work was then divided into smaller modules for collaboration, with contributors volunteering for specific areas and coordinating progress through multiple PRs.
Number of comments this week: 11

[Fuzzer][Eager/Compile Divergence] DDE when calling .item() fake impl not called.: This issue reports a discrepancy between eager execution and compiled execution in PyTorch when calling the .item() method on a tensor, specifically noting that the fake implementation for .item() is not invoked during compilation, leading to a data-dependent error (DDE). The problem appears to stem from the compiled code calling the native C++ .item() method directly on a FakeTensor instead of routing through the expected fake tensor dispatch mechanism, causing a guard failure related to symbolic size expressions.

The comments discuss whether the bug is realistic or user error, clarify that the issue is not limited to boolean tensors but also occurs with int64 tensors, and identify that the .item() call bypasses the fake tensor dispatch system. Debugging suggestions include examining the C++ stack trace, and a linked pull request is mentioned as a fix for the problem.
Number of comments this week: 8

Something broke from functorch import make_fx on main: This issue reports a backward compatibility-breaking problem with the import statement from functorch import make_fx on the main branch, which appears to have been introduced recently. The discussion revolves around whether the functorch module should be removed or deprecated in favor of torch.func, with attempts to identify the cause and confirm the regression, while noting that the issue does not reproduce in nightly builds.

The comments highlight the urgency of investigating this regression, debate the future of the functorch module including plans for its removal or deprecation, share attempts to reproduce and isolate the problem, and confirm that the issue is not present in nightly versions, suggesting a need for a proper deprecation cycle rather than immediate removal.
Number of comments this week: 8

torch.compile fails to trace datetime.now() with Dynamo guard check failure: This issue reports that torch.compile fails to trace the Python builtin function datetime.now() due to a Dynamo guard check failure, causing an assertion error when compiling a model that uses this function. The error message indicates that Dynamo cannot trace this builtin and suggests filing an issue to add support or using workarounds such as wrapping the function or allowing it in the graph.

The comments discuss previous occurrences of this problem and propose adding a variable tracker similar to randInt to support tracing datetime.now(). Contributors highlight challenges with meaningful timing measurements if datetime calls are reordered during compilation and suggest alternatives like moving datetime calls outside compiled regions or using the PyTorch profiler. One participant expresses intent to start working on a solution involving a new variable tracker and requests assignment to the issue.
Number of comments this week: 6

torch.cuda._check_capability could raise a false-negative warning: This issue reports that the function torch.cuda._check_capability in PyTorch may raise a false-negative warning when detecting CUDA capability versions, particularly when the minor version of the device's CUDA capability is not properly accounted for. The user observes that despite having a compatible CUDA environment and source build, a warning is issued suggesting an unsupported CUDA version, and they suggest the check should consider the minor version as well as the major version.

The comments discuss that this behavior was somewhat intentional, as upstream builds typically do not include the minor version 8.9 in TORCH_CUDA_ARCH_LIST, only 8.6. There is a suggestion that similar warnings would occur for 8.6 if built with that version, and the issue is linked to a known effort to address this. The fix was identified and a commit resolving the problem was missed initially but has since been found and considered for backporting in the 2.9.1 release.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0 development build, CUDA 12.1, and Ubuntu 22.04, and demonstrates the error occurring while compiling specific pipeline components with torch.compile in a custom pipeline setup.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, with demonstrated speedup in testing.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and directory permissions being correct.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling UFMT (a formatting tool) on all files within the PyTorch codebase, specifically targeting approximately 1,500 files that are currently excluded from UFMT formatting. It outlines the process for removing files from the exclusion list, running the formatter, and managing preparatory fixes for known problems, while also providing a detailed worklist organized by directory to coordinate the incremental application of these formatting changes.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on mobile devices where storage is limited.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 86
Summarized Issues:

TorchDynamo compilation errors with dynamic shapes and symbolic sizes: Multiple issues report failures in TorchDynamo compilation due to inability to guard on data-dependent expressions involving symbolic sizes or dynamic shapes, leading to user errors or assertion failures that do not occur in eager execution. These problems manifest in operations like .item() calls on tensors with dynamic shapes, while_loop constructs with symbolic integers, and complex chained matrix multiplications, causing divergence between eager and compiled execution.  
[issues/164704, issues/164725, issues/164800, issues/165081, issues/165105]

Performance discrepancies and backend inefficiencies: There are significant performance differences between backends such as CUSPARSELT and CUTLASS for SparseSemiStructuredTensor matmul, with CUTLASS outperforming dense operations while CUSPARSELT is slower. Additionally, torch._int_mm and torch._scaled_mm functions perform poorly or error out when the right-hand side matrix is in row-major layout due to inefficient handling or lack of support for strided memory access.  
[issues/164707, issues/165230, issues/165231]

MPS and Metal backend test failures on MacOS: Multiple tests fail on MacOS Tahoe with M4 Pro using the Metal Performance Shaders backend, including signal processing, exponential, matrix inverse, Cholesky, logarithmic operations, and memory allocator validations, due to tensor mismatches and allocation discrepancies. These failures indicate instability and correctness issues in the MPS backend on this hardware and OS combination.  
[issues/164712]

FSDP and pipeline parallelism bugs: Issues with Fully Sharded Data Parallel (FSDP) unshard/reshard operations not working correctly within pipeline parallelism arise from non-recursive unshard calls and the Interleaved1F1B schedule not using the pipeline runtime, causing incorrect allgather timings and performance degradation. Additionally, broadcasting model parameters with FSDP2 triggers assertion errors due to missing DeviceMesh in distributed tensor arguments.  
[issues/164756, issues/164843]

Torch.compile and Inductor backend limitations: Several issues report failures when using torch.compile with the Inductor backend, including inability to compile functions using .to_sparse(), runtime errors compiling flex_attention on RTX 2080 Ti with CUDA 12.8 and Triton 3.3.0, and assertion errors when calling DeviceMesh methods inside compiled functions without required arguments. These limitations cause runtime failures and restrict model compatibility with Inductor.  
[issues/164813, issues/164823, issues/165215]

ONNX export and graph fusion problems: Exporting models to ONNX format fails or raises exceptions in cases such as exporting GRU layers with dynamo=True enabled and exporting UNet models with MultiheadAttention and dynamic input shapes, due to operator support and shape mismatch issues. Additionally, the fuse_as_graphmodule function changes input order in partitioned exported programs, altering expected input sequences.  
[issues/164834, issues/164837, issues/165041]

Symbolic integer (SymInt) and fake tensor issues: Bugs arise from fake tensors leaking into model state during aot_autograd passes, causing corruption, and from unbacked symbolic integers causing sharding propagation errors during DTensor matrix multiplication tracing. There is also inconsistency between torch.export and Dynamo in handling lifted symbolic integer inputs, leading to divergence in tracing behavior.  
[issues/164732, issues/165034, issues/165073]

Dependency and version conflicts in nightly and release builds: Nightly ROCm7 versions of PyTorch and related packages like torchaudio and torchvision are misaligned, causing dependency conflicts requiring previous torch versions. Users also face difficulties installing PyTorch 2.8.0 with CUDA 12.8 on Windows due to torchaudio requiring older torch versions, complicating upgrade paths.  
[issues/164710, issues/164768]

CUDA capability and driver issues: The torch.cuda._check_capability function may incorrectly raise false-negative warnings by not properly considering GPU minor versions, leading to unnecessary warnings despite correct environment setup. Additionally, automated NVIDIA driver updates fail on H100 and A100 instances, requiring manual intervention to update drivers and fabric manager to avoid CI failures.  
[issues/164708, issues/164860]

Test suite instability and disabled tests: Several tests are disabled due to consistent failures or flaky behavior on main branches or specific platforms, including fuzzer tests on Linux and ROCm, CUDA matmul tests on Linux, XPU tests in Inductor suites, and AOTInductor kernel profiling tests on XPU. These disabled tests indicate ongoing instability in various components and platforms.  
[issues/164840, issues/164845, issues/165025, issues/165130]

Distributed and RPC errors: Distributed checkpointing incorrectly assumes CUDA compatibility due to reliance on CPU loader, causing bugs. RPC module's wait() on RRef objects from rpc.remote() triggers AttributeErrors and aborts due to unhandled exceptions in thread pools. Also, pipeline parallelism across nodes fails with EOFError due to message truncation during inter-stage communication.  
[issues/165097, issues/165122, issues/165143]

Build and compilation environment problems: Building PyTorch from source on CentOS Stream 9 fails due to missing generated header files like TensorBody.h. The build process incorrectly uses FBGEMM's Cutlass instead of PyTorch's third_party/cutlass, causing build inconsistencies. Additionally, gfx1100 test jobs time out due to missing build support for the gfx1100 target, and an IndentationError occurs importing nightly torch builds in Python 3.13.8.  
[issues/165100, issues/165110, issues/165040, issues/165238]

Memory allocation and tensor slicing errors: Slicing tensor subclasses like Float8Tensor during inference mode causes runtime errors due to invalid version counter updates. Slicing PyTorch tensors with negative steps raises ValueErrors, inconsistent with Python and NumPy behavior. Enabling expandable_segments in CUDA allocation config causes segmentation faults on certain Linux systems.  
[issues/164872, issues/165244, issues/165208]

Error handling and code quality improvements: The PyTorch codebase is undergoing systematic removal of plain Python assert statements without messages, replacing them with proper error handling to improve clarity and reliability. Additionally, exception handling in validate_input_col is incomplete, catching only ValueError but not TypeError, leading to silent failures in data pipeline validation.  
[issues/164878, issues/164862]

Documentation and user experience issues: The Adagrad optimizer documentation incorrectly references SGD in examples, causing confusion. MaskedTensors documentation contains broken links, suggesting updates or removal. Users request improved support for specifying multiple default distributed backends to enhance usability of distributed APIs.  
[issues/164932, issues/165134, issues/164925]

Runtime errors due to deterministic algorithm settings and optimizer state: Enabling deterministic algorithms causes runtime errors in CUDA torch.kthvalue() due to lack of deterministic implementation, affecting quantized model training workflows. Calling get_optimizer_state_dict unexpectedly modifies fresh optimizer state, causing inconsistent behavior due to assumptions about zero learning rate steps.  
[issues/165227, issues/164929]

CI and infrastructure issues: Docker image pulls on H100 runners take excessively long, impacting CI efficiency. macOS CI runners are unavailable, causing indefinite job queuing and inability to test on macOS. ROCm GPU machines are under maintenance, increasing queue times for workflows.  
[issues/164951, issues/165207, issues/165240]

Symbolic tracing and guard failure diagnostics: Dynamo guard failures on the same frame produce unclear error messages, complicating debugging. There is a need to retain stack traces at mutation points to guide strict-export application and user code rewriting. Dynamo tracing breaks on C-implemented optree structures, requiring prototypes to replace optree with pytree for compatibility.  
[issues/164990, issues/164971, issues/164972]

Miscellaneous feature requests and proposals: Proposals include adding immutable ones/zeros tensors for constant-time all()/any() operations, implementing tofile method on TorchTensor in ONNX, dynamically loading CUDSS library from Nvidia PyPI wheels to reduce wheel size, and reinstating CUDA 12.9 support for performance-critical workloads like vLLM.  
[issues/165042, issues/165120, issues/165070, issues/165165]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 14
Summarized Issues:

Compilation and Runtime Errors in TorchInductor and CUDA Backends: Several issues report errors during compilation or runtime involving tensor dimensionality mismatches and illegal memory accesses. These include problems with tensor sizes and strides mismatching in TorchInductor, assertion errors when compiling torch.isin with scalar inputs, and CUDA illegal memory access errors when stacking many permuted tensors.  
[issues/164814, issues/164849, issues/164924]

Tensor and Kernel Miscompilation Bugs: There is a miscompilation bug in torch.compile where a fused kernel copying from and saving to the same state tensor causes a race condition on CUDA devices, leading to incorrect tensor updates. Additionally, a data-dependent error occurs in the autograd slicing function, indicating issues with tensor slicing operations.  
[issues/164701, issues/164835]

Linker and Type Annotation Issues in PyTorch Extensions: A multiple definitions linker error arises when including <torch/csrc/stable/tensor.h> in both CPU and CUDA sources due to a non-inline function definition. Separately, a type annotation error exists in torch/utils/cpp_extension.py where a function is incorrectly declared to return None but returns other types.  
[issues/164742, issues/165125]

Backend Compatibility and OS Version Confusion: The MPS backend incorrectly reports a macOS version requirement, stating it needs macOS 13.0 or higher when it actually requires macOS 14.0 or higher, causing confusion and preventing usage on supported newer OS versions.  
[issues/164943]

Build and Access Control Interruptions: Temporary AWS access denial caused build job failures in the PyTorch project, which were resolved by restoring permissions and rerunning jobs.  
[issues/164850]

Model Export and Evaluation Mode Handling: Exporting a torchaudio model with torch.onnx.export fails due to issues handling model evaluation mode and None outputs, with suggested fixes including setting the model to eval mode or wrapping it to drop None outputs before export.  
[issues/165096]

Debugging and Logging Failures in Multi-GPU Training: During an 8-GPU training session simulating failures, the watchdog detects timeouts and prints stack traces but fails to generate the expected dump file in the /tmp directory.  
[issues/165117]

Project Maintenance and Documentation Improvements: The autorevert functionality was temporarily disabled to investigate a problematic revert, with plans to improve it later. Additionally, a documentation issue was reported in the "khan" project, and a full linting process was ensured on pull requests after adding ciflow/trunk.  
[issues/165054, issues/165166, issues/165168]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 224
Key Open Pull Requests
1. Bugfix to forward autodiff causing differing data types: This pull request addresses a bug in PyTorch's forward-mode automatic differentiation where dtype promotion logic failed to correctly handle Python scalars and 0-dimensional tensors by introducing a new property was_wrapped_number to distinguish wrapped numbers, updating autograd code to set this property during arithmetic operations, and modifying the Python dtype promotion logic accordingly to ensure consistent data types, accompanied by new tests targeting these cases.

URL: pull/164784

Merged: No

Associated Commits: ddf3e, c29c2, c327c, ad40b, 1a10b, 76ac4, 1ed00, 145d9, 11e9d, 18faa, d1f7c, 7a21c, 2ccd5, 7bc18, dc6cd, 8c2d2, f8a0d, 167cb, d2d2b, b08d7, 4eca0, c3885, c3fb4, f8efa, b1fd1, 36a4c, 6b230, 03fd0

2. Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask): This pull request patches the flex_attention._get_mod_type function to eliminate the use of inspect.signature when computing num_positional_args by instead relying on properties from NestedUserFunctionVariable.const_getattr, thereby providing an alternative fix to resolve the flex attention graph break issue related to create_block_mask.

URL: pull/164923

Merged: No

Associated Commits: 52067, 87e3e, e530f, 337be, 385ae, 19312, 289a9, 1c8b0, f303f, 48429, 3592f, 06b43, 4a0ea, c707f, 96aea, b1afa, 69159, a6004, 20fbb

3. [Bugfix][Dynamo] Fix Sparse tensors by graph break  in Dynamo: This pull request addresses bug #164823 by explicitly handling the lack of support for sparse tensors in Dynamo through graph breaks, with fixes applied across fake tensor, inductor, and lowering code to improve consistency and formalize support.

URL: pull/164873

Merged: No

Associated Commits: 56b3c, a1f47, 2a0fe, d7780, 4c076, ec8f0, 52874, 06950, 8d5e7, 55f4c, b7582, 0d7ff

Other Open Pull Requests

DTensor Redistribution and API Enhancements: Multiple pull requests improve DTensor functionality by implementing a graph-based redistribution algorithm using weighted collective operations and Dijkstra's algorithm to optimize device order redistribution. Additionally, a user-facing API was introduced to specify tensor distribution with placements and shard_order parameters, enhancing control and consistency in tensor distribution.  
pull/164902, pull/165092

DebugMode and Guard Handling Improvements: Pull requests address issues with DebugMode by adding tensor node IDs and output annotations for better traceability and fixing compile failures related to DebugMode's stack guards. These changes include a helper to globally add guard filter functions and customize C++ guard checks to ignore DebugMode, resolving assertion errors.  
pull/165076, pull/165012

Proxy Tensor and FakeTensor Symbolic Shape Tracking: Changes were made to maintain symbolic shape node tracking in ProxyTorchDispatchMode even when proxy tracing is disabled, preventing confusion in proxy tracing with dynamic shapes. The FakeTensor cache was also improved to handle SymNode objects correctly by avoiding caching operations producing SymNodes and updating cache keys accordingly.  
pull/164717, pull/164718

Continuous Integration and Environment Updates: Several pull requests update CI workflows by migrating Python version testing from 3.13 to 3.14, enabling AOTI tests on Windows, and updating binary YAML files to use the linux.rocm.gpu.mi250.1 label for ROCm GPU compatibility.  
pull/164711, pull/164935, [pull/164791](https://github.com/pull/164791]

Code Quality and Cleanup: Multiple pull requests focus on improving code quality by polishing docstrings and comments, removing unused variables and code, and replacing std::runtime_error with TORCH_CHECK for better error handling.  
pull/165039, pull/165136, pull/165119

CUDA and Performance Fixes: Fixes include correcting the CUDA reduction warp shuffle order to align with Triton-generated code for better bitwise equivalence and minimal performance impact, as well as addressing numerical issues related to cuDNN through iterative debugging.  
pull/164790, pull/164950

Sparse Tensor Operations on MPS Backend: A pull request implements matrix multiplication for sparse tensors on the MPS backend, completing most core sparse operations and providing performance comparisons between MPS and CPU on an M1 Pro.  
pull/165232

Bug Fixes and Warning Mechanisms: Bug fixes include correcting the tolist method for GradTrackingTensors by recursive unwrapping and fixing stream graph output semantics. A warning mechanism was also added to alert users when AccumulateGrad node streams do not match their producer node streams to identify synchronization issues.  
pull/165184, pull/164819, pull/165065

Optimizer and Linker Fixes: Fixes address linker errors and C++/Python inconsistencies in optimizer default parameter handling by moving implementations, adding explicit template instantiations, and enabling automatic parameter group inheritance to maintain API parity without breaking changes.  
pull/165182

Documentation and Benchmarking Enhancements: Documentation efforts include setting up compiler documentation in the User Guide, while benchmarking improvements expand the generic Benchmarker.benchmark function to support additional devices like Triton CPU and callable arguments for more flexible performance testing.  
pull/164940, pull/164938

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 201
Key Closed Pull Requests
1. [compile] Regional inductor compilation with fx.annotate: This pull request introduces a mechanism to compile specific regions of an FX graph using fx.traceback.annotate by allowing users to mark code regions with annotations like {"compile_with_inductor": 0}, which are then recognized and compiled into subgraphs via a CapabilityBasedPartitioner and torch._inductor.standalone_compile, requiring explicit compiler invocation and metadata preservation to control compilation scope and enable regional inductor compilation within PyTorch’s FX framework.

URL: pull/164776

Merged: No

Associated Commits: 7508d, 7b02e, 3fec7, 8f730, 7fdaa, 80733, a8fa3, ea770, 0eb30, 3a585, 4b912, 29b5a, 92782, ab91c, e95ed, 954f7, 72ee6, 9b0c0, af4e7, 3ceb4

2. [ROCm][CI] Change gfx1100 workflow to run periodically and with 6 shards: This pull request proposes changing the ROCm gfx1100 continuous integration workflow to run periodically with six shards, including updates to job names, test labels, and conditions to only run on the main branch, aiming to optimize and refine the testing process.

URL: pull/164975

Merged: No

Associated Commits: d9ea6, 6de07, 914bb, 9f846, 3275d, 2ea58, 45bfb, fae03, 71675, 8e331, 76cea, 641fe, d35a2, a0b46, 2bd47, b86d3

3. [4/N] [DTensor device order] Support debugmode to show dtensor distribution transform path: This pull request proposes adding a DebugMode feature to the DTensor device order system that enables printing detailed updates of placements and shard_order during the execution of transform_infos, thereby allowing developers to trace the transformation path from source to target placement in the tensor distribution.

URL: pull/164821

Merged: No

Associated Commits: 94153, 87a72, a90ce, 7da17, dc430, 48661, 5bca7, 23af7, bb7b6, a97e5, efee0, 5a0ac, ae808, b8a7b, 0fbda

Other Closed Pull Requests

DTensor device order and placement improvements: Multiple pull requests propose enhancements to the DTensor system by adding a new shard_order attribute to DTensorSpec and improving utility functions within the device mesh component to support device order placement. These changes aim to better specify and manage the order of distributed tensor shards across devices.  
pull/164806, pull/164797

Removal of check_is_size function attempts: There are two pull requests attempting to remove additional instances of the _check_is_size function or check from the PyTorch codebase. Both attempts were ultimately not merged despite multiple commits focused on eliminating or replacing this check.  
pull/164753, pull/164706

TorchInductor benchmark pruning: Two pull requests focus on reducing continuous integration costs by pruning the TorchInductor benchmark dashboard, specifically by removing or reducing the number of timm and huggingface models included. These changes follow documented guidelines to streamline the benchmark suite.  
pull/164816, pull/164805

Fixes and improvements for ROCm and CUDA build/test issues: Several pull requests address platform-specific issues including fixing test failures on ROCm caused by API refactoring, skipping hipify steps to fix build errors, and making the libnvToolsExt library optional to resolve CUDA 13.0 build failures on Amazon Linux 2023. These fixes improve compatibility and stability across different hardware and software environments.  
pull/164897, pull/164735, pull/164870

Dynamo and Fx graph enhancements: One pull request proposes changes to the Dynamo component to support eager execution of torch.fx.traceback.annotate and ensure every Fx node includes a custom meta node. This builds on prior work to facilitate future regional inductor compilation despite current limitations with graph breaks.  
pull/164678

Optimization and bug fixes in Inductor and list comparison: Pull requests include an optimization to list equality comparisons by adding an early exit for different lengths to avoid unnecessary checks and a fix to achieve bitwise equivalence between compiled and eager modes for small reductions by adjusting inductor numerics. These changes improve performance and test stability.  
pull/165091, pull/164755

Memory Estimator refactoring: One pull request refactors the Memory Estimator to use node storages for analysis, simplifying bookkeeping and factoring logic into a reusable class. Tests were added to verify correctness on forward and backward passes, improving maintainability and accuracy.  
pull/164783

Fixes for stride and tensor size handling in Inductor: A pull request addresses stride incorrectness for tensors with size zero or symbolic stride equal to one by fixing errors in graph capture and Inductor code generation. This resolves related bugs and ensures correct handling of these edge cases.  
pull/164897

Dynamo DebugMode recompilation fix: One pull request fixes unnecessary recompilations in PyTorch's Dynamo DebugMode caused by guard failures on dispatch key set checks by masking out Python dispatch keys in guard comparisons. This prevents spurious recompilations during repeated executions of compiled functions.  
pull/164992

Pipeline schedule migration: A pull request aims to migrate various pipeline schedules to use a unified _PipelineScheduleRuntime that standardizes execution logic by adding UNSHARD and RESHARD operations. This addresses a known issue and fixes a gradient scaling problem but has not been merged yet.  
pull/164777

Deprecation of sizelike functionality: One pull request proposes further deprecation of the sizelike functionality by removing C++ bindings and usages of the expect_size feature in the PyTorch codebase.  
pull/164803

Custom operator autotune fixes: A pull request fixes issues related to the custom operator autotune feature by removing redundant code, refining decorators, including default implementations in choices, and improving test cases and code quality. These changes ensure correct fallback behavior and maintainability.  
pull/164689

Test and type checker suppressions: One pull request adds suppressions to the pyrefly type checker configuration to ensure a clean typecheck by removing certain exclusions, running checks, and updating suppressions accordingly. This pull request was not merged.  
pull/164748

Embedding backward int32 overflow fix attempt: A pull request addresses a potential int32 overflow issue in the embedding_dense_backward function caused by large values of max_partial_segment that can overflow the gid variable, but it was not merged.  
pull/165095

Log classifier enhancement for flaky tests: One pull request enhances the log classifier to better distinguish between rerun tests and genuine failures by adding print statements that identify flaky tests which fail and then succeed upon rerun in a new subprocess.  
pull/165163

Distributed capacity test submission: A non-merged test submission pull request aimed at evaluating distributed capacity by distributing gfx942 workloads and updating the periodic ROCm workflow.  
pull/165124

Placement class static method proposal: One pull request proposes making certain methods in the Placement class static to allow usage without instantiating the class, facilitating operations like distributing tensors from normal tensors more conveniently.  
pull/164820

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

bobrenjc93
334
48
47
32

laithsakka
163
31
7
47

cyyever
144
56
0
17

ezyang
65
16
6
117

malfet
86
15
12
77

anijain2305
144
23
4
13

Skylion007
12
5
2
159

eellison
68
19
0
52

kwen2501
74
15
7
40

tugsbayasgalan
86
22
4
23

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
bobrenjc93	334	48	47	32
laithsakka	163	31	7	47
cyyever	144	56	0	17
ezyang	65	16	6	117
malfet	86	15	12	77
anijain2305	144	23	4	13
Skylion007	12	5	2	159
eellison	68	19	0	52
kwen2501	74	15	7	40
tugsbayasgalan	86	22	4	23