Weekly GitHub Report for Pytorch: February 24, 2025 - March 03, 2025

            Weekly GitHub Report for Pytorch: February 24, 2025 - March 03, 2025

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release also marks the deprecation of PyTorch's official Anaconda channel, with a shift towards using Manylinux 2.28 for Linux binaries, and introduces a backward compatibility-breaking change by setting weights_only=True as the default for torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL: This issue is a feature request for a stream management API in PyTorch's Process Group NCCL to address asynchronous communication challenges, particularly the "read-before-write" issue that arises when each NCCL process group operates on its own dedicated stream. The request suggests allowing users to define or set streams in process groups to ensure proper execution order of collective operations, which is crucial for overlapping communication with compute operations in multi-data center training scenarios.

The comments discuss related issues and potential solutions, including using existing APIs with user-defined streams to manage synchronization and execution order. Some contributors suggest that the current behavior is not a bug but a user responsibility to ensure correct synchronization, while others propose improvements for better stream control. There is a consensus on the need for clearer documentation and possibly a new feature to allow more direct control over which stream NCCL uses.
Number of comments this week: 16

Checkpoint doesn't work with torch_function if torch_function change tensor metadata: This issue highlights a problem with the PyTorch checkpoint functionality when used in conjunction with TorchFunctionMode, specifically when __torch_function__ alters tensor metadata, leading to a metadata mismatch error during the recomputation phase. The user provides a minimal reproducible example demonstrating the error and seeks advice on whether there is a way to make __torch_function__ compatible with checkpointing.

The comments discuss potential reasons for the issue, such as the violation of autograd conditions due to metadata changes, and suggest workarounds like manually activating TorchFunctionMode within the checkpointed function. There is also a discussion about the possibility of checkpoint detecting and re-enabling modes during recomputation, with some users expressing optimism about the feasibility of this solution.
Number of comments this week: 9

FlexAttention compiled has illegal memory access or device-side assert even though all tensors are contiguous: This issue involves a bug in the PyTorch library where the FlexAttention module, when compiled, encounters illegal memory access or device-side assertions despite all tensors being contiguous. The problem arises under specific parameter configurations, and while a workaround involving padding the rel_bias tensor can mitigate the illegal memory access, it does not resolve the device-side assertion issue, and further complications occur during the backward pass.

The comments discuss potential causes and solutions for the issue, including a suggestion to modify the block-mask function and score-mod function to prevent out-of-bounds access. There is a debate about whether the score_mod function should be evaluated conditionally based on the mask_mod function, and it is noted that this problem has been a long-standing issue with some workarounds available.
Number of comments this week: 6

[inductor][cpu]AOT inductor AMP static shape default wrapper occupied almost 3x disk than before: This issue reports a significant increase in disk usage when using the AOT inductor AMP static shape default wrapper, which is observed to occupy almost three times more disk space than before, specifically when running the ResNet50 model. The problem is suspected to be caused by a specific commit, which has been identified as potentially responsible for this behavior.

The comments discuss the need to identify the commit causing the issue, with some users suggesting manual searching due to the lack of an automated mechanism. A suspected commit is identified, and further investigation is requested from other contributors to understand the cause of the increased disk usage. There is also a mention of an upcoming change that might resolve the issue.
Number of comments this week: 6

No gradient for residuals in the return value of torch.linalg.lstsq: This issue highlights a concern with the torch.linalg.lstsq function in PyTorch, where the residuals in its return value do not have a gradient, unlike the solution. The user is questioning whether this behavior is expected and is considering contributing a solution to address this limitation.

The comments discuss the inefficiency of manually computing gradients for residuals and suggest that the current API design is flawed. A user proposes a code solution and offers to submit a pull request to improve the functionality, while another commenter advises against changing the API and suggests users implement their own solutions if needed.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 135
Summarized Issues:

ONNX Export Issues: The PyTorch ONNX export functionality encounters various issues, such as unsupported operations and errors during model export. These include failures with dynamic shapes, unsupported operators like aten::bucketize, and slicing operations on complex tensors. These issues prevent successful model export and require workarounds or updates to the ONNX export process.  
pytorch/pytorch/issues/147720, pytorch/pytorch/issues/147739, pytorch/pytorch/issues/147859, pytorch/pytorch/issues/147973, pytorch/pytorch/issues/148041, pytorch/pytorch/issues/148098, pytorch/pytorch/issues/148192, pytorch/pytorch/issues/148268

Segmentation Faults and Errors: Several issues report segmentation faults and errors in PyTorch, often related to specific operations or configurations. These include faults in the Triton upstream, dtype view conversions, and the use of torch.sparse.sum, leading to crashes and requiring investigation and fixes.  
pytorch/pytorch/issues/147734, pytorch/pytorch/issues/148276, pytorch/pytorch/issues/148273

Distributed and Parallel Computing Challenges: PyTorch faces challenges in distributed and parallel computing, such as the lack of sharding strategies for certain operators, issues with the ProcessGroupNCCL, and problems with the fully_shard function in FSDP2. These issues affect performance and require enhancements to improve distributed training capabilities.  
pytorch/pytorch/issues/147724, pytorch/pytorch/issues/147729, pytorch/pytorch/issues/148242

Backend and Device-Specific Bugs: Various backend and device-specific bugs are reported, including issues with the MPS backend, CUDA, and the Inductor backend. These bugs lead to incorrect results, crashes, and performance discrepancies, necessitating backend-specific fixes and optimizations.  
pytorch/pytorch/issues/147740, pytorch/pytorch/issues/148156, pytorch/pytorch/issues/148064

Compilation and Export Errors: Errors during the compilation and export processes in PyTorch are reported, including issues with torch.compile, torch.onnx.dynamo_export, and the handling of dynamic shapes. These errors hinder model deployment and require updates to the compilation and export mechanisms.  
pytorch/pytorch/issues/147756, pytorch/pytorch/issues/148201, pytorch/pytorch/issues/148199

Performance and Optimization Issues: PyTorch experiences performance and optimization issues, such as slow operations with certain data types, inefficient memory usage, and discrepancies in execution times across different backends. These issues require performance tuning and optimization strategies to enhance efficiency.  
pytorch/pytorch/issues/147971, pytorch/pytorch/issues/148073, pytorch/pytorch/issues/148219

Model and Operation Inconsistencies: Inconsistencies in model outputs and operations are reported, such as incorrect gradients, mismatched outputs between backends, and unexpected behavior in certain functions. These inconsistencies require debugging and corrections to ensure reliable model performance.  
pytorch/pytorch/issues/147801, pytorch/pytorch/issues/148241, pytorch/pytorch/issues/148244

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 48
Summarized Issues:

Unit Test Failures Due to Triton Update: The PyTorch project experienced widespread unit test failures due to an AttributeError caused by a deprecated API in the latest Triton update. This issue specifically affected the cpp_wrapper for GPU and XPU, requiring fixes for compatibility with NVIDIA and AMD platforms.
issues/147375, issues/147378

torch.export API Issues: The torch.export function in PyTorch has issues with marking inputs as constant, leading to errors during the export process. Additionally, it fails to provide useful error messages when unrecognized dataclasses are used as input.
issues/147397, issues/147400

Test Failures in TestViewOpsLAZY Suite: The test test_real_imag_view_lazy_complex128 in the TestViewOpsLAZY suite was disabled due to its failure on the main branch of the PyTorch project. Multiple issues document the disabling of this test with references to recent failure examples.
issues/147711, issues/147712, issues/147713, issues/147714, issues/147715, issues/147716, issues/147718, issues/147719

Bugs in PyTorch's MPS Backend: PyTorch's MPS backend has several bugs, including crashes when using scaled_dot_product_attention with non-contiguous tensors and torch.randn producing identical random values for 5D tensors. These issues highlight problems with the MPSGraph rather than PyTorch itself.
issues/147443, issues/147624

Memory Access and Allocation Issues: PyTorch encountered memory access violations on the ROCm platform and memory allocator lock contention slowing down operations in the Inductor-CPU project. These issues suggest solutions like modifying backend allocation and memory buffer allocation strategies.
issues/147378, issues/147766

Compilation and Runtime Errors: Various compilation and runtime errors were reported in PyTorch, including assertion errors with torch.norm and torch.nn.Fold, and a SubgraphLoweringException during inductor compilation. These issues indicate compatibility problems with certain functions and backends.
issues/147840, issues/147848, issues/148107

Backend and Platform Compatibility Issues: PyTorch faced compatibility issues with different backends and platforms, such as the OffsetBasedRNGTracker always using CUDA, and the GradScaler not functioning on Intel ARC A770 GPUs. These issues highlight the need for more flexible backend support.
issues/147584, issues/147731

Test Failures and Disabling: Several tests in PyTorch were disabled due to failures, including test_flatten_nonview_xla and test_mkldnn.py::TestMkldnnCPU::test_mul_cpu. These failures were not detected by the test detection system, indicating a limitation in the current testing framework.
issues/147717, issues/148085

Errors with Triton and ROCm: PyTorch encountered errors with Triton and ROCm, such as a "Cannot bitcast data-type" error and a Triton HIP error indicating no kernel image available. These issues suggest compatibility challenges with the latest Triton updates.
issues/147737, issues/147838

Miscellaneous Issues: Other issues in PyTorch include a malicious link promotion, a ResourceWarning from the tempfile module, and a request for guidance on building a specific PyTorch version. These issues highlight a range of challenges faced by the project.
issues/147738, issues/147744, issues/147953

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 180
Key Open Pull Requests
1. [test] Linter docker image: This pull request is focused on testing a linter Docker image for the PyTorch project, as indicated by the title "[test] Linter docker image," and it aims to address a specific issue referenced as #ISSUE_NUMBER, although it has not yet been merged.

URL: pull/147789

Merged: No

Associated Commits: 4545e, ff328, c3f77, c65fb, 28551, 5488f, d143c, eb7af, dc5d2, 3f852, 029b1, 8e0c1, d9995, 2a3d9, a1f32, cfec8, 111b7, 821f8, d698d, 7a52c, a95d2, 14bcd, e92fb, cf903, 15d46, 1ce32, d6df9, 4fcbc, 097fd, f491b, 74240

2. Add note to get start xpu: This pull request aims to add a note to the PyTorch documentation to inform users that installing PyTorch from binaries will automatically include Intel® Deep Learning Essentials runtime packages, which could lead to environment issues if oneAPI is activated in a standalone installation, and thus advises users to avoid this situation.

URL: pull/148168

Merged: No

Associated Commits: 122ae, cb561, 7e0a4, 4c6fc, dc4b2, 7a001, 2d789, d106c, b3771, 37c52, 7ba05, 065d3, 3d8db, 5694b, 355e1, ef7bd, 56b93, ad7af, e777d, 30662, 10e67, 99da1, 0edcf, 055f5

3. [pytree] add another simplified pytree module torch.pytree: This pull request introduces a new simplified module, torch.pytree, to the PyTorch library, which aligns more closely with the JAX pytree API by removing the tree_ prefix from its functions and reversing the argument order of the unflatten function to enhance compatibility with functools.partial, without causing any backward compatibility issues.

URL: pull/148180

Merged: No

Associated Commits: 3ae64, 51232, c5f43, c3620, 08aff, 4e773, 308f6, 00a64, f6206, c93a2, 4b8ce, b8ead, c7e5b, ef3ad, 0518f, 7c8ae, a5d05

Other Open Pull Requests

Intel Triton Update for PyTorch 2.7: This topic involves updating the Intel Triton component within the PyTorch project to prepare for the release of version 2.7. The pull request is marked as work-in-progress and includes multiple commits refining the update, with several contributors involved for review and collaboration.
pull/147727

Configurable Record and Storage Alignment in torch.save: This topic focuses on making the record and storage alignment in the torch.save function configurable. The pull request is part of a series of changes tracked through the ghstack tool, with multiple updates and contributions from several collaborators.
pull/147788

Eager Then Compile Stance in PyTorch: This topic introduces a new eager_then_compile stance through the set_stance APIs. The pull request aims to reduce compile times and improve the ergonomics of using dynamic shapes by initially running the compile in eager mode and then deriving input dynamism on subsequent invocations.
pull/147983

Support for contextlib.suppress in PyTorch: This topic involves adding support for the contextlib.suppress feature in the PyTorch project. The pull request is part of a series of related changes tracked by the ghstack tool and involves multiple commits that are currently not merged.
pull/147990

Enhancements to torch.load in FakeTensorMode: This topic focuses on enhancing the PyTorch library by adding information about checkpoint offsets to untyped storages when using the torch.load function under the FakeTensorMode. The pull request is part of a series of changes tracked by the ghstack tool.
pull/147787

Threading of boxed_forward_device_index in CompiledFXGraph: This topic addresses the threading of the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. The pull request ensures accurate updates of BoxedDeviceIndex from cache hits and includes testing with a specific Python benchmark.
pull/148130

Performance Enhancements in PyTorch Symbolic Tracing: This topic involves enhancing the performance of the PyTorch project by moving the Node._prepend and Node._remove_from_list methods to C++. The pull request results in a significant reduction in function calls and execution time during microbenchmarking of symbolic tracing.
pull/148261, pull/148260

RuntimeEstimator and SACEstimator Modifications: This topic involves modifications to the RuntimeEstimator and SACEstimator in the PyTorch project. The pull request addresses an unspecified issue and includes various commits such as testing fake utilities, fixing default arguments and bindings, and resolving linting issues.
pull/147750

Loading FakeTensors with Correct Devices in FakeTensorMode: This topic addresses the issue of loading FakeTensors with the correct devices under FakeTensorMode in PyTorch. The pull request fixes the functions _rebuild_tensor_v2 and _rebuild_tensor_v3 as part of a series of related changes tracked through ghstack.
pull/147786

torch.serialization.skip_data Compatibility with torch.load: This topic aims to enhance the PyTorch library by enabling the torch.serialization.skip_data functionality to be compatible with the torch.load method. The pull request is part of a series of related updates tracked through the ghstack tool.
pull/148018

Propagation Strategy Change in Inductor Pattern Matcher: This topic proposes a change in the propagation strategy of arg_kwarg_vals within the PyTorch project. The pull request modifies the Inductor Pattern Matcher to trace replacement graphs using arg_kwarg_vals instead of node.meta['val'].
pull/148046

New Operator Tag "needs_exact_strides" for Inductor: This topic introduces a new operator tag, "needs_exact_strides," for the Inductor component in the PyTorch project. The pull request enforces exact strides on custom operators, with plans to make this behavior the default in a subsequent update.
pull/148063, pull/148091, pull/148104

Templatized CUDA Kernel for GammaBeta Backwards Pass: This topic introduces a new templatized CUDA kernel designed to replace three existing non-ROCM CUDA kernels for the GammaBeta backwards pass. The pull request addresses performance issues by utilizing warp shuffles, coalesced loads, and parallelism across the M dimension.
pull/147773

Regression Fix in evaluate_expr Function: This topic addresses a regression issue in the evaluate_expr function caused by a previously added logging argument. The pull request refactors the code to eliminate the use of expr_sym_node_id in cache lookups and introduces a new function evaluate_sym_node.
pull/147836

Renaming of node.meta["arg_kwarg_vals"] Key: This topic proposes renaming the key node.meta["arg_kwarg_vals"] to node.meta["orig_arg_kwarg_vals"] in the codebase. The pull request aims to improve clarity and is awaiting continuous integration (CI) test results for validation.
pull/148092

Optimization of PyTorch Codebase: This topic aims to optimize the PyTorch codebase by removing unnecessary tensor clones and redundant variable calls. The pull request also adds argument comments for clarity as part of addressing a specific issue referenced in the project.
pull/148159

Selective Code Building at Optimization Level O1: This topic is focused on selectively building code at optimization level O1 as part of an ongoing stack of changes. The pull request includes tasks such as porting improvements to cpp_wrapper mode and addressing CMake packaging issues.
pull/148212

Support for Rowwise Scaling in Scaled GEMM Operations: This topic introduces support for rowwise scaling in scaled GEMM operations. The pull request includes various enhancements such as fixes for offline tuning and the addition of new unit tests for offline scaled GEMM.
pull/148238

Sharding Propagation Refactor for Cross-Mesh Computations: This topic refactors the sharding propagation mechanism in the PyTorch project to handle cross-mesh computations. The pull request moves the same mesh check from the sharding propagation level to the individual operator level.
pull/147869

Registration Process for at::_weight_int4pack_mm_with_scale_and_zeros: This topic aims to facilitate the registration process related to the at::_weight_int4pack_mm_with_scale_and_zeros function. The pull request is part of a stack of changes managed through the ghstack tool.
pull/147962

Stride Consistency in while Loop for body_fn Function: This topic addresses the requirement for the stride in a while loop to remain consistent with the input for the body_fn function. The pull request is part of a series of updates tracked via ghstack and involves multiple contributors for review and collaboration.
pull/148002

Preservation of Torch Function Mode Stack in torch.utils.checkpoint: This topic addresses the issue of preserving the torch function mode stack during recompute in torch.utils.checkpoint. The pull request ensures that the TorchFunctionModeTLS remains active even when .backward() is called.
pull/148023

New Utility Function torch._library.utils.normalize_args_kwargs: This topic introduces a new utility function, torch._library.utils.normalize_args_kwargs. The pull request standardizes the (args, kwargs) to align with the PyTorch dispatcher calling convention and includes new tests to validate this functionality.
pull/148062

Integration of XPU Device Support in LayerNormKernel: This topic aims to integrate support for the XPU device into the LayerNormKernel devices within the PyTorch project. The pull request is in collaboration with a related pull request from the Intel torch-xpu-ops repository.
pull/148081

Performance Improvement by Moving map_aggregate to C++: This topic involves moving the map_aggregate function to C++ within the PyTorch project. The pull request results in improved performance as demonstrated by a microbenchmark showing a reduction in function calls and execution time.
pull/148243

Crash Fix in gen_patterns.py Script: This topic addresses a crash issue encountered when running the gen_patterns.py script in the PyTorch project. The pull request fixes a TypeError related to the issubclass() function and includes multiple commits for fixing the error and cleaning up the code.
pull/147723

Argument Passing Support in DeviceMesh.get_group Function: This topic introduces support for passing arguments to the DeviceMesh.get_group function in the PyTorch project. The pull request includes adding tests and updating relevant files like test_dtensor_compile.py and distributed.py.
pull/147741

Enhancements to torchgen Tool for C Shim Files: This topic aims to enhance the torchgen tool by enabling it to automatically update C shim files with a version number and a list of new arguments for modified operations. The pull request ensures backward compatibility when adding new arguments to fallback operations in Python.
pull/147745

Addition of Meta Kernels in PyTorch: This topic aims to enhance the PyTorch project by adding additional meta kernels. The pull request is associated with the ghstack tool, which helps manage stacked pull requests.
pull/147862

Support for Custom Operations with Arbitrary Input Types: This topic introduces support for custom operations in PyTorch that can handle arbitrary input types. The pull request demonstrates this through a test case involving a custom operation that processes dictionary and tensor inputs.
pull/147927

Fix for linspace Function Decomposition: This topic addresses an issue in the PyTorch project by fixing the decomposition for the linspace function. The pull request ensures that no non-functional operations are performed on functional operators.
pull/147997

Recompile Limit Handling in Dynamo Component: This topic addresses the issue of exceeding the recompile limit in the Dynamo component of the PyTorch project. The pull request implements a solution that allows the system to run recursively only when this limit is surpassed.
pull/148021

Data Type Checks in torch.matmul and Related Functions: This topic addresses an issue related to the data type checks in the torch.matmul function and its related operations. The pull request ensures correct behavior when handling different dimensional inputs.
pull/148174

Replacement of unimplemented with unimplemented_v2 in torch/_dynamo/variables/base.py: This topic involves replacing the unimplemented function with unimplemented_v2 in the torch/_dynamo/variables/base.py file. The pull request is part of issue #147913 and includes several commits for updates and fixes.
pull/148177

Removal of Internal Stack Traces for Graph Breaks: This topic aims to remove internal stack traces for graph breaks when the fullgraph=True option is used in the Dynamo component of the PyTorch project. The pull request is indicated by the title and commit messages.
pull/148205

Renaming of Test File in PyTorch Project: This topic involves renaming a test file from "test_graph_break_messages" to "test_error_messages" in the PyTorch project. The pull request is part of a series of changes managed through the ghstack tool.
pull/148220

XPU Support for Inductor MM Triton Kernel Benchmark: This topic aims to enable the XPU (Intel GPU) support for the Inductor MM Triton Kernel Benchmark. The pull request addresses a test case regression issue introduced in a previous update.
pull/148237

PT2 Enablement Tests for torch.float8_e8m0fnu Data Type: This topic introduces PT2 enablement tests for the torch.float8_e8m0fnu data type. The pull request addresses specific functionalities such as displaying e8m0 in TORCH_LOGS output and testing uint8 to e8m0 conversions.
pull/147770

Support for invoke_subgraph Feature in PyTorch: This topic introduces support for the invoke_subgraph feature in the PyTorch project. The pull request is part of a stack of changes managed by ghstack and includes multiple commits with updates and discussions.
pull/147863

CK Submodule Version Usage in ROCm Environment: This topic aims to ensure that PyTorch uses the CK submodule's version of the config.h file instead of potentially defaulting to the system version. The pull request enhances compatibility and consistency within the ROCm environment.
pull/147993

Enhancements to nonstrict_trace Functionality: This topic aims to enhance the nonstrict_trace functionality in the PyTorch project. The pull request allows it to handle objects whose types have been registered as constants using pytree.register_constant.
pull/148007

Support for Navi4 Architecture in CUDA Tests: This topic involves updating the test/test_matmul_cuda.py and torch/testing/_internal/common_cuda.py files to introduce support for the Navi4 architecture. The pull request adds an IS_NAVI4 constant and implements conditional test skipping for row-wise FP8 tests.
pull/148037

Autotuning of User-Defined Triton Kernels: This topic aims to enhance the autotuning process of user-defined Triton kernels in the PyTorch project. The pull request utilizes real input data to ensure the correct execution path is followed, with test cases included to validate the changes.
pull/148131

Upgrade of oneDNN Submodule to Version 3.7: This topic aims to upgrade the oneDNN submodule to version 3.7 in the PyTorch project. The pull request enhances performance for various operations on Intel Xeon processors and Intel GPUs, while also addressing several issues related to accuracy and performance.
pull/148173

Main Tests for Cutlass Backend Matrix Multiplication: This topic introduces the initial step of adding main tests for matrix multiplication (mm), addition and multiplication (addmm), and batch matrix multiplication (bmm) functionalities in the Cutlass backend of the PyTorch project. The pull request is part of a series of related changes tracked through the ghstack tool.
pull/148229

Spelling Corrections Across PyTorch Codebase: This topic addresses and corrects various spelling errors across the PyTorch codebase. The pull request enhances documentation quality and code readability, ensuring consistent spelling without affecting the functionality of the code.
pull/148262

Introduction of RandomBatchSampler in PyTorch: This topic introduces a new RandomBatchSampler to the PyTorch project. The pull request optimizes the process of generating batch indices by replacing the traditional iteration method with slicing, resulting in significant speed improvements.
pull/147706

Skipping of Intel GPU TestCommon::test_dtypes Test: This topic aims to skip the Intel GPU TestCommon::test_dtypes test for the bmm and addbmm operations. The pull request addresses the lack of complex64 support and extends the DecorateInfo to accommodate a list of device types.
pull/147721

CK Backend for Memory-Efficient Attention in ROCm: This topic introduces CK as the backend for memory-efficient attention in ROCm. The pull request enables the use of attention bias while noting limitations such as the lack of support for Nested Tensors.
pull/147778

Removal of Unused rand Function Call: This topic addresses issue #147171 by removing an unused call to the rand function when not falling back to eager execution. The pull request includes commits that eliminate dead code in the Graph component of the PyTorch project.
pull/147790

Stack-Allocated Buffer in GEMM Template: This topic proposes using a stack-allocated buffer in the GEMM template to reduce memory allocator lock contention. The pull request potentially saves a few cycles and removes some non-determinism, although no significant performance difference was observed.
pull/147797

Test Update for ghstack-Poisoned Changes: This topic, titled "test," is part of a stack of changes managed by ghstack. The pull request includes updates marked as "[ghstack-poisoned]" with two specific commits, but it has not yet been merged into the PyTorch project.
pull/147800

Application of torch_compile_options to C10 Libraries: This topic aims to enhance the PyTorch project by applying torch_compile_options to the C10 libraries. The pull request addresses a specific issue and includes commits that introduce this change and fix a semicolon error.
pull/147821

Enhancement of MSVC Build Process with /permissive- Flag: This topic aims to enhance the MSVC build process of the torch libraries by implementing the /permissive- flag. The pull request addresses build errors as part of the solution and involves collaboration with several contributors.
pull/147825

Parallelization of bf16 to f32 Conversion in at::addmm and Linear Kernels: This topic aims to enhance the performance of at::addmm and linear kernels by parallelizing the conversion from bf16 to f32. The pull request focuses on parallelization and vectorization of the conversion process.
pull/147864

Switching TestConsistency to Use MPS Device: This topic proposes switching the TestConsistency to use the MPS device. The pull request is part of a stack of changes aimed at eventually moving decorators away from test_mps to OpDB.
pull/147893

Prevention of Premature Garbage Collection in THPGenerator_reduce: This topic addresses the issue of premature garbage collection of the state tensor in the THPGenerator_reduce function. The pull request increases its reference count, preventing runtime errors when using the multiprocessing spawn methods "forkserver" and "spawn" in PyTorch.
pull/147907

Resetting AOT Counter in torch._dynamo.reset Function: This topic aims to reset the AOT (Ahead-Of-Time) counter when the torch._dynamo.reset function is called. The pull request is part of a series of changes tracked by ghstack and involves multiple contributors for review and collaboration.
pull/147915

Introduction of export_cache Feature in PyTorch: This topic introduces a feature called export_cache, which is a modified version of @mark_compiled_region. The pull request is designed to handle function call differentiation by input metadata in non-strict export scenarios.
pull/147992

Annotation of Forward Graph Dynamic Tensor Outputs: This topic involves annotating forward graph dynamic tensor outputs with mark_dynamic in the PyTorch project. The pull request is part of a stack of changes managed by ghstack and is currently a work in progress.
pull/148042

Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test update aimed at upgrading the oneDNN library to version 3.7 without making any changes to iDeep. The pull request is not intended for merging.
pull/148076

Reconstruction of WeakRefVar in Dynamo Component: This topic involves the reconstruction of the WeakRefVar in the Dynamo component of the PyTorch project. The pull request is part of a stack of changes managed by ghstack, with multiple contributors being notified for review or collaboration.
pull/148083

Resolution of HSDP Custom Hook Unit Test Issues: This topic addresses the issue of HSDP custom hook unit tests being multi-threaded and using a single physical GPU. The pull request removes the device rank setting to prevent referencing the same GPU with multiple ranks.
pull/148099

Draft for Stable Version of Torch Library: This topic is a draft for a stable version of the Torch library. The pull request is part of a stack of changes managed by ghstack, with the continuous integration checks intentionally skipped.
pull/148124

Change of force_nn_module_property_static_shapes Flag Default Setting: This topic proposes changing the default setting of the force_nn_module_property_static_shapes flag to False. The pull request supports the dynamic shapes roadmap by reducing the number of unrolled out flags.
pull/148139

Correction of Parameter and Function Descriptions in Test Package: This topic addresses inaccuracies in parameter and function descriptions within the test package of the PyTorch project. The pull request aims to correct these issues for improved clarity and accuracy.
pull/148145

Enablement of Kineto for XPU: This topic aims to enable Kineto for XPU by updating the intel-pti to version 0.10.1. The pull request turns on the XPU_ENABLE_KINETO flag as part of the ongoing development in the PyTorch project.
pull/148149

Replacement of unimplemented with unimplemented_v2 in torch/_dynamo/variables/constant.py: This topic proposes replacing the existing 'unimplemented' functionality with 'unimplemented_v2' in the torch/_dynamo/variables/constant.py file. The pull request is part of a series of changes tracked by ghstack and is linked to issue #147913 on GitHub.
pull/148158

Data Type Checks for torch.addbmm, torch.addmv, and torch.baddbmm: This topic addresses the issue of incorrect data type checks for the output of the PyTorch functions torch.addbmm, torch.addmv, and torch.baddbmm. The pull request includes updates to ensure proper handling of these data types.
pull/148176

Migration of Python Formatting Tool to ruff format: This topic aims to migrate the Python formatting tool for the torch/ao/ directory from PYFMT to ruff format. The pull request is part of a stack of changes tracked via ghstack and involves multiple commits with updates.
pull/148185, pull/148186

Use of itertools.chain.from_iterable for Code Enhancement: This topic proposes the use of itertools.chain.from_iterable in the codebase to enhance readability, efficiency, and support for infinite iterables. The pull request is currently open for review on the PyTorch GitHub repository.
pull/148190

Update of XPU Triton Build to Manylinux 2.28 Environment: This topic aims to update the continuous integration process by moving the XPU Triton build to the manylinux 2.28 environment. The pull request includes a change to use GCC version 13.
pull/148195

Untracked Unbacked Symbols Handling in Conditional Statements: This topic addresses the issue of untracked unbacked symbols leaking from the true and false branches of a conditional statement in the PyTorch project. The pull request ensures that these symbols are properly identified and tracked as outputs of the while_loop operator.
pull/148206

Exposure of Functions in torch_python DLL for Custom Backend: This topic aims to expose functions used in a custom backend within the torch_python DLL to improve performance. The pull request addresses issue #148208 while referencing a related discussion on symbol hiding.
pull/148213

Introduction of AppendingByteSerializer Utility Class: This topic introduces a new utility class called AppendingByteSerializer to the PyTorch project. The pull request is designed to facilitate the efficient appending of sequential byte data with customizable serialization and deserialization processes.
pull/148226

Performance Enhancement of save_cache_artifacts Function: This topic aims to significantly enhance the performance of the save_cache_artifacts function. The pull request redesigns the serialization algorithm and opts out of using pickle, reducing the computational expense incurred when the function is called repeatedly in internal workloads.
pull/148227

Modification of Cutlass Backend for Self-Multiplication Operations: This topic aims to modify the Cutlass backend by removing an assertion that previously prevented self-multiplication operations. The pull request allows such operations to proceed without restriction.
pull/148233

Expansion of addmm Test in Cutlass Backend: This topic aims to enhance the cutlass backend by expanding the addmm test to cover all four broadcastable shape biases. The pull request is part of a series of related changes tracked through a stack of pull requests.
pull/148234

Modification of require_contiguous Function for Exact Strides: This topic aims to modify the require_contiguous function to necessitate exact strides rather than just the stride order. The pull request is part of a series of changes tracked by ghstack and involves multiple contributors and reviewers.
pull/148235

Reenablement of Subprocess Addition Matrix Multiplication Test: This topic aims to reenable a subprocess addition matrix multiplication test in the Cutlass backend of the PyTorch project. The pull request involves multiple contributors for review and collaboration.
pull/148236

Fix for Vectorized Code Generation of tanh Function: This topic addresses an issue in the PyTorch project by fixing the vectorized code generation for the tanh function. The pull request resolves this by switching to the Sleef implementation to ensure consistent outputs.
pull/148254

Default USE_LIBUV to 0 for dist.init_process_group on Windows: This topic addresses an issue with the dist.init_process_group function on Windows by proposing to default USE_LIBUV to 0. The pull request includes a more informative error message to improve user experience.
pull/148266

Performance Enhancement of Interpolation Operations on MPS Backend: This topic aims to enhance the performance of interpolation operations in PyTorch on the Metal Performance Shaders (MPS) backend. The pull request addresses a bug in the benchmarking script and optimizes the computation of spatial coordinates.
pull/148277

Enhancement of Error Messaging for Missing Ninja Build System: This topic aims to enhance the error messaging related to missing Ninja build system in the cpp_extensions module of the PyTorch project. The pull request is indicated by the commit titled "Update ninja missing error message."
pull/147698

Fix for Multiple OpenMP Runtimes Linked to libtorch_cpu.so: This topic addresses an issue in PyTorch where building with OpenBLAS support and directly linking libopenblas with libgomp.so results in multiple OpenMP runtimes being linked to libtorch_cpu.so. The pull request proposes a fix by avoiding linking against libomp.so if OpenBLAS is already linked with libgomp.so.
pull/147725

Tolerance Adjustments for test_torchinductor_opinfo on AArch64: This topic addresses the failure of the test_torchinductor_opinfo test for nn.functional.triplet_margin_loss on AArch64. The pull request increases the acceptable absolute and relative tolerances (ATOL and RTOL) for this test when using F16.
pull/147742

Minimum Viable Product for P1 INT16 Full Quantization Target: This topic introduces a minimum viable product (MVP) for the P1 INT16 Full quantization target. The pull request involves quantizing the input to int16 as part of the PyTorch project.
pull/147747

Registration of Normal Class to register_dataclass Function: This topic aims to address a specific issue discussed in a previous pull request by registering a normal class to the register_dataclass function within the PyTorch project. The pull request is indicated by the commit message and linked discussion.
pull/147752

Rank Local Checkpointing Demonstration in DCP: This topic is a work in progress aimed at demonstrating rank local checkpointing in the Distributed Checkpointing Protocol (DCP) for the PyTorch project. The pull request is not yet ready for review.
pull/147758

Respecting priority_order Setting in torch.compile Path: This topic addresses the issue where the torch.compile path was not respecting the priority_order setting of sdpa_kernel. The pull request ensures that the context manager handling within torch.compile now properly acknowledges this configuration.
pull/147768

Handling of Real-Tensor Fallback Failures in Dynamic Shapes: This topic addresses an issue in the PyTorch project by implementing a solution to ignore failures when the real-tensor fallback mechanism does not succeed. The pull request is part of handling dynamic shapes during export.
pull/147779

Deprecation of Silent Fallback Mechanism for GEMM Tuning: This topic initiates the first stage of deprecating the silent fallback mechanism for tuning GEMM in the PyTorch project. The pull request involves the removal of a conditional block related to the eager mode implementation for int_mm.
pull/147798

Reversion of copy2d Implementation for Data Transfers: This topic aims to revert a previous change that implemented the use of "copy2d" for host-to-device and device-to-host data transfers. The pull request is indicated by the original commit changeset aa7d1b82ac9d.
pull/147808

Enablement of AddressSanitizer in CUDA Tests: This topic aims to enable AddressSanitizer (ASAN) in CUDA tests for the PyTorch project. The pull request involves collaboration with multiple contributors mentioned in the body, although it has not yet been merged.
pull/147812

Test Update for oneDNN Library Upgrade to Version 3.7: This topic aims to upgrade the oneDNN library to version 3.7 in the PyTorch project, specifically for testing purposes and not intended for merging. The pull request involves multiple contributors tagged for review or awareness.
pull/147817

Optimization in PyTorch Distributed Library: This topic introduces an optimization in the PyTorch Distributed (PTD) library by allowing the use of the current compute stream as the NCCL stream when operating in async=False mode. The pull request significantly reduces CPU overhead by 50% and overall CPU/GPU time by 15% during collective communication operations.
pull/147820

Removal of Assertion in expand_to_full_mesh_op_strategy Function: This topic addresses issue #147732 by removing an assertion in the expand_to_full_mesh_op_strategy function. The pull request involves several contributors for review and discussion, but it has not yet been merged.
pull/147823

Fix for is_compile_supported() Function in PyTorch: This topic addresses a bug in the PyTorch project by fixing the is_compile_supported() function to correctly handle cases where the device_type includes a device index. The pull request is referenced in issue #147826.
pull/147837

Setting disable_clone Parameter to True in opt_gm Function: This topic proposes setting the disable_clone parameter to True when executing the opt_gm function. The pull request addresses issue #147843 in the PyTorch project.
pull/147845

Transition of mkldnn_linear Components to oneDNN MatMul: This topic aims to transition the mkldnn_linear and mkldnn_linear_backward components from using oneDNN Inner Product to oneDNN MatMul. The pull request is part of a series of changes tracked by ghstack and is currently not intended for merging.
pull/147855

Performance Enhancement of gemv Operator in PyTorch: This topic aims to enhance the performance of the gemv operator in PyTorch by offloading OpenBLAS gemv calls to a dedicated OpenBLAS kernel. The pull request results in a 14% performance improvement for operations on matrices of shape 1x4096x4096.
pull/147858

Testing of optree Component with Latest HEAD Version: This topic is focused on testing the optree component with the latest HEAD version in the PyTorch project. The pull request is indicated by the title and commit message and has not yet been merged.
pull/147870

Introduction of Dim._OBLIVIOUS Feature in Export Dynamic Shapes: This topic introduces the Dim._OBLIVIOUS feature in export dynamic shapes and the _mark_oblivious() function in dynamo decorators. The pull request allows developers to opt into size-oblivious reasoning and avoid 0/1 specialization.
pull/147881

Addition of Missing Matrix Cases in CI Setup: This topic adds missing matrix cases for the pytorch-linux-focal-py{3.12,3.13}-clang10 configuration in the continuous integration setup. The pull request references specific lines in the project's GitHub workflow files to ensure comprehensive testing coverage.
pull/147882

Removal of Unnecessary Tombstone Messages from TARGETS Files: This topic aims to proactively remove unnecessary tombstone messages from TARGETS files. The pull request addresses the redundancy of these messages due to the merging of files using non_fbcode_target.
pull/147897

Allowing Tensor Types in allowed_getattr_types_for_subgm: This topic addresses an issue in the PyTorch project by allowing tensor types in the allowed_getattr_types_for_subgm when verifying export processes. The pull request previously caused a SpecViolationError due to invalid get_attr types in non-lowerable parts of a graph.
pull/147898

Increase of Persistent Reduction Threshold for Inductor Multikernel Flag: This topic proposes increasing the persistent reduction threshold for the inductor multikernel flag from 16 to 32. The pull request is expected to yield significant performance improvements, as demonstrated by benchmark results.
pull/147899

Use of TorchFunctionMode for SDPA Dispatch in CP Feature: This topic introduces the use of TorchFunctionMode to dispatch the Scaled Dot-Product Attention (SDPA) for the CP (Checkpointing) feature in the PyTorch project. The pull request is indicated by the title and the associated commit.
pull/147902

Triggering of MI300-Specific CI Workflows on PRs: This topic aims to enable the triggering of MI300-specific continuous integration workflows on pull requests. The pull request uses a PR label with a temporary workaround using the ciflow/unstable label.
pull/147904

Opportunity Finder Feature for GEMM Horizontal Fusion Search: This topic introduces an "opportunity finder" feature within the inductor for General Matrix Multiply (GEMM) horizontal fusion search. The pull request includes a detailed test plan for local reproduction and performance benchmarking on a GPU.
pull/147908

Handling of Partial and Scalar Values in PyTorch: This topic addresses a specific issue related to the handling of partial and scalar values in the PyTorch project. The pull request involves collaboration with multiple contributors, although it is not intended to be merged at this time.
pull/147910

Enablement of cpu_offload Feature for _distribute_state_dict Function: This topic aims to enhance the PyTorch project by enabling the cpu_offload feature for the _distribute_state_dict function. The pull request is part of an ongoing effort to address a specific issue and involves collaboration with multiple contributors.
pull/147916

Update of basic.TestSqueeze for 0-Dimensional Squeeze Operations: This topic aims to update the basic.TestSqueeze by replacing a TODO with a test for 0-dimensional squeeze operations. The pull request ensures that scalars remain unchanged as part of the PyTorch project.
pull/147928

Test Update for oneDNN Library Upgrade to Version 3.7: This topic aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is marked as a test and not intended for merging.
pull/147945

Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test update that aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is not intended to be merged.
pull/147955

Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test and not intended for merging, aiming to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is indicated by the commit message and the involvement of multiple contributors for review and feedback.
pull/147956

Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test and not intended for merging, aiming to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is indicated by the commit message and the involvement of multiple contributors tagged for review or notification.
pull/147957

Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test update that aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is not intended for merging as indicated by the title and the lack of a specific issue number in the body.
pull/147958

Test Update for oneDNN Library Upgrade to Version 3.7: This topic aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is indicated by the commit message and the involvement of multiple contributors tagged for review or notification.
pull/147959

Test-Related Update for PyTorch Project: This topic is a test-related update for the PyTorch project, as indicated by the title '[TEST]' and the test plan mentioned in the body. The pull request includes a single commit with a differential revision reference, but it has not yet been merged.
pull/147960

Update of Protobuf Dependency to Version 5.29: This topic aims to update the Protobuf dependency to version 5.29 in the PyTorch project. The pull request successfully addresses CMake build compatibility while expressing uncertainty about resolving issues with Bazel builds.
pull/147963

Order Maintenance in ElasticDistributedSampler: This topic addresses an issue in the ElasticDistributedSampler where the order of indices is not maintained correctly when the start_index is not zero. The pull request ensures that training resumes from the correct point if a job is restarted.
pull/147967

Extension of CUDA Test to Include XPU SyclExtension Case: This topic extends the existing CUDA test to include the XPU SyclExtension case for the py_limited_api feature. The pull request cannot be merged until the commit pin for torch-xpu-ops is updated.
pull/147984

Documentation Update for py_limited_api Feature in SyclExtension: This topic updates the documentation to align the description of the py_limited_api feature in the SyclExtension with the existing descriptions for CPP and CUDA. The pull request addresses a previously missed change due to concurrent work on the SyclExtension.
pull/147988

CI Workflow Modification to Avoid Workspace Cleaning: This topic addresses an issue in the continuous integration process by modifying the workflow to avoid cleaning the workspace when fetching the repository. The pull request is indicated by the title and the reference to a specific issue number.
pull/147994

Default Generation of AOTI Size and Stride Input Checks: This topic introduces a default generation of AOTI (Ahead-Of-Time Inductor) size and stride input checks in the PyTorch project. The pull request ensures these checks are only executed when the AOT_INDUCTOR_DEBUG_COMPILE environment variable is set.
pull/148005

Reversion of Triton Call in Worker Process: This topic aims to revert a previous change that involved calling Triton in the worker process and compiling ahead of time. The pull request is indicated by the title and commit message referencing the original changeset and differential revision.
pull/148008

Reversion of Triton Call in Worker Process: This topic aims to revert a previous change that involved calling Triton in the worker process and compiling ahead of time. The pull request is indicated by the original commit changeset 5e70e713d95b and the associated differential revision D70210584.
pull/148010

Skipping of test_reference_numerics_large_jiterator_unary_cuda_complex64 Test: This topic proposes to skip the test_reference_numerics_large_jiterator_unary_cuda_complex64 test on CUDA due to a change in recent numpy versions. The pull request alters the convention from nan+infj to -inf+infj, similar to a previous skip on ROCM.
pull/148024

Gradient Computation Corner Case in torch.nn.functional.hardswish: This topic addresses a corner case in the gradient computation of torch.nn.functional.hardswish. The pull request modifies the condition for gradient calculation, enabling CUDA support for the test test_hardswish_grad_corner.
pull/148049

Modification of torch.celu and quantized_celu Operations: This topic addresses issue #148065 by modifying the torch.celu and quantized_celu operations to return the input directly when alpha is set to infinity. The pull request ensures that the function celu(x, inf) is well-defined for all input values x.
pull/148066

Replacement of unimplemented with unimplemented_v2 in codegen.py: This topic aims to address issue #147913 by replacing the unimplemented function with unimplemented_v2 in the codegen.py file. The pull request removes the unused import of unimplemented.
pull/148069

Compatibility Fix for Atomic Operations on ARMv8-A Platforms: This topic addresses compatibility issues with atomic operations on ARMv8-A platforms, such as the Raspberry Pi 4. The pull request adjusts the compilation flags to use -march=armv8-a+sve, ensuring that PyTorch builds correctly without generating unsupported instructions.
pull/148070

NameError Fix in PyTorch Test Suite: This topic addresses a NameError in the PyTorch project's test suite by providing a dummy DataType. The pull request ensures syntactical correctness when the TEST_TENSORBOARD flag is set to False.
pull/148079

Consistency and Brevity Enhancement in Test Code: This topic aims to enhance the consistency and brevity of the test code in the PyTorch project. The pull request utilizes the existing load_torchbind_test_lib function to replace multiple slightly different implementations of repeated code.
pull/148082

OpenMP Flag Parsing Fix for clang-cl on Windows: This topic addresses the issue of incorrect OpenMP flag parsing by clang-cl on Windows. The pull request ensures that MSVC-style arguments are used and clang-style arguments are properly prefixed with -Xclang.
pull/148097

Experimental Change for C++ Wrapper with CUDA Graphs: This topic is an experimental change aimed at measuring the impact of integrating a C++ wrapper with CUDA graphs in the PyTorch project. The pull request is indicated by its title and the associated commit.
pull/148100

Integration of myst_nb Plugin into PyTorch Documentation: This topic proposes the integration of the myst_nb plugin into the PyTorch documentation. The pull request enables the rendering of Jupyter notebooks and execution of code blocks within markdown documents.
pull/148105

Output Memory Planning for ATen Convolution Operation: This topic aims to enable the ATen convolution operation to plan its output memory for potential fusion opportunities. The pull request is part of the lowerings process in the PyTorch project.
pull/148132

Enhancement of OrderedPreservingDictTest.test_range_insert Functionality: This topic enhances the OrderedPreservingDictTest.test_range_insert by incorporating functionality to check key-value pair indexing and order. The pull request addresses an unused loop variable issue.
pull/148136

Setting of force_parameter_static_shapes Parameter to False: This topic involves setting the parameter force_parameter_static_shapes to False in the PyTorch project. The pull request is part of a stack of changes managed by ghstack and has not yet been merged.
pull/148138

Disabling of cuDNN During Export Tracing for Batch Normalization: This topic addresses the issue of ConstraintViolation errors in the batch normalization operation by disabling cuDNN during export tracing. The pull request prevents the creation of problematic guards.
pull/148140

Removal of Outdated CUDA Version Checks: This topic proposes the removal of outdated CUDA version checks from the PyTorch project. The pull request is based on the framework now requiring a minimum CUDA version of 11.
pull/148142

Transformation of UnpackedDualTensor into Namedtuple: This topic aims to enhance the PyTorch project by transforming the UnpackedDualTensor into a true namedtuple. The pull request is part of a series of related changes tracked through the ghstack tool.
pull/148151

Incorporation of Python 3.9 Typing Features: This topic aims to update the PyTorch project by incorporating Python 3.9 typing features. The pull request is indicated by the commit message and the involvement of multiple contributors tagged for review or notification.
pull/148157

Draft for Addressing Specific Issue in PyTorch Project: This topic is a draft aimed at addressing a specific issue in the PyTorch project. The pull request is indicated by the placeholder '#ISSUE_NUMBER' and involves collaboration with multiple contributors.
pull/148160

Upgrade of oneDNN Submodule to Version 3.7 with PDB Build Focus: This topic aims to upgrade the submodule oneDNN to version 3.7 in the PyTorch project. The pull request focuses on building PDB with the Z7 option and is currently not intended for merging.
pull/148163

Enhancement of Code with Docstrings and Type Annotations: This topic focuses on enhancing the code by adding comprehensive docstrings and implementing proper type annotations. The pull request also introduces a class method for context retrieval and improves overall code organization.
pull/148170

Introduction of HPU Profiler Activity: This topic introduces a new profiler activity specifically for HPU (Habana Processing Unit) devices. The pull request addresses issue #148181 in the PyTorch project.
pull/148182

Fix for FlexibleLayout Weights in Batch Matrix Multiplication: This topic addresses an issue in the PyTorch project where an error occurs with FlexibleLayout weights in Batch Matrix Multiplication (BMM). The pull request potentially alters node B's layout during a specific kernel selection process.
pull/148188

Distributed Data Handling for Hugging Face Readers and Writers: This topic introduces the capability for Hugging Face (HF) readers and writers to handle data in a distributed manner. The pull request ensures that all tensors intended for the same file are directed to the same rank.
pull/148189

Enhancement of Cache Size Limit Error Message: This topic aims to enhance the cache size limit error message by including the configured limit size. The pull request provides more informative feedback when the cache size limit is reached.
pull/148191

Dispatch Logic Update for BF16 Linear Layers: This topic updates the dispatch logic for linear layers using BF16 in the PyTorch project. The pull request utilizes oneDNN instead of OpenBLAS, based on profiling results on AArch64.
pull/148197

Refactoring of Estimate Runtime and Pick Loop Order Heuristics: This topic involves refactoring the code by moving the estimate runtime and pick loop order heuristics into the choices.py file. The pull request is part of an ongoing effort to reorganize similar elements within the scheduler.
pull/148202

Fix for [No available kernel] Error with cuDNN on A100 GPUs: This topic addresses a '[No available kernel]' error encountered with cuDNN on A100 GPUs. The pull request is part of a stack of changes tracked via ghstack and involves multiple contributors for review and collaboration.
pull/148204

Support for Dilation in max_pool2d Lowering Process: This topic introduces support for dilation in the max_pool2d lowering process within the PyTorch project. The pull request is part of a stack of changes aimed at enhancing the functionality of the inductor component.
pull/148209

Lowerings for max_pool3d Function in PyTorch: This topic introduces lowerings for the max_pool3d function in the PyTorch project. The pull request is part of a stack of changes and is currently open with a single commit linked to it.
pull/148210

Addition of kBatch_sweep Option in ROCm Configuration: This topic introduces a new feature to the ROCm configuration by adding a kBatch_sweep option. The pull request allows users and tests to specify a set of kBatches to evaluate.
pull/148223

Disabling of Torch Check for Float8_e5m2 Matrix Multiplication on ROCm: This topic proposes disabling the torch check for the multiplication of two Float8_e5m2 matrices on ROCm. The pull request includes a test command for verification on ROCm hardware that supports fp8.
pull/148228

Fix for Logging Mechanism to Prevent Maximum Recursion Error: This topic addresses an issue in the PyTorch project by fixing the logging mechanism to prevent a maximum recursion error. The pull request is detailed in the test plan and associated with differential revision D70416613.
pull/148231

Fallback Mechanism for JK Error on Platform Without Service Network: This topic addresses an issue where a "jk error" occurs on a platform lacking a service network. The pull request implements a fallback mechanism when JK is disabled.
pull/148240

Enhancement of qlinear_pointwise_binary Fusion Process: This topic aims to enhance the qlinear_pointwise_binary fusion process by enabling dimension collapse for 3D linear cases. The pull request specifically targets the qlinear+add path with sum as a post-operation.
pull/148245

Modification of TensorMaker::make_tensor() Function: This topic addresses the issue #146419 by modifying the TensorMaker::make_tensor() function to set the requires_grad attribute. The pull request is currently open for review on the PyTorch GitHub repository.
pull/148255

Addition of Recursive Glob Support to setuptools: This topic aims to enhance the build process by adding recursive glob support to setuptools in the PyTorch project. The pull request ensures that all necessary files are included during the setup.
pull/148258

Update of 'fmt' Submodule to Version 11.1.4: This topic aims to update the 'fmt' submodule to version 11.1.4 in the PyTorch project. The pull request primarily addresses bug fixes, ABI fixes, and improvements in compiler support.
pull/148264

Hot Fix for Inductor Component Following Changes in Pull Request #148011: This topic addresses a hot fix for the Inductor component following changes made in pull request #148011. The pull request involves multiple contributors for review and collaboration.
pull/148270

Fix for Include Directories with Spaces on Windows: This topic addresses a bug in the PyTorch project where include directories containing spaces on Windows systems cause errors during execution. The pull request implements a fix that ensures paths are correctly handled without being split.
pull/148271

CMake and RowwiseScaledMM.cu File Updates for SM10.0a Architecture: This topic updates the CMake files and the RowwiseScaledMM.cu file to enable building on the SM10.0a architecture. The pull request ensures compatibility with CUDA toolkit 12.8.
pull/148274

Fix for Test Errors in aot_inductor_package: This topic addresses test errors in the aot_inductor_package by ensuring that script.ld is copied to the build-time directory. The pull request fixes the fbcode test failures introduced by a previous pull request.
pull/148279

New Test for Layernorm CUDA Backwards Pass Accuracy: This topic introduces a new test to ensure the accuracy of the layernorm CUDA backwards pass. The pull request serves as a foundational step towards future performance improvements.
pull/147763

Upgrade of oneDNN Submodule to Version 3.7: This topic aims to upgrade the oneDNN submodule to version 3.7 in the PyTorch project. The pull request brings various performance improvements and optimizations for convolution and matrix multiplication primitives on Intel Xeon processors.
pull/147917

Test for Code Base of Previous Pull Request: This topic is a test for the code base of a previous pull request (https://github.com/pytorch/pytorch/pull/147498) in the PyTorch project. The pull request aims to build the Windows binary and test the test_mkldnn.py.cc file.
pull/147964

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 249

Don't miss what's next. Subscribe to Weekly Project News: