Weekly GitHub Report for Pytorch: February 24, 2025 - March 03, 2025

            Weekly GitHub Report for Pytorch: February 24, 2025 - March 03, 2025

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notable changes include the deprecation of PyTorch's official Anaconda channel, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL: This issue is a feature request for a stream management API in PyTorch's Process Group NCCL to address the challenges of asynchronous communication, which can lead to a "read-before-write" issue due to each NCCL process group operating on its own dedicated stream. The proposal suggests allowing users to define or set streams for process groups to ensure proper execution order of collective operations, thereby enabling better overlap with compute operations and improving performance in multi-data center training scenarios.

The comments discuss related issues and potential solutions, including using existing APIs with user-defined streams to manage synchronization and execution order. Some contributors suggest that the current behavior is not a bug but a user responsibility to manage synchronization, while others propose changes to improve user experience and documentation. There is also mention of ongoing work to address these concerns in future updates.
Number of comments this week: 15

Checkpoint doesn't work with torch_function if torch_function change tensor metadata: This issue highlights a problem with the PyTorch checkpoint functionality when used in conjunction with TorchFunctionMode, specifically when __torch_function__ alters tensor metadata, leading to a metadata mismatch error during the backward pass. The user provides a minimal reproducible example demonstrating the error and seeks advice on whether there is a way to make __torch_function__ compatible with checkpointing.

The comments discuss potential reasons for the issue, such as the violation of conditions due to metadata changes and the absence of TorchFunctionMode during recomputation. Suggestions include manually activating TorchFunctionMode within the checkpointed function and considering changes to checkpoint logic to detect and re-enable modes during recomputation. There is also a discussion about the feasibility of implementing these changes and alternative approaches to avoid the problem.
Number of comments this week: 9

FlexAttention compiled has illegal memory access or device-side assert even though all tensors are contiguous: This issue involves a bug in the PyTorch library where the FlexAttention module, when compiled, encounters illegal memory access or device-side assertions despite all tensors being contiguous. The problem arises under specific parameter configurations, and while a workaround involving padding the rel_bias tensor can mitigate the illegal memory access, it does not resolve the device-side assertion issue.

The comments discuss potential causes and solutions for the issue, including code modifications to address out-of-bounds errors and the need for conditional evaluation of the score_mod function. A workaround involving padding is mentioned, and the issue is acknowledged as a known problem with suggestions for further investigation.
Number of comments this week: 6

[inductor][cpu]AOT inductor AMP static shape default wrapper occupied almost 3x disk than before: This issue reports a significant increase in disk usage when using the AOT Inductor AMP static shape default wrapper, which is observed to occupy almost three times more disk space than before, specifically when running the ResNet50 model. The problem is suspected to be caused by a specific commit, and the user is seeking assistance to identify and address the root cause of this behavior.

The comments discuss the need to identify the commit responsible for the increased disk usage, with suggestions to manually search for the guilty commit. A specific commit is identified as potentially responsible, and further investigation is requested from other contributors to understand the cause and determine if a fix is necessary, considering upcoming changes that might resolve the issue.
Number of comments this week: 6

No gradient for residuals in the return value of torch.linalg.lstsq: This issue highlights a concern with the torch.linalg.lstsq function in PyTorch, where the residuals in the return value do not have a gradient, unlike the solution. The user is questioning whether this behavior is expected and is considering contributing a pull request to address this by modifying the autograd functionality.

The comments discuss the inefficiency of manually computing gradients for residuals and suggest that the current API design is flawed. A user proposes a code solution and asks for feedback on submitting a pull request, while another commenter advises against changing the API, suggesting users implement their own solutions if needed.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 125
Summarized Issues:

Compilation and Runtime Errors in PyTorch: Compilation and runtime errors are prevalent in PyTorch, affecting various functions and modules. Issues include failures in compiling functions like flex_attention and Dropout with SequenceParallel, and runtime errors due to improper handling of tensor types or device assignments. These errors often suggest trying different PyTorch versions or involve specific contributors for resolution.  
issues/147756, issues/147757, issues/147767, issues/147769

Bugs in PyTorch Functions and Modules: Various bugs are reported in PyTorch functions and modules, such as incorrect gradient calculations, inconsistent outputs, and failures in specific operations. These bugs affect functions like torch._check, torch.distributed.context_parallel, and torch.nn.functional.hardswish, often requiring workarounds or patches to address the issues.  
issues/147772, issues/147793, issues/147801

ONNX Export and Dynamic Shape Issues: Exporting PyTorch models to ONNX format often encounters issues, particularly with dynamic shapes and unsupported operations. Problems include errors with slicing operations on complex tensors and failures due to unsupported operators, necessitating workarounds or downgrades to previous PyTorch versions.  
issues/147854, issues/147896, issues/148268

Performance Discrepancies and Regressions: Performance discrepancies and regressions are noted in PyTorch, affecting operations like matrix multiplication and data parallel training. These issues highlight slower performance in certain configurations or backends, prompting investigations into optimization and efficiency improvements.  
issues/147971, issues/148078, issues/148086

Inductor and Backend Inconsistencies: Inconsistencies between the Inductor backend and eager execution mode are reported, affecting functions like torch.slice_scatter and torch.cdist. These inconsistencies often lead to assertion errors or incorrect outputs, requiring adjustments to ensure consistent behavior across backends.  
issues/147842, issues/148064, issues/148244

Dynamic Control Flow and Compilation Errors: Dynamic control flow and compilation errors are prevalent in PyTorch, particularly when using torch.compile with complex models. These errors often involve unsupported operations or internal assertion failures, necessitating debugging and potential code modifications.  
issues/148193, issues/148201, issues/148273

Documentation and Usability Improvements: Several issues highlight the need for documentation and usability improvements in PyTorch, such as clarifying function arguments and enhancing profiling capabilities. These improvements aim to provide clearer guidance and better support for users, particularly in distributed and performance-critical scenarios.  
issues/148123, issues/148181, issues/148253

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 35
Summarized Issues:

Triton Integration Issues: The integration of the Triton library with PyTorch has led to multiple issues, including widespread unit test failures due to an AttributeError from an API deprecation and memory access violations on the ROCm platform. These problems highlight the challenges of maintaining compatibility across different platforms and the need for careful management of library updates.
issues/147375, issues/147378, issues/147737, issues/147838, issues/148111

PyTorch Backend and API Bugs: Several bugs have been identified in PyTorch's backend and API, such as crashes on Apple's MPS backend with non-contiguous tensors and issues with the OffsetBasedRNGTracker being limited to CUDA. These bugs indicate the need for more robust backend support and flexible API design to accommodate various hardware and use cases.
issues/147443, issues/147584, issues/147828, issues/148194

FSDP and GradScaler Issues: The Fully Sharded Data Parallel (FSDP) module and GradScaler have encountered issues, such as index errors when called with zero arguments and non-functioning mixed precision on Intel ARC A770 GPUs. These issues suggest the need for more flexible argument handling and better support for diverse hardware configurations.
issues/147531, issues/147731

API and Export Functionality: There are issues with PyTorch's API and export functionality, including the need for improved constant marking in torch.export and problems with torch._dynamo.mark_dynamic causing compilation errors. These highlight the importance of clear API documentation and robust export mechanisms to prevent errors during model deployment.
issues/147397, issues/148040

Compilation and Build Errors: Compilation and build errors have been reported, such as assertion errors with torch.norm and build failures with CUDA 12.6 and 12.8 on Amazon Linux. These issues underscore the challenges of maintaining compatibility with evolving compiler and library versions.
issues/147840, issues/148120, issues/148263

Performance and Optimization Issues: Performance issues have been identified, such as slowdowns in templated GEMMs due to memory allocator lock contention and the inability to use GEMM templates in the LLaMA model. These issues highlight the need for efficient memory management and support for advanced quantization techniques to optimize performance.
issues/147766, issues/147954

Testing and Error Handling: Various testing and error handling issues have been reported, including a SubgraphLoweringException during inductor compilation and a ResourceWarning from the tempfile module. These issues emphasize the importance of comprehensive testing and error handling to ensure reliable software behavior.
issues/147744, issues/148107

Operator and Mode Conflicts: Conflicts have arisen from operator registration and mode handling, such as a RuntimeError from operator name conflicts and incorrect mode mutation in auto_functionalization. These issues suggest the need for careful management of operator namespaces and mode handling to prevent conflicts.
issues/148148, issues/147924

Documentation and Compliance: Documentation errors and compliance issues have been identified, such as a mistake in the replace_pattern docstring and the need for FIPS compliance in hashlib usage. These highlight the importance of accurate documentation and adherence to security standards.
issues/147610, issues/147627

Model and Inference Issues: Model inference issues have been reported, such as the Detection Transformer failing with a batch size of 1 and non-contiguous outputs causing errors in Triton kernels. These issues indicate the need for robust model handling and inference support across different configurations.
issues/147700, issues/147828

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in the 'Other Pull Requests' section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 169
Key Open Pull Requests
1. [test] Linter docker image: This pull request is focused on testing a linter Docker image for the PyTorch project, as indicated by the title "[test] Linter docker image," and it aims to address a specific issue referenced as #ISSUE_NUMBER, although it has not yet been merged.

URL: pull/147789

Merged: No

Associated Commits: 4545e, ff328, c3f77, c65fb, 28551, 5488f, d143c, eb7af, dc5d2, 3f852, 029b1, 8e0c1, d9995, 2a3d9, a1f32, cfec8, 111b7, 821f8, d698d, 7a52c, a95d2, 14bcd, e92fb, cf903, 15d46, 1ce32, d6df9, 4fcbc, 097fd, f491b, 74240

2. Add note to get start xpu: This pull request aims to add a note to the PyTorch documentation to inform users that installing PyTorch from binaries will automatically include Intel® Deep Learning Essentials runtime packages, which could lead to environment issues if oneAPI is activated in a standalone installation, and thus advises users to avoid this situation.

URL: pull/148168

Merged: No

Associated Commits: 122ae, cb561, 7e0a4, 4c6fc, dc4b2, 7a001, 2d789, d106c, b3771, 37c52, 7ba05, 065d3, 3d8db, 5694b, 355e1, ef7bd, 56b93, ad7af, e777d, 30662, 10e67, 99da1, 0edcf, 055f5

3. [pytree] add another simplified pytree module torch.pytree: This pull request introduces a new simplified module torch.pytree to the PyTorch library, which aligns more closely with the JAX pytree API by removing the tree_ prefix from its functions and reversing the argument order of the unflatten function to enhance compatibility with functools.partial, while ensuring no backward compatibility issues since it is a completely new module.

URL: pull/148180

Merged: No

Associated Commits: 3ae64, 51232, c5f43, c3620, 08aff, 4e773, 308f6, 00a64, f6206, c93a2, 4b8ce, b8ead, c7e5b, ef3ad, 0518f, 7c8ae, a5d05

Other Open Pull Requests

Re-merge of Previously Reverted Pull Request: This pull request is a re-merge of a previously reverted pull request (#144974) in the PyTorch project. It involves recreating the branch and updating several commits, as indicated by the multiple "Update" and "base update" commit messages.
pull/147903

Configurable Record and Storage Alignment in torch.save: This pull request aims to make the record and storage alignment in the torch.save function configurable. It allows users to adjust how data is stored and aligned, as part of a series of changes tracked through the ghstack tool.
pull/147788

Eager Then Compile Stance: This pull request introduces a new eager_then_compile stance through the set_stance APIs. It aims to reduce compile times and improve the ergonomics of using dynamic shapes by initially running the compile in eager mode.
pull/147983

Support for contextlib.suppress: This pull request aims to add support for the contextlib.suppress feature in the PyTorch project. It is part of a series of related changes tracked by the ghstack tool.
pull/147990

Enhancements in FakeTensorMode: This pull request aims to enhance the PyTorch library by adding information about checkpoint offsets to untyped storages when using the torch.load function under the FakeTensorMode. It is part of a series of updates tracked through the ghstack tool.
pull/147787

Threading boxed_forward_device_index: This pull request addresses the issue of correctly threading the boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. It ensures accurate updates of BoxedDeviceIndex from cache hits.
pull/148130

Performance Enhancements in Symbolic Tracing: This pull request aims to enhance the performance of the PyTorch project by moving the Node._prepend and Node._remove_from_list methods to C++. It results in a significant reduction in function calls and execution time during microbenchmarking of symbolic tracing operations.
pull/148261, pull/148260

Loading FakeTensors with Correct Devices: This pull request addresses the issue of loading FakeTensors with the correct devices under FakeTensorMode in PyTorch. It fixes the functions _rebuild_tensor_v2 and _rebuild_tensor_v3 as part of a series of related changes tracked through ghstack.
pull/147786

torch.serialization.skip_data Compatibility: This pull request aims to enhance the PyTorch library by enabling the torch.serialization.skip_data functionality to be compatible with the torch.load method. It is part of a series of updates tracked through the ghstack tool.
pull/148018

Propagation Strategy in Inductor Pattern Matcher: This pull request proposes a change in the propagation strategy of arg_kwarg_vals within the PyTorch Inductor Pattern Matcher. It aims to trace replacement graphs using arg_kwarg_vals instead of node.meta['val'].
pull/148046

needs_exact_strides Operator Tag: This pull request introduces a new operator tag, "needs_exact_strides," for the Inductor component in the PyTorch project. It enforces exact strides on custom operators, with plans to make this behavior the default in a subsequent update.
pull/148063, pull/148091

Templatized CUDA Kernel for GammaBeta Backwards Pass: This pull request introduces a new templatized CUDA kernel designed to replace three existing non-ROCM CUDA kernels for the GammaBeta backwards pass. It addresses performance issues by optimizing for warp shuffles, coalesced loads, and parallelism across the M dimension.
pull/147773

Regression Issue in evaluate_expr Function: This pull request addresses a regression issue in the evaluate_expr function, which was caused by a previously added logging argument that disrupted cache lookups. It refactors the code to eliminate the use of expr_sym_node_id in cache lookups.
pull/147836

Renaming arg_kwarg_vals Key for Clarity: This pull request proposes renaming the key node.meta["arg_kwarg_vals"] to node.meta["orig_arg_kwarg_vals"] in the codebase to improve clarity. It is accompanied by an explanatory comment to prevent confusion.
pull/148092

Optimization by Removing Unnecessary Operations: This pull request aims to optimize the PyTorch codebase by removing unnecessary tensor clone operations and redundant variable calls. It also adds argument comments for clarity.
pull/148159

Selective Build at Optimization Level O1: This pull request aims to selectively build code at optimization level O1 as part of a series of changes. It includes tasks like porting improvements to cpp_wrapper mode and addressing CMake packaging issues.
pull/148212

Support for Rowwise Scaling in Scaled GEMM: This pull request introduces support for rowwise scaling in scaled GEMM operations. It includes various enhancements such as fixes for offline tuning and updates to online unit tests.
pull/148238

Refactoring Sharding Propagation Mechanism: This pull request refactors the sharding propagation mechanism in the PyTorch project to handle cross mesh computations. It moves the same mesh check from the sharding propagation level to each individual operator level.
pull/147869

Registration Process for at::_weight_int4pack_mm_with_scale_and_zeros: This pull request aims to facilitate the registration process related to the at::_weight_int4pack_mm_with_scale_and_zeros function in the PyTorch project. It is part of a stack of changes managed through the ghstack tool.
pull/147962

Stride Consistency in while_loop Operation: This pull request proposes a change to the PyTorch project, specifically requiring that the stride in a while_loop operation must remain consistent with the input for the body_fn function. It is detailed in the commits and discussed among several contributors.
pull/148002

Preserving Torch Function Mode Stack During Recompute: This pull request addresses the issue of preserving the torch function mode stack during recompute in torch.utils.checkpoint. It ensures that the TorchFunctionModeTLS remains active even when .backward() is called.
pull/148023

Utility Function for Normalizing Args and Kwargs: This pull request introduces a new utility function, torch._library.utils.normalize_args_kwargs, which standardizes the (args, kwargs) to align with the PyTorch dispatcher calling convention. It includes new tests to validate this functionality.
pull/148062

Support for XPU Device in LayerNormKernel: This pull request aims to enhance the PyTorch project by adding support for the XPU device to the LayerNormKernel. It is in collaboration with the Intel torch-xpu-ops project.
pull/148081

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in the 'Other Pull Requests' section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 233
Key Closed Pull Requests
1. Make Tensor.set_ validate storage_offset when sizes/strides are unchanged: This pull request aims to enhance the Tensor.set_ function in the PyTorch library by adding validation for the storage_offset parameter when the sizes and strides of a tensor remain unchanged, ensuring more robust and error-free tensor operations.

URL: pull/147354

Merged: No

Associated Commits: d5583, 9677c, 56c4c, 2db7e, 649dd, 439a6, 5eb7f, 4d08c, 6be5a, 8e4de, fa37c, b73cb, 5b7f8, 38f65, 72410

2. Validate inputs to _nested_view_from_buffer to prevent overflows: This pull request aims to enhance the robustness of the PyTorch codebase by validating inputs to the _nested_view_from_buffer function to prevent potential buffer overflow issues, as evidenced by multiple updates and commits addressing this concern.

URL: pull/147356

Merged: No

Associated Commits: 0aae8, 9856c, af85a, e5e15, 420f1, 9ce79, b6578, 1a53d, b604b, f54f4, fef6d, c07cd, f3c46, 119c3

3. torch._scaled_mm with MXFP8: This pull request introduces blockwise MXFP8 support to the torch._scaled_mm function for devices with CUDA capability 10.0 and higher, enabling the dispatch to a blockwise kernel from cuBLAS when the scales for matrices A and B are of dtype torch.float8_e8m0fnu, and includes tests for basic functionality, such as numerics of simple matrices and end-to-end quantization with GEMM, while noting that MXFP4 support will be addressed in a future update.

URL: pull/147548

Merged: No

Associated Commits: 503dd, 0c9c9, 61149, 9478e, d050b, be8f6, a8d24, 49752, 2079b, abcb3, 94933, ed495, b4324

Other Closed Pull Requests

Error Message Improvements: Several pull requests focus on enhancing error messages in the PyTorch project. These include improving the readability of graph break error messages and adding explicit error messages for small tensor sizes in the FlexAttention module. Additionally, there are efforts to introduce generic graph break hints and context manager debug information, although some of these changes were not merged.
pull/147385, pull/147429, pull/147494, pull/147765, pull/147872

Graph and Tensor Manipulations: Various pull requests address issues related to graph and tensor manipulations in PyTorch. These include fixing a bug in the "reshape -> scaled mm -> reshape" pattern, optimizing the lifting of ID_GUARDED tensors, and extracting codegen_unbacked_symbol_defs for handling subgraphs with unbacked symbolic integers.
pull/147952, pull/147824, pull/147567

Performance Optimizations: Several pull requests focus on optimizing performance in the PyTorch project. These include implementing the masked_fill_scalar function as a shader for MPS, updating ck_conv_template code generation for ROCm CK kernels, and optimizing outer loop fusion heuristics.
pull/147369, pull/147504, pull/147523

Bug Fixes and Enhancements: Multiple pull requests address various bugs and enhancements in the PyTorch project. These include fixing a crash issue with torch._C.ScriptFunction, addressing a broken int8 WoQ GEMM AMX implementation, and resolving an overflow issue in the checkInBoundsForStorage function.
pull/147894, pull/147895, pull/147352

Unmerged and Experimental Changes: Some pull requests involve unmerged or experimental changes. These include introducing a sourceless builder for types.MethodType, enabling SDPA functionality on Intel GPUs, and testing penetration attempts, none of which were intended for merging.
pull/147880, pull/147951, pull/147634

Module and API Adjustments: A few pull requests focus on module and API adjustments. These include fixing the register constant functionality, ensuring proper export of classes from torch.utils.tensorboard, and removing manylinux builds for the Triton project.
pull/147533, pull/147550, pull/148129

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

zou3519
38
8
4
82

williamwen42
61
11
2
34

mikaylagawarecki
80
10
1
15

malfet
38
14
1
35

jansel
23
5
0
57

BoyuanFeng
80
3
0
2

bobrenjc93
47
14
0
16

clee2000
62
8
3
4

oulgen
52
11
0
7

Skylion007
10
7
2
50

Access Last Week's Newsletter:  

Link

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
zou3519	38	8	4	82
williamwen42	61	11	2	34
mikaylagawarecki	80	10	1	15
malfet	38	14	1	35
jansel	23	5	0	57
BoyuanFeng	80	3	0	2
bobrenjc93	47	14	0	16
clee2000	62	8	3	4
oulgen	52	11	0	7
Skylion007	10	7	2	50