Weekly GitHub Report for Pytorch: February 24, 2025 - March 03, 2025
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notable changes include the deprecation of PyTorch's official Anaconda channel, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL: This issue is a feature request for a stream management API in PyTorch's Process Group NCCL to address the challenges of asynchronous communication, which can lead to a "read-before-write" issue due to each NCCL process group operating on its own dedicated stream. The proposal suggests allowing users to define or set streams for process groups to ensure proper execution order of collective operations, thereby enabling better overlap with compute operations and improving performance in multi-data center training scenarios.
- The comments discuss related issues and potential solutions, including using existing APIs with user-defined streams to manage synchronization and execution order. Some contributors suggest that the current behavior is not a bug but a user responsibility to manage synchronization, while others propose changes to improve user experience and documentation. There is also mention of ongoing work to address these concerns in future updates.
- Number of comments this week: 15
-
Checkpoint doesn't work with torch_function if torch_function change tensor metadata: This issue highlights a problem with the PyTorch checkpoint functionality when used in conjunction with
TorchFunctionMode
, specifically when__torch_function__
alters tensor metadata, leading to a metadata mismatch error during the backward pass. The user provides a minimal reproducible example demonstrating the error and seeks advice on whether there is a way to make__torch_function__
compatible with checkpointing.- The comments discuss potential reasons for the issue, such as the violation of conditions due to metadata changes and the absence of
TorchFunctionMode
during recomputation. Suggestions include manually activatingTorchFunctionMode
within the checkpointed function and considering changes to checkpoint logic to detect and re-enable modes during recomputation. There is also a discussion about the feasibility of implementing these changes and alternative approaches to avoid the problem. - Number of comments this week: 9
- The comments discuss potential reasons for the issue, such as the violation of conditions due to metadata changes and the absence of
-
FlexAttention compiled has illegal memory access or device-side assert even though all tensors are contiguous: This issue involves a bug in the PyTorch library where the FlexAttention module, when compiled, encounters illegal memory access or device-side assertions despite all tensors being contiguous. The problem arises under specific parameter configurations, and while a workaround involving padding the
rel_bias
tensor can mitigate the illegal memory access, it does not resolve the device-side assertion issue.- The comments discuss potential causes and solutions for the issue, including code modifications to address out-of-bounds errors and the need for conditional evaluation of the
score_mod
function. A workaround involving padding is mentioned, and the issue is acknowledged as a known problem with suggestions for further investigation. - Number of comments this week: 6
- The comments discuss potential causes and solutions for the issue, including code modifications to address out-of-bounds errors and the need for conditional evaluation of the
-
[inductor][cpu]AOT inductor AMP static shape default wrapper occupied almost 3x disk than before: This issue reports a significant increase in disk usage when using the AOT Inductor AMP static shape default wrapper, which is observed to occupy almost three times more disk space than before, specifically when running the ResNet50 model. The problem is suspected to be caused by a specific commit, and the user is seeking assistance to identify and address the root cause of this behavior.
- The comments discuss the need to identify the commit responsible for the increased disk usage, with suggestions to manually search for the guilty commit. A specific commit is identified as potentially responsible, and further investigation is requested from other contributors to understand the cause and determine if a fix is necessary, considering upcoming changes that might resolve the issue.
- Number of comments this week: 6
-
No gradient for
residuals
in the return value oftorch.linalg.lstsq
: This issue highlights a concern with thetorch.linalg.lstsq
function in PyTorch, where theresiduals
in the return value do not have a gradient, unlike thesolution
. The user is questioning whether this behavior is expected and is considering contributing a pull request to address this by modifying the autograd functionality.- The comments discuss the inefficiency of manually computing gradients for
residuals
and suggest that the current API design is flawed. A user proposes a code solution and asks for feedback on submitting a pull request, while another commenter advises against changing the API, suggesting users implement their own solutions if needed. - Number of comments this week: 5
- The comments discuss the inefficiency of manually computing gradients for
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 125
Summarized Issues:
- Compilation and Runtime Errors in PyTorch: Compilation and runtime errors are prevalent in PyTorch, affecting various functions and modules. Issues include failures in compiling functions like
flex_attention
andDropout
withSequenceParallel
, and runtime errors due to improper handling of tensor types or device assignments. These errors often suggest trying different PyTorch versions or involve specific contributors for resolution.
- Bugs in PyTorch Functions and Modules: Various bugs are reported in PyTorch functions and modules, such as incorrect gradient calculations, inconsistent outputs, and failures in specific operations. These bugs affect functions like
torch._check
,torch.distributed.context_parallel
, andtorch.nn.functional.hardswish
, often requiring workarounds or patches to address the issues.
- ONNX Export and Dynamic Shape Issues: Exporting PyTorch models to ONNX format often encounters issues, particularly with dynamic shapes and unsupported operations. Problems include errors with slicing operations on complex tensors and failures due to unsupported operators, necessitating workarounds or downgrades to previous PyTorch versions.
- Performance Discrepancies and Regressions: Performance discrepancies and regressions are noted in PyTorch, affecting operations like matrix multiplication and data parallel training. These issues highlight slower performance in certain configurations or backends, prompting investigations into optimization and efficiency improvements.
- Inductor and Backend Inconsistencies: Inconsistencies between the Inductor backend and eager execution mode are reported, affecting functions like
torch.slice_scatter
andtorch.cdist
. These inconsistencies often lead to assertion errors or incorrect outputs, requiring adjustments to ensure consistent behavior across backends.
- Dynamic Control Flow and Compilation Errors: Dynamic control flow and compilation errors are prevalent in PyTorch, particularly when using
torch.compile
with complex models. These errors often involve unsupported operations or internal assertion failures, necessitating debugging and potential code modifications.
- Documentation and Usability Improvements: Several issues highlight the need for documentation and usability improvements in PyTorch, such as clarifying function arguments and enhancing profiling capabilities. These improvements aim to provide clearer guidance and better support for users, particularly in distributed and performance-critical scenarios.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 35
Summarized Issues:
- Triton Integration Issues: The integration of the Triton library with PyTorch has led to multiple issues, including widespread unit test failures due to an AttributeError from an API deprecation and memory access violations on the ROCm platform. These problems highlight the challenges of maintaining compatibility across different platforms and the need for careful management of library updates.
- PyTorch Backend and API Bugs: Several bugs have been identified in PyTorch's backend and API, such as crashes on Apple's MPS backend with non-contiguous tensors and issues with the
OffsetBasedRNGTracker
being limited to CUDA. These bugs indicate the need for more robust backend support and flexible API design to accommodate various hardware and use cases.
- FSDP and GradScaler Issues: The Fully Sharded Data Parallel (FSDP) module and GradScaler have encountered issues, such as index errors when called with zero arguments and non-functioning mixed precision on Intel ARC A770 GPUs. These issues suggest the need for more flexible argument handling and better support for diverse hardware configurations.
- API and Export Functionality: There are issues with PyTorch's API and export functionality, including the need for improved constant marking in
torch.export
and problems withtorch._dynamo.mark_dynamic
causing compilation errors. These highlight the importance of clear API documentation and robust export mechanisms to prevent errors during model deployment.
- Compilation and Build Errors: Compilation and build errors have been reported, such as assertion errors with
torch.norm
and build failures with CUDA 12.6 and 12.8 on Amazon Linux. These issues underscore the challenges of maintaining compatibility with evolving compiler and library versions.
- Performance and Optimization Issues: Performance issues have been identified, such as slowdowns in templated GEMMs due to memory allocator lock contention and the inability to use GEMM templates in the LLaMA model. These issues highlight the need for efficient memory management and support for advanced quantization techniques to optimize performance.
- Testing and Error Handling: Various testing and error handling issues have been reported, including a
SubgraphLoweringException
during inductor compilation and aResourceWarning
from thetempfile
module. These issues emphasize the importance of comprehensive testing and error handling to ensure reliable software behavior.
- Operator and Mode Conflicts: Conflicts have arisen from operator registration and mode handling, such as a
RuntimeError
from operator name conflicts and incorrect mode mutation inauto_functionalization
. These issues suggest the need for careful management of operator namespaces and mode handling to prevent conflicts.
- Documentation and Compliance: Documentation errors and compliance issues have been identified, such as a mistake in the
replace_pattern
docstring and the need for FIPS compliance in hashlib usage. These highlight the importance of accurate documentation and adherence to security standards.
- Model and Inference Issues: Model inference issues have been reported, such as the Detection Transformer failing with a batch size of 1 and non-contiguous outputs causing errors in Triton kernels. These issues indicate the need for robust model handling and inference support across different configurations.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in the 'Other Pull Requests' section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 169
Key Open Pull Requests
1. [test] Linter docker image: This pull request is focused on testing a linter Docker image for the PyTorch project, as indicated by the title "[test] Linter docker image," and it aims to address a specific issue referenced as #ISSUE_NUMBER, although it has not yet been merged.
- URL: pull/147789
- Merged: No
- Associated Commits: 4545e, ff328, c3f77, c65fb, 28551, 5488f, d143c, eb7af, dc5d2, 3f852, 029b1, 8e0c1, d9995, 2a3d9, a1f32, cfec8, 111b7, 821f8, d698d, 7a52c, a95d2, 14bcd, e92fb, cf903, 15d46, 1ce32, d6df9, 4fcbc, 097fd, f491b, 74240
2. Add note to get start xpu: This pull request aims to add a note to the PyTorch documentation to inform users that installing PyTorch from binaries will automatically include Intel® Deep Learning Essentials runtime packages, which could lead to environment issues if oneAPI is activated in a standalone installation, and thus advises users to avoid this situation.
- URL: pull/148168
- Merged: No
- Associated Commits: 122ae, cb561, 7e0a4, 4c6fc, dc4b2, 7a001, 2d789, d106c, b3771, 37c52, 7ba05, 065d3, 3d8db, 5694b, 355e1, ef7bd, 56b93, ad7af, e777d, 30662, 10e67, 99da1, 0edcf, 055f5
3. [pytree] add another simplified pytree module torch.pytree
: This pull request introduces a new simplified module torch.pytree
to the PyTorch library, which aligns more closely with the JAX pytree API by removing the tree_
prefix from its functions and reversing the argument order of the unflatten
function to enhance compatibility with functools.partial
, while ensuring no backward compatibility issues since it is a completely new module.
- URL: pull/148180
- Merged: No
- Associated Commits: 3ae64, 51232, c5f43, c3620, 08aff, 4e773, 308f6, 00a64, f6206, c93a2, 4b8ce, b8ead, c7e5b, ef3ad, 0518f, 7c8ae, a5d05
Other Open Pull Requests
- Re-merge of Previously Reverted Pull Request: This pull request is a re-merge of a previously reverted pull request (#144974) in the PyTorch project. It involves recreating the branch and updating several commits, as indicated by the multiple "Update" and "base update" commit messages.
- Configurable Record and Storage Alignment in torch.save: This pull request aims to make the record and storage alignment in the
torch.save
function configurable. It allows users to adjust how data is stored and aligned, as part of a series of changes tracked through the ghstack tool.
- Eager Then Compile Stance: This pull request introduces a new
eager_then_compile
stance through the set_stance APIs. It aims to reduce compile times and improve the ergonomics of using dynamic shapes by initially running the compile in eager mode.
- Support for contextlib.suppress: This pull request aims to add support for the
contextlib.suppress
feature in the PyTorch project. It is part of a series of related changes tracked by the ghstack tool.
- Enhancements in FakeTensorMode: This pull request aims to enhance the PyTorch library by adding information about checkpoint offsets to untyped storages when using the
torch.load
function under theFakeTensorMode
. It is part of a series of updates tracked through the ghstack tool.
- Threading boxed_forward_device_index: This pull request addresses the issue of correctly threading the
boxed_forward_device_index
fromgraph_kwargs
toCompiledFXGraph.post_compile
. It ensures accurate updates ofBoxedDeviceIndex
from cache hits.
- Performance Enhancements in Symbolic Tracing: This pull request aims to enhance the performance of the PyTorch project by moving the
Node._prepend
andNode._remove_from_list
methods to C++. It results in a significant reduction in function calls and execution time during microbenchmarking of symbolic tracing operations.
- Loading FakeTensors with Correct Devices: This pull request addresses the issue of loading FakeTensors with the correct devices under FakeTensorMode in PyTorch. It fixes the functions _rebuild_tensor_v2 and _rebuild_tensor_v3 as part of a series of related changes tracked through ghstack.
- torch.serialization.skip_data Compatibility: This pull request aims to enhance the PyTorch library by enabling the
torch.serialization.skip_data
functionality to be compatible with thetorch.load
method. It is part of a series of updates tracked through the ghstack tool.
- Propagation Strategy in Inductor Pattern Matcher: This pull request proposes a change in the propagation strategy of
arg_kwarg_vals
within the PyTorch Inductor Pattern Matcher. It aims to trace replacement graphs usingarg_kwarg_vals
instead ofnode.meta['val']
.
- needs_exact_strides Operator Tag: This pull request introduces a new operator tag, "needs_exact_strides," for the Inductor component in the PyTorch project. It enforces exact strides on custom operators, with plans to make this behavior the default in a subsequent update.
- Templatized CUDA Kernel for GammaBeta Backwards Pass: This pull request introduces a new templatized CUDA kernel designed to replace three existing non-ROCM CUDA kernels for the GammaBeta backwards pass. It addresses performance issues by optimizing for warp shuffles, coalesced loads, and parallelism across the
M
dimension.
- Regression Issue in evaluate_expr Function: This pull request addresses a regression issue in the
evaluate_expr
function, which was caused by a previously added logging argument that disrupted cache lookups. It refactors the code to eliminate the use ofexpr_sym_node_id
in cache lookups.
- Renaming arg_kwarg_vals Key for Clarity: This pull request proposes renaming the key
node.meta["arg_kwarg_vals"]
tonode.meta["orig_arg_kwarg_vals"]
in the codebase to improve clarity. It is accompanied by an explanatory comment to prevent confusion.
- Optimization by Removing Unnecessary Operations: This pull request aims to optimize the PyTorch codebase by removing unnecessary tensor clone operations and redundant variable calls. It also adds argument comments for clarity.
- Selective Build at Optimization Level O1: This pull request aims to selectively build code at optimization level O1 as part of a series of changes. It includes tasks like porting improvements to
cpp_wrapper
mode and addressing CMake packaging issues.
- Support for Rowwise Scaling in Scaled GEMM: This pull request introduces support for rowwise scaling in scaled GEMM operations. It includes various enhancements such as fixes for offline tuning and updates to online unit tests.
- Refactoring Sharding Propagation Mechanism: This pull request refactors the sharding propagation mechanism in the PyTorch project to handle cross mesh computations. It moves the same mesh check from the sharding propagation level to each individual operator level.
- Registration Process for at::_weight_int4pack_mm_with_scale_and_zeros: This pull request aims to facilitate the registration process related to the
at::_weight_int4pack_mm_with_scale_and_zeros
function in the PyTorch project. It is part of a stack of changes managed through the ghstack tool.
- Stride Consistency in while_loop Operation: This pull request proposes a change to the PyTorch project, specifically requiring that the stride in a
while_loop
operation must remain consistent with the input for thebody_fn
function. It is detailed in the commits and discussed among several contributors.
- Preserving Torch Function Mode Stack During Recompute: This pull request addresses the issue of preserving the torch function mode stack during recompute in
torch.utils.checkpoint
. It ensures that the TorchFunctionModeTLS remains active even when.backward()
is called.
- Utility Function for Normalizing Args and Kwargs: This pull request introduces a new utility function,
torch._library.utils.normalize_args_kwargs
, which standardizes the (args, kwargs) to align with the PyTorch dispatcher calling convention. It includes new tests to validate this functionality.
- Support for XPU Device in LayerNormKernel: This pull request aims to enhance the PyTorch project by adding support for the XPU device to the LayerNormKernel. It is in collaboration with the Intel torch-xpu-ops project.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in the 'Other Pull Requests' section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 233
Key Closed Pull Requests
1. Make Tensor.set_ validate storage_offset when sizes/strides are unchanged: This pull request aims to enhance the Tensor.set_
function in the PyTorch library by adding validation for the storage_offset
parameter when the sizes and strides of a tensor remain unchanged, ensuring more robust and error-free tensor operations.
- URL: pull/147354
- Merged: No
- Associated Commits: d5583, 9677c, 56c4c, 2db7e, 649dd, 439a6, 5eb7f, 4d08c, 6be5a, 8e4de, fa37c, b73cb, 5b7f8, 38f65, 72410
2. Validate inputs to _nested_view_from_buffer to prevent overflows: This pull request aims to enhance the robustness of the PyTorch codebase by validating inputs to the _nested_view_from_buffer
function to prevent potential buffer overflow issues, as evidenced by multiple updates and commits addressing this concern.
- URL: pull/147356
- Merged: No
- Associated Commits: 0aae8, 9856c, af85a, e5e15, 420f1, 9ce79, b6578, 1a53d, b604b, f54f4, fef6d, c07cd, f3c46, 119c3
3. torch._scaled_mm with MXFP8: This pull request introduces blockwise MXFP8 support to the torch._scaled_mm
function for devices with CUDA capability 10.0 and higher, enabling the dispatch to a blockwise kernel from cuBLAS when the scales for matrices A and B are of dtype torch.float8_e8m0fnu
, and includes tests for basic functionality, such as numerics of simple matrices and end-to-end quantization with GEMM, while noting that MXFP4 support will be addressed in a future update.
- URL: pull/147548
- Merged: No
- Associated Commits: 503dd, 0c9c9, 61149, 9478e, d050b, be8f6, a8d24, 49752, 2079b, abcb3, 94933, ed495, b4324
Other Closed Pull Requests
- Error Message Improvements: Several pull requests focus on enhancing error messages in the PyTorch project. These include improving the readability of graph break error messages and adding explicit error messages for small tensor sizes in the FlexAttention module. Additionally, there are efforts to introduce generic graph break hints and context manager debug information, although some of these changes were not merged.
- Graph and Tensor Manipulations: Various pull requests address issues related to graph and tensor manipulations in PyTorch. These include fixing a bug in the "reshape -> scaled mm -> reshape" pattern, optimizing the lifting of ID_GUARDED tensors, and extracting
codegen_unbacked_symbol_defs
for handling subgraphs with unbacked symbolic integers.
- Performance Optimizations: Several pull requests focus on optimizing performance in the PyTorch project. These include implementing the
masked_fill_scalar
function as a shader for MPS, updatingck_conv_template
code generation for ROCm CK kernels, and optimizing outer loop fusion heuristics.
- Bug Fixes and Enhancements: Multiple pull requests address various bugs and enhancements in the PyTorch project. These include fixing a crash issue with
torch._C.ScriptFunction
, addressing a broken int8 WoQ GEMM AMX implementation, and resolving an overflow issue in thecheckInBoundsForStorage
function.
- Unmerged and Experimental Changes: Some pull requests involve unmerged or experimental changes. These include introducing a sourceless builder for
types.MethodType
, enabling SDPA functionality on Intel GPUs, and testing penetration attempts, none of which were intended for merging.
- Module and API Adjustments: A few pull requests focus on module and API adjustments. These include fixing the register constant functionality, ensuring proper export of classes from
torch.utils.tensorboard
, and removing manylinux builds for the Triton project.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
zou3519 | 38 | 8 | 4 | 82 |
williamwen42 | 61 | 11 | 2 | 34 |
mikaylagawarecki | 80 | 10 | 1 | 15 |
malfet | 38 | 14 | 1 | 35 |
jansel | 23 | 5 | 0 | 57 |
BoyuanFeng | 80 | 3 | 0 | 2 |
bobrenjc93 | 47 | 14 | 0 | 16 |
clee2000 | 62 | 8 | 3 | 4 |
oulgen | 52 | 11 | 0 | 7 |
Skylion007 | 10 | 7 | 2 | 50 |
Access Last Week's Newsletter: