Weekly GitHub Report for Pytorch: June 02, 2025 - June 09, 2025 (12:03:14)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notably, FP16 support is now available on X86 CPUs, and the release marks a shift away from publishing on Conda, directing users to alternative package sources. Additionally, the release includes a backward compatibility-breaking change with the default value for the weights_only
parameter in torch.load
, and experimental Linux binaries are now built with CXX11_ABI=1 using the Manylinux 2.28 platform.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Cannot build docs via
make html
: This issue is about a user encountering an error when attempting to build the PyTorch documentation using themake html
command, as specified in the project's README file. The error is related to an unsupported dispatch key, "NestedTensorHPU," which causes the build process to fail on their macOS system.- The comments discuss various attempts to resolve the issue, including trying different versions of PyTorch and using nightly builds, but these attempts are met with errors related to version compatibility and platform support. Some users suggest that the documentation build process is more suited for Linux, and there are concerns about the feasibility of building on Intel Macs due to deprecation. Others share their successful setups on different environments, while some express frustration over the lack of support for macOS x86 builds.
- Number of comments this week: 11
-
[FSDP2] Slower Convergence with fully_shard() Compared to DDP during Qwen2-VL Fine-Tuning: This issue discusses the slower convergence observed when using the
fully_shard()
method compared to Distributed Data Parallel (DDP) during the fine-tuning of the Qwen2-VL model. The user has provided a script and detailed observations, including validation loss rates, and is seeking insights or similar experiences from the community to understand the cause of this performance difference.- The comments involve a detailed discussion about potential causes for the slower convergence, including requests for loss curves, questions about checkpoint loading, and known issues with gradient clipping. The user provides additional data and experiments, including loss values for different configurations, and eventually, a solution is proposed that aligns the loss values between FSDP2 and DDP. The conversation also touches on issues with activation checkpointing, leading to the creation of a separate issue for further investigation.
- Number of comments this week: 10
-
[FR] Expose CUDAGraph handle to allow customized modification on the graph: This issue is about a feature request to expose the
cudaGraph_t
handle in PyTorch, allowing users to make customized modifications to CUDA graphs, which is currently not possible due to the encapsulation of these attributes in the PyTorch library. The request highlights the need for accessing specific CUDA graph parameters and edges for further editing, which is crucial for advanced CUDA operations.- The comments discuss the feasibility of accessing
cudaGraph_t
during or after stream capture, with suggestions on using cuda-python and torch C++ for modifications. A draft PR is mentioned, and there is a discussion on injecting host nodes into CUDA graphs, with explanations on usingcudaLaunchHostFunc
and its limitations. The conversation also touches on specific use cases like mixture of experts and dynamic behaviors in GPU operations. - Number of comments this week: 10
- The comments discuss the feasibility of accessing
-
documentation of Adafactor does not match the implementation: This issue highlights a discrepancy between the documentation and implementation of the Adafactor optimizer in PyTorch, where the documentation and the original Adafactor paper suggest using the sum of the gradient square, but the actual implementations use the mean. The user questions why this difference is not documented or discussed, seeking clarification on the rationale behind this choice.
- The comments discuss the equivalence of using a mean versus a sum in the implementation, noting that a mean is a scaled sum and advantageous for numerical representation. There is a discussion about the implications of using a mean, particularly regarding block-wise learning rates, and a suggestion to document this difference for clarity. A contributor expresses willingness to review a pull request to add this explanation, and another user volunteers to work on it.
- Number of comments this week: 7
-
don't require recompiles when switching between torch.Tensor vs AsyncCollectiveTensor graph inputs: This issue addresses the need to avoid recompiling when switching between
torch.Tensor
andAsyncCollectiveTensor
as graph inputs in PyTorch, particularly when the input to a compiled region changes from a regular tensor to one that requires synchronization due to asynchronous collective operations. The proposed solutions include automatically insertingwait_tensor()
calls to handle synchronization, with options for user configuration and default behaviors to balance performance and usability.- The comments discuss the challenges of handling
AsyncCollectiveTensor
in compiled graphs, emphasizing the need for synchronization to prevent recompilation. Contributors suggest various solutions, including unconditional insertion ofwait_tensor()
and considering the implications of graph breaks and asynchronous operations. There is a consensus on the importance of supporting different use cases, such as those involving eager and compiled modes, to ensure flexibility and performance. - Number of comments this week: 6
- The comments discuss the challenges of handling
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment using the 'inductor' backend. The problem arises during the execution of a Python script that utilizes the OotdPipeline and involves compiling specific components with Torch's compile function, and it has been open for over 434 days without resolution.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing MaxPool2D in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when executing a compiled model in a Docker environment with a
tmpfs
permission set to1777
, where the execution of the cachedcuda_utils.so
file in the/tmp
directory fails due to a missing execution bit, despite the directories having the correct permissions. The error occurs specifically when using PyTorch'storch.compile
function, and the problem is highlighted by anImportError
indicating a failure to map a segment from the shared object, which is critical for the model's execution. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues in certain files before the UFMT changes can be committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this request is to reduce the file size for deployment, especially on mobile devices, as demonstrated by the user's experience where removing these files manually resulted in a substantial reduction in the archive size without affecting the model's functionality.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 137
Summarized Issues:
- PyTorch Autograd and In-Place Operations: This topic covers issues related to bugs in PyTorch's autograd function where in-place operations are incorrectly reordered, leading to unexpected behavior during forward and backward passes. These issues persist even with custom operations, causing incorrect gradient applications and runtime errors due to size mismatches.
- Torch Inductor and Cudagraph Tree Bugs: This topic involves bugs in the PyTorch project where the torch inductor cudagraph tree may incorrectly release input nodes during replay, causing errors when these inputs are also used as outputs. This is due to the release of tensors and storage making path_weakrefs invalid.
- Torchrun Signal Handling Enhancements: This topic addresses the enhancement of the
torchrun
utility to handleSIGUSR1
andSIGUSR2
signals, which are commonly used in SLURM environments to indicate imminent job preemption. The enhancement aims to add support for these signals similar to how other signals likeSIGINT
andSIGTERM
are managed.
- PyTorch Documentation Conversion to MyST Markdown: This topic involves converting several
.rst
documentation files in the PyTorch project to MyST markdown format. The conversion process ensures that the documentation tests pass and follows specific procedures for file conversion, local documentation building, and PR submission to maintain the git history and ensure the converted pages render correctly.- issues/155013, issues/155014, issues/155015, issues/155016, issues/155018, issues/155019, issues/155020, issues/155021, issues/155022, issues/155023, issues/155024, issues/155025, issues/155026, issues/155027, issues/155028, issues/155029, issues/155030, issues/155031, issues/155032, issues/155033, issues/155034, issues/155035, issues/155036, issues/155037, issues/155038, issues/155039, issues/155040
- PyTorch MPS Backend Bugs: This topic covers various bugs in the PyTorch library related to the Metal Performance Shaders (MPS) backend. These include failures in operations like
cumsum
,max_pool2d_with_indices
, andaddmm
for non-float input types, leading to runtime errors and assertion failures.
- PyTorch Compilation and Graph Recompilation Issues: This topic involves issues with PyTorch's
torch.compile
function, where unnecessary graph recompilations occur due to nearly identical graph traces being generated with minor differences. This leads to significant time and memory overhead during training.
- PyTorch Profiler and CUDA Graphs: This topic addresses issues with the PyTorch profiler, including segmentation faults and crashes when using
torch.profiler.profile
withProfilerActivity.CUDA
. These issues are potentially due to nested profiler contexts and improper handling of the Global Interpreter Lock (GIL).
- PyTorch ROCm and AMD GPU Bugs: This topic covers bugs in the PyTorch ROCm distribution, including memory access faults and indexing issues on AMD GPUs. These bugs affect functions like
torch.cholesky_inverse
and lead to errors on specific GPU models.
- PyTorch Export and FakeTensor Issues: This topic involves issues with PyTorch's export functionality, where models exported using
torch.export.export
result in FakeTensors instead of regular Tensors. This is due to changes in model attributes during the export process, leading to unintended modifications.
- PyTorch Tensor Handling and Errors: This topic covers various issues related to tensor handling in PyTorch, including segmentation faults, runtime errors, and incorrect output when using functions like
torch.concat
,torch.nonzero
, andtorch.histc
. These issues highlight the need for better error handling and consistency across different modes and backends.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 23
Summarized Issues:
- Older PyTorch Versions Compatibility with Newer CUDA: Users face challenges in using older PyTorch versions with newer CUDA versions required for the latest GPUs. This issue highlights the need for support in older PyTorch versions to ensure stability and compatibility while utilizing advanced hardware.
- MPS Backend Operation Support and Errors: Several issues have been identified with the MPS backend in PyTorch, including lack of support for certain operations and runtime errors. These issues have been addressed through recent updates and pull requests to improve functionality and compatibility.
- Triton and PyTorch Integration Issues: Problems have arisen from Triton's integration with PyTorch, including type inference failures and toolchain version mismatches. These issues affect the compilation process and test outcomes, necessitating updates and adjustments.
- Documentation Format Conversion: The PyTorch project is undergoing a conversion of documentation files from reStructuredText to MyST markdown. This process aims to maintain git history and ensure documentation tests pass while preserving visual consistency.
- PyTorch Model Integration with Tauri and Rust: There have been proposals to integrate PyTorch models into applications using Tauri and Rust. However, these requests were closed as duplicates or redirected to discussion forums for further exploration.
- Regression and Compatibility Issues in PyTorch Builds: Various regressions and compatibility issues have been reported in PyTorch builds, affecting model predictions and dependency management. These issues are often traced back to specific library versions or missing dependencies.
- Continuous Integration and Resource Management Challenges: The PyTorch project has faced challenges with CI queue times and resource management, particularly related to AWS storage limits and GitHub incidents. These issues have been addressed through scaling adjustments and incident resolutions.
- Import and Execution Errors in PyTorch: Users have encountered import errors and execution discrepancies in PyTorch, such as missing functions and improper error handling. These issues highlight the need for updates and clearer error messaging in newer versions.
- Tensor Operation and Graph Saving Bugs: Bugs have been reported in PyTorch related to tensor operations and graph saving, leading to incorrect outputs and missing information. These issues require fixes to ensure accurate and complete data handling.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 203
Key Open Pull Requests
1. DOC: Convert to markdown: torch.compiler_best_practices_for_backends.rst, torch.compiler_cudagraph_trees.rst, torch.compiler_custom_backends.rst, torch.compiler_dynamic_shapes.rst, torch.compiler_dynamo_deepdive.rst: This pull request involves converting several documentation files related to PyTorch's compiler features from reStructuredText (.rst) format to Markdown (.md) format, addressing formatting issues, updating syntax for headings, code blocks, links, and tables, and fixing a 404 error in the documentation.
- URL: pull/155137
- Merged: No
- Associated Commits: a77a3, 98423, 2670a, 3d0b8, d5b7a, a9c46, 19af0, 5b640, 73f8c, 6da82, 1df84, f0e14, 5ade4, ed25c, 3bf09, 953e7, 7ab66
2. Convert to markdown onnx rst: This pull request involves converting several ONNX-related documentation files from reStructuredText (rst) format to MyST markdown format, ensuring that the documentation tests pass, and addressing specific files such as onnx_dynamo_onnxruntime_backend.rst
, onnx_dynamo.rst
, and onnx_ops.rst
, while noting that onnx_torchscript_supported_aten_ops.rst
remains unchanged due to its autogenerated nature.
- URL: pull/155228
- Merged: No
- Associated Commits: 11994, 7700e, 2a0e6, 2ce77, e1f07, 8cf7a, 8de50, 2660e, 317c2, 6fbae, c3350, 044e4, ba078, 16d62, 957b7, 792a0
3. distributions/constraints type annotations + public classes + some refactoring: This pull request involves making certain classes and constraints public, adding additional constraint subclasses, refactoring type annotations, and updating method signatures and documentation within the PyTorch project, as part of ongoing efforts to address issues #144196 and #144219, and as an alternative to a previous pull request #154711.
- URL: pull/154827
- Merged: No
- Associated Commits: 84da6, 5e0da, 322e8, c3ca8, cc6e2, c3e6a, 0e667, de02f, 23f6c, 6f59f, d515a, ee7ea, e9d62, 2e90a, 0feaa
Other Open Pull Requests
- Testing and Validation Workflows: This topic includes pull requests focused on testing and validation workflows in the PyTorch project. One pull request is dedicated to testing the mi300 workflows on a Vultr cluster, marked as "do not merge" for experimental purposes. Another pull request implements a "keep going" feature in the testing process to ensure test jobs continue running even after encountering errors.
- Documentation Format Conversion: Several pull requests address the conversion of documentation files from reStructuredText (RST) to Markdown format in the PyTorch project. These conversions are part of addressing specific issues and include updates and suggestions from code reviews.
- Memory Management and Access: Pull requests in this category focus on memory management and access issues within the PyTorch project. One pull request addresses a device-side memory access fault by correcting tensor life management, while another fixes illegal memory access in the attention mechanism.
- Enhancements and Features: This topic covers pull requests introducing new features and enhancements to the PyTorch project. These include adding unit tests for the memory-related API of
torch.accelerator
, integrating FP8 headers in the AOTI C++ wrapper, and implementing a@deprecate
decorator for internal functions.
- Build and Compatibility Updates: Pull requests in this category focus on build process updates and compatibility improvements. One pull request integrates support for building the Magma library with CUDA version 12.9, while another addresses incompatibility between the
--user
flag and Python virtual environments.
- Code and API Improvements: This topic includes pull requests aimed at improving code and API functionality in the PyTorch project. These include implementing
__eq__
and__ne__
methods for thedict
class and introducing a "dont constant fold" flag for FP8 quantization.
- Error Handling and Debugging: Pull requests in this category focus on error handling and debugging improvements. One pull request addresses an issue with the
torch.export
function failing to preserve PythonEnum
types, while another fixes an error during a ghstack commit involving multiple pull requests.
- Miscellaneous Enhancements: This topic covers various enhancements and updates in the PyTorch project. These include introducing guard filter helper functions for the Dynamo project and enhancing the export process of scripted functions by inlining the original callable.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 174
Key Closed Pull Requests
1. DOC: Convert to markdown: torch.overrides.rst, type_info.rst, utils.rst, xpu.rst: This pull request involves converting several documentation files from reStructuredText (.rst) format to MyST Markdown (.md) format, specifically for the files torch.overrides, type_info, utils, and xpu, while also addressing issues with tables and links, as part of fixing issue #155041.
- URL: pull/155088
- Merged: No
- Associated Commits: 0dd62, acdad, cc060, df006, d1ecb, e2ffb, ddf4e, c62fb, 7b4d5, ecc02, 13b33, 4707d, 970f3, d5953, c8d4b, 82eda, 952e3, 3ce2d, 10154, c9a92, aca49, 8f41b, 62d1c, 18ec4, 688bb, f44b0, 0108f, 7c418, 91401, 8a883, 65954, d7d57, 40ab5, bce46, e1d3b, d64e1, f19f1, 7f357, 35624, 5edc2, 3d788
2. Custom FX pass for inductor's backend registration: This pull request introduces an extension to Inductor's backend registration interface, allowing for the registration of custom FX passes, and includes multiple commits addressing various fixes and improvements such as linting, typing, and device checks, although it was ultimately not merged.
- URL: pull/154841
- Merged: No
- Associated Commits: f1473, 1f32f, a40ee, 962d5, c935d, baa2e, 8e40f, fbedb, 7c58b, 9f95d, 494ef, c3ae4, 4c785, fcf15, 848dd
3. Converting .rst files to .md : This pull request involves updating the documentation for several modules in the PyTorch project by converting files from the reStructuredText (.rst) format to Markdown (.md), specifically targeting the torch.ao.ns._numeric_suite
, torch.ao.ns._numeric_suite_fx
, AOTInductor
, AOTInductor Minifier
, and the torch.compiler
API, with the aim of enhancing readability and usability.
- URL: pull/155273
- Merged: No
- Associated Commits: 1f385, 526e7, 6f125, d59a3, 0cac0, e0d9c, 631f6, c8f71, 664e4, abb85, 1ff7a, 6ff2a, 65df7, 11c56, c0e62
Other Closed Pull Requests
- Dynamic Shapes in Cutlass EVT: This topic covers the introduction and testing of dynamic shapes in the Cutlass EVT within the PyTorch project. The pull requests aimed to implement dynamic shapes using the fp8 format and add references to the dynamic shapes documentation, but they were ultimately not merged.
- Documentation Conversion to Markdown: Several pull requests focused on converting documentation files from reStructuredText (.rst) to Markdown (.md) format using MyST Markdown. These efforts aimed to improve consistency and readability, but some were not merged due to issues like branch contamination.
- Inductor and Triton Enhancements: Enhancements in the Inductor project included support for autotuning Triton kernels and updating unit tests for CUDA version compatibility. These changes aimed to improve performance and ensure compatibility with new Triton versions.
- Registry and Command Updates: New functionalities were introduced to update the registry via terminal commands, allowing users to add or modify "gb_type" properties. These updates aimed to streamline the process of managing registry entries.
- Bug Fixes in Inductor's FX Backend: Specific bugs in Inductor's FX backend were addressed, including issues with offset extraction and constant input values. Tests were added to expose and verify the fixes for these issues.
- 2D AllToAllv Operations: The introduction of a 2D AllToAllv shuffle operation aimed to facilitate efficient data distribution in distributed computing contexts. This operation transposes data from rank-major to expert-major order.
- Enhancements in PyTorch Functions and Templates: Enhancements included a new overload for the
randint_like
function and support for batch matrix multiplication in generated templates. These changes aimed to expand functionality and improve performance.
- Triton API Compatibility: A shim layer was implemented to ensure compatibility with Triton 3.4 by handling the removal of experimental TMA APIs. This change allowed the system to adapt to new API versions while maintaining functionality.
- Miscellaneous Enhancements and Fixes: Various pull requests addressed issues such as recompilation hints for integer attributes, fixing the TD indexer workflow, and enhancing type annotations in the
torch._logging
module. These changes aimed to improve functionality and code quality.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
-
[MPS][BE] Refactor round_decimals shader code to leverage new macro
- Toxicity Score: 0.55 (Frustration,defensive responses,lack of consensus)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone is generally tense, with moments of constructive feedback overshadowed by misunderstandings and repeated requests for clarification. The conversation is marked by a lack of consensus and growing impatience, particularly from username1, who feels their concerns are not being adequately addressed.
-
- Toxicity Score: 0.55 (Frustration expressed, Defensive tone, Unresolved tension.)
- This GitHub conversation involves a series of interactions where username1 initially provides a solution, but username2 expresses frustration over its ineffectiveness. Username3 attempts to mediate by offering alternative suggestions, but username1's defensive tone escalates the tension. The conversation remains unresolved, with username2 and username1 exchanging terse comments.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 150 | 24 | 8 | 135 |
Skylion007 | 102 | 27 | 2 | 163 |
bobrenjc93 | 230 | 37 | 7 | 20 |
svekars | 60 | 4 | 32 | 102 |
guilhermeleobas | 121 | 23 | 2 | 1 |
anijain2305 | 108 | 20 | 1 | 11 |
laithsakka | 80 | 21 | 4 | 30 |
eellison | 59 | 17 | 2 | 45 |
ngimel | 35 | 5 | 1 | 65 |
cyyever | 71 | 23 | 0 | 9 |