Weekly GitHub Report for Pytorch: July 14, 2025 - July 21, 2025 (12:05:38)

                        July 21, 2025

            Weekly GitHub Report for Pytorch: July 14, 2025 - July 21, 2025 (12:05:38)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for Python 3.13 with torch.compile, a new performance tuning feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release also marks a shift away from publishing on Conda, with a focus on using Manylinux 2.28 for Linux binaries, and introduces a backward-incompatible change by setting weights_only=True as the default for torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Transition Adam and maybe other optimizers to fused code path by default to avoid foreach=True-specific  VRAM peak due to temp TensorList for bias-corrected moments: This issue discusses the proposal to transition the default code path of the Adam optimizer and potentially other optimizers to a fused code path to avoid a significant VRAM peak caused by the current foreach=True setting, which allocates a temporary TensorList for bias-corrected moments. The author suggests switching to fused=True to alleviate this memory allocation issue and inquires about the state of compiled optimizers and their ability to avoid extra allocations by integrating bias-correction into the weights update rule.

The comments discuss plans to make fused the default setting, acknowledging unresolved performance issues, and suggest improvements to memory usage. There is a consensus on the potential benefits of using compiled optimizers and the need for benchmarks. The discussion also covers the compatibility of certain techniques with FSDP, the possibility of using torch.compile for optimizers, and the idea of including torch.compile-powered implementations in the core.
Number of comments this week: 13

received a FakeScriptObject input when dispatching DispatchKey.AutocastCUDA. but no python implementation is found: This issue involves a bug encountered when using custom operators with PyTorch's JIT Script Tracing and Autocast, specifically when a FakeScriptObject input is received during the dispatching of DispatchKey.AutocastCUDA, but no Python implementation is found. The user is seeking clarification on whether this is a known limitation and is looking for guidance on how to resolve the issue, as it affects the composition of custom operators with various PyTorch systems.

The comments discuss whether the custom operator is registered via C++ or Python, with suggestions to use Python registration for better support. The user provides a reproducible setup and asks for advice on the best approach, while another participant confirms a small reproduction of the issue and suggests that the Python registration method is the correct path to pursue.
Number of comments this week: 6

torch 2.8 RC regression - part 2: This issue reports a regression in the PyTorch 2.8 RC version, specifically related to a test failure when running a script on an NVIDIA A10 GPU. The error is caused by a stricter aliasing check in the higher-order operations, which results in a graph break during the execution of a test case in the transformers library.

The comments discuss the cause of the test failure, which is due to a stricter aliasing check introduced in PyTorch. A suggested fix is to clone the sliced output before returning it in the code. It is confirmed that this change should be made in the transformers library, and the issue is removed from the PyTorch 2.8.0 milestone as it is considered expected behavior.
Number of comments this week: 5

Remove tuple of ints as supported dtype for torch.max(): This issue highlights a discrepancy in the documentation for the torch.max() function, where it incorrectly states that the dim parameter supports a tuple of integers as a data type, which leads to an error when used. The issue suggests correcting the documentation by removing the mention of a tuple of integers as a supported data type for the dim parameter.

A user offers to collaborate on the task, and another user assigns the issue to themselves. A discussion follows about the inconsistency between the documentation and the source code, with images provided for clarification. Finally, a pull request is submitted to address the documentation error.
Number of comments this week: 4

GPT-2 compiled model perf/accuracy: This issue describes a bug encountered when attempting to compile a GPT-2 model using PyTorch, following a tutorial by Karpathy, which results in a syntax error related to Metal shader compilation. The user provides links to their script and error logs, and speculates that the error might be due to a misunderstanding in the shader script regarding variable scope within a loop.

The comments indicate that the issue has been fixed in a future PyTorch release, but there are concerns about silent correctness errors. The user reports that compiling from the main source resolves the compilation issue, although performance and accuracy are still not optimal.
Number of comments this week: 3

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment using the 'inductor' backend. The problem arises in a Python script that configures a pipeline with specific model components and attempts to compile them using Torch's compile function, but fails due to the missing import, affecting the execution of the script on a system running Ubuntu 22.04.3 LTS with CUDA support.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly for larger kernel sizes. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached shared object file, cuda_utils.so, fails due to a missing execution permission, despite the script being executed with root privileges. The error occurs specifically in a Docker environment with a tmpfs permission set to 1777, and the problem is highlighted by the absence of the execution bit on the cuda_utils.so file, which leads to an ImportError during the model's execution.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the storage footprint of models, particularly for deployment on mobile devices, by eliminating unnecessary debug files that can occupy a substantial portion of the model's total size.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 64
Summarized Issues:

Incorrect or NaN Gradients in Custom Attention Mask: This issue involves incorrect or NaN gradients for k_proj and v_proj when using a custom attention mask with compiled flex attention in PyTorch. The problem is not related to precision or hardware and persists across different data types and hardware configurations, similar to a previously reported issue.
pytorch/pytorch/issues/158212

Compilation Output Discrepancies: Using torch.compile() on a transformer model with causal attention results in significantly different outputs compared to the uncompiled version. The first two elements of the model output have opposite positive and negative values, indicating a potential problem in the compilation process affecting the model's behavior.
pytorch/pytorch/issues/158226

Performance and Compatibility Issues: A customer experiences significantly prolonged execution times when running torchscripted PyTorch models on a Windows system with an NVIDIA 5090 GPU using libtorch 2.7.0. Additionally, a critical incompatibility arises when nesting torch.no_grad and torch.autocast contexts while using Fully Sharded Data Parallel (FSDP) with a param_dtype of torch.float32, leading to a runtime error during the backward pass.
pytorch/pytorch/issues/158227, pytorch/pytorch/issues/158232

Build and Compilation Challenges: The significantly slower build times of Windows compilers compared to Linux are being investigated to identify the root cause and potential solutions. Additionally, a memory access fault occurs when the function at::_flash_attention_forward is called on an AMD Radeon RX 7900 XTX GPU, resulting in an error message indicating a page not present or supervisor privilege problem.
pytorch/pytorch/issues/158240, pytorch/pytorch/issues/158308

Feature Requests and Enhancements: There is a request for updates and access to a design document regarding the support for Mixture of Experts (MoE) and Expert Parallelism (EP) in DTensor. Additionally, a proposal to enable the use of C++20's std::format in Torch and Kineto as an alternative to libfmt is made, requiring a conditional check similar to the implementation in spdlog.
pytorch/pytorch/issues/158244, pytorch/pytorch/issues/158346

Logging and Debugging Issues: Starting from the PyTorch nightly build dated 20250630, logging and debugging are forcefully set to the INFO level in applications using PyTorch, such as reForge and ComfyUI, without an apparent way to disable it. This behavior was not present in previous versions.
pytorch/pytorch/issues/158307

Incompatibility and Update Requirements: Updating the PyTorch Elastic Run to support the etcd v3 API as a rendezvous backend is necessary due to the current etcd backend's incompatibility with the latest etcd version 3.6.0. Additionally, upgrading the XNNPACK library in the PyTorch project is required as the current version is outdated, leading to compilation errors.
pytorch/pytorch/issues/158326, pytorch/pytorch/issues/158343

Optimization and Efficiency Improvements: The _rename_without_collisions function within the placeholder_naming_pass of the torch/_export/utils.py module currently consumes over 35% of the total runtime for large FX graphs due to its O(n^2) complexity. A more efficient algorithm could significantly reduce export times.
pytorch/pytorch/issues/158357

Memory and Performance Enhancements: A feature request for the memory visualization tool proposes implementing flexible time or range memory event filtering controls to prevent browser unresponsiveness. Additionally, transitioning the default code path for the Adam optimizer to a fused implementation is discussed to mitigate peak VRAM usage.
pytorch/pytorch/issues/158364, pytorch/pytorch/issues/158371

Regression and Bug Fixes: A regression in the PyTorch 2.8 RC version causes a stricter aliasing check to fail during higher-order operation tracing. Additionally, a bug in the PyTorch project involves the while_loop implementation encountering errors during decomposition and inference.
pytorch/pytorch/issues/158375, pytorch/pytorch/issues/158366

Internal Structure and Error Handling: Reorganizing the internal module structure of AOTAutograd in the PyTorch project is proposed to improve clarity and maintainability. Additionally, the full internal stack trace is displayed with the torch._dynamo.exc.BackendCompilerFailed error, suggesting the use of TORCHDYNAMO_VERBOSE=1 for additional context.
pytorch/pytorch/issues/158382, pytorch/pytorch/issues/158384

Training and Model Support: Extending the AOTInductor to support model training is discussed, addressing challenges such as parameter representation and gradient accumulation. Additionally, a bug in the PyTorch project involves the @torch.compile decorator failing unexpectedly with an internal "tuple index out of range" error.
pytorch/pytorch/issues/158392, pytorch/pytorch/issues/158401

Inconsistent Results and Numerical Discrepancies: The torch.absolute function produces inconsistent results for complex128 data types between CPU and CUDA. Additionally, a bug in the PyTorch library involves the torch.copysign function producing inconsistent results between CPU and CUDA for float16 data types.
pytorch/pytorch/issues/158412, pytorch/pytorch/issues/158419

Functionality and Compatibility Issues: A bug in the PyTorch library involves the torch.logaddexp function producing inconsistent results for complex128 data types when executed on CPU versus CUDA. Additionally, a bug in the Inductor backend of PyTorch fails to compile models using the Tensor.copy_() method with a scalar source value.
pytorch/pytorch/issues/158429, pytorch/pytorch/issues/158437

Terms of Service and Performance Optimization: The new Anaconda Terms of Service are causing continuous integration (CI) failures in the PyTorch project. Additionally, a proposal to introduce CUDA streams in the BatchLinearAlgebraLib.cpp file aims to address performance drop-offs in CUDA-accelerated batch operations.
pytorch/pytorch/issues/158438, pytorch/pytorch/issues/158444

Documentation and Latency Issues: A documentation error in the PyTorch project involves the term "SourceBuilder" being mentioned incorrectly. Additionally, a discrepancy in latency when using torch.compile with torch.nn.functional.cross_entropy in different wrapping methods is identified as a user experience issue.
pytorch/pytorch/issues/158447, pytorch/pytorch/issues/158455

Compiler and Warning Issues: A failure in the Inductor compiler occurs when attempting to compile models that utilize torch.randperm in conjunction with advanced indexing. Additionally, a warning is triggered in PyTorch 2.8 rc5 due to the use of deprecated logical operators 'and' and 'or' for non-scalar tensors in flex attention kernels.
pytorch/pytorch/issues/158457, pytorch/pytorch/issues/158463

State Dict and Scalar Float Support: A bug in the PyTorch library involves the get_model_state_dict function returning inconsistent output keys before and after using the set_model_state_dict function. Additionally, the incorrect handling of scalar float support in the static CUDA launcher within the PyTorch project is addressed.
pytorch/pytorch/issues/158519, pytorch/pytorch/issues/158521

Numerical Discrepancies and Conversion Issues: A numerical discrepancy is observed when using torch.compile in PyTorch, where the compiled model's output significantly deviates from the float64 eager execution baseline. Additionally, a small numerical discrepancy is observed when using the torch.compile function with the max-autotune-no-cudagraphs mode on CUDA.
pytorch/pytorch/issues/158530, pytorch/pytorch/issues/158544

Test Failures and Compilation Errors: The test test_einsum_to_pointwise in the TestKernelOptimization suite is disabled on ROCm platforms due to failures. Additionally, a bug in the PyTorch library involves compiling a model using the inductor backend, which combines 2D convolution and FFT operations, resulting in an internal stride validation error.
pytorch/pytorch/issues/158546, pytorch/pytorch/issues/158547

Runtime Errors and Compilation Issues: A persistent RuntimeError in PyTorch 2.3.0 related to CUDAGraphs occurs during a module's forward pass. Additionally, a bug in PyTorch involves using torch.compile with an in-place operation like copy_() raising a RuntimeError related to gradient computation.
pytorch/pytorch/issues/158551, pytorch/pytorch/issues/158561

CUDA and Compilation Errors: A bug in the PyTorch library involves the torch.cuda.make_graphed_callables function failing to correctly capture functions that include slice operations. Additionally, a bug in the PyTorch library involves attempting to compile a model using torch.compile with the torch.ops.higher_order.scan operation resulting in an UncapturedHigherOrderOpError.
pytorch/pytorch/issues/158564, pytorch/pytorch/issues/158566

Regression and Build Errors: A regression in PyTorch 2.8 causes the async-TP compiler pass to throw an assertion error if any all-gather or reduce-scatter operation is performed on a ProcessGroup without symmetric-memory enabled. Additionally, a build error in the FBGEMM GPU/GenAI CI is related to the function c10::cuda::get_cuda_check_suffix() requiring more arguments.
pytorch/pytorch/issues/158569, pytorch/pytorch/issues/158588

Gradient and Memory Issues: A bug in the PyTorch library involves the vmap function inadvertently disabling gradient calculation by setting requires_grad to False. Additionally, a memory leak in the AOTIModelPackageLoader component of the PyTorch project is reported, with a pull request submitted to address the likely cause.
pytorch/pytorch/issues/158610, pytorch/pytorch/issues/158614

Scheduler and Documentation Errors: A bug in PyTorch 2.7.1 involves the _inductor.Scheduler leaving node mappings in an inconsistent state when the requires_grad parameter is modified. Additionally, a documentation error in the PyTorch project involves the RandomSampler documentation incorrectly stating the data_source parameter type.
pytorch/pytorch/issues/158626, pytorch/pytorch/issues/158631

Function Crashes and Feature Enhancements: A bug in the PyTorch library involves the torch.nansum function crashing when applied to a torch.tensor of type complex32 on a CUDA device. Additionally, a feature enhancement is proposed for the PyTorch library, suggesting the addition of an explicit keyword argument set_grad_to_none=True to the optimizer.step() function.
pytorch/pytorch/issues/158635, pytorch/pytorch/issues/158638

CUDA Support and Graph Compilation: Adding support for the checkPoolLiveAllocations feature in cudaMallocAsync within the PyTorch library is requested. Additionally, a problem with PyTorch's Dynamo graph compilation process involves the use of the Tensor.item() operation causing a graph compilation interruption.
pytorch/pytorch/issues/158641, pytorch/pytorch/issues/158642

Inconsistent Outputs and Documentation Discrepancies: A problem with torch.compile() involves models containing dropout and in-place ReLU operations producing inconsistent outputs. Additionally, a discrepancy in the PyTorch documentation for the torch.max() function is highlighted, where it incorrectly states that the dim parameter supports a tuple of integers.
pytorch/pytorch/issues/158643, pytorch/pytorch/issues/158645

Checkpointing and Export Issues: Extending the torch.util.checkpoint functionality in PyTorch to support offloading checkpointed tensors to CPU and optionally NVMe storage is proposed. Additionally, a bug is encountered when exporting a PyTorch model trained with mixed precision bfloat16 to ONNX format, where a non-learnable scalar tensor of dtype fp32 is incorrectly cast to complex128.
pytorch/pytorch/issues/158657, pytorch/pytorch/issues/158658

Test Disabling and Optimization: The test_sum_all_cpu_float64 test within the TestReductionsCPU suite is disabled due to flaky failures. Additionally, optimizing the Fully Sharded Data Parallel (FSDP) process in PyTorch by fusing the all-reduce operation in fp32 with the dtype conversion to bf16 is discussed.
pytorch/pytorch/issues/158664, pytorch/pytorch/issues/158698

Precision and Runtime Errors: A discrepancy in numerical precision between the compiled and eager execution of a simple rms_norm function in PyTorch is reported. Additionally, a RuntimeError in PyTorch version 2.5.1 involves an internal assertion failing in the profiling graph executor implementation.
pytorch/pytorch/issues/158699, pytorch/pytorch/issues/158701

DataLoader and Gradient Computation: Enhancing the DataLoader in PyTorch to allow prefetching of batches across epochs is proposed. Additionally, a user seeks guidance on computing gradients in float32 precision while storing model weights in bfloat16 precision for fine-tuning a large model.
pytorch/pytorch/issues/158704, pytorch/pytorch/issues/158709

In-Place Conversion and Compilation Issues: A feature request for implementing an in-place downcast dtype conversion in PyTorch is made. Additionally, a problem with PyTorch's Dynamo graph compilation process involves the use of the Tensor.item() operation causing a graph compilation interruption.
pytorch/pytorch/issues/158710, pytorch/pytorch/issues/158642

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 19
Summarized Issues:

Discrepancies in CPU vs GPU Computations: Several issues highlight discrepancies in PyTorch computations between CPU and GPU, often due to differences in underlying libraries or precision errors. These include differences in the torch.det function, torch.nn.LogSoftmax, and torch.nn.functional.celu outputs, which have been attributed to variations in LAPACK and cuSOLVER operations, floating-point errors, and precision issues, respectively. 
issues/158208, issues/158310, issues/158314

ROCm Platform Test Failures: Multiple issues report the disabling of tests on the ROCm platform due to consistent failures on the main branch. These failures are linked to specific pull requests and recent instability in ROCm jobs, affecting tests like test_triton_relu_fused_gemm and test_consolidate_to_one_file.
issues/158274, issues/158278, issues/158279

Regression Issues in PyTorch 2.8 RC: The PyTorch 2.8 RC version has introduced regressions affecting specific tests, such as the Doge model and TimesFmModelTest::test_sdpa_can_compile_dynamic. These regressions are due to changes in attention mask handling and scalar conversion overflow, with proposed fixes and workarounds being discussed.
issues/158374, issues/158376

Random Number Generation Discrepancy on CUDA 12.6: A critical issue in PyTorch involves the randn function producing different outputs for the same seed on different machines when using CUDA 12.6, affecting reproducibility in experiments. This discrepancy is not observed on CPUs, suggesting a potential issue with the GPU or CUDA kernel.
issues/158398

Documentation vs Implementation Discrepancies: An issue highlights a discrepancy between the documentation and implementation of the scaled_dot_product_attention function in PyTorch. The documentation suggests False values in a boolean mask should be filled with negative infinity, while the implementation fills True values, leading to confusion about padding positions.
issues/158451

Build and Compilation Errors: Several issues report build and compilation errors in PyTorch, including a Windows build failure due to incorrect flag usage in Ninja and a Clang compilation error due to missing symbolic functions. These issues suggest the need for conditional fallbacks and guidance on manual function implementations.
issues/158452, issues/158496

Bugs in PyTorch Functions: Various bugs in PyTorch functions have been reported, such as the unwrap_to_op_info function's failure with TensorList inputs, the @cached_property decorator not raising exceptions correctly, and the index_put operation modifying incorrect tensor dimensions. These issues highlight the need for fixes to ensure expected behavior.
issues/158217, issues/158345, issues/158413

Accuracy Regression in Inductor Unit Tests: An accuracy regression has been identified in the Inductor unit tests for PyTorch's Triton backend, specifically affecting standard deviation and variance computations on XPU with float64 precision. This regression is linked to updating the CI driver from LTS to LTS2.
issues/158431

Support for FP8_E4M3 Format on CUDA: An issue discusses the need for support in training models using the FP8_E4M3 format on CUDA devices, as users encounter a "NotImplementedError" when fine-tuning E4M3 models. Clarification is sought on whether this limitation has been addressed in recent updates or influenced by specific hardware.
issues/158313

Inconsistent Behavior with CuPy Arrays: An issue addresses inconsistent behavior in PyTorch's torch.as_tensor function when working with CuPy arrays, due to a CUDA driver bug affecting device inference. A workaround is suggested by allowing users to override automatic device detection with a device argument.
issues/158316

Inappropriate Use of GitHub for Discussions: An issue highlights the inappropriate use of GitHub for discussions, where a user implementing PyTorch in Dart was advised to use the PyTorch developer forum instead. This suggests the need for clearer guidelines on the appropriate platforms for different types of contributions.
issues/158230

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 203
Key Open Pull Requests
1. Clarify torch.max Docstring by Removing Incorrect Tuple of Ints Reference: This pull request aims to clarify the documentation for the torch.max function by removing an incorrect reference to a tuple of integers as a supported data type for the dim parameter, thereby addressing issue #158645.

URL: pull/158707

Merged: No

Associated Commits: d7062, d363d, 092ad, ccf3b, 3b132, e982d, f1c0d, 4f2c7, dfdf9, 1caab, 19541, 9cdce, 46cec, 7d463, f7abf, 68f87, 403f9, 3f4a2, 01a57, 6d9ca, d0f0e, 9113b, 064a3, a9214, e6921, 05bf7, 45b74, 09ed6

2. Setup TorchBench in Docker: This pull request aims to set up TorchBench within a Docker environment to significantly reduce the setup time on A100 and H100 GPUs by approximately half an hour, with testing benchmarks provided for both GPU models to ensure all models are correctly integrated.

URL: pull/158613

Merged: No

Associated Commits: 0e37f, ec8fa, 3aec2, 30c21, 444df, 55316, a151d, 0ce83, 9fe49, 9ff8a, 753fe, 2435a, 2bc4a, 14a38, 80d24, c6847, 550d9, 1eaa3, 046bc, 837ea, 626d1, 1b2cd, 1d309, e6bd5, 701a4

3. [DO NOT MERGE] Enable MI355X ROCm CI testing.: This pull request is focused on enabling and testing the PyTorch ROCm Continuous Integration (CI) system on MI355X nodes, involving multiple updates such as enabling ROCm 7.0 Alpha CI, updating docker image names, adjusting runner labels, and incorporating compatibility changes for MI350x, all while ensuring the necessary infrastructure and dependencies are in place for effective testing.

URL: pull/158221

Merged: No

Associated Commits: ce1e4, 1322b, 929cd, ab0c1, 31f1a, 05783, 1d2c2, 64294, 39e02, 74319, c4a54, 88f41, 0b932, c5d76, b98a6, c1ad0, e4187, 8b85f, ca7d5

Other Open Pull Requests

Event Management in PyTorch: Several pull requests focus on enhancing the management of events within the PyTorch project. These changes include reusing the EventPool::Event within the CUDAAllocator, moving the Event class to the c10 namespace, and relocating CUDAEvent and XPUEvent to their respective directories to maintain consistency and simplify the codebase.
pull/158224, pull/158220, pull/158219, pull/158336

DataLoader Enhancements: Enhancements to the PyTorch DataLoader are addressed in multiple pull requests. These include deprecating the pin_memory_device parameter and introducing multi-threading capabilities within each worker to improve data fetching performance from high-latency storage systems.
pull/158323, pull/158218

Autograd and Operator Improvements: Improvements to autograd and operator functionality in PyTorch are covered in several pull requests. These include introducing autograd rules for the aten::aminmax operator and addressing issues with the clamp strategies to fix op coverage test failures.
pull/158241, pull/158619

Documentation and Code Quality: Enhancements to documentation and code quality are the focus of several pull requests. These involve documenting torch.nn.modules functions, generating CLAUDE.md files, and profiling linters to improve performance or configuration.
pull/158491, pull/158378, pull/158405

Performance and Optimization: Various pull requests aim to optimize performance in the PyTorch project. These include enhancing the TENSOR_MATCH operation, updating the OpenBLAS commit for performance boosts, and optimizing test job durations by using larger runners.
pull/158211, pull/158243, pull/158691

Code Refactoring and Cleanup: Code refactoring and cleanup are addressed in several pull requests. These include replacing the check_pyobj(bool) function, removing _RegisterOrVerify logic, and eliminating the pyinterpreter struct for space savings.
pull/158493, pull/158503, pull/158507

New Features and Functionality: New features and functionality are introduced in several pull requests. These include adding a lowering mechanism for repeat_interleave.Tensor, enhancing device placement in CuPy, and introducing a new replicate function using FSDP.
pull/158462, pull/158529, pull/158207

Miscellaneous Enhancements: Various other enhancements are covered in several pull requests. These include enhancing C++ code generation tabbing, fixing the update constants buffer, and addressing CPU kernel generation issues with MPS and CPU code.
pull/158351, pull/158349, pull/158350

Automation and Tooling: Automation and tooling improvements are introduced in several pull requests. These include automating the JSON registry update using lintrunner and enhancing the torch.cat meta function for better error handling.
pull/158460, pull/158249

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 195
Key Closed Pull Requests
1. [release/2.5] upgrade3.13: This pull request, titled "[release/2.5] upgrade3.13," is a QA request that involves multiple commits aimed at upgrading and fixing various components and dependencies in the PyTorch project for the 2.5 release, including updates to the ONNX exporter documentation, improvements to the Inductor AMP FP16 benchmark, fixes for xpu memory stats errors, and several other enhancements and bug fixes across different modules, as indicated by the detailed commit messages and the involvement of numerous contributors and reviewers.

URL: pull/158280

Merged: No

Associated Commits: e1306, 4a3da, ba275, cea56, 612fc, 828d6, 6b14e, 9e315, 98870, 813e0, a889c, 4b5bf, 1db2a, 39209, 19544, b7de7, fb276, becdf, ffed7, c5e52, dd732, 4e6a9, 7a007, 7c550, 6e867, 9b80d, 4b030, cbe47, 2e2c0, 17157, ecd33, c2578, 39641, 6101a, 24bd8, aa574, bc421, 051df, 550ed, 57717, d80f5, 69ed7, 70298, 17d25, 058d3, 8a71e, 8af31, f8c4c, 3a541, 1a0b1, 0b45a, 0b1b6, 1de13, 53752, 783a6, 119e7, 417a0, 32f58, 7d26c, 4be8e, ff940, f6ad6, da863, 50eb2, bb7fd, 8fa58, d8002, ba7c2, 00314, dc956, b253b, 757cb, 8a12b, db943, 5e412, 6efce, 79740, b654f, 362be, 2acd2, aaa31, ed475, 09af6, 088b8, dca35, 02220, 6df27, dca53, 08c07, d88da, feade, 5b76f, 8ec01, 06b6a, f0fb4, 2096c, 75e26, 0e2b4, 9ba9a, 6a4c4, 17dc4, 49c8b, 72908, 7d7ec, 5d212, c3ba1, f0927, abbfe, f069c, f8544, 1ed41, 826ee, b0ea6, e814e, f6389, 78a47, f0207, c5667, d8a7a, 45946, 61ba0, 01137, 0a0be, 78426, a4935, f929e, 33911, 4bed2, 60cb6, dacf5, ff48a, 80f18, 47074, b5380, 5d018, 6a281, 8b752, 8b59e, 38c82, a6f13, 4b515, 8e47c, 23d1a, f27cc, a07b6, 80e18, 1550e, 8ecf0, 4ed5c, 595a2, 8b7ad, 58e54, a02ca, 0e782, c040e, dda59, e481a, a1efa, 4b826, 5eaa4, 0e8bd, 2b906, 01a0b, fb716, 94e61, 94412, fe82c, 973af, cb954, 5a980, 45e62, f2ca4, a92b4, b2e4f, 0119c, 6d856, a0985, a0aaf, 5ca0b, 64033, fc456, 84209, 6c84d, 87431, 554c7, fd9f7, 4904f, fcfca, f2af1, af9d0, 2bd6c, 4ac3a, c2e67, 433ce, 4ce2c, 47a02, e63b8, fd3c6, 8a1ad, eaaa7, 6c674, 5ddb6, 8cda9, 57d41, 0f2b8, 00316, 0fc21, db67d, b4977, c0bc4, 8d82c, 75163, 675b2, d401b, faab7, a1ad1, eea96, 2269e, 66db7, 1be0e, 31ceb, 7db77, 0a2ec, b3cf5, 0b718

2. [CUDA] Use runtime driver API for cuStreamWriteValue32: This pull request aims to update the PyTorch project by utilizing the CUDA runtime driver API for the cuStreamWriteValue32 function, addressing issues from previous pull requests and incorporating various code improvements such as refactoring, version checks, and lint fixes.

URL: pull/158295

Merged: No

Associated Commits: 072f5, b4505, 05605, ac8f4, 7af44, bbdb8, 14a3f, 8b852, d17fd, 1057c, b8612, e37ab, f37d6, cb140, 84152, a5c26, 7d313, 449bc, 1d1c1, c4a63

3. [BE] Add pre-push hook for lintrunner to the PyTorch repo: This pull request introduces an optional pre-push hook for the PyTorch repository that automatically runs the lintrunner tool in a consistent, isolated virtual environment before every git push, ensuring code quality checks are performed on changes, with the ability to skip these checks if necessary.

URL: pull/158389

Merged: No

Associated Commits: 887f9, 892e1, ae1fc, c3ec7, cbd7a, bdb15, 6bcb7, 24125, 3c479, 73da8, 9d4fd, 3c65f, b45b2, d9e4b, ea23d, 4ee6c, 16e54, dacdc, b0332, 0e533

Other Closed Pull Requests

GenAI Layer Benchmarking: This pull request introduces a GenAI layer benchmark to compare the performance of various PyTorch execution modes, including eager execution, PyTorch compiler, Liger, and Quack. It evaluates all kernels supported by Quack, such as CrossEntropy, Softmax, RMSNorm, and LayerNorm, while addressing common benchmarking errors and providing a clear example for proper benchmarking practices.
pull/158536

Deprecation of torch::deploy: Multiple pull requests focus on deprecating the torch::deploy feature in the open-source PyTorch project. They involve removing the test_deploy_interaction, eliminating all torch._running_with_deploy checks, and removing the USE_DEPLOY flag, while noting that MyPy errors may need to be addressed separately.
pull/158288, pull/158290, pull/158291

Docker Build Transition: This pull request addresses an issue with Docker builds by proposing a transition from using Miniconda to Miniforge. The change is due to problems related to the acceptance of Anaconda's Terms of Service, which caused build failures, and suggests using Miniforge as a solution since it does not require the same terms acceptance.
pull/158370

CTC Loss API Enhancement: This pull request aims to enhance the Python API for CTC (Connectionist Temporal Classification) loss by adding support for raw, unnormalized inputs. It ensures backward compatibility with the existing log-simplex space and includes various commits addressing gradient checks, test case updates, and documentation improvements.
pull/158446

merge_remote_group Implementation: This pull request involves a tentative implementation of the merge_remote_group function as part of the PyTorch project. It follows a proposal outlined in a linked Google document and includes multiple updates and discussions with contributors.
pull/158287

Symbolic Integer Variables Handling: This pull request addresses the issue of redundant wrapping of symbolic integer variables (symInts) in PyTorch's dynamo. It ensures that these variables are properly tracked and de-duplicated in side effects, preventing the creation of multiple unique SymNodeVariable instances for the same symInt during subgraph speculation.
pull/158273

Multi-Kernel Support for ROCm: This pull request aims to address and fix the issue of disabling multi-kernel support for ROCm in the PyTorch project. It involves multiple updates and contributions from various developers.
pull/158299

Warnings Handling in PyTorch: This pull request addresses the issue of hiding warnings in the PyTorch project without invalidating the warnings cache. It implements a solution that avoids directly modifying the undocumented and private __warningregistry__, instead opting to suppress warnings in the torch/_dynamo module.
pull/158520

PyObjectSlot and PyInterpreter Simplification: This pull request aims to simplify the PyObjectSlot and PyInterpreter by using a global PyInterpreter. It removes the "tags" of the PyInterpreter by deprecating PyInterpreterStatus and updates all call sites to rely on the assumption of a single global interpreter.
pull/158427

DTensor copy_ Strategy Bug Fix: This pull request addresses a bug in the DTensor copy_ strategy by correcting the mapping of dimensions between the self and src inputs. It ensures proper functionality even when the src input is broadcasted, as demonstrated with specific sharding combinations for a given tensor example.
pull/158538

NVIDIA Driver Compatibility: This pull request aims to address issues with older NVIDIA drivers, specifically version 525.*, by adding a periodic test in the CI that installs driver 525.105.17. It ensures CUDA can be initialized and the entire test suite runs successfully.
pull/158300

DTensor Strategy Improvements: This pull request addresses several bugs in the DTensor's default strategy by fixing incorrect strategy returns and adjusting redistribute costs. It separates the copy_ operation from the default strategy and renames the function to propagate_single_input_strategy for clarity.
pull/158490

ONNX Component Legacy Removal: Multiple pull requests aim to remove legacy components from the ONNX module in the PyTorch project. These changes are part of a series managed through the ghstack tool, although they were ultimately not merged.
pull/158258, pull/158262, pull/158283

Registry Format Update: This pull request involves updating the registry format from JSON to YAML to store code samples for additional information. It introduces a new feature that opens a nano editor in the terminal for developers to input content when the --additional info command is executed.
pull/158327

PyObjectSlot Modifications: This pull request proposes modifications to the PyObjectSlot by removing unnecessary components such as check_interpreter and has_pyobj_nonhermetic. It simplifies the codebase and reduces potential risks under the assumption that only a single interpreter is in use.
pull/158407

Public API Type Issues: This pull request aims to address issues with the public API of types in the PyTorch project by attempting to use a new type statement. It also seeks alternative solutions to maintain compatibility with older Python versions, as evidenced by multiple commits exploring different approaches and linting adjustments.
pull/158487

DTensor Documentation: This pull request aims to document the "redistribute_costs" feature within the DTensor module of the PyTorch project. It involves multiple commits and discussions with several contributors, although it was ultimately not merged.
pull/158495

Inductor CI Test Skipping: This pull request addresses the issue of excessive time consumption by the test_torchinductor_opinfo.py unit test during the enabling of inductor CI on Windows. It temporarily skips the test while a solution for the compiler building speed on Windows is sought.
pull/158225

ShardingPropagator Documentation: This pull request aims to add documentation to the ShardingPropagator.register_op_strategy function. It provides guidance on drafting strategy_func and determining when to use schema_info as part of the PyTorch project.
pull/158362

Triton Integration Update: This pull request updates the Triton commit hash to the latest release/3.4.x branch and modifies the handling of the HAS_WARP_SPEC. It accommodates changes in Triton 3.4, where the warp spec interface now automatically determines the number of consumer groups.
pull/158459

Inductor Imports Modification: This pull request aims to modify the PyTorch project by making the Inductor imports conditional on the TYPE_CHECKING flag. It is part of a stack of changes managed by ghstack, although it was ultimately not merged.
pull/158524

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

DO NOT MERGE, testing lint failures and how they appear in the CI.
Toxicity Score: 0.55 (Defensive responses, passive-aggressive remarks, unresolved technical issues.)
This GitHub conversation involves several users discussing a pull request, with username1 expressing concern over potential issues in the code, while username2 responds with a defensive tone. The conversation escalates as username3 joins, supporting username1's concerns, which leads to username2 showing signs of frustration. The tone remains tense as the discussion continues, with users focusing on resolving the technical issues but occasionally slipping into passive-aggressive remarks.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

XuehaiPan
292
35
1
15

bobrenjc93
170
36
0
3

malfet
73
8
4
82

atalman
102
18
8
22

williamwen42
35
9
4
84

ezyang
48
17
4
59

wconstab
52
10
0
64

guangyey
83
7
3
29

coconutruben
49
8
0
47

jansel
26
7
1
70

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
XuehaiPan	292	35	1	15
bobrenjc93	170	36	0	3
malfet	73	8	4	82
atalman	102	18	8	22
williamwen42	35	9	4	84
ezyang	48	17	4	59
wconstab	52	10	0	64
guangyey	83	7	3	29
coconutruben	49	8	0	47
jansel	26	7	1	70