Weekly GitHub Report for Pytorch: July 14, 2025 - July 21, 2025 (12:05:38)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for Python 3.13 with torch.compile
, a new performance tuning feature torch.compiler.set_stance
, and FP16 support on X86 CPUs. Notably, the release also marks a shift away from publishing on Conda, with a focus on using Manylinux 2.28 for Linux binaries, and introduces a backward-incompatible change by setting weights_only=True
as the default for torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Transition Adam and maybe other optimizers to fused code path by default to avoid
foreach=True
-specific VRAM peak due to temp TensorList for bias-corrected moments: This issue discusses the proposal to transition the default code path of the Adam optimizer and potentially other optimizers to a fused code path to avoid a significant VRAM peak caused by the currentforeach=True
setting, which allocates a temporary TensorList for bias-corrected moments. The author suggests switching tofused=True
to alleviate this memory allocation issue and inquires about the state of compiled optimizers and their ability to avoid extra allocations by integrating bias-correction into the weights update rule.- The comments discuss plans to make fused the default setting, acknowledging unresolved performance issues, and suggest improvements to memory usage. There is a consensus on the potential benefits of using compiled optimizers and the need for benchmarks. The discussion also covers the compatibility of certain techniques with FSDP, the possibility of using
torch.compile
for optimizers, and the idea of including torch.compile-powered implementations in the core. - Number of comments this week: 13
- The comments discuss plans to make fused the default setting, acknowledging unresolved performance issues, and suggest improvements to memory usage. There is a consensus on the potential benefits of using compiled optimizers and the need for benchmarks. The discussion also covers the compatibility of certain techniques with FSDP, the possibility of using
-
received a FakeScriptObject input when dispatching DispatchKey.AutocastCUDA. but no python implementation is found: This issue involves a bug encountered when using custom operators with PyTorch's JIT Script Tracing and Autocast, specifically when a FakeScriptObject input is received during the dispatching of DispatchKey.AutocastCUDA, but no Python implementation is found. The user is seeking clarification on whether this is a known limitation and is looking for guidance on how to resolve the issue, as it affects the composition of custom operators with various PyTorch systems.
- The comments discuss whether the custom operator is registered via C++ or Python, with suggestions to use Python registration for better support. The user provides a reproducible setup and asks for advice on the best approach, while another participant confirms a small reproduction of the issue and suggests that the Python registration method is the correct path to pursue.
- Number of comments this week: 6
-
torch 2.8 RC regression - part 2: This issue reports a regression in the PyTorch 2.8 RC version, specifically related to a test failure when running a script on an NVIDIA A10 GPU. The error is caused by a stricter aliasing check in the higher-order operations, which results in a graph break during the execution of a test case in the transformers library.
- The comments discuss the cause of the test failure, which is due to a stricter aliasing check introduced in PyTorch. A suggested fix is to clone the sliced output before returning it in the code. It is confirmed that this change should be made in the transformers library, and the issue is removed from the PyTorch 2.8.0 milestone as it is considered expected behavior.
- Number of comments this week: 5
-
Remove tuple of ints as supported dtype for torch.max(): This issue highlights a discrepancy in the documentation for the
torch.max()
function, where it incorrectly states that thedim
parameter supports a tuple of integers as a data type, which leads to an error when used. The issue suggests correcting the documentation by removing the mention of a tuple of integers as a supported data type for thedim
parameter.- A user offers to collaborate on the task, and another user assigns the issue to themselves. A discussion follows about the inconsistency between the documentation and the source code, with images provided for clarification. Finally, a pull request is submitted to address the documentation error.
- Number of comments this week: 4
-
GPT-2 compiled model perf/accuracy: This issue describes a bug encountered when attempting to compile a GPT-2 model using PyTorch, following a tutorial by Karpathy, which results in a syntax error related to Metal shader compilation. The user provides links to their script and error logs, and speculates that the error might be due to a misunderstanding in the shader script regarding variable scope within a loop.
- The comments indicate that the issue has been fixed in a future PyTorch release, but there are concerns about silent correctness errors. The user reports that compiling from the main source resolves the compilation issue, although performance and accuracy are still not optimal.
- Number of comments this week: 3
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment using the 'inductor' backend. The problem arises in a Python script that configures a pipeline with specific model components and attempts to compile them using Torch's compile function, but fails due to the missing import, affecting the execution of the script on a system running Ubuntu 22.04.3 LTS with CUDA support.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly for larger kernel sizes. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached shared object file,
cuda_utils.so
, fails due to a missing execution permission, despite the script being executed with root privileges. The error occurs specifically in a Docker environment with atmpfs
permission set to1777
, and the problem is highlighted by the absence of the execution bit on thecuda_utils.so
file, which leads to an ImportError during the model's execution. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the storage footprint of models, particularly for deployment on mobile devices, by eliminating unnecessary debug files that can occupy a substantial portion of the model's total size.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 64
Summarized Issues:
- Incorrect or NaN Gradients in Custom Attention Mask: This issue involves incorrect or NaN gradients for
k_proj
andv_proj
when using a custom attention mask with compiled flex attention in PyTorch. The problem is not related to precision or hardware and persists across different data types and hardware configurations, similar to a previously reported issue.
- Compilation Output Discrepancies: Using
torch.compile()
on a transformer model with causal attention results in significantly different outputs compared to the uncompiled version. The first two elements of the model output have opposite positive and negative values, indicating a potential problem in the compilation process affecting the model's behavior.
- Performance and Compatibility Issues: A customer experiences significantly prolonged execution times when running torchscripted PyTorch models on a Windows system with an NVIDIA 5090 GPU using libtorch 2.7.0. Additionally, a critical incompatibility arises when nesting
torch.no_grad
andtorch.autocast
contexts while using Fully Sharded Data Parallel (FSDP) with aparam_dtype
oftorch.float32
, leading to a runtime error during the backward pass.
- Build and Compilation Challenges: The significantly slower build times of Windows compilers compared to Linux are being investigated to identify the root cause and potential solutions. Additionally, a memory access fault occurs when the function
at::_flash_attention_forward
is called on an AMD Radeon RX 7900 XTX GPU, resulting in an error message indicating a page not present or supervisor privilege problem.
- Feature Requests and Enhancements: There is a request for updates and access to a design document regarding the support for Mixture of Experts (MoE) and Expert Parallelism (EP) in DTensor. Additionally, a proposal to enable the use of C++20's
std::format
in Torch and Kineto as an alternative tolibfmt
is made, requiring a conditional check similar to the implementation inspdlog
.
- Logging and Debugging Issues: Starting from the PyTorch nightly build dated 20250630, logging and debugging are forcefully set to the INFO level in applications using PyTorch, such as reForge and ComfyUI, without an apparent way to disable it. This behavior was not present in previous versions.
- Incompatibility and Update Requirements: Updating the PyTorch Elastic Run to support the etcd v3 API as a rendezvous backend is necessary due to the current etcd backend's incompatibility with the latest etcd version 3.6.0. Additionally, upgrading the XNNPACK library in the PyTorch project is required as the current version is outdated, leading to compilation errors.
- Optimization and Efficiency Improvements: The
_rename_without_collisions
function within theplaceholder_naming_pass
of thetorch/_export/utils.py
module currently consumes over 35% of the total runtime for large FX graphs due to itsO(n^2)
complexity. A more efficient algorithm could significantly reduce export times.
- Memory and Performance Enhancements: A feature request for the memory visualization tool proposes implementing flexible time or range memory event filtering controls to prevent browser unresponsiveness. Additionally, transitioning the default code path for the Adam optimizer to a fused implementation is discussed to mitigate peak VRAM usage.
- Regression and Bug Fixes: A regression in the PyTorch 2.8 RC version causes a stricter aliasing check to fail during higher-order operation tracing. Additionally, a bug in the PyTorch project involves the
while_loop
implementation encountering errors during decomposition and inference.
- Internal Structure and Error Handling: Reorganizing the internal module structure of AOTAutograd in the PyTorch project is proposed to improve clarity and maintainability. Additionally, the full internal stack trace is displayed with the
torch._dynamo.exc.BackendCompilerFailed
error, suggesting the use ofTORCHDYNAMO_VERBOSE=1
for additional context.
- Training and Model Support: Extending the AOTInductor to support model training is discussed, addressing challenges such as parameter representation and gradient accumulation. Additionally, a bug in the PyTorch project involves the
@torch.compile
decorator failing unexpectedly with an internal "tuple index out of range" error.
- Inconsistent Results and Numerical Discrepancies: The
torch.absolute
function produces inconsistent results forcomplex128
data types between CPU and CUDA. Additionally, a bug in the PyTorch library involves thetorch.copysign
function producing inconsistent results between CPU and CUDA forfloat16
data types.
- Functionality and Compatibility Issues: A bug in the PyTorch library involves the
torch.logaddexp
function producing inconsistent results forcomplex128
data types when executed on CPU versus CUDA. Additionally, a bug in the Inductor backend of PyTorch fails to compile models using theTensor.copy_()
method with a scalar source value.
- Terms of Service and Performance Optimization: The new Anaconda Terms of Service are causing continuous integration (CI) failures in the PyTorch project. Additionally, a proposal to introduce CUDA streams in the BatchLinearAlgebraLib.cpp file aims to address performance drop-offs in CUDA-accelerated batch operations.
- Documentation and Latency Issues: A documentation error in the PyTorch project involves the term "SourceBuilder" being mentioned incorrectly. Additionally, a discrepancy in latency when using
torch.compile
withtorch.nn.functional.cross_entropy
in different wrapping methods is identified as a user experience issue.
- Compiler and Warning Issues: A failure in the Inductor compiler occurs when attempting to compile models that utilize
torch.randperm
in conjunction with advanced indexing. Additionally, a warning is triggered in PyTorch 2.8 rc5 due to the use of deprecated logical operators 'and' and 'or' for non-scalar tensors in flex attention kernels.
- State Dict and Scalar Float Support: A bug in the PyTorch library involves the
get_model_state_dict
function returning inconsistent output keys before and after using theset_model_state_dict
function. Additionally, the incorrect handling of scalar float support in the static CUDA launcher within the PyTorch project is addressed.
- Numerical Discrepancies and Conversion Issues: A numerical discrepancy is observed when using
torch.compile
in PyTorch, where the compiled model's output significantly deviates from thefloat64
eager execution baseline. Additionally, a small numerical discrepancy is observed when using thetorch.compile
function with themax-autotune-no-cudagraphs
mode on CUDA.
- Test Failures and Compilation Errors: The test
test_einsum_to_pointwise
in theTestKernelOptimization
suite is disabled on ROCm platforms due to failures. Additionally, a bug in the PyTorch library involves compiling a model using theinductor
backend, which combines 2D convolution and FFT operations, resulting in an internal stride validation error.
- Runtime Errors and Compilation Issues: A persistent
RuntimeError
in PyTorch 2.3.0 related to CUDAGraphs occurs during a module's forward pass. Additionally, a bug in PyTorch involves usingtorch.compile
with an in-place operation likecopy_()
raising a RuntimeError related to gradient computation.
- CUDA and Compilation Errors: A bug in the PyTorch library involves the
torch.cuda.make_graphed_callables
function failing to correctly capture functions that include slice operations. Additionally, a bug in the PyTorch library involves attempting to compile a model usingtorch.compile
with thetorch.ops.higher_order.scan
operation resulting in anUncapturedHigherOrderOpError
.
- Regression and Build Errors: A regression in PyTorch 2.8 causes the async-TP compiler pass to throw an assertion error if any all-gather or reduce-scatter operation is performed on a ProcessGroup without symmetric-memory enabled. Additionally, a build error in the FBGEMM GPU/GenAI CI is related to the function
c10::cuda::get_cuda_check_suffix()
requiring more arguments.
- Gradient and Memory Issues: A bug in the PyTorch library involves the
vmap
function inadvertently disabling gradient calculation by settingrequires_grad
to False. Additionally, a memory leak in theAOTIModelPackageLoader
component of the PyTorch project is reported, with a pull request submitted to address the likely cause.
- Scheduler and Documentation Errors: A bug in PyTorch 2.7.1 involves the
_inductor.Scheduler
leaving node mappings in an inconsistent state when therequires_grad
parameter is modified. Additionally, a documentation error in the PyTorch project involves theRandomSampler
documentation incorrectly stating thedata_source
parameter type.
- Function Crashes and Feature Enhancements: A bug in the PyTorch library involves the
torch.nansum
function crashing when applied to atorch.tensor
of typecomplex32
on a CUDA device. Additionally, a feature enhancement is proposed for the PyTorch library, suggesting the addition of an explicit keyword argumentset_grad_to_none=True
to theoptimizer.step()
function.
- CUDA Support and Graph Compilation: Adding support for the
checkPoolLiveAllocations
feature incudaMallocAsync
within the PyTorch library is requested. Additionally, a problem with PyTorch's Dynamo graph compilation process involves the use of theTensor.item()
operation causing a graph compilation interruption.
- Inconsistent Outputs and Documentation Discrepancies: A problem with
torch.compile()
involves models containing dropout and in-place ReLU operations producing inconsistent outputs. Additionally, a discrepancy in the PyTorch documentation for thetorch.max()
function is highlighted, where it incorrectly states that thedim
parameter supports a tuple of integers.
- Checkpointing and Export Issues: Extending the
torch.util.checkpoint
functionality in PyTorch to support offloading checkpointed tensors to CPU and optionally NVMe storage is proposed. Additionally, a bug is encountered when exporting a PyTorch model trained with mixed precision bfloat16 to ONNX format, where a non-learnable scalar tensor of dtype fp32 is incorrectly cast to complex128.
- Test Disabling and Optimization: The
test_sum_all_cpu_float64
test within theTestReductionsCPU
suite is disabled due to flaky failures. Additionally, optimizing the Fully Sharded Data Parallel (FSDP) process in PyTorch by fusing the all-reduce operation in fp32 with the dtype conversion to bf16 is discussed.
- Precision and Runtime Errors: A discrepancy in numerical precision between the compiled and eager execution of a simple
rms_norm
function in PyTorch is reported. Additionally, a RuntimeError in PyTorch version 2.5.1 involves an internal assertion failing in the profiling graph executor implementation.
- DataLoader and Gradient Computation: Enhancing the
DataLoader
in PyTorch to allow prefetching of batches across epochs is proposed. Additionally, a user seeks guidance on computing gradients infloat32
precision while storing model weights inbfloat16
precision for fine-tuning a large model.
- In-Place Conversion and Compilation Issues: A feature request for implementing an in-place downcast dtype conversion in PyTorch is made. Additionally, a problem with PyTorch's Dynamo graph compilation process involves the use of the
Tensor.item()
operation causing a graph compilation interruption.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 19
Summarized Issues:
- Discrepancies in CPU vs GPU Computations: Several issues highlight discrepancies in PyTorch computations between CPU and GPU, often due to differences in underlying libraries or precision errors. These include differences in the
torch.det
function,torch.nn.LogSoftmax
, andtorch.nn.functional.celu
outputs, which have been attributed to variations in LAPACK and cuSOLVER operations, floating-point errors, and precision issues, respectively.
- ROCm Platform Test Failures: Multiple issues report the disabling of tests on the ROCm platform due to consistent failures on the main branch. These failures are linked to specific pull requests and recent instability in ROCm jobs, affecting tests like
test_triton_relu_fused_gemm
andtest_consolidate_to_one_file
.
- Regression Issues in PyTorch 2.8 RC: The PyTorch 2.8 RC version has introduced regressions affecting specific tests, such as the Doge model and
TimesFmModelTest::test_sdpa_can_compile_dynamic
. These regressions are due to changes in attention mask handling and scalar conversion overflow, with proposed fixes and workarounds being discussed.
- Random Number Generation Discrepancy on CUDA 12.6: A critical issue in PyTorch involves the
randn
function producing different outputs for the same seed on different machines when using CUDA 12.6, affecting reproducibility in experiments. This discrepancy is not observed on CPUs, suggesting a potential issue with the GPU or CUDA kernel.
- Documentation vs Implementation Discrepancies: An issue highlights a discrepancy between the documentation and implementation of the
scaled_dot_product_attention
function in PyTorch. The documentation suggests False values in a boolean mask should be filled with negative infinity, while the implementation fills True values, leading to confusion about padding positions.
- Build and Compilation Errors: Several issues report build and compilation errors in PyTorch, including a Windows build failure due to incorrect flag usage in Ninja and a Clang compilation error due to missing symbolic functions. These issues suggest the need for conditional fallbacks and guidance on manual function implementations.
- Bugs in PyTorch Functions: Various bugs in PyTorch functions have been reported, such as the
unwrap_to_op_info
function's failure withTensorList
inputs, the@cached_property
decorator not raising exceptions correctly, and theindex_put
operation modifying incorrect tensor dimensions. These issues highlight the need for fixes to ensure expected behavior.
- Accuracy Regression in Inductor Unit Tests: An accuracy regression has been identified in the Inductor unit tests for PyTorch's Triton backend, specifically affecting standard deviation and variance computations on XPU with float64 precision. This regression is linked to updating the CI driver from LTS to LTS2.
- Support for FP8_E4M3 Format on CUDA: An issue discusses the need for support in training models using the FP8_E4M3 format on CUDA devices, as users encounter a "NotImplementedError" when fine-tuning E4M3 models. Clarification is sought on whether this limitation has been addressed in recent updates or influenced by specific hardware.
- Inconsistent Behavior with CuPy Arrays: An issue addresses inconsistent behavior in PyTorch's
torch.as_tensor
function when working with CuPy arrays, due to a CUDA driver bug affecting device inference. A workaround is suggested by allowing users to override automatic device detection with adevice
argument.
- Inappropriate Use of GitHub for Discussions: An issue highlights the inappropriate use of GitHub for discussions, where a user implementing PyTorch in Dart was advised to use the PyTorch developer forum instead. This suggests the need for clearer guidelines on the appropriate platforms for different types of contributions.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 203
Key Open Pull Requests
1. Clarify torch.max Docstring by Removing Incorrect Tuple of Ints Reference: This pull request aims to clarify the documentation for the torch.max
function by removing an incorrect reference to a tuple of integers as a supported data type for the dim
parameter, thereby addressing issue #158645.
- URL: pull/158707
- Merged: No
- Associated Commits: d7062, d363d, 092ad, ccf3b, 3b132, e982d, f1c0d, 4f2c7, dfdf9, 1caab, 19541, 9cdce, 46cec, 7d463, f7abf, 68f87, 403f9, 3f4a2, 01a57, 6d9ca, d0f0e, 9113b, 064a3, a9214, e6921, 05bf7, 45b74, 09ed6
2. Setup TorchBench in Docker: This pull request aims to set up TorchBench within a Docker environment to significantly reduce the setup time on A100 and H100 GPUs by approximately half an hour, with testing benchmarks provided for both GPU models to ensure all models are correctly integrated.
- URL: pull/158613
- Merged: No
- Associated Commits: 0e37f, ec8fa, 3aec2, 30c21, 444df, 55316, a151d, 0ce83, 9fe49, 9ff8a, 753fe, 2435a, 2bc4a, 14a38, 80d24, c6847, 550d9, 1eaa3, 046bc, 837ea, 626d1, 1b2cd, 1d309, e6bd5, 701a4
3. [DO NOT MERGE] Enable MI355X ROCm CI testing.: This pull request is focused on enabling and testing the PyTorch ROCm Continuous Integration (CI) system on MI355X nodes, involving multiple updates such as enabling ROCm 7.0 Alpha CI, updating docker image names, adjusting runner labels, and incorporating compatibility changes for MI350x, all while ensuring the necessary infrastructure and dependencies are in place for effective testing.
- URL: pull/158221
- Merged: No
- Associated Commits: ce1e4, 1322b, 929cd, ab0c1, 31f1a, 05783, 1d2c2, 64294, 39e02, 74319, c4a54, 88f41, 0b932, c5d76, b98a6, c1ad0, e4187, 8b85f, ca7d5
Other Open Pull Requests
- Event Management in PyTorch: Several pull requests focus on enhancing the management of events within the PyTorch project. These changes include reusing the
EventPool::Event
within theCUDAAllocator
, moving theEvent
class to thec10
namespace, and relocatingCUDAEvent
andXPUEvent
to their respective directories to maintain consistency and simplify the codebase.
- DataLoader Enhancements: Enhancements to the PyTorch DataLoader are addressed in multiple pull requests. These include deprecating the
pin_memory_device
parameter and introducing multi-threading capabilities within each worker to improve data fetching performance from high-latency storage systems.
- Autograd and Operator Improvements: Improvements to autograd and operator functionality in PyTorch are covered in several pull requests. These include introducing autograd rules for the
aten::aminmax
operator and addressing issues with theclamp
strategies to fix op coverage test failures.
- Documentation and Code Quality: Enhancements to documentation and code quality are the focus of several pull requests. These involve documenting
torch.nn.modules
functions, generating CLAUDE.md files, and profiling linters to improve performance or configuration.
- Performance and Optimization: Various pull requests aim to optimize performance in the PyTorch project. These include enhancing the TENSOR_MATCH operation, updating the OpenBLAS commit for performance boosts, and optimizing test job durations by using larger runners.
- Code Refactoring and Cleanup: Code refactoring and cleanup are addressed in several pull requests. These include replacing the
check_pyobj(bool)
function, removing_RegisterOrVerify
logic, and eliminating thepyinterpreter
struct for space savings.
- New Features and Functionality: New features and functionality are introduced in several pull requests. These include adding a lowering mechanism for
repeat_interleave.Tensor
, enhancing device placement in CuPy, and introducing a new replicate function using FSDP.
- Miscellaneous Enhancements: Various other enhancements are covered in several pull requests. These include enhancing C++ code generation tabbing, fixing the update constants buffer, and addressing CPU kernel generation issues with MPS and CPU code.
- Automation and Tooling: Automation and tooling improvements are introduced in several pull requests. These include automating the JSON registry update using lintrunner and enhancing the
torch.cat
meta function for better error handling.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 195
Key Closed Pull Requests
1. [release/2.5] upgrade3.13: This pull request, titled "[release/2.5] upgrade3.13," is a QA request that involves multiple commits aimed at upgrading and fixing various components and dependencies in the PyTorch project for the 2.5 release, including updates to the ONNX exporter documentation, improvements to the Inductor AMP FP16 benchmark, fixes for xpu memory stats errors, and several other enhancements and bug fixes across different modules, as indicated by the detailed commit messages and the involvement of numerous contributors and reviewers.
- URL: pull/158280
- Merged: No
- Associated Commits: e1306, 4a3da, ba275, cea56, 612fc, 828d6, 6b14e, 9e315, 98870, 813e0, a889c, 4b5bf, 1db2a, 39209, 19544, b7de7, fb276, becdf, ffed7, c5e52, dd732, 4e6a9, 7a007, 7c550, 6e867, 9b80d, 4b030, cbe47, 2e2c0, 17157, ecd33, c2578, 39641, 6101a, 24bd8, aa574, bc421, 051df, 550ed, 57717, d80f5, 69ed7, 70298, 17d25, 058d3, 8a71e, 8af31, f8c4c, 3a541, 1a0b1, 0b45a, 0b1b6, 1de13, 53752, 783a6, 119e7, 417a0, 32f58, 7d26c, 4be8e, ff940, f6ad6, da863, 50eb2, bb7fd, 8fa58, d8002, ba7c2, 00314, dc956, b253b, 757cb, 8a12b, db943, 5e412, 6efce, 79740, b654f, 362be, 2acd2, aaa31, ed475, 09af6, 088b8, dca35, 02220, 6df27, dca53, 08c07, d88da, feade, 5b76f, 8ec01, 06b6a, f0fb4, 2096c, 75e26, 0e2b4, 9ba9a, 6a4c4, 17dc4, 49c8b, 72908, 7d7ec, 5d212, c3ba1, f0927, abbfe, f069c, f8544, 1ed41, 826ee, b0ea6, e814e, f6389, 78a47, f0207, c5667, d8a7a, 45946, 61ba0, 01137, 0a0be, 78426, a4935, f929e, 33911, 4bed2, 60cb6, dacf5, ff48a, 80f18, 47074, b5380, 5d018, 6a281, 8b752, 8b59e, 38c82, a6f13, 4b515, 8e47c, 23d1a, f27cc, a07b6, 80e18, 1550e, 8ecf0, 4ed5c, 595a2, 8b7ad, 58e54, a02ca, 0e782, c040e, dda59, e481a, a1efa, 4b826, 5eaa4, 0e8bd, 2b906, 01a0b, fb716, 94e61, 94412, fe82c, 973af, cb954, 5a980, 45e62, f2ca4, a92b4, b2e4f, 0119c, 6d856, a0985, a0aaf, 5ca0b, 64033, fc456, 84209, 6c84d, 87431, 554c7, fd9f7, 4904f, fcfca, f2af1, af9d0, 2bd6c, 4ac3a, c2e67, 433ce, 4ce2c, 47a02, e63b8, fd3c6, 8a1ad, eaaa7, 6c674, 5ddb6, 8cda9, 57d41, 0f2b8, 00316, 0fc21, db67d, b4977, c0bc4, 8d82c, 75163, 675b2, d401b, faab7, a1ad1, eea96, 2269e, 66db7, 1be0e, 31ceb, 7db77, 0a2ec, b3cf5, 0b718
2. [CUDA] Use runtime driver API for cuStreamWriteValue32: This pull request aims to update the PyTorch project by utilizing the CUDA runtime driver API for the cuStreamWriteValue32
function, addressing issues from previous pull requests and incorporating various code improvements such as refactoring, version checks, and lint fixes.
- URL: pull/158295
- Merged: No
- Associated Commits: 072f5, b4505, 05605, ac8f4, 7af44, bbdb8, 14a3f, 8b852, d17fd, 1057c, b8612, e37ab, f37d6, cb140, 84152, a5c26, 7d313, 449bc, 1d1c1, c4a63
3. [BE] Add pre-push hook for lintrunner to the PyTorch repo: This pull request introduces an optional pre-push hook for the PyTorch repository that automatically runs the lintrunner tool in a consistent, isolated virtual environment before every git push
, ensuring code quality checks are performed on changes, with the ability to skip these checks if necessary.
- URL: pull/158389
- Merged: No
- Associated Commits: 887f9, 892e1, ae1fc, c3ec7, cbd7a, bdb15, 6bcb7, 24125, 3c479, 73da8, 9d4fd, 3c65f, b45b2, d9e4b, ea23d, 4ee6c, 16e54, dacdc, b0332, 0e533
Other Closed Pull Requests
- GenAI Layer Benchmarking: This pull request introduces a GenAI layer benchmark to compare the performance of various PyTorch execution modes, including eager execution, PyTorch compiler, Liger, and Quack. It evaluates all kernels supported by Quack, such as CrossEntropy, Softmax, RMSNorm, and LayerNorm, while addressing common benchmarking errors and providing a clear example for proper benchmarking practices.
- Deprecation of torch::deploy: Multiple pull requests focus on deprecating the
torch::deploy
feature in the open-source PyTorch project. They involve removing thetest_deploy_interaction
, eliminating alltorch._running_with_deploy
checks, and removing theUSE_DEPLOY
flag, while noting that MyPy errors may need to be addressed separately.
- Docker Build Transition: This pull request addresses an issue with Docker builds by proposing a transition from using Miniconda to Miniforge. The change is due to problems related to the acceptance of Anaconda's Terms of Service, which caused build failures, and suggests using Miniforge as a solution since it does not require the same terms acceptance.
- CTC Loss API Enhancement: This pull request aims to enhance the Python API for CTC (Connectionist Temporal Classification) loss by adding support for raw, unnormalized inputs. It ensures backward compatibility with the existing log-simplex space and includes various commits addressing gradient checks, test case updates, and documentation improvements.
- merge_remote_group Implementation: This pull request involves a tentative implementation of the
merge_remote_group
function as part of the PyTorch project. It follows a proposal outlined in a linked Google document and includes multiple updates and discussions with contributors.
- Symbolic Integer Variables Handling: This pull request addresses the issue of redundant wrapping of symbolic integer variables (symInts) in PyTorch's dynamo. It ensures that these variables are properly tracked and de-duplicated in side effects, preventing the creation of multiple unique
SymNodeVariable
instances for the same symInt during subgraph speculation.
- Multi-Kernel Support for ROCm: This pull request aims to address and fix the issue of disabling multi-kernel support for ROCm in the PyTorch project. It involves multiple updates and contributions from various developers.
- Warnings Handling in PyTorch: This pull request addresses the issue of hiding warnings in the PyTorch project without invalidating the warnings cache. It implements a solution that avoids directly modifying the undocumented and private
__warningregistry__
, instead opting to suppress warnings in thetorch/_dynamo
module.
- PyObjectSlot and PyInterpreter Simplification: This pull request aims to simplify the PyObjectSlot and PyInterpreter by using a global PyInterpreter. It removes the "tags" of the PyInterpreter by deprecating
PyInterpreterStatus
and updates all call sites to rely on the assumption of a single global interpreter.
- DTensor copy_ Strategy Bug Fix: This pull request addresses a bug in the DTensor
copy_
strategy by correcting the mapping of dimensions between theself
andsrc
inputs. It ensures proper functionality even when thesrc
input is broadcasted, as demonstrated with specific sharding combinations for a given tensor example.
- NVIDIA Driver Compatibility: This pull request aims to address issues with older NVIDIA drivers, specifically version
525.*
, by adding a periodic test in the CI that installs driver525.105.17
. It ensures CUDA can be initialized and the entire test suite runs successfully.
- DTensor Strategy Improvements: This pull request addresses several bugs in the DTensor's default strategy by fixing incorrect strategy returns and adjusting redistribute costs. It separates the
copy_
operation from the default strategy and renames the function topropagate_single_input_strategy
for clarity.
- ONNX Component Legacy Removal: Multiple pull requests aim to remove legacy components from the ONNX module in the PyTorch project. These changes are part of a series managed through the ghstack tool, although they were ultimately not merged.
- Registry Format Update: This pull request involves updating the registry format from JSON to YAML to store code samples for additional information. It introduces a new feature that opens a nano editor in the terminal for developers to input content when the
--additional info
command is executed.
- PyObjectSlot Modifications: This pull request proposes modifications to the PyObjectSlot by removing unnecessary components such as
check_interpreter
andhas_pyobj_nonhermetic
. It simplifies the codebase and reduces potential risks under the assumption that only a single interpreter is in use.
- Public API Type Issues: This pull request aims to address issues with the public API of types in the PyTorch project by attempting to use a new type statement. It also seeks alternative solutions to maintain compatibility with older Python versions, as evidenced by multiple commits exploring different approaches and linting adjustments.
- DTensor Documentation: This pull request aims to document the "redistribute_costs" feature within the DTensor module of the PyTorch project. It involves multiple commits and discussions with several contributors, although it was ultimately not merged.
- Inductor CI Test Skipping: This pull request addresses the issue of excessive time consumption by the
test_torchinductor_opinfo.py
unit test during the enabling of inductor CI on Windows. It temporarily skips the test while a solution for the compiler building speed on Windows is sought.
- ShardingPropagator Documentation: This pull request aims to add documentation to the
ShardingPropagator.register_op_strategy
function. It provides guidance on draftingstrategy_func
and determining when to useschema_info
as part of the PyTorch project.
- Triton Integration Update: This pull request updates the Triton commit hash to the latest release/3.4.x branch and modifies the handling of the HAS_WARP_SPEC. It accommodates changes in Triton 3.4, where the warp spec interface now automatically determines the number of consumer groups.
- Inductor Imports Modification: This pull request aims to modify the PyTorch project by making the Inductor imports conditional on the
TYPE_CHECKING
flag. It is part of a stack of changes managed by ghstack, although it was ultimately not merged.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- DO NOT MERGE, testing lint failures and how they appear in the CI.
- Toxicity Score: 0.55 (Defensive responses, passive-aggressive remarks, unresolved technical issues.)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing concern over potential issues in the code, while username2 responds with a defensive tone. The conversation escalates as username3 joins, supporting username1's concerns, which leads to username2 showing signs of frustration. The tone remains tense as the discussion continues, with users focusing on resolving the technical issues but occasionally slipping into passive-aggressive remarks.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
XuehaiPan | 292 | 35 | 1 | 15 |
bobrenjc93 | 170 | 36 | 0 | 3 |
malfet | 73 | 8 | 4 | 82 |
atalman | 102 | 18 | 8 | 22 |
williamwen42 | 35 | 9 | 4 | 84 |
ezyang | 48 | 17 | 4 | 59 |
wconstab | 52 | 10 | 0 | 64 |
guangyey | 83 | 7 | 3 | 29 |
coconutruben | 49 | 8 | 0 | 47 |
jansel | 26 | 7 | 1 | 70 |