Weekly GitHub Report for Pytorch: September 15, 2025 - September 22, 2025 (12:06:16)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile
support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance
, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on X86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load
to weights_only=True
, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
RuntimeError: Intel XPU device doesn't support querying free memory on PyTorch 2.9+xpu (LN/L Arc 140V): This issue reports a RuntimeError encountered when attempting to load models on an Intel Arc 140V GPU using PyTorch 2.9.0+xpu, where the device does not support querying free memory via
torch.xpu.mem_get_info()
. This limitation prevents models from being loaded into GPU memory as expected, and the user seeks guidance on driver versions or fixes to resolve this problem, especially when running under WSL2.- The comment discussion centers on recommending the latest Intel rolling graphics driver installation and clarifying that WSL2 support is still experimental, so the issue may persist there. A temporary workaround involving monkey-patching the memory query function was shared, and users were advised to consider using the native Windows PyTorch version for better support.
- Number of comments this week: 6
-
Flex Attention Decoding Silently Incorrect with H=None Broadcasting and Sliced mask_mod: This issue reports two bugs in the
torch.nn.attention.flex_attention
module that cause silent incorrect results during autoregressive decoding when using BlockMask slicing. The first bug involves incorrect broadcasting behavior whenH=None
is used, leading to significant errors, while the second bug concerns themask_mod
attribute becoming a no-op after slicing a BlockMask, which can cause subtle errors if users modify the sliced mask incorrectly.- The discussion confirms the second issue is intentional behavior but suggests improving documentation and error handling to prevent silent failures; a proposal to replace the no-op
mask_mod
with an error-throwing function was positively received, and a pull request was encouraged to implement this improvement. - Number of comments this week: 6
- The discussion confirms the second issue is intentional behavior but suggests improving documentation and error handling to prevent silent failures; a proposal to replace the no-op
-
Poor bf16 precision in backward torch.asin: This issue reports a precision discrepancy in the backward computation of the torch.asin function when using bfloat16 (bf16) on CPU compared to GPU, where the CPU implementation shows up to 2 units in the last place (ulp) error. The root cause was identified as the CPU’s torch.rsqrt function for bf16 lacking intermediate fp32 precision, unlike the CUDA implementation, and a patch was proposed to improve accuracy by casting inputs to fp32 during computation.
- The discussion involved debugging suggestions focusing on the derivative definition and isolating the precision problem to the rsqrt operation. A detailed demonstration confirmed the CPU’s rsqrt in bf16 is less accurate due to missing fp32 intermediate steps, unlike CUDA. A patch addressing this by introducing fp32 intermediate computation was created and positively reviewed, resolving the precision issue.
- Number of comments this week: 5
-
[ci/cd] A couple of xpu nightly builds cancelled on 15/7: This issue reports that two XPU nightly builds were cancelled on July 15th due to timeouts during the continuous integration process, specifically during the linking stage of the build. The problem appears related to increased build times caused by recent upgrades to the XPU support package, which have extended the build duration and led to performance regressions in the CI runner.
- The comments discuss the cancellation being caused by timeouts, with observations that CUDA and ROCm builds also failed, possibly due to a recent PR or the long build times exceeding limits. It was confirmed that the XPU builds now take around 5.5 to 6 hours due to multiple AOT targets and package upgrades, and a rerun was planned while efforts to fix the build performance regression were underway.
- Number of comments this week: 5
-
[dynamo] graph break when attempting to construct ConstDictVariable with LazyVariable members: This issue reports a graph break occurring in the Dynamo component when attempting to construct a ConstDictVariable containing LazyVariable members, resulting in a TypeError due to unhashable LazyVariableTracker objects. The user provides a stack trace highlighting the failure point and discusses potential fixes and improvements to error messaging with contributors, who suggest patches and plan further investigation after a related pull request merges.
- The comments include an initial suggestion that a missing
.realize()
call might be causing the issue, followed by a commitment to submit a fix. Further discussion clarifies that the problem may stem from inserting non-hashable objects, leading to a proposed patch to improve error messages by realizing the variable type. The user agrees with the approach, and contributors note that a deeper investigation will proceed after a related pull request is merged. - Number of comments this week: 5
- The comments include an initial suggestion that a missing
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0.dev, CUDA 12.1, and Ubuntu 22.04, and shares code snippets demonstrating the error occurring while compiling specific model components with torch.compile.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, and preliminary testing shows a speedup of approximately 1.3 times compared to the standard approach.
- cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at
/tmp
having permissions set to1777
. Although the model compiles successfully, execution fails with an error indicating that the shared objectcuda_utils.so
cannot be mapped due to missing execute permissions on the file, despite the script running as root and directory permissions being correct. - Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, and managing preparatory fixes for known formatting conflicts, while also providing a detailed worklist organized by directory to coordinate contributions and track progress.
- [JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the
torch.jit.save()
function that allows users to exclude debug files, such as.debug_pkl
, from the JIT archive to reduce file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on resource-constrained devices like mobile.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 84
Summarized Issues:
- CUDA Distributed and Fused Operations Failures: Multiple issues report failures in CUDA distributed tests and fused operations, including AsyncTPTest failing due to improper MMA instruction targeting causing 100% output mismatch, and fused scaled matmul reduce-scatter tests failing due to accuracy mismatches. These problems highlight challenges in maintaining correctness and performance in distributed CUDA kernels and fused operations.
- Torch.compile and Inductor Backend Bugs: Several issues describe bugs and runtime errors when using
torch.compile
with the Inductor backend, such as failures with.mH
views of complex tensors, incorrect gradient computations losing dependencies, errors with compiled higher-order operators under custom TorchDispatchMode, and layout conversion errors in flex_attention. These indicate ongoing instability and edge cases in the compilation and optimization pipeline.
- Memory and Performance Regressions: Issues report significant GPU memory spikes in
torch.linalg.eigh
in version 2.8.0 compared to 2.7.0, inefficient immediate scatter operations inparallelize_module
, and build time regressions causing CI timeouts for XPU builds. These problems affect resource usage and CI reliability.
- Export and ONNX Compatibility Issues: Multiple issues describe failures and errors during model export to ONNX or PyTorch export APIs, including segmentation faults from mismatched dynamic_axes keys, inability to guard on data-dependent symbolic shapes, and errors exporting models with past_key_values or using
torch.func.jvp
. These highlight challenges in exporting complex models with dynamic or symbolic inputs.
- Distributed Tensor and DTensor Bugs: Issues report incorrect results from adding partial DTensors with scalars, wrong placements and outputs from inplace operations on DTensors, and inefficiencies in distributed tensor sharding due to immediate scatter operations. These bugs affect correctness and performance in distributed tensor computations.
- Testing and CI Failures and Flakiness: Several tests are reported as flaky or disabled due to consistent failures on specific platforms like ROCm and XPU, including
test_ring_flex_attention
,test_graph_partition_default_device_context
, andtest_partial_flat_weights
. Additionally, lintrunner performance issues and CI workflow linting gaps are noted. These issues reduce test reliability and CI effectiveness.
- API and Functionality Enhancement Requests: Requests include adding support for
torch.embedding
to accepttorch.uint8
indices, enhancingFSDPModule.set_requires_gradient_sync
for finer control, adding custom determinism checks, and integrating Flash Dynamic Mask Attention as a new backend for scaled_dot_product_attention. These aim to improve flexibility, efficiency, and feature coverage.
- Global Interpreter Lock (GIL) and CUDA Interaction Issues: Issues report that
torch.compile
kernels and CUDA device tensor creation do not release the GIL as expected, causing potential thread parallelism limitations and implicit synchronized device-to-host copies while holding the GIL. These behaviors deviate from typical CUDA API expectations.
- Symbolic Tracing and Graph Partitioning Bugs: Bugs include failures in symbolic tracing due to decorator unwrapping mismatches, graph partitioning errors with custom mutating ops causing UnboundLocalError, and ConstDictVariable construction failures from unhashable LazyVariableTrackers. These issues hinder reliable graph transformations and tracing.
- Documentation and Build System Issues: Documentation errors include incorrect default values for
align_corners
in interpolation and outdated CUDA version info in "Get Started Locally". Build system problems include missing headers causing Vulkan backend build failures, Python environment misdetection in CMake/ninja, and excessive CI git tag noise. These reduce usability and build reliability.
- Stable API and C++ Header Anti-patterns: Issues highlight the use of anonymous namespaces inside stable API headers causing symbol overloading and propose adding helper utilities for boxing wrappers to reduce manual code maintenance. These address code quality and maintainability concerns in the C++ API.
- Type Checking and Type Annotation Problems: Problems include learning rate schedulers aliasing optimizer learning rates causing corruption, and a proposal to migrate from mypy to Pyrefly for improved type checking performance and features. These affect code correctness and developer experience.
- Kernel Compilation and Runtime Improvements: Progress and issues are reported on the experimental
torch.cuda._compile_kernel
API for inline kernel compilation, including cross-platform support and integration withtorch.compile
. Additionally, a bug in the inductor runtime related to TRITON_CACHE_DIR environment variable handling may cause cache replacement conflicts.
- Attention and Masking Bugs: Bugs in
torch.nn.attention.flex_attention
cause silent incorrect results during autoregressive decoding with BlockMask slicing, including broadcasting errors and no-op mask_mod attributes after slicing. These lead to silent failures in attention computations.
- Exported Program and Device Metadata Issues: Casting exported programs to different devices causes runtime assertion failures due to hardcoded device metadata in IR, breaking device-agnostic export introduced in version 2.8. This limits flexibility in model deployment.
- Miscellaneous Bugs and Crashes: Other reported bugs include segmentation faults in
torch.nn.MaxUnpool3d
with complex and uint32 inputs, build failures with rocWMMA on ROCm due to float to half conversion issues, and runtime crashes in AOTAutograd view replay due to non-contiguous aliased base tensors. These affect stability across various components.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 21
Summarized Issues:
- Test Failures on XPU Platform: Multiple tests including
test_libtorch_free_so_xpu
,test_automatic_dynamo_graph_breaks_device_xpu
,test_lazy_backward_backend_inductor_device_xpu
,test_lazy_backward_backend_eager_device_xpu
, andtest_graph_break_bomb_backend_inductor_device_xpu
have been consistently failing on the main branch for the xpu platform, leading to their disabling. These recurring failures indicate stability issues with the xpu platform tests that affect the reliability of the test suite.
- CI and Test Infrastructure Issues: The PyTorch CI pipeline's
run_test.py
suppresses test failure errors by returning a success exit code due to a rerun and skip logic, masking actual test failures and causing inaccurate test results. Additionally, nvshmem_triton tests broke due to a Triton update causing key errors and compilation failures that were not properly caught by the CI system.
- Documentation and Usability Improvements: There is a need for clearer explanations and additional examples regarding the "out" parameter in PyTorch tensor documentation, along with a suggestion to rename the title for better clarity to improve user understanding.
- Sparse Tensor Format Implementation: A new sparse tensor format is being proposed with a request for guidance on adding a method to the tensor object to enable conversion to this new format via a call like
A.to_new_format()
. This indicates ongoing development to extend PyTorch's tensor capabilities.
- CUDA and Build Failures: Windows CUDA builds have been failing due to a missing identifier
__builtin_clz
in CUDA source files, and manywheel CUDA builds on Linux have been failing due to missing CUDA header files likecuda_runtime.h
, likely caused by recent submodule updates. These build issues are blocking successful compilation and deployment.
- Learning Rate Scheduler Bugs: The
ReduceLROnPlateau._reduce_lr
method triggers unnecessary recompilation when the learning rate is set as a float instead of a tensor, causing performance inefficiencies. Also, the SWALR scheduler improperly serializes itsanneal_func
attribute, causing anUnpicklingError
when loading its state_dict withweights_only=True
, leading to checkpoint loading failures.
- Attention Masking Implementation Concern: There is a concern about whether the
torch.nn.functional.scaled_dot_product_attention
function correctly applies attention masking by adding -infinity to masked positions in the attention bias, which is critical for excluding those positions during the softmax operation. This raises questions about the correctness of the attention mechanism implementation.
- MPS Backend Error Handling: The
max_unpool2d
andmax_unpool3d
functions do not raise errors for invalid indices on the MPS backend, unlike CPU and CUDA where invalid indices correctly trigger runtime errors, indicating inconsistent error handling across backends.
- PyTorch Compile Tuple Assignment Bug: Using
torch.compile
causes aTypeError
when directly assigning a tuple slice fromtorch.linalg.qr
to a variable, while tuple unpacking works correctly, highlighting a discrepancy between compiled and eager execution modes.
- FSDP Example and Tutorial Development: A tutorial script for sentiment classification on the IMDB dataset using Fully Sharded Data Parallel (FSDP) has been introduced to help new users understand FSDP through practical experience, including support for checkpointing on distributed multi-GPU systems.
- CUDA Out-of-Memory Error in Distributed Process Group: An
ncclUnhandledCudaError
occurs duringdistributed.destroy_process_group
with a CUDA out-of-memory error despite monitoring only 8 MB of GPU memory usage, causing confusion about the underlying cause of the failure.
- Global Namespace Pollution in API Headers: A stable API header contains a
using
declaration that pollutes the global namespace, causing ambiguous symbol errors when PyTorch headers are included before other libraries like CUTLASS, leading to integration issues.
- FSDP Implicit Prefetching Issue: In the PyTorch FSDP2 example, implicit prefetching does not work as expected because the all-gather operation does not overlap with computation due to an unintended dependency caused by memory access patterns. This was traced to the environment variable
CUDA_DEVICE_MAX_CONNECTIONS=1
limiting CUDA connections, resulting in delayed GPU kernel execution.
- CUDA Version Support Inquiry: A user inquires whether PyTorch supports CUDA 3.0 for their GPU server, seeking information on the minimum CUDA version required to run PyTorch.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 207
Key Open Pull Requests
1. Fix for ValueError: ProcessGroupXCCL::gather: invalid tensor type at index 0 : This pull request addresses a ValueError in ProcessGroupXCCL::gather caused by a device mismatch where tensors participating in the gather operation must be on the XPU device but were incorrectly gathered into a CPU tensor, and it includes fixes to ensure proper tensor device placement for distributed operations on XPU.
- URL: pull/163362
- Merged: No
- Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55, 7dade, 580aa, c0f57, 6dedb, 20d07, 124ff, 0bea1, 7409a, 636cb, 0d5a8, 3826e, cb711, d6cd1, 624be, 8d8c5, 1cf78, 41475, 0628c, b0d93, 58eb8, e558e, 83ac5, 0e7a7, a2b2f, 8de00, 91f5d, 39e6c, 06d6c, a0590, 31ddf, 9a6df, 50cb9, 08559, f6a8c, dc0be, 9fd7e, cce98, cf3bf, e3957, cd29a, 7b81c, 42401, 1dee9, 815db, bae7d, c283c, f7846, af361, 22cc1, 482f1, 9818e, 57d20, 2a5f3, 9f0c5, bb73a, ec4ea, 084ee, 70689, 491cc, 4a2d0, a6164, 12663, a03d1, 7da46, cfe26, 4d68d, 4c74d, 42219, 267bd, 36291, b6a17, 98cbe, 7245c, 878e1, 948b9, 2176c, f632e, 84ccf, 97d4c, 07bf4, 1e846, 493b2, 9b3f5, bf100, a0522, 6c84e, d7fcd, 14f79, da233, 04b39, 95ef9, ab42f, a1c29, 62c60, 94863, 0bb6d, 79c00, c8938, 976a6, 65292, 2e2d1, 4d289, a4627, 48e56, 02f1c, 09f47, 04fb8, e1dd6, c36f7, 1a3bd, ad3f9, 164bd, afa17, 32120, b293c, 339c0, 8e8c0, be778, caade, 380e3, dcd90, 9d255, 20dee, 6049b, e2a43, ec50d, 64864
2. [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth: This pull request introduces a library for querying device hardware limits such as FLOPS and memory bandwidth on CUDA devices, aiming to replace hardcoded values in benchmarks with a structured utility that can be extended to support more devices and parameters for both CPUs and accelerators.
- URL: pull/162942
- Merged: No
- Associated Commits: 36ea5, 23695, a6bf5, 9e5f4, 5633b, f3ab8, 86931, 815a5, 8d90d, d4796, 4d68f, 9a7ae, 8bfaa, a4333, d030e, 0d65d, dc8ae, 29e96, b03fa, 74ddd, 4ac03, 8c8be, f03b2, 66c59, 0d0f2, 7e2df, 5c782, c3dc3, 64f2d, e1e07, 007a4, b0b04
3. [RFC]: No Distributed Log Spew: This pull request proposes a novel, more intrusive approach to prevent excessive logging ("log spew") in PyTorch distributed environments by monkey patching the logging, warning, and optionally print functions to ensure that log statements only execute on the rank 0 process after distributed initialization, thereby making all logging statements distributed-aware without relying on linters or piecemeal code fixes.
- URL: pull/162999
- Merged: No
- Associated Commits: d093a, 606c4, eca1b, 607f6, fbdb9, c94cc, da969, cb062, 074af, d1a64, ff61b, 9c85b, 259bb, bb9d8, e105c
Other Open Pull Requests
- Cutlass library update: This pull request updates the Cutlass library version used in the fbcode environment of the PyTorch project, incorporating multiple commits to ensure the integration is pristine and up-to-date. It ensures that the latest improvements and fixes in Cutlass are reflected in the PyTorch build environment.
- [pull/163091]
- C++ migration of DTensor components: Multiple pull requests introduce C++ implementations and improvements for DTensor functionality, including the compute_global_tensor_info function and access to Placement data via pybind11 bindings. These changes represent incremental progress toward migrating DTensor operations from Python to C++ for better performance and interoperability.
- [pull/162990, pull/163030, pull/163031]
- Size-hint multi-kernel selection improvements: This pull request introduces a variant of the size-hint multi-kernel selection method that chooses an optimal pre-generated kernel based on shape similarity measured by L1 distance in log2 space for novel runtime shapes. It also adjusts activation conditions and limits dimensionality for pre-generation searches to improve kernel selection efficiency.
- [pull/163090]
- Torchfuzz introduction: This pull request introduces the initial implementation of torchfuzz, a fuzz testing tool for PyTorch operations, with plans to transition from a stack-based to a graph-based representation. This tool aims to improve testing coverage and robustness of PyTorch operations by handling increasing complexity more effectively.
- [pull/163417]
- Argument parser bug fix: This pull request addresses a bug in PyTorch's argument parser where methods with a single sequence argument, such as tensor.reshape(), incorrectly ignored additional arguments when the first argument was a tuple. The fix extends to other similar methods and includes unit tests to ensure correct argument handling and prevent silent bugs.
- [pull/163081]
- DeviceMesh refactoring with CuTe layout: This pull request refactors DeviceMesh internal bookkeeping by leveraging the CuTe layout to simplify and generalize index operations, replacing dimension-mapping methods with layout-based approaches. It introduces new functions for layout overlap checking and remapping, enabling slicing and flattening of non-contiguous dimensions without changing existing behavior.
- [pull/163213]
- Testing for compute_global_tensor_info: This pull request adds basic tests for the function torch.distributed.tensor._utils.compute_global_tensor_info in preparation for its subsequent C++ implementation. These tests help ensure correctness before migrating the function to C++.
- [pull/162968]
- Learning rate scheduler aliasing fix: This pull request prevents unintended aliasing between
self._last_lr
andbase_lrs
with tensor learning rates in optimizer parameter groups, updates type annotations to support floats and tensors, enhances documentation, and adds tests to ensure correct behavior with tensor learning rates.- [pull/163120]
- GPU health monitoring integration: This pull request integrates GPU health monitoring functionality from the NVIDIA Resiliency Extension into PyTorch’s distributed elastic training system. It provides comprehensive GPU health checks, recovery action detection, thread-safe asynchronous monitoring, and includes tests, documentation, and usage examples to enhance robustness.
- [pull/163192]
- Hessian function optimization: This pull request introduces an opt-in
is_scalar
argument to thehessian
function that treats scalar functions separately, resulting in a 15-20% CPU and CUDA speedup without changing default behavior. The change is supported by extensive benchmarking and additional unit tests to prevent regressions.- [pull/162915]
- ROCm MI350 GPU kernel autotuning: This pull request introduces heuristic improvements and additional autotuning configurations for pointwise kernels targeting the MI350 GPU on the ROCm platform. The goal is to enhance performance by optimizing kernel grid configurations and increasing maximum block size for better execution efficiency.
- [pull/163197]
- Broadcast_in_dim decomposition: This pull request proposes adding a decomposition implementation for the
prims.broadcast_in_dim
operation in PyTorch, addressing issue #163037. This aims to improve the modularity and maintainability of the operation.- [pull/163377]
- Python dispatch documentation improvements: This pull request proposes slight improvements to the documentation in the python_dispatch module to clarify the direction of stack iteration, addressing the author's initial confusion.
- [pull/162963]
- User-streams aliasing handling: This pull request aims to properly handle aliasing in the user-streams component of PyTorch, as indicated by a series of commits updating and refining this functionality.
- [pull/163028]
- High priority streams support in ProcessGroupXCCL: This pull request adds support for high priority streams in ProcessGroupXCCL, enabling XPU streams to execute with higher priority similar to CUDA streams, and includes registration of this feature.
- [pull/163049]
- Test updates for dtensor compile: This pull request refactors the test_dtensor_compile by updating tests to accommodate the removal of previously unsupported usage involving FakeStore and init_process_group with a "fake" backend, ensuring compatibility with the current implementation.
- [pull/163058]
- Model recompilation on memory alignment change: This pull request ensures that the model is recompiled when memory alignment changes between runs, specifically addressing a RuntimeError caused by incorrect attn_bias alignment when running torch.compile with different sequence lengths. It adds proper guards to detect alignment differences and trigger recompilation.
- [pull/163083]
- ROCm docker images upgrade: This pull request proposes upgrading all ROCm docker images in PyTorch to the ROCm 7.0.1 release version, including updates to related dependencies such as UCX and UCC, while removing prior alpha cases.
- [pull/163140]
- Intel GPU TF32 support API: This pull request introduces a new API
torch.xpu.is_tf32_supported
for Intel GPUs to check hardware support for TF32 matrix multiplication acceleration. This aligns with other backends and enables informed use oftorch.backends.mkldnn.allow_tf32=True
or hardware capability queries for Triton.- [pull/163141]
- Mark_dynamic name argument support: This pull request introduces support for a
name
keyword argument in themark_dynamic
function to improve ergonomics by enabling symbol sharing without relying on the complextorch._check
paradigm. It addresses issues related to enforcing sequence length consistency in KV cached tensors during dynamic shape tracing.- [pull/163246]
- MPS backend embedding bag improvements: This pull request proposes computing the
offset2bag
,bag_size
, andmax_indices
parameters within the_embedding_bag
function in the MPS backend to improve embedding bag operations.- [pull/163281]
- Performance testing preparation for PyTorch 2.9: This pull request prepares performance testing for PyTorch version 2.9 by reinstalling the release candidate, rebuilding dependent packages torchrec and fbgemm due to the absence of RC wheels, and includes several commits aimed at benchmarking and debugging without requiring review.
- [pull/163334]
- Placement type accessibility in C++: This pull request aims to improve the performance of the
compute_global_tensor_info
function by making thePlacement
type accessible in C++, addressing overhead involved in unwrappingPlacement
via pybind11 and considering nanobind to reduce this overhead.- [pull/163031]
- Device argument support in get_rng_state: This pull request adds support for passing a device argument—either as a device object, string, or integer—to the torch.random.get_rng_state function, enabling retrieval of the random number generator state for the specified device. It includes tests to verify this functionality.
- [pull/163034]
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 171
Key Closed Pull Requests
1. compile_kernel enable pch: This pull request implements support for enabling automatic precompiled headers (PCH) in CUDA kernel compilation within PyTorch, demonstrating significant average compilation speed improvements through benchmarking while noting that maximum compilation times with PCH can be higher, which currently prevents enabling PCH by default.
- URL: pull/162972
- Merged: No
- Associated Commits: 0ec2a, 7a334, 834ab, 5c94b, 506a4, 10453, edbbc, 34aa6, 00525, b9f3d, d04fb, a1080, 344f2, cdff0
2. [Inductor-FX] Support torch.cond: This pull request introduces support for the torch.cond
operation in the PyTorch FX converter by implementing a new ConditionalLine
wrapper IR line to represent conditionals, generating corresponding torch.ops.higher_order.cond
nodes in the FX IR, and adapting the FX backend’s code generation and graph input/output handling to properly manage subgraphs and conditionals, along with adding tests for this feature.
- URL: pull/163234
- Merged: No
- Associated Commits: 90635, 8cadb, 87d26, f274a, 45a61, ebf25, dd8aa, 7fb15, a900c, 32233, 2d41d, 3e07d, cc09a, a94d4
3. Just Sample upload and update: This pull request is about updating the transformers dependency from version 4.54.0 to 4.56.1 and adding various continuous integration workflow files, although it was not merged.
- URL: pull/163172
- Merged: No
- Associated Commits: ad19e, 49a51, 450d9, dc544, c21d9, 4ae32, b9b39, 50a26, 41eb7, e18bc, 77e42, efc52, b4383
Other Closed Pull Requests
- Triton Inductor Blackwell-specific matrix multiplication: This pull request adds a persistent matrix multiplication template tailored for Blackwell architecture in the Triton Inductor, incorporating device-side TMA and new Triton features like automatic warp specialization and loop flattening to enhance performance. It excludes epilogue subtiling and tuning, with tests confirming functionality on a Blackwell machine.
- Fullgraph mode feature enhancements: Two pull requests propose enabling
capture_dynamic_output_shape_ops
andcapture_scalar_outputs
features when fullgraph mode is true, aiming to improve dynamic output shape handling and scalar output capture during full graph capture in PyTorch. These changes enhance the behavior and flexibility of the full graph capture mechanism.
- ROCm codebase cleanup: This pull request removes the
HIPBLASLT_ALLOW_TF32
flag along with all related code, documentation, environment variable manipulations, and test dependencies to address several unit test failures in the ROCm codebase.
- CI NVIDIA driver update and NUMBA regression fix: This pull request updates the NVIDIA driver to version 580.82.07 in the CI environment to enable CUDA-13 tests and applies a live patch to address a regression in NUMBA integration as recommended in a related issue discussion.
- FSDP reset_sharded_param idempotency: This pull request makes the
reset_sharded_param
function in the Fully Sharded Data Parallel module idempotent by checking the storage data pointer to no-op if the local tensor is already padded, preventing redundant resets. Unit tests are included to verify this behavior.
- Flex module backward configuration and performance boost: This pull request updates the backward configuration setup and the default b200 configuration in the Flex module, resulting in up to a 4x performance improvement in backward pass throughput for various attention types and tensor shapes, supported by detailed benchmarking.
- MPS backend embedding_bag forward pass: This pull request adds the forward pass implementation for the
embedding_bag
operation in the MPS backend of PyTorch.
- clone_meta stride semantics fix: This pull request fixes a bug by ensuring the
clone_meta
function matches the stride semantics of eager mode tensors withpreserve_format
. It handles non-overlapping dense tensors by copying input strides and computes contiguous strides for other cases.
- Autograd mutation warnings workaround: This pull request proposes a substantial workaround to suppress warnings or errors from PyTorch's autograd system related to mutations in the threaded process group, as detailed in the accompanying note.
- Inductor floor division operator replacement: This pull request replaces more instances of the floor division operator
//
with the explicitFloorDiv
operation in the inductor code to improve symbolic division representation and reasoning.
- PyTorch Dynamo stack trace preservation: This pull request enhances PyTorch Dynamo by enabling it to preserve stack traces correctly when working with inlined neural network modules, improving debugging and error tracking.
- Export_db tracer update and optional input removal: This pull request moves the
export_db
functionality to use a new tracer and removes the restriction on optional inputs to improve flexibility and tracing capabilities.
- CUTLASS submodule upgrade and renaming: This pull request upgrades the CUTLASS submodule to version 4.2.0 and renames references from "cutlass" to "cutlass_cppgen" within the PyTorch project.
- cusparseLt buffer allocation fix: This pull request ensures buffer allocations precisely match computed sizes from cusparseLt compression metadata by encoding original dimensions and using computed compressed sizes directly, preventing under-allocation and fixing potential bugs related to metadata size changes.
- Filesystem usage standardization: This pull request redirects all uses of filesystem functionality to the
c10/utils/FileSystem.h
header to standardize and possibly improve file system operations.
- assert_tensor_metadata decomposition for BatchedTensors: This pull request adds a decomposition rule to the
assert_tensor_metadata
aten operator to ensure compatibility with BatchedTensors by skipping them and applying the operator to the underlying tensor, fixing issues with Vmap fallback logic during device moves.
- Learning rate scheduler tensor aliasing fix: This pull request prevents problematic tensor aliasing in learning rate schedulers like SequentialLR and ReduceLROnPlateau by introducing a helper function to safely update parameter group values, fixing multiple related bugs and improving robustness.
- Fallback Kernel output aliasing handling: This pull request improves handling of output aliasing in custom operations with multiple outputs within the Fallback Kernel by correctly identifying outputs from mutating ops that should not alias inputs, allowing earlier deallocation of intermediate buffers without affecting downstream usage.
- Global namespace using declarations prohibition: This pull request addresses the prohibition of using declarations in the global namespace within stable header files by configuring clang-tidy checks and updating code to fix lint errors related to global namespace usage.
- DeviceMesh initialization update: This pull request proposes replacing the usage of
DeviceMesh
withinit_device_mesh
in backend code to improve or update device mesh initialization, but it has not been merged.
- Documentation typo fixes: This pull request proposes small typo fixes in the PyTorch project documentation but was not merged.
- AOT metadata collection optimization: This pull request optimizes the AOT metadata collection pass by reducing the number of
__torch_dispatch__
calls in FakeTensorMode through using the saved device on storage instead ofdevice_custom
.
- Functional export naming and closure fix: This pull request fixes issues in functional export of PyTorch modules by unifying the wrapper’s self argument naming to prevent collisions and re-injecting closure variables into the wrapper’s local scope to ensure Dynamo’s guard builder can detect and use them during compilation.
- Communication ops visibility in schedule IR: This pull request adds a small utility change to enable visibility of communication operations such as
SEND
andRECV
within the schedule intermediate representation (IR).
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 100 | 17 | 6 | 110 |
coconutruben | 191 | 24 | 0 | 11 |
swolchok | 143 | 25 | 1 | 21 |
ezyang | 61 | 20 | 10 | 80 |
huydhn | 126 | 12 | 2 | 25 |
kwen2501 | 103 | 29 | 2 | 29 |
tugsbayasgalan | 112 | 30 | 0 | 15 |
Skylion007 | 9 | 4 | 0 | 121 |
anijain2305 | 75 | 13 | 5 | 29 |
fduwjj | 65 | 15 | 0 | 39 |