Weekly GitHub Report for Pytorch: September 15, 2025 - September 22, 2025 (12:06:16)

            Weekly GitHub Report for Pytorch: September 15, 2025 - September 22, 2025 (12:06:16)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on X86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

RuntimeError: Intel XPU device doesn't support querying free memory on PyTorch 2.9+xpu (LN/L Arc 140V): This issue reports a RuntimeError encountered when attempting to load models on an Intel Arc 140V GPU using PyTorch 2.9.0+xpu, where the device does not support querying free memory via torch.xpu.mem_get_info(). This limitation prevents models from being loaded into GPU memory as expected, and the user seeks guidance on driver versions or fixes to resolve this problem, especially when running under WSL2.

The comment discussion centers on recommending the latest Intel rolling graphics driver installation and clarifying that WSL2 support is still experimental, so the issue may persist there. A temporary workaround involving monkey-patching the memory query function was shared, and users were advised to consider using the native Windows PyTorch version for better support.
Number of comments this week: 6

Flex Attention Decoding Silently Incorrect with H=None Broadcasting and Sliced mask_mod: This issue reports two bugs in the torch.nn.attention.flex_attention module that cause silent incorrect results during autoregressive decoding when using BlockMask slicing. The first bug involves incorrect broadcasting behavior when H=None is used, leading to significant errors, while the second bug concerns the mask_mod attribute becoming a no-op after slicing a BlockMask, which can cause subtle errors if users modify the sliced mask incorrectly.

The discussion confirms the second issue is intentional behavior but suggests improving documentation and error handling to prevent silent failures; a proposal to replace the no-op mask_mod with an error-throwing function was positively received, and a pull request was encouraged to implement this improvement.
Number of comments this week: 6

Poor bf16 precision in backward torch.asin: This issue reports a precision discrepancy in the backward computation of the torch.asin function when using bfloat16 (bf16) on CPU compared to GPU, where the CPU implementation shows up to 2 units in the last place (ulp) error. The root cause was identified as the CPU’s torch.rsqrt function for bf16 lacking intermediate fp32 precision, unlike the CUDA implementation, and a patch was proposed to improve accuracy by casting inputs to fp32 during computation.

The discussion involved debugging suggestions focusing on the derivative definition and isolating the precision problem to the rsqrt operation. A detailed demonstration confirmed the CPU’s rsqrt in bf16 is less accurate due to missing fp32 intermediate steps, unlike CUDA. A patch addressing this by introducing fp32 intermediate computation was created and positively reviewed, resolving the precision issue.
Number of comments this week: 5

[ci/cd] A couple of xpu nightly builds cancelled on 15/7: This issue reports that two XPU nightly builds were cancelled on July 15th due to timeouts during the continuous integration process, specifically during the linking stage of the build. The problem appears related to increased build times caused by recent upgrades to the XPU support package, which have extended the build duration and led to performance regressions in the CI runner.

The comments discuss the cancellation being caused by timeouts, with observations that CUDA and ROCm builds also failed, possibly due to a recent PR or the long build times exceeding limits. It was confirmed that the XPU builds now take around 5.5 to 6 hours due to multiple AOT targets and package upgrades, and a rerun was planned while efforts to fix the build performance regression were underway.
Number of comments this week: 5

[dynamo] graph break when attempting to construct ConstDictVariable with LazyVariable members: This issue reports a graph break occurring in the Dynamo component when attempting to construct a ConstDictVariable containing LazyVariable members, resulting in a TypeError due to unhashable LazyVariableTracker objects. The user provides a stack trace highlighting the failure point and discusses potential fixes and improvements to error messaging with contributors, who suggest patches and plan further investigation after a related pull request merges.

The comments include an initial suggestion that a missing .realize() call might be causing the issue, followed by a commitment to submit a fix. Further discussion clarifies that the problem may stem from inserting non-hashable objects, leading to a proposed patch to improve error messages by realizing the variable type. The user agrees with the approach, and contributors note that a deeper investigation will proceed after a related pull request is merged.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0.dev, CUDA 12.1, and Ubuntu 22.04, and shares code snippets demonstrating the error occurring while compiling specific model components with torch.compile.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, and preliminary testing shows a speedup of approximately 1.3 times compared to the standard approach.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at /tmp having permissions set to 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and directory permissions being correct.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, and managing preparatory fixes for known formatting conflicts, while also providing a detailed worklist organized by directory to coordinate contributions and track progress.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on resource-constrained devices like mobile.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 84
Summarized Issues:

CUDA Distributed and Fused Operations Failures: Multiple issues report failures in CUDA distributed tests and fused operations, including AsyncTPTest failing due to improper MMA instruction targeting causing 100% output mismatch, and fused scaled matmul reduce-scatter tests failing due to accuracy mismatches. These problems highlight challenges in maintaining correctness and performance in distributed CUDA kernels and fused operations.  
issues/162917, issues/162940

Torch.compile and Inductor Backend Bugs: Several issues describe bugs and runtime errors when using torch.compile with the Inductor backend, such as failures with .mH views of complex tensors, incorrect gradient computations losing dependencies, errors with compiled higher-order operators under custom TorchDispatchMode, and layout conversion errors in flex_attention. These indicate ongoing instability and edge cases in the compilation and optimization pipeline.  
issues/162949, issues/163048, issues/163013, issues/163243, issues/163300

Memory and Performance Regressions: Issues report significant GPU memory spikes in torch.linalg.eigh in version 2.8.0 compared to 2.7.0, inefficient immediate scatter operations in parallelize_module, and build time regressions causing CI timeouts for XPU builds. These problems affect resource usage and CI reliability.  
issues/162957, issues/163043, issues/162962

Export and ONNX Compatibility Issues: Multiple issues describe failures and errors during model export to ONNX or PyTorch export APIs, including segmentation faults from mismatched dynamic_axes keys, inability to guard on data-dependent symbolic shapes, and errors exporting models with past_key_values or using torch.func.jvp. These highlight challenges in exporting complex models with dynamic or symbolic inputs.  
issues/163033, issues/163170, issues/163051, issues/163146

Distributed Tensor and DTensor Bugs: Issues report incorrect results from adding partial DTensors with scalars, wrong placements and outputs from inplace operations on DTensors, and inefficiencies in distributed tensor sharding due to immediate scatter operations. These bugs affect correctness and performance in distributed tensor computations.  
issues/163193, issues/163374, issues/163043

Testing and CI Failures and Flakiness: Several tests are reported as flaky or disabled due to consistent failures on specific platforms like ROCm and XPU, including test_ring_flex_attention, test_graph_partition_default_device_context, and test_partial_flat_weights. Additionally, lintrunner performance issues and CI workflow linting gaps are noted. These issues reduce test reliability and CI effectiveness.  
issues/162976, issues/163159, issues/163072, issues/162973, issues/163071

API and Functionality Enhancement Requests: Requests include adding support for torch.embedding to accept torch.uint8 indices, enhancing FSDPModule.set_requires_gradient_sync for finer control, adding custom determinism checks, and integrating Flash Dynamic Mask Attention as a new backend for scaled_dot_product_attention. These aim to improve flexibility, efficiency, and feature coverage.  
issues/162918, issues/162952, issues/162980, issues/163400

Global Interpreter Lock (GIL) and CUDA Interaction Issues: Issues report that torch.compile kernels and CUDA device tensor creation do not release the GIL as expected, causing potential thread parallelism limitations and implicit synchronized device-to-host copies while holding the GIL. These behaviors deviate from typical CUDA API expectations.  
issues/163061, issues/163062

Symbolic Tracing and Graph Partitioning Bugs: Bugs include failures in symbolic tracing due to decorator unwrapping mismatches, graph partitioning errors with custom mutating ops causing UnboundLocalError, and ConstDictVariable construction failures from unhashable LazyVariableTrackers. These issues hinder reliable graph transformations and tracing.  
issues/163312, issues/163129, issues/163321

Documentation and Build System Issues: Documentation errors include incorrect default values for align_corners in interpolation and outdated CUDA version info in "Get Started Locally". Build system problems include missing headers causing Vulkan backend build failures, Python environment misdetection in CMake/ninja, and excessive CI git tag noise. These reduce usability and build reliability.  
issues/163266, issues/163287, issues/163307, issues/163309, issues/163311

Stable API and C++ Header Anti-patterns: Issues highlight the use of anonymous namespaces inside stable API headers causing symbol overloading and propose adding helper utilities for boxing wrappers to reduce manual code maintenance. These address code quality and maintainability concerns in the C++ API.  
issues/163343, issues/163346

Type Checking and Type Annotation Problems: Problems include learning rate schedulers aliasing optimizer learning rates causing corruption, and a proposal to migrate from mypy to Pyrefly for improved type checking performance and features. These affect code correctness and developer experience.  
issues/163103, issues/163283

Kernel Compilation and Runtime Improvements: Progress and issues are reported on the experimental torch.cuda._compile_kernel API for inline kernel compilation, including cross-platform support and integration with torch.compile. Additionally, a bug in the inductor runtime related to TRITON_CACHE_DIR environment variable handling may cause cache replacement conflicts.  
issues/163142, issues/163336

Attention and Masking Bugs: Bugs in torch.nn.attention.flex_attention cause silent incorrect results during autoregressive decoding with BlockMask slicing, including broadcasting errors and no-op mask_mod attributes after slicing. These lead to silent failures in attention computations.  
issues/163314

Exported Program and Device Metadata Issues: Casting exported programs to different devices causes runtime assertion failures due to hardcoded device metadata in IR, breaking device-agnostic export introduced in version 2.8. This limits flexibility in model deployment.  
issues/163323

Miscellaneous Bugs and Crashes: Other reported bugs include segmentation faults in torch.nn.MaxUnpool3d with complex and uint32 inputs, build failures with rocWMMA on ROCm due to float to half conversion issues, and runtime crashes in AOTAutograd view replay due to non-contiguous aliased base tensors. These affect stability across various components.  
issues/163409, issues/163337, issues/163328

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 21
Summarized Issues:

Test Failures on XPU Platform: Multiple tests including test_libtorch_free_so_xpu, test_automatic_dynamo_graph_breaks_device_xpu, test_lazy_backward_backend_inductor_device_xpu, test_lazy_backward_backend_eager_device_xpu, and test_graph_break_bomb_backend_inductor_device_xpu have been consistently failing on the main branch for the xpu platform, leading to their disabling. These recurring failures indicate stability issues with the xpu platform tests that affect the reliability of the test suite.  
issues/162937, issues/162938, issues/162939, issues/163032, issues/163045

CI and Test Infrastructure Issues: The PyTorch CI pipeline's run_test.py suppresses test failure errors by returning a success exit code due to a rerun and skip logic, masking actual test failures and causing inaccurate test results. Additionally, nvshmem_triton tests broke due to a Triton update causing key errors and compilation failures that were not properly caught by the CI system.  
issues/162978, issues/162934

Documentation and Usability Improvements: There is a need for clearer explanations and additional examples regarding the "out" parameter in PyTorch tensor documentation, along with a suggestion to rename the title for better clarity to improve user understanding.  
issues/162931

Sparse Tensor Format Implementation: A new sparse tensor format is being proposed with a request for guidance on adding a method to the tensor object to enable conversion to this new format via a call like A.to_new_format(). This indicates ongoing development to extend PyTorch's tensor capabilities.  
issues/162908

CUDA and Build Failures: Windows CUDA builds have been failing due to a missing identifier __builtin_clz in CUDA source files, and manywheel CUDA builds on Linux have been failing due to missing CUDA header files like cuda_runtime.h, likely caused by recent submodule updates. These build issues are blocking successful compilation and deployment.  
issues/163108, issues/163342

Learning Rate Scheduler Bugs: The ReduceLROnPlateau._reduce_lr method triggers unnecessary recompilation when the learning rate is set as a float instead of a tensor, causing performance inefficiencies. Also, the SWALR scheduler improperly serializes its anneal_func attribute, causing an UnpicklingError when loading its state_dict with weights_only=True, leading to checkpoint loading failures.  
issues/163093, issues/163105

Attention Masking Implementation Concern: There is a concern about whether the torch.nn.functional.scaled_dot_product_attention function correctly applies attention masking by adding -infinity to masked positions in the attention bias, which is critical for excluding those positions during the softmax operation. This raises questions about the correctness of the attention mechanism implementation.  
issues/163026

MPS Backend Error Handling: The max_unpool2d and max_unpool3d functions do not raise errors for invalid indices on the MPS backend, unlike CPU and CUDA where invalid indices correctly trigger runtime errors, indicating inconsistent error handling across backends.  
issues/163035

PyTorch Compile Tuple Assignment Bug: Using torch.compile causes a TypeError when directly assigning a tuple slice from torch.linalg.qr to a variable, while tuple unpacking works correctly, highlighting a discrepancy between compiled and eager execution modes.  
issues/163253

FSDP Example and Tutorial Development: A tutorial script for sentiment classification on the IMDB dataset using Fully Sharded Data Parallel (FSDP) has been introduced to help new users understand FSDP through practical experience, including support for checkpointing on distributed multi-GPU systems.  
issues/163413

CUDA Out-of-Memory Error in Distributed Process Group: An ncclUnhandledCudaError occurs during distributed.destroy_process_group with a CUDA out-of-memory error despite monitoring only 8 MB of GPU memory usage, causing confusion about the underlying cause of the failure.  
issues/163397

Global Namespace Pollution in API Headers: A stable API header contains a using declaration that pollutes the global namespace, causing ambiguous symbol errors when PyTorch headers are included before other libraries like CUTLASS, leading to integration issues.  
issues/163338

FSDP Implicit Prefetching Issue: In the PyTorch FSDP2 example, implicit prefetching does not work as expected because the all-gather operation does not overlap with computation due to an unintended dependency caused by memory access patterns. This was traced to the environment variable CUDA_DEVICE_MAX_CONNECTIONS=1 limiting CUDA connections, resulting in delayed GPU kernel execution.  
issues/163153

CUDA Version Support Inquiry: A user inquires whether PyTorch supports CUDA 3.0 for their GPU server, seeking information on the minimum CUDA version required to run PyTorch.  
issues/163331

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 207
Key Open Pull Requests
1.  Fix for ValueError: ProcessGroupXCCL::gather: invalid tensor type at index 0 : This pull request addresses a ValueError in ProcessGroupXCCL::gather caused by a device mismatch where tensors participating in the gather operation must be on the XPU device but were incorrectly gathered into a CPU tensor, and it includes fixes to ensure proper tensor device placement for distributed operations on XPU.

URL: pull/163362

Merged: No

Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55, 7dade, 580aa, c0f57, 6dedb, 20d07, 124ff, 0bea1, 7409a, 636cb, 0d5a8, 3826e, cb711, d6cd1, 624be, 8d8c5, 1cf78, 41475, 0628c, b0d93, 58eb8, e558e, 83ac5, 0e7a7, a2b2f, 8de00, 91f5d, 39e6c, 06d6c, a0590, 31ddf, 9a6df, 50cb9, 08559, f6a8c, dc0be, 9fd7e, cce98, cf3bf, e3957, cd29a, 7b81c, 42401, 1dee9, 815db, bae7d, c283c, f7846, af361, 22cc1, 482f1, 9818e, 57d20, 2a5f3, 9f0c5, bb73a, ec4ea, 084ee, 70689, 491cc, 4a2d0, a6164, 12663, a03d1, 7da46, cfe26, 4d68d, 4c74d, 42219, 267bd, 36291, b6a17, 98cbe, 7245c, 878e1, 948b9, 2176c, f632e, 84ccf, 97d4c, 07bf4, 1e846, 493b2, 9b3f5, bf100, a0522, 6c84e, d7fcd, 14f79, da233, 04b39, 95ef9, ab42f, a1c29, 62c60, 94863, 0bb6d, 79c00, c8938, 976a6, 65292, 2e2d1, 4d289, a4627, 48e56, 02f1c, 09f47, 04fb8, e1dd6, c36f7, 1a3bd, ad3f9, 164bd, afa17, 32120, b293c, 339c0, 8e8c0, be778, caade, 380e3, dcd90, 9d255, 20dee, 6049b, e2a43, ec50d, 64864

2. [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth: This pull request introduces a library for querying device hardware limits such as FLOPS and memory bandwidth on CUDA devices, aiming to replace hardcoded values in benchmarks with a structured utility that can be extended to support more devices and parameters for both CPUs and accelerators.

URL: pull/162942

Merged: No

Associated Commits: 36ea5, 23695, a6bf5, 9e5f4, 5633b, f3ab8, 86931, 815a5, 8d90d, d4796, 4d68f, 9a7ae, 8bfaa, a4333, d030e, 0d65d, dc8ae, 29e96, b03fa, 74ddd, 4ac03, 8c8be, f03b2, 66c59, 0d0f2, 7e2df, 5c782, c3dc3, 64f2d, e1e07, 007a4, b0b04

3. [RFC]: No Distributed Log Spew: This pull request proposes a novel, more intrusive approach to prevent excessive logging ("log spew") in PyTorch distributed environments by monkey patching the logging, warning, and optionally print functions to ensure that log statements only execute on the rank 0 process after distributed initialization, thereby making all logging statements distributed-aware without relying on linters or piecemeal code fixes.

URL: pull/162999

Merged: No

Associated Commits: d093a, 606c4, eca1b, 607f6, fbdb9, c94cc, da969, cb062, 074af, d1a64, ff61b, 9c85b, 259bb, bb9d8, e105c

Other Open Pull Requests

Cutlass library update: This pull request updates the Cutlass library version used in the fbcode environment of the PyTorch project, incorporating multiple commits to ensure the integration is pristine and up-to-date. It ensures that the latest improvements and fixes in Cutlass are reflected in the PyTorch build environment.  
[pull/163091]

C++ migration of DTensor components: Multiple pull requests introduce C++ implementations and improvements for DTensor functionality, including the compute_global_tensor_info function and access to Placement data via pybind11 bindings. These changes represent incremental progress toward migrating DTensor operations from Python to C++ for better performance and interoperability.  
[pull/162990, pull/163030, pull/163031]

Size-hint multi-kernel selection improvements: This pull request introduces a variant of the size-hint multi-kernel selection method that chooses an optimal pre-generated kernel based on shape similarity measured by L1 distance in log2 space for novel runtime shapes. It also adjusts activation conditions and limits dimensionality for pre-generation searches to improve kernel selection efficiency.  
[pull/163090]

Torchfuzz introduction: This pull request introduces the initial implementation of torchfuzz, a fuzz testing tool for PyTorch operations, with plans to transition from a stack-based to a graph-based representation. This tool aims to improve testing coverage and robustness of PyTorch operations by handling increasing complexity more effectively.  
[pull/163417]

Argument parser bug fix: This pull request addresses a bug in PyTorch's argument parser where methods with a single sequence argument, such as tensor.reshape(), incorrectly ignored additional arguments when the first argument was a tuple. The fix extends to other similar methods and includes unit tests to ensure correct argument handling and prevent silent bugs.  
[pull/163081]

DeviceMesh refactoring with CuTe layout: This pull request refactors DeviceMesh internal bookkeeping by leveraging the CuTe layout to simplify and generalize index operations, replacing dimension-mapping methods with layout-based approaches. It introduces new functions for layout overlap checking and remapping, enabling slicing and flattening of non-contiguous dimensions without changing existing behavior.  
[pull/163213]

Testing for compute_global_tensor_info: This pull request adds basic tests for the function torch.distributed.tensor._utils.compute_global_tensor_info in preparation for its subsequent C++ implementation. These tests help ensure correctness before migrating the function to C++.  
[pull/162968]

Learning rate scheduler aliasing fix: This pull request prevents unintended aliasing between self._last_lr and base_lrs with tensor learning rates in optimizer parameter groups, updates type annotations to support floats and tensors, enhances documentation, and adds tests to ensure correct behavior with tensor learning rates.  
[pull/163120]

GPU health monitoring integration: This pull request integrates GPU health monitoring functionality from the NVIDIA Resiliency Extension into PyTorch’s distributed elastic training system. It provides comprehensive GPU health checks, recovery action detection, thread-safe asynchronous monitoring, and includes tests, documentation, and usage examples to enhance robustness.  
[pull/163192]

Hessian function optimization: This pull request introduces an opt-in is_scalar argument to the hessian function that treats scalar functions separately, resulting in a 15-20% CPU and CUDA speedup without changing default behavior. The change is supported by extensive benchmarking and additional unit tests to prevent regressions.  
[pull/162915]

ROCm MI350 GPU kernel autotuning: This pull request introduces heuristic improvements and additional autotuning configurations for pointwise kernels targeting the MI350 GPU on the ROCm platform. The goal is to enhance performance by optimizing kernel grid configurations and increasing maximum block size for better execution efficiency.  
[pull/163197]

Broadcast_in_dim decomposition: This pull request proposes adding a decomposition implementation for the prims.broadcast_in_dim operation in PyTorch, addressing issue #163037. This aims to improve the modularity and maintainability of the operation.  
[pull/163377]

Python dispatch documentation improvements: This pull request proposes slight improvements to the documentation in the python_dispatch module to clarify the direction of stack iteration, addressing the author's initial confusion.  
[pull/162963]

User-streams aliasing handling: This pull request aims to properly handle aliasing in the user-streams component of PyTorch, as indicated by a series of commits updating and refining this functionality.  
[pull/163028]

High priority streams support in ProcessGroupXCCL: This pull request adds support for high priority streams in ProcessGroupXCCL, enabling XPU streams to execute with higher priority similar to CUDA streams, and includes registration of this feature.  
[pull/163049]

Test updates for dtensor compile: This pull request refactors the test_dtensor_compile by updating tests to accommodate the removal of previously unsupported usage involving FakeStore and init_process_group with a "fake" backend, ensuring compatibility with the current implementation.  
[pull/163058]

Model recompilation on memory alignment change: This pull request ensures that the model is recompiled when memory alignment changes between runs, specifically addressing a RuntimeError caused by incorrect attn_bias alignment when running torch.compile with different sequence lengths. It adds proper guards to detect alignment differences and trigger recompilation.  
[pull/163083]

ROCm docker images upgrade: This pull request proposes upgrading all ROCm docker images in PyTorch to the ROCm 7.0.1 release version, including updates to related dependencies such as UCX and UCC, while removing prior alpha cases.  
[pull/163140]

Intel GPU TF32 support API: This pull request introduces a new API torch.xpu.is_tf32_supported for Intel GPUs to check hardware support for TF32 matrix multiplication acceleration. This aligns with other backends and enables informed use of torch.backends.mkldnn.allow_tf32=True or hardware capability queries for Triton.  
[pull/163141]

Mark_dynamic name argument support: This pull request introduces support for a name keyword argument in the mark_dynamic function to improve ergonomics by enabling symbol sharing without relying on the complex torch._check paradigm. It addresses issues related to enforcing sequence length consistency in KV cached tensors during dynamic shape tracing.  
[pull/163246]

MPS backend embedding bag improvements: This pull request proposes computing the offset2bag, bag_size, and max_indices parameters within the _embedding_bag function in the MPS backend to improve embedding bag operations.  
[pull/163281]

Performance testing preparation for PyTorch 2.9: This pull request prepares performance testing for PyTorch version 2.9 by reinstalling the release candidate, rebuilding dependent packages torchrec and fbgemm due to the absence of RC wheels, and includes several commits aimed at benchmarking and debugging without requiring review.  
[pull/163334]

Placement type accessibility in C++: This pull request aims to improve the performance of the compute_global_tensor_info function by making the Placement type accessible in C++, addressing overhead involved in unwrapping Placement via pybind11 and considering nanobind to reduce this overhead.  
[pull/163031]

Device argument support in get_rng_state: This pull request adds support for passing a device argument—either as a device object, string, or integer—to the torch.random.get_rng_state function, enabling retrieval of the random number generator state for the specified device. It includes tests to verify this functionality.  
[pull/163034]

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 171
Key Closed Pull Requests
1. compile_kernel enable pch: This pull request implements support for enabling automatic precompiled headers (PCH) in CUDA kernel compilation within PyTorch, demonstrating significant average compilation speed improvements through benchmarking while noting that maximum compilation times with PCH can be higher, which currently prevents enabling PCH by default.

URL: pull/162972

Merged: No

Associated Commits: 0ec2a, 7a334, 834ab, 5c94b, 506a4, 10453, edbbc, 34aa6, 00525, b9f3d, d04fb, a1080, 344f2, cdff0

2. [Inductor-FX] Support torch.cond: This pull request introduces support for the torch.cond operation in the PyTorch FX converter by implementing a new ConditionalLine wrapper IR line to represent conditionals, generating corresponding torch.ops.higher_order.cond nodes in the FX IR, and adapting the FX backend’s code generation and graph input/output handling to properly manage subgraphs and conditionals, along with adding tests for this feature.

URL: pull/163234

Merged: No

Associated Commits: 90635, 8cadb, 87d26, f274a, 45a61, ebf25, dd8aa, 7fb15, a900c, 32233, 2d41d, 3e07d, cc09a, a94d4

3. Just Sample upload and update: This pull request is about updating the transformers dependency from version 4.54.0 to 4.56.1 and adding various continuous integration workflow files, although it was not merged.

URL: pull/163172

Merged: No

Associated Commits: ad19e, 49a51, 450d9, dc544, c21d9, 4ae32, b9b39, 50a26, 41eb7, e18bc, 77e42, efc52, b4383

Other Closed Pull Requests

Triton Inductor Blackwell-specific matrix multiplication: This pull request adds a persistent matrix multiplication template tailored for Blackwell architecture in the Triton Inductor, incorporating device-side TMA and new Triton features like automatic warp specialization and loop flattening to enhance performance. It excludes epilogue subtiling and tuning, with tests confirming functionality on a Blackwell machine.  
pull/162916

Fullgraph mode feature enhancements: Two pull requests propose enabling capture_dynamic_output_shape_ops and capture_scalar_outputs features when fullgraph mode is true, aiming to improve dynamic output shape handling and scalar output capture during full graph capture in PyTorch. These changes enhance the behavior and flexibility of the full graph capture mechanism.  
pull/163123, pull/163121

ROCm codebase cleanup: This pull request removes the HIPBLASLT_ALLOW_TF32 flag along with all related code, documentation, environment variable manipulations, and test dependencies to address several unit test failures in the ROCm codebase.  
pull/162998

CI NVIDIA driver update and NUMBA regression fix: This pull request updates the NVIDIA driver to version 580.82.07 in the CI environment to enable CUDA-13 tests and applies a live patch to address a regression in NUMBA integration as recommended in a related issue discussion.  
pull/163111

FSDP reset_sharded_param idempotency: This pull request makes the reset_sharded_param function in the Fully Sharded Data Parallel module idempotent by checking the storage data pointer to no-op if the local tensor is already padded, preventing redundant resets. Unit tests are included to verify this behavior.  
pull/163130

Flex module backward configuration and performance boost: This pull request updates the backward configuration setup and the default b200 configuration in the Flex module, resulting in up to a 4x performance improvement in backward pass throughput for various attention types and tensor shapes, supported by detailed benchmarking.  
pull/163318

MPS backend embedding_bag forward pass: This pull request adds the forward pass implementation for the embedding_bag operation in the MPS backend of PyTorch.  
pull/163012

clone_meta stride semantics fix: This pull request fixes a bug by ensuring the clone_meta function matches the stride semantics of eager mode tensors with preserve_format. It handles non-overlapping dense tensors by copying input strides and computes contiguous strides for other cases.  
pull/163017

Autograd mutation warnings workaround: This pull request proposes a substantial workaround to suppress warnings or errors from PyTorch's autograd system related to mutations in the threaded process group, as detailed in the accompanying note.  
pull/163238

Inductor floor division operator replacement: This pull request replaces more instances of the floor division operator // with the explicit FloorDiv operation in the inductor code to improve symbolic division representation and reasoning.  
pull/162969

PyTorch Dynamo stack trace preservation: This pull request enhances PyTorch Dynamo by enabling it to preserve stack traces correctly when working with inlined neural network modules, improving debugging and error tracking.  
pull/162992

Export_db tracer update and optional input removal: This pull request moves the export_db functionality to use a new tracer and removes the restriction on optional inputs to improve flexibility and tracing capabilities.  
pull/162993

CUTLASS submodule upgrade and renaming: This pull request upgrades the CUTLASS submodule to version 4.2.0 and renames references from "cutlass" to "cutlass_cppgen" within the PyTorch project.  
pull/163092

cusparseLt buffer allocation fix: This pull request ensures buffer allocations precisely match computed sizes from cusparseLt compression metadata by encoding original dimensions and using computed compressed sizes directly, preventing under-allocation and fixing potential bugs related to metadata size changes.  
pull/163125

Filesystem usage standardization: This pull request redirects all uses of filesystem functionality to the c10/utils/FileSystem.h header to standardize and possibly improve file system operations.  
pull/162914

assert_tensor_metadata decomposition for BatchedTensors: This pull request adds a decomposition rule to the assert_tensor_metadata aten operator to ensure compatibility with BatchedTensors by skipping them and applying the operator to the underlying tensor, fixing issues with Vmap fallback logic during device moves.  
pull/163008

Learning rate scheduler tensor aliasing fix: This pull request prevents problematic tensor aliasing in learning rate schedulers like SequentialLR and ReduceLROnPlateau by introducing a helper function to safely update parameter group values, fixing multiple related bugs and improving robustness.  
pull/163098

Fallback Kernel output aliasing handling: This pull request improves handling of output aliasing in custom operations with multiple outputs within the Fallback Kernel by correctly identifying outputs from mutating ops that should not alias inputs, allowing earlier deallocation of intermediate buffers without affecting downstream usage.  
pull/163227

Global namespace using declarations prohibition: This pull request addresses the prohibition of using declarations in the global namespace within stable header files by configuring clang-tidy checks and updating code to fix lint errors related to global namespace usage.  
pull/163352

DeviceMesh initialization update: This pull request proposes replacing the usage of DeviceMesh with init_device_mesh in backend code to improve or update device mesh initialization, but it has not been merged.  
pull/162960

Documentation typo fixes: This pull request proposes small typo fixes in the PyTorch project documentation but was not merged.  
pull/162982

AOT metadata collection optimization: This pull request optimizes the AOT metadata collection pass by reducing the number of __torch_dispatch__ calls in FakeTensorMode through using the saved device on storage instead of device_custom.  
pull/162987

Functional export naming and closure fix: This pull request fixes issues in functional export of PyTorch modules by unifying the wrapper’s self argument naming to prevent collisions and re-injecting closure variables into the wrapper’s local scope to ensure Dynamo’s guard builder can detect and use them during compilation.  
pull/162994

Communication ops visibility in schedule IR: This pull request adds a small utility change to enable visibility of communication operations such as SEND and RECV within the schedule intermediate representation (IR).  
pull/162996

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
100
17
6
110

coconutruben
191
24
0
11

swolchok
143
25
1
21

ezyang
61
20
10
80

huydhn
126
12
2
25

kwen2501
103
29
2
29

tugsbayasgalan
112
30
0
15

Skylion007
9
4
0
121

anijain2305
75
13
5
29

fduwjj
65
15
0
39

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	100	17	6	110
coconutruben	191	24	0	11
swolchok	143	25	1	21
ezyang	61	20	10	80
huydhn	126	12	2	25
kwen2501	103	29	2	29
tugsbayasgalan	112	30	0	15
Skylion007	9	4	0	121
anijain2305	75	13	5	29
fduwjj	65	15	0	39