Weekly GitHub Report for Pytorch: July 21, 2025 - July 28, 2025 (12:04:12)

            Weekly GitHub Report for Pytorch: July 21, 2025 - July 28, 2025 (12:04:12)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, FlexAttention for X86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[RFC]: Iterations of DeviceMesh based on recent user requests/feedback: This issue discusses proposed iterations and improvements to the DeviceMesh abstraction in PyTorch based on recent user feedback and requests, aiming to simplify distributed parallelism configurations and address UX concerns such as unnecessary subgroup creation and reshape support. It also covers challenges related to subclassing DeviceMesh, multi-threading, and performance overhead—particularly in relation to DTensor integration—while seeking community input on design trade-offs and future directions.

The comment thread includes detailed clarifications on why current DeviceMesh initialization creates many unused sub-process groups and how a reshape API could improve UX by reducing these. Participants debated the implications of subclassing DeviceMesh and its impact on DTensor’s CPU overhead, with some emphasizing the importance of maintaining eager mode performance. Examples and code snippets were shared to illustrate proposed usage patterns, and there was consensus on splitting the RFC into orthogonal topics for focused discussion. Suggestions were made to unify interfaces with related projects like Monarch, and ongoing collaboration was noted to balance new features with backward compatibility and performance concerns.
Number of comments this week: 17

[RFC] Support slicing of submesh and reshape (view) of one device mesh: This issue proposes adding support for slicing submeshes and reshaping (viewing) device meshes to improve the flexibility and usability of device mesh management in distributed training. Currently, users must always pass the root device mesh and cannot easily convert between 1D and 2D meshes or slice submeshes, which complicates APIs like replicate() and impacts user experience in scenarios such as Fully Sharded Data Parallel (FSDP) and Elastic Parallelism (EP).

The discussion clarifies that the replicate API should accept a 1D device mesh but internally requires a 2D mesh for HSDP, motivating the need for reshaping capabilities. Commenters emphasize that these operations create new process groups, which are costly and collective, so the API should be private and carefully named. The conversation also touches on limitations like only supporting contiguous reshapes initially and the need to consider corner cases such as non-contiguous flattening and composability with pipeline parallelism.
Number of comments this week: 8

torchrun failed to start workers on single node multi XPU: This issue describes a problem where using torchrun to start multiple workers on a single node with multiple XPUs fails, while the same setup works correctly with mpirun when PMI environment variables are properly set. The user provides a minimal example and error logs showing that setting PMI environment variables manually for torchrun leads to fatal errors, and the discussion in the comments suggests that torchrun should not use PMI environment variables, as doing so causes the failures observed.

The comments clarify that manually setting PMI environment variables for torchrun is not recommended and causes errors, while removing those settings avoids the failure; however, without PMI vars, a different error related to starting workers was initially encountered, and the user plans to provide more details once the server is available again.
Number of comments this week: 6

Erroneous Recompile Warnings with Multiple Models: This issue reports that compiling multiple instances of different models with torch.compile triggers unexpected recompile warnings and errors due to the way recompilation limits are enforced, even though each model is only compiled once. The user is concerned that recompilation is counted across separate model instances sharing the same code object, leading to confusing and excessive warnings that hinder identifying truly problematic recompilations, and requests either a workaround or design changes to better distinguish these cases.

The discussion clarifies that recompilation in PyTorch is tracked at the code object level, so compiling separate instances of the same module class counts as recompilations, which triggers the limit and errors when fail_on_recompile_limit_hit is set. Suggestions include wrapping model calls in distinct functions to avoid this behavior, though this reduces reuse of compiled artifacts. The conversation also highlights the need for clearer logging and documentation around recompilation semantics to help users differentiate between necessary and unnecessary recompiles.
Number of comments this week: 6

torch.compile improperly removes memory layout transformation: This issue reports a bug where using torch.compile improperly removes or alters the memory layout transformation required for a tensor in a float8 rowwise training scenario, causing an inductor error related to the tensor's expected column-major format. The problem arises specifically when compiling with torch.compile, as the error does not occur in eager mode, indicating that the compilation process is not preserving the necessary memory layout constraints for the operation torch._scaled_grouped_mm.

The comments discuss the error details and confirm that the tensor layout is incorrectly handled during compilation, with observations that the float8 operations are not fully traced by dynamo and that inductor passes a row-major buffer despite the graph specifying column-major layout. It is suggested that the issue may be due to missing layout tags on the scaled_grouped_mm operation, and various debugging outputs and traces are shared to pinpoint the cause.
Number of comments this week: 6

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes the PyTorch backend compiler 'inductor' to fail during execution. The user provides detailed environment information and code snippets showing that the error arises while compiling certain model components with torch.compile, indicating a potential compatibility or packaging problem with the Triton compiler integration in PyTorch version 2.4.0 development builds.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, with demonstrated speedup in testing.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at /tmp having permissions set to 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot map a segment due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase by removing approximately 1,500 files currently excluded from UFMT and applying consistent formatting to them. It outlines the process for updating the .lintrunner.toml configuration, running the formatting tool, handling known edge cases that require preparatory fixes, and organizing the work by directory to facilitate manageable and reviewable pull requests.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the saved JIT archive to reduce file size. The motivation stems from observations that these debug files, which are primarily for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on resource-constrained devices like mobile.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 83
Summarized Issues:

Distributed and Multi-Device Execution Issues: Several issues report problems with distributed training and multi-device setups, including failures in torchrun due to improper PMI environment variable handling, DistributedDataParallel (DDP) failing to converge with complex-valued models, and torch.distributed.gather producing incorrect results on uncontiguous tensors. Additionally, warnings about client socket hostname retrieval delays and init_process_group assertion errors with certain backend strings highlight challenges in distributed communication and backend compatibility.  
issues/158722, issues/158753, issues/158902, issues/159007, issues/159200

Network and RPC Communication Bugs: The RPC system incorrectly attempts IPv4 communication on dual-stack networks even when IPv6 is specified, causing connection errors not seen on single-stack IPv6 setups. This indicates a bug in network address handling within the RPC framework.  
issues/158724

Compilation and Build Failures: Multiple issues describe compilation problems, such as missing header checks causing build failures, misuse of C++17 flags with C code on macOS CI jobs, and CUDA kernel compilation errors due to large constant data footprints. These build issues affect various platforms and configurations, requiring fixes in build scripts and source code.  
issues/158725, issues/158728, issues/158986

Torch.compile and Dynamo Compiler Limitations: Several bugs involve torch.compile failing or misbehaving, including unsupported calls to NamedTuple._replace, recompilation limits being hit with multiple model instances, incorrect device inference with FakeTensor, and runtime errors due to layout transformations or device mismatches. These issues reveal limitations and edge cases in the PyTorch compilation and tracing infrastructure.  
issues/158772, issues/158873, issues/159041, issues/159097, issues/159133, issues/159196

Documentation and Usability Requests: There are requests for improved documentation, such as adding torchrun NUMA binding docs, clarifying scaled_dot_product_attention mask parameter behavior, and documenting constraints on parameters like edge_order in torch.gradient(). These highlight gaps between implementation and user-facing documentation.  
issues/158775, issues/158969, issues/159129

Linters and CI Tooling Issues: Problems with linters include false positives from the set linter on f-string width specifiers, stuck linting processes in local environments and CI, and proposals for backend linters to detect missing or unused Docker images. These issues affect developer productivity and CI reliability.  
issues/158782, issues/158783, issues/158785, issues/158895, issues/158906, issues/159056

DeviceMesh and Distributed Parallelism Enhancements: Multiple issues propose iterative improvements to the DeviceMesh abstraction, including support for slicing and reshaping submeshes, simplifying subclassing and multithreading safety, adding Fake PG mode for checkpointing, and enhancing process group bookkeeping to support multiple groups. These aim to improve usability and flexibility in distributed parallelism.  
issues/158793, issues/159013, issues/159014, issues/159015, issues/159017, issues/159018, issues/159019

ONNX Export and Model Export Issues: ONNX export fails due to unsupported operators and dynamic shape handling errors, and torch.export produces invalid programs when exporting models with non-contiguous tensors in evaluation mode. These issues hinder model interoperability and deployment workflows.  
issues/158739, issues/159072, issues/159126

Performance Regressions and Optimization Proposals: Reports include severe performance degradation in batched SVD on GPUs, AMP static shape wrapper slowdowns, and Stable Diffusion throughput regressions. Proposals include Triton-based CUDA kernels for quantized operations, enhancements to foreach_map for kernel fusion and matrix multiplication, and optimizations to ShardedOptimizer and CUDACachingAllocator to reduce memory fragmentation.  
issues/158831, issues/159031, issues/159121, issues/158849, issues/158968, issues/158970, issues/158975, issues/159147, issues/159164

Sparse Tensor and Numeric Operation Bugs: Sparse tensor broadcasting multiplication produces incorrect results compared to dense tensors, and torch.searchsorted behaves unexpectedly with NaN values, placing insertion indices after NaNs contrary to intuitive and NumPy behavior. These bugs affect numerical correctness and consistency.  
issues/158861, issues/158738

Benchmarking and Profiling Tools Issues: Bugs include premature device backend loading causing errors in benchmarking utilities, invalid strides in meta kernels for inductor, and creation of profiling scripts to measure CPU overhead in distributed setups. These affect performance measurement accuracy and tooling stability.  
issues/158825, issues/159169, issues/159192

Torch Classes and Python Introspection Bugs: The torch.classes module does not set all module attributes correctly, breaking Python introspection and returning incorrect types for special attributes, suggesting a need to use importlib.util.module_from_spec for proper module creation.  
issues/158871

Test Flakiness and Disabled Tests: Some tests are flaky or disabled due to intermittent failures or platform-specific issues, such as a flash attention gradient test failing intermittently and a max autotune test disabled on xpu. These reduce test suite reliability.  
issues/158890, issues/159000

Memory and Runtime Errors in CUDA and Inductor: Issues include internal assertion failures in CUDA caching allocator during speech generation, incorrect C++ code generation by inductor causing memory corruption, and inability to preallocate output tensors in AOTInductor leading to inefficient memory usage.  
issues/159149, issues/159154, issues/159124

API and Feature Enhancement Requests: Proposals include adding an API to retrieve the original unsharded module from FSDP, implementing a LazyLayerNorm layer for vision tasks, supporting CUDA Compute Capability sm_120 for new NVIDIA GPUs, and enhancing DTensor dispatch logic to prioritize registered sharding strategies over decompositions. These aim to improve functionality and hardware support.  
issues/158819, issues/158832, issues/159207, issues/159110

Error Message and Warning Improvements: Some issues report unclear or misleading error messages, such as confusing cross-device mismatch errors in torch.compile, misleading error messages in ReflectionPadNd, and documentation inconsistencies about matplotlib version constraints and padding parameters. These affect developer experience and debugging.  
issues/159133, issues/159138, issues/158992, issues/159141

Miscellaneous Bugs and Requests: Other issues include segmentation faults on AMD GPUs with TunableOp enabled, NaN returns from F.gumbel_softmax on Apple MPS devices, AttributeErrors due to missing attributes in model backbones, and requests to remove unused dependencies like protoc. These cover a range of smaller but impactful problems.  
issues/159070, issues/159103, issues/159203, issues/159156

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 39
Summarized Issues:

FSDP CPU Offload and Parameter Handling: Using Fully Sharded Data Parallel (FSDP) with CPU offload but no sharding prevents parameters from being prefetched or moved to the GPU automatically, causing errors when manually transferring tensors to the GPU. This requires manual conversion of weights before the forward pass to avoid failures.  
[issues/157209]

PyTorch 2.8 Numerical Regression in Transformers Vision Models: Upgrading to PyTorch 2.8 Release Candidate causes large output differences in vision-related transformers models like ViT, CLIP, and BEiT, increasing numerical discrepancies by about 10,000 times. This leads to numerous CI test failures due to unexpected output deviations on GPU.  
[issues/157274]

ONNX Export and Operator Support Issues: Exporting PyTorch models to ONNX format fails due to unsupported operators like Unfold and data-dependent errors in reshape operations, causing runtime errors during symbolic shape tracing and guard checks. These issues affect modules such as ultralytics CARAFE and Qwen2VL visual encoder.  
[issues/158194, issues/158786]

Import Errors Related to Missing Symbols and CUDA Profiling: ImportErrors occur due to missing symbols like 'scaled_mm_configs' in the inductor backend and undefined CUDA Profiling Tools Interface (CUPTI) symbols when CUDA 12.4 is installed, causing failures in model compilation and tutorial builds.  
[issues/157343, issues/157381]

Dynamo and Torch.compile Compatibility Problems: PyTorch Dynamo faces issues tracing nn.Parameter constructors, causing fragility and requiring graph breaks, and also fails when compiling models using unsupported operations like Tensor.item(), leading to hard errors. Additionally, incompatibility with Python's sys.monitoring module causes graph breaks during torch.compile execution.  
[issues/157452, issues/158642, issues/158164]

torch.compile Output and Numerical Accuracy Issues: Using torch.compile on transformer models with causal attention or performing tensor division by scalars results in significantly incorrect or slightly different outputs due to rounding errors and compilation codegen issues. Converting scalars to tensor parameters can mitigate some errors.  
[issues/158226, issues/157959]

Performance and Scalability Bottlenecks in FX Graph Export: The _rename_without_collisions function causes quadratic time complexity during export of large FX graphs, creating a performance bottleneck that requires optimization to speed up placeholder naming and overall export runtime.  
[issues/158357]

Higher-Order Operator Export and Loop Handling Bugs: Exporting and rewriting models using the while_loop operator leads to runtime errors from symbolic shape tracing and incorrect constant propagation when using torch.tensor(0) as a loop counter, causing loops to process only the initial index.  
[issues/158366]

AOTAutograd Module Reorganization and Memory Leak Fixes: Proposals to reorganize AOTAutograd internal modules aim to improve clarity and maintainability, while a memory leak in the AOTIModelPackageLoader component was identified and fixed, resolving direct leaks detected by sanitizers.  
[issues/158382, issues/158614]

Custom C++ Operator and AutocastCUDA Dispatch Issues: Using custom C++ operators with JIT Script Tracing and AutocastCUDA causes runtime errors due to missing Python implementations for the AutocastCUDA dispatch key, raising questions about proper registration for seamless integration with PyTorch subsystems.  
[issues/158414]

Anaconda Terms of Service Impact on CI Builds: New Anaconda Terms of Service effective July 15, 2025, require explicit acceptance before accessing certain package channels, causing CI build failures unless acceptance commands or environment variables are set.  
[issues/158438]

CUDA Graph Capture Failures with Slice Operations and AOTInductor: The function torch.cuda.make_graphed_callables fails to capture functions containing tensor slice operations, causing runtime errors, and CUDA graph capture also fails when using AOTInductor compilation and packaging workflows, despite success in standard Python and torch.compile environments.  
[issues/158564, issues/158834]

Async Tensor Parallelism Regression and Collective Operation Assertion: Enabling async tensor parallelism in PyTorch 2.8 triggers assertion errors during compiler passes if collective operations occur on ProcessGroups without symmetric memory enabled, breaking workflows that combine tensor and context parallelism due to incorrect enforcement.  
[issues/158569]

Build Errors from Incompatible API Changes in CUDA Libraries: Recent nightly builds introduced incompatible API changes causing build errors, such as calls to c10::cuda::get_cuda_check_suffix() with too few arguments, affecting FBGEMM GPU/GenAI CI.  
[issues/158588]

Documentation and API Inaccuracies: PyTorch documentation contains inaccuracies such as unsupported tuple types for the dim parameter in torch.max(), missing kernel_options documentation in flex_attention API, and outdated ONNX tutorial references to deprecated technologies, all requiring updates to prevent confusion and deprecation.  
[issues/157300, issues/158645, issues/158741]

Convolutional Operations and Channel Last Layout Support: PyTorch convolutional operations lack native support for channel last tensor layouts (e.g., (N,L,C), (N,H,W,C)), necessitating costly data transpositions and limiting performance and compatibility with other operations. Proposals include adding flags or new modules to handle these layouts directly.  
[issues/157663]

Data Type Support and Device Compatibility Issues: The MPS backend does not support uint16 dtype causing errors on Apple M1 GPUs, and hipblaslt support is incorrectly disabled on MI100 (gfx908) architectures despite ROCm enabling it, requiring updates to device recognition and dtype support for efficient computation.  
[issues/159076, issues/159030]

GPU Memory Query and Synchronization Problems: The torch.xpu.mem_get_info function fails on certain Intel GPUs due to unsupported memory queries, and CUDA kernel code may lack necessary __syncthreads() calls causing potential race conditions, highlighting inconsistent hardware support and synchronization issues.  
[issues/159027, issues/158921]

Regression in vmap Affecting Per-Sample Gradients: A regression in PyTorch 2.8 RC related to vmap causes significant discrepancies in per-sample gradient calculations, linked to a cuDNN version update and changes around the 0613 nightly build, impacting gradient correctness compared to version 2.7.  
[issues/158787]

Hardware and Benchmarking Anomalies on AMD GPUs: Benchmarking shows the newer AMD 9060XT GPU performing significantly slower than the older 6600 in PyTorch, despite hardware and setup checks, contrasting with TensorFlow results and suggesting potential inefficiencies in PyTorch's ROCm GPU support or benchmarking methodology.  
[issues/158828]

Compilation Failures Due to Missing NVSHMEM Headers: Building PyTorch with NVSHMEM support fails due to missing nvshmem_host.h header files in certain installed NVSHMEM versions, despite CMake detecting NVSHMEM, requiring updates to dependencies or build configurations.  
[issues/159045]

Complex Tensor Representation Inconsistencies: PyTorch incorrectly displays the sign of the real part of complex tensors as positive zero instead of negative zero for certain values with very small imaginary components, causing representation inconsistencies.  
[issues/158743]

Triton Kernel Codegen Fails with Boolean Parameters: Code generation for user-defined Triton kernels fails when boolean parameters are included, due to improper handling of boolean arguments during signature determination, resulting in TypeErrors.  
[issues/158778]

H100 Instance Downtime Due to Expired Capacity Reservations: Two H100 instances, including the only linux.aws.h100.8, experienced downtime caused by expired capacity reservations, leading to job blocking and longer queues, with mitigation involving reservation renewal and plans to prevent future expirations.  
[issues/158809]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 166
Key Open Pull Requests
1. [do not review] Add vllm build: This pull request proposes adding a vllm build setup workflow to enable building the vllm project against PyTorch, as indicated by the title and description.

URL: pull/158797

Merged: No

Associated Commits: 4f2b5, 2c1ab, 558d9, bfa14, 222d4, 9028d, 8c683, 7f761, 5f445, 9cb70, 2cfed, 98484, 2066c, 479cc, 70eef, f2e5f, 5e58d, 33716, db5b1, e6923, e8542, 6131d, 97752, 2c8a5, 4b131, 0e16c, f8f76, 4f28a, 22660, 1159e, 1d587, 7856a, 46db8, 8c350, 5f5b0, 93e72, c8726, 6b8c0, 74581, 5d4e6, 96ab5, 5544c, 5dbbd, ed9c9, e06ce, 1886b, 5f669, 484e6, f128a, 131fc, d2b75, a63f2, 9c33d, 63bbf, 3924b, 0161b, 5c8c2, 77b0c, 56184, 0800e, 6c912, d4023, 22cf0, 8d0ee, a1759, 43d2f, 94240, df7ad, 67ec8, 31ef7, 1c23a, 28b4e, 5a394, fcf17, 77d9b, 0d06f, 5e2cb, d3594, abb7c, f5dc8, 3cd02, 6659b, 28a73

2. [DO NOT MERGE] Test New MI325X Capacity.: This pull request is a test to evaluate additional capacity for the MI325X hardware by updating and renaming various ROCm-related CI configuration files and workflows in the PyTorch project.

URL: pull/159059

Merged: No

Associated Commits: 44b27, 266ab, c17b5, 6a7a1, fdb49, 1e274, 570dd, 61a20, e0abf, 0dacf, a1b0e, 5d8fd, bce86, 3f886, decda, 4f343, b48c0

3. [CUDA] Add experimental green context support for SM carveout: This pull request introduces experimental support for green context in CUDA's SM carveout within PyTorch, aiming to leverage low-level APIs for improved driver API usage stability and usability.

URL: pull/159104

Merged: No

Associated Commits: f6995, 47058, 01a90, 66502, 6982f, b8d2c, 2326e, 097d5, bece9, b2d56, 8a014, 39fe1, d49cc

Other Open Pull Requests

Inductor backend memory optimizations: Multiple pull requests focus on improving memory allocation and usage in the PyTorch Inductor backend. These include allocating memory for non-blocking copy destinations in pinned memory, moving all CPU scalar tensors to pinned memory, and respecting layout tags for operations with registered lowerings to ensure correct stride handling and performance.

[pull/158758, pull/158882, pull/159134]

ROCm and Composable Kernel (CK) integration: Several pull requests enhance ROCm backend support by enabling gfx950 architecture for autotuning, integrating the Composable Kernel library into the ROCm backend for Inductor, and updating workflows to switch MI300 CI jobs from 2-GPU to 1-GPU runners. These changes improve performance, build flexibility, and CI efficiency.

[pull/159195, pull/158747, pull/158877]

Schema generation and mutation prevention: A group of pull requests add support for generating schemas for operations like map, scan, and associative_scan while enforcing no in-place mutations on inputs. This prevents inter-loop dependencies that could break parallelism and ensures consistent behavior in iterative operations.

[pull/158884, pull/158740, pull/158864]

Flake8 F824 fixes: Two pull requests address the flake8 rule F824 by correcting unnecessary use of global and nonlocal statements in both the torch/ directory and backend test files. They clarify that these keywords are only needed when assigning variables locally, not when modifying their state via methods.

[pull/159119, pull/159120]

Dynamo documentation and behavior changes: Documentation improvements and semantic changes are introduced for the Dynamo component, including adding docs for fullgraph=False, recompilation, observability, and issue reporting. Additionally, the nn.Parameter constructor behavior is changed to default to graph breaking when lacking a clean source, with an option to revert via config.

[pull/159050, pull/159062, pull/158800]

Support for new platforms and operations: Pull requests add support for OSX and Windows platforms in the OpenReg component and introduce avg_pool3d operation support for the Metal Performance Shaders (MPS) backend. These extend PyTorch's compatibility and functionality across platforms and backends.

[pull/159029, pull/158877]

Performance testing and autotuning infrastructure: A performance testing addition for the addmm_fusion operation and infrastructure to record autotuning lookup tables in Inductor are introduced. These facilitate performance evaluation and easier management of autotuning results, focusing on matrix multiplication templates.

[pull/159182, pull/158987]

Bug fixes and feature enhancements: Various fixes include correcting the itertools accumulate function, adding support for torch.mm with out_dtype in torch.compile, fixing matrix multiplication decomposition for small dynamic-shaped tensors, and adding support for mm on ZeroTensor objects with refactored logic.

[pull/158774, pull/159026, pull/158998, pull/158740]

CTC loss gradient update: A draft pull request modifies the CTC loss gradient calculations to be with respect to raw inputs by applying a correction factor involving the exponential of log probabilities. CPU changes are tested while CUDA and CUDNN validations are ongoing.

[pull/159106]

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 296
Key Closed Pull Requests
1. Fix unit tests for Navi4x: This pull request addresses fixing and skipping certain unit tests for the Navi4x GPU architecture in the PyTorch ROCm release/2.6 branch, ensuring compatibility by enabling or disabling tests based on hardware support and resolving related test failures as tracked in issue SWDEV-523736.

URL: pull/158953

Merged: No

Associated Commits: 93864, 88b97, bbd00, 1f32b, d33dd, f1481, ea546, ac7d6, ed487, 66dfe, 3783d, 8adc1, 5c4fa, 639ee, b445b, 8d72c, 374e5, 6a3b5, e607b, aafc7, d5947, 1b753, ba1ba, 70f30, 3398f, 8354d, 737cf, 4202f, 7c27e, 2e2c7, 3a818, 53ad2, 8eb5d, dbe8c, fcdff, 92b55, f6789, 2e1ed, 13339, 82ac2, 3608e, bfb23, 86b0a, 03714, 34caa, ac032, 5dd61, 7d528, d9a03, 7c072, 73dd0, b08d9, d70a9, 7ad5a, 2fd46, ed8c6, bf084, 20ad8, 2fb0a, 8cfa9, 50a04, 45896, 9d0a4, 1a808, 6fe84, a3632, 68180, fb24f, e53a9, 2cda1, 9d566, a7044, c7ba8, cbd7b, c3733, faf90, a87c9, ce6b7, dc41a, 469ce, 8ccfc, 1290e, 93693, 75628, f4c96, 9cf15, 5c42a, 1a150, 1ded2, 2ff80, b6e5f, 50924, 4642c, 5de86, f0b4a, 96c61, 751e4, 684f6, 0afd4, 93ff7, e22ae, 2eff7, 2a634, 90ab5, 95d7f, 22d88, 882f3, 8fe3c, 7c63c, cce16, e4e68, 2045a, bbf4b, 0ad73, a8545, ce580, f3092, eb37e, d1c90, 01546, dae14, 9d15d, 8b223, 793db, 5c54a, 538a5, 6b7db, 4ca33, ea75c, 80396, 67972, 212bd, 133e9, 43000, 84e98, 76481, 781c5, 9c357, 5f465, 74d7d, 7a4c2

2. Clarify torch.max Docstring by Removing Incorrect Tuple of Ints Reference: This pull request aims to clarify the torch.max function's docstring by removing the incorrect reference to a tuple of ints as a supported dtype for the parameter dim, addressing issue #158645.

URL: pull/158707

Merged: No

Associated Commits: d7062, d363d, 092ad, ccf3b, 3b132, e982d, f1c0d, 4f2c7, dfdf9, 1caab, 19541, 9cdce, 46cec, 7d463, f7abf, 68f87, 403f9, 3f4a2, 01a57, 6d9ca, d0f0e, 9113b, 064a3, a9214, e6921, 05bf7, 45b74, 09ed6

3. Setup TorchBench in Docker: This pull request sets up TorchBench within a Docker environment to significantly reduce the setup time on A100 and H100 GPUs by about half an hour, with benchmarks run and results to be reviewed on the HUD dashboard to ensure all models are included.

URL: pull/158613

Merged: 2025-07-26T19:56:03Z

Associated Commits: 0e37f, ec8fa, 3aec2, 30c21, 444df, 55316, a151d, 0ce83, 9fe49, 9ff8a, 753fe, 2435a, 2bc4a, 14a38, 80d24, c6847, 550d9, 1eaa3, 046bc, 837ea, 626d1, 1b2cd, 1d309, e6bd5, 701a4

Other Closed Pull Requests

ROCm CI and Architecture Support: This topic covers efforts to enable and test PyTorch ROCm continuous integration on MI355X nodes by updating CI configurations, Docker images, and build arguments to support ROCm 7.0 and gfx950 architecture. Although these changes were proposed, the pull request was not merged.  
pull/158221

Dynamo Guards Recursive Dictionary Tagging Optimization: Multiple pull requests propose optimizations for recursive dictionary tagging within the Dynamo guards system to improve guard-related performance and safety checks in PyTorch. These changes aim to enhance the handling of recursive dictionary tags in the Dynamo guards system.  
pull/158805, pull/158803

Codebase Warning Fixes and Lint Updates: This topic includes fixing over 300 instances of extra semicolon warnings by removing the -Wno-error=extra-semi flag and correcting code to prevent these warnings from appearing as errors. Additionally, there was a proposal to update flake8 and mypy lint dependencies in the CI setup, though it was not merged.  
pull/158730, pull/158720

AOT Export and Compilation Enhancements: Proposals include adding functions aot_export_joint_with_descriptors and aot_compile_joint_with_descriptors to PyTorch, and adding the aot_autograd.fx_utils module with an API validated through autoparallel usage. These changes aim to improve AOT autograd capabilities and output handling.  
pull/158715, pull/159005, pull/158460

Inductor and Triton Kernel Optimizations: This includes enabling Tensor Memory Access (TMA) for the flex-attention mechanism within Inductor and Triton components to optimize kernel block descriptors and data loads on supported devices. Also covered is a refactor of the inductor template system to finalize all registered template hooks before accessing the template's code object.  
pull/158850, pull/157270

Generic c_shim Generation and Dispatcher Compliance: This pull request introduces a generic c_shim that does not bypass the dispatcher by adding c_shim_aten.{h/cpp} files and applying this mechanism specifically to the fill_ operation. This enables more consistent and dispatcher-compliant function calls within PyTorch.  
pull/158974

Example Script Updates for Device Management: An attempt was made to update example scripts to utilize torch.accelerator for device type detection and management, including checks for the number of available devices. However, this pull request was not merged.  
pull/157317

Documentation Coverage Improvements: This pull request improves documentation coverage for the torch.nn.modules.* namespace by adding a new reStructuredText page documenting aliases, removing most entries from the coverage skiplist, and supplementing key methods with basic docstrings to satisfy coverage tests.  
pull/158491

Clamp Strategy Fixes for Op Coverage: This pull request fixes the clamp, clamp_min, and clamp_max strategies to correctly handle cases where min/max inputs can be tensors or scalar values, addressing a failing op coverage test.  
pull/158619

DTensor Fused RMS Normalization Proposal: A proposal was made to add a fused RMS normalization operation strategy for DTensor to address a specific issue, but it was not merged.  
pull/158716

Function Wrapping Improvements in AOTAutograd: This pull request proposes replacing functools.wraps with simple_wraps in AOTAutograd to preserve original function name and module information for better runtime introspection, along with tightening assertions in descriptor code.  
pull/158734

Precompile Feature Enhancements for User-Defined Functions: This pull request enhances the precompile feature by supporting user-defined function calls from bytecode, enabling successful nanogpt inference and training with precompile under torchbench by serializing locations of new user-defined code objects.  
pull/158947

C++ Code Generation Formatting Improvements: This pull request aims to improve tabbing formatting in the C++ code generation process within PyTorch.  
pull/158351

Automation for JSON Registry Updates in Dynamo: This pull request proposes adding automation for adding and updating the JSON registry in the PyTorch dynamo project using the lintrunner tool.  
pull/158460

Device Field Passing Fix for Inductor Compilation: This pull request addresses correct passing of the device field to enable proper CPU kernel generation when both MPS and CPU code can be inductor compiled, ensuring proper device casing in compilation.  
pull/158349

Parallel Testing for sm89 and sm90 CI Jobs: This pull request proposes using three parallel processes for testing on sm89 and sm90 CI jobs to reduce test duration and manage GPU memory usage more efficiently, applying skips where necessary to address failures and out-of-memory issues.  
pull/158350

Windows Support for C++ Compile Command: This pull request adds Windows support for the get_cpp_compile_command function, including additional argument support and Windows-specific compiler commands.  
pull/158691

Type Alias Update for padding_mode Parameter: This pull request updates the type alias for the padding_mode parameter in module/conv.py from a generic string to a more specific Literal type with defined values to address mypy linting errors.  
pull/158732

FSDP-Based Replicate Function Introduction: This pull request introduces a new replicate function using Fully Sharded Data Parallel (FSDP) instead of Distributed Data Parallel (DDP), enabling users to utilize replicate with FSDP and tensor parallelism by modifying the fully_shard function and state.  
pull/158843

ONNX Export TorchScript Sentence Filtering: This pull request aims to filter out torchscript sentences in the ONNX export process to address a specific issue, but it was not merged.  
pull/158850

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

XuehaiPan
201
32
0
17

malfet
100
23
4
85

ezyang
72
19
2
73

bobrenjc93
132
28
0
1

williamwen42
35
8
4
70

wconstab
48
3
1
65

anijain2305
91
16
0
10

ydwu4
82
25
0
9

guangyey
72
7
1
36

zeshengzong
84
7
0
19

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
XuehaiPan	201	32	0	17
malfet	100	23	4	85
ezyang	72	19	2	73
bobrenjc93	132	28	0	1
williamwen42	35	8	4	70
wconstab	48	3	1	65
anijain2305	91	16	0	10
ydwu4	82	25	0	9
guangyey	72	7	1	36
zeshengzong	84	7	0	19