Weekly GitHub Report for Pytorch: June 30, 2025 - July 07, 2025 (12:01:05)

            Weekly GitHub Report for Pytorch: June 30, 2025 - July 07, 2025 (12:01:05)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release marks a shift away from publishing on Conda, with a focus on using official wheel packages, and introduces a backward-incompatible change by setting weights_only=True as the default for torch.load, enhancing security.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

How to compose HSDP with CP?: This issue involves a user encountering difficulties while attempting to compose Hierarchical Sharded Data Parallel (HSDP) with Context Parallel (CP) in a PyTorch project, specifically when trying to flatten a device mesh for expert and non-expert parameters. The user is unsure if the problem is due to a bug or a misunderstanding of the intended behavior of the DeviceMesh, and seeks guidance on whether a flattened mesh can be used as the replication dimension for HSDP.

The comments discuss potential solutions and clarifications regarding the issue, with contributors suggesting different methods for sharding and highlighting the limitations of the current implementation. A contributor identifies that the problem arises from the inability to slice out non-contiguous flattened dimensions, and a pull request is mentioned to improve error messaging. The conversation also includes a follow-up question about configuring the setup correctly, indicating ongoing challenges with the implementation.
Number of comments this week: 10

Unexpected, batch size and device dependent NaN propagation in Conv1d: This issue describes a bug in PyTorch where unexpected NaN propagation occurs during causal 1D convolutions, and the problem is dependent on the batch size and the device used. Specifically, when sequences in a batch contain NaN elements at their ends, the convolution operation produces additional NaNs at the boundary between NaN and non-NaN elements, but this behavior is inconsistent across different devices.

The comments discuss attempts to reproduce the issue on various hardware configurations, with some users confirming the problem on Apple CPUs and others unable to reproduce it on different ARM architectures. It is noted that the issue seems specific to macOS and occurs with certain data types and batch sizes, with a potential link to NNPACK being investigated.
Number of comments this week: 8

torch 2.8 RC gives 10000 larger output difference in some transformers tests: This issue highlights a significant increase in output differences in some transformers tests when using the torch 2.8 RC version, where the discrepancy has grown from a magnitude of 1e-9 to 1e-5, raising concerns about potential impacts on integration tests. The problem is particularly evident in models with a vision component, and the user is seeking assistance from the torch team to investigate the cause before updating the expected outputs in the transformers tests.

The comments discuss whether the issue pertains to eager or compiler use cases, confirm that the tests use torch.float32, and suggest that the differences might be due to different cuBLAS kernels or accumulation order. The user is advised to use cuBLAS logging or nsys nvprof to investigate kernel selection, and it is noted that using "amax" is a pessimistic approach compared to rtol + atol for determining acceptable differences.
Number of comments this week: 8

Regression in llama2 model export: This issue reports a regression in the export functionality of the Llama2 model using the PyTorch 2.8.0 nightly build, which results in a fake tensor error that was not present in previous versions. The user is uncertain whether this is a regression or a user error and seeks the attention of the exporter team due to an upcoming release.

The comments discuss attempts to reproduce the error, with one user unable to replicate it and suggesting trying a different PyTorch version. The original poster confirms using transformers version 4.53.0, and another user identifies a change in the transformers library as the cause of the issue. A workaround is suggested using export_with_dynamic_cache to avoid the regression by switching the sdpa version.
Number of comments this week: 8

[autograd] Slowdown in backward after #151079: This issue reports a performance regression in the PyTorch library's autograd backward pass, specifically after a recent pull request (#151079), where a simple matrix multiplication benchmark shows a slowdown of over 5%. The user provides a detailed reproduction script and mentions the slowdown is observed on different hardware setups, suggesting the cause might be related to additional event recording during the backward pass.

The comments discuss potential causes of the slowdown, with some users unable to reproduce the issue and others suggesting it might be due to additional eventWait calls. A user provides a detailed reproduction script, and another user suggests a fix related to event recording, which is acknowledged and will be tested.
Number of comments this week: 8

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend, and is likely related to compatibility or versioning issues with the Triton library.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing MaxPool2D in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached cuda_utils.so file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with a tmpfs permission set to 1777, and the problem is highlighted by the inability to map a segment from the shared object, which is crucial for the model's execution.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are exempt from this formatting standard. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature request is to reduce the size of TorchScript files, especially for small models with quantization, to make them more suitable for deployment on mobile devices, as demonstrated by the user's experience where removing these files manually resulted in a substantial reduction in file size without affecting model functionality.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 87
Summarized Issues:

Unexpected NaN Propagation in PyTorch Conv1d on macOS: This issue involves unexpected NaN propagation in causal 1D convolutions using PyTorch's Conv1d on macOS. The occurrence of additional NaNs in the output is dependent on the batch size and device, specifically appearing with larger batch sizes on CPU but not on MPS, and is suspected to be related to NNPACK.
issues/157237

Illegal Memory Access Errors in PyTorch 2.8 with Triton: A bug in the PyTorch 2.8 branch causes illegal memory access (IMA) errors when using on-device TMA and AOTI with Triton. These errors have been fixed in a pull request but need to be cherry-picked into the release/2.8 branch to address compatibility issues with the updated Triton 3.4 API.
issues/157240

Test Failures and Warnings in PyTorch: The test_tensor_with_grad_to_scalar_warning fails when run as part of the test_torch.py suite due to a one-time warning being triggered and silenced by a preceding test. Wrapping certain function calls in a with torch.no_grad(): block could resolve the problem.
issues/157252

Type Checking Issues in PyTorch Integer-Only Operators: Improved type checking is needed for certain integer-only operators in PyTorch that currently throw a TypeError at runtime but are incorrectly allowed by the type system. A two-commit solution is proposed to expand existing type-checking tests and fix the typing issues to prevent regressions.
issues/157266

NaN Return Bug in PyTorch CPU Implementation: A bug in the PyTorch library causes the CPU implementation of torch.reciprocal and torch.divide functions to incorrectly return NaN for complex infinity inputs when the tensor has four or more elements. The GPU implementation and CPU implementation for tensors with fewer than four elements correctly return zero.
issues/157272

AttributeError in Transformers with PyTorch 2.8 RC: The test test_can_compile_fast_image_processor in the transformers library passes with torch 2.7 but fails with torch 2.8 RC due to an AttributeError related to the BitImageProcessorFast object not having the __wrapped__ attribute. This affects multiple models running on a single GPU in the CI suite.
issues/157273

Output Differences in Transformers Tests with PyTorch 2.8 RC: A significant increase in output differences is observed in some transformers tests when using the torch 2.8 RC version. The discrepancy between expected and actual outputs has grown from a magnitude of 1e-9 to 1e-5, potentially affecting numerous tests, particularly those involving models with vision components.
issues/157274

PTXAS Error in Transformers Tests on T4 GPUs with PyTorch 2.8 RC: Several transformers tests fail when using torch 2.8 RC on T4 GPUs due to a PTXAS error related to Triton configurations. These tests pass on A10 GPUs and with torch 2.7.1, indicating a regression potentially linked to the Triton version used in the torch 2.8 RC release.
issues/157276

Inefficient Operations in PyTorch's run_decompositions Function: The run_decompositions function in PyTorch generates inefficient operations by creating unnecessary slices and using slice_scatter to copy data. Optimizing this pattern could be beneficial, especially since it is a common occurrence in large language models (LLMs).
issues/157289

Outdated Technologies in PyTorch ONNX Documentation: Several PyTorch documentation pages related to ONNX need auditing and updating to ensure they do not contain outdated technologies such as FSDP1, TorchScript, Torchserve, FX tracing, and old TorchMobile. Necessary changes must be made by July 31 to avoid deprecation of the tutorials.
issues/157300

Timeout Error in PyTorch Elastic Framework: A user encounters a timeout error while waiting on an exit barrier in the PyTorch Elastic framework and inquires about the possibility of increasing the barrier wait timeout. The issue references a specific line in the code and tags several contributors for assistance.
issues/157318

Symmetric Memory Test Failure with NVSHMEM in PyTorch: A symmetric memory test fails in the PyTorch project with the environment variable TORCH_SYMMMEM set to NVSHMEM, resulting in a runtime error due to the allocation backend NVSHMEM not being found. Tests pass without this environment variable set.
issues/157321

Regression in Exporting Llama2 Model with PyTorch 2.8.0 Nightly: A regression occurs when exporting the Llama2 model using the latest PyTorch 2.8.0 nightly build, where an assertion error related to fake tensors occurs. This is potentially due to a change in the transformers library that affects the exportability of the model.
issues/157323

Dynamic Shape Constraint Failures in Hugging Face Models: Several Hugging Face models fail in the cudagraph_dynamic shape configuration due to error guard failures related to dynamic shape constraints being violated. These models have been intentionally excluded from the Inductor Dashboard HUD as of June 30, 2025.
issues/157330

Performance Discrepancy in PyTorch 2D Depthwise Convolution: A 2D depthwise convolution implemented in PyTorch is observed to run approximately three times slower and consume significantly more power compared to an equivalent implementation using JAX/XLA on a GPU.
issues/157334

ImportError in PyTorch Project with Scaled MM Configs: An ImportError occurs in a PyTorch project where the code fails to import 'scaled_mm_configs' from 'torch._inductor.kernel.mm_common'. The error message hints at 'scaled_mm_options' instead, suggesting a possible typo or missing module.
issues/157343

Performance Discrepancy in PyTorch's RMSNorm: PyTorch's implementation of RMSNorm is unexpectedly slower than LayerNorm despite theoretical expectations of a speedup. Significant degradation is observed across various input sizes, particularly focusing on C=1024.
issues/157345

Incorrect CUDA Version in PyTorch CMake File: A bug in the cmake/public/cuda.cmake file results in a missing CUDA version in the error message. The variable cuda_version_from_findcuda is not properly set, leading to an empty value being displayed instead of the expected nvcc-reported version.
issues/157354

GuardOnDataDependentSymNode Failure in PyTorch Export: The torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode fails to guard on a data-dependent expression during the export process of a PyTorch model using torch.export.export. This results in a runtime error due to the inability to handle expressions involving infinite values in the input tensor.
issues/157355

Discrepancy in torch.Tensor.addmm_ Function Results: The torch.Tensor.addmm_ function in PyTorch shows a significant discrepancy in computation results, where the output on a CPU significantly differs from the expected results on GPU and NPU. This suggests a potential bug in the CPU implementation or configuration.
issues/157360

Incorrect Outputs with nn.Linear Layers in PyTorch: Compiling a neural network model containing nn.Linear layers with the mode="reduce-overhead" option in PyTorch results in incorrect outputs for large inputs. The failure is dependent on the GPU model and size of the input, and it is suggested that the problem may be related to synchronization issues in the tutorial code rather than in PyTorch itself.
issues/157363

Test Failure in test_linalg_cholesky on M4 Architecture: The test_linalg_cholesky function fails on the M4 architecture, resulting in a significant mismatch between CPU and MPS outputs, despite the same test passing on the M2 architecture.
issues/157364

Unreliable CI Tests on MacOS-15 for M2Pro: Continuous integration tests on MacOS-15 for M2Pro have become unreliable again after a brief period of stability. The issue is detailed in the provided links.
issues/157379

ImportError in PyTorch with CUDA 12.4: PyTorch becomes unusable due to an ImportError related to an undefined symbol in libtorch_cpu.so when CUDA version 12.4 is installed locally. This causes compatibility issues with the cuptiActivityEnableDriverApi function in libcupti.so.12.
issues/157381

Documentation on Modifying DTensor Model Parameters: There is a need to improve documentation regarding the behavior of modifying DTensor model parameters in PyTorch. It highlights that while users can modify DTensor model parameters after full shard initialization and before the first forward pass, using .data does not work, and it is not possible to update unsharded model parameters.
issues/157391

Error in Composing HSDP with Context Parallel in PyTorch: Composing Hierarchical Sharded Data Parallel (HSDP) with Context Parallel (CP) in a PyTorch project results in an error when attempting to flatten a device mesh for expert and non-expert parameters. The user seeks clarification on whether this behavior is expected or a bug.
issues/157393

Mixed Precision Casting Responsibility in PyTorch: There is a discussion on whether the responsibility for implementing mixed precision (mp_policy) casting should lie with the user during the pre all-gather phase or if the FSDP2 framework should handle it. The issue references a specific unit test and a related discussion on INT8 mixed precision training.
issues/157395

Dynamo Tracing Error with Constant Return Values: A problem arises when a function traced with PyTorch's Dynamo in non-strict mode cannot return a constant value. An error occurs when a function intended to return a constant integer is traced, leading to a failure because the operation returns a non-Tensor value, which is unsupported in the Dynamo FX graph output.
issues/157397

Performance Degradation in PyTorch After Ubuntu Upgrade: After upgrading Ubuntu from version 22.04 to 24.04, the performance of a deep network written in PyTorch degrades due to unexpected behavior when copying data between GPUs. Some tensor values are incorrectly set to 0 instead of 1, despite no apparent GPU memory usage conflicts.
issues/157398

RecompileError with torch._dynamo.disable Decorator: Using the @torch._dynamo.disable decorator inside a @torch.compile function causes the function to recompile every time it is called, leading to a RecompileError due to cache invalidation. A suggested workaround is to define the disabled function outside the main function.
issues/157399

TypeError in PyTorch's AOTI with TorchBind ScriptObjects: A TypeError occurs in PyTorch's AOTI when using TorchBind ScriptObjects, where the torch.equal function fails to handle FakeScriptObject inputs. A proposed fix involves adding logic to bypass this check for FakeScriptObject to allow successful compilation.
issues/157401

RuntimeError with NestedTensor in PyTorch: A RuntimeError is encountered when attempting to call the unbind function on a 2D NestedTensor in PyTorch. The error is demonstrated through a provided script that reproduces the error by creating a nested tensor from ragged data and attempting to unbind it, resulting in a condition check failure.
issues/157404

Performance Regression in PyTorch Autograd Backward Pass: A performance regression is reported in the PyTorch autograd backward pass, specifically a slowdown of over 5% in matrix multiplication operations following a recent pull request. Discussions focus on potential causes such as additional event recording and suggestions for fixes.
issues/157407

s390x-Periodic Test Failures in PyTorch CI Pipeline: The s390x-periodic tests in the CI pipeline fail due to the unavailability of a compatible version of cuda-bindings, specifically requiring a version between 12.0 and 13.0. The development team has acknowledged the issue, and a fix is being addressed in a separate pull request.
issues/157409

Compatibility Problem with einops and torch.compile: A compatibility problem arises between einops version 0.6.1 and the torch.compile function in PyTorch nightlies. Executing a specific code snippet results in a TypeError due to an "unhashable type: non-nested SymInt" error when using the repeat function.
issues/157417

Discrepancy in torch.Tensor.scatter_ Function Documentation: The PyTorch torch.Tensor.scatter_ function documentation states that the self, index, and src tensors should have the same number of dimensions, but in practice, this validation check is not enforced on CPU and GPU. A validation mechanism needs to be implemented to prevent potential runtime errors.
issues/157419

Discrepancy in nll_loss Function Input Dimensions: The PyTorch nll_loss function unexpectedly computes a result when both the input and target are 1D tensors, contrary to the documented expected input dimensions.
issues/157420

Discrepancy in torch.gather Documentation: The torch.gather and Tensor.gather documentation states that the input and index must have the same number of dimensions, yet there is no validation implemented to enforce this requirement for both CPU and GPU operations. This leads to potential runtime errors.
issues/157425

Precision Errors in torch.quantile Function: An edge case in the torch.quantile function results in inconsistent outputs due to precision errors when calculating quantiles with very close floating-point values. A rewrite is suggested to handle such cases more accurately.
issues/157431

Regression in resnet50_quantized_qat Model with PyTorch 2.8: A regression in PyTorch 2.8 causes the resnet50_quantized_qat model to fail to run, despite working in PyTorch 2.7. The issue is due to changes introduced by a specific pull request and can be reproduced across different hardware backends like CPU, CUDA, and XPU.
issues/157434

Challenges with DDP and DTensor-Based Tensor Parallelism: Integrating DistributedDataParallel (DDP) with DTensor-based tensor parallelism in PyTorch presents challenges and unexpected behaviors. Issues include problems with parameter conversion logic, activation checkpointing, and inconsistent propagation of requires_grad, resulting in a stateless model and complications with optimizer interactions.
issues/157445

Indexing Type in Triton Kernel for FlexAttention: Ensuring the correct indexing type is used in the Triton kernel for the FlexAttention component is necessary when dealing with dimension sizes significantly larger than int32 elements, as indicated in the PyTorch GitHub project.
issues/157446

Version Check for einops in Dynamo: A problem with the version check for einops in Dynamo suggests the removal of the check for version 0.8.1 as it is the latest version. A related pull request needs to be addressed.
issues/157451

Graph Break on nn.Parameter Constructors in PyTorch: There is a need to implement a graph break on nn.Parameter constructors in PyTorch's dynamo tracing, as the current support is fragile. Users are advised to initialize nn.Parameters at model initialization time, outside of the compiled region.
issues/157452

Deprecation of CUTLASS Python Interface: The deprecation of the CUTLASS Python interface affects users of both PyTorch and the existing Python interfaces of CUTLASS. Clarification is sought on PyTorch's plans for adapting to this change, especially considering that the nvidia-cutlass does not plan to support Blackwell architecture.
issues/157456

Error in PyTorch RNN Documentation Pseudocode: A potential error in the pseudocode of the PyTorch RNN documentation suggests that the expression x[t] should be replaced with (x[t] if layer == 0 else h_t[layer - 1]) to ensure accuracy.
issues/157457

Missing C++ Compiler in PyTorch 2.9.0.dev on Windows 11: Code that previously ran successfully with PyTorch version 2.7 fails with version 2.9.0.dev due to a missing C++ compiler, specifically the 'cl' compiler. This is required for the code to execute properly on a Windows 11 system with CUDA support.
issues/157458

Test Failures in vLLM Project with PyTorch 2.8rc: Test failures in the vLLM project when using PyTorch 2.8rc are attributed to changes in CUDA memory management and initialization requirements. A proposed solution includes creating a compatibility layer to ensure functionality with both PyTorch 2.7 and 2.8.
issues/157461

Missing Optional Features in PyTorch NCCL Builds: The PyTorch nightly builds have a problem where the statically linked NCCL (NVIDIA Collective Communications Library) is missing support for several optional features, such as IBVERBS, MLX5DV, SHARP, and RDMACM. This is due to missing RDMA/IB libraries and headers, raising questions about the logic of compiling NCCL from scratch in docker containers versus using a redistributable version from Nvidia.
issues/157465

Unit Test for load_state_dict in AOTI Models: A unit test is needed to verify that the load_state_dict function correctly updates the weights of an AOTI (Ahead-Of-Time Inductor) model when using user-managed weights. This ensures that changes to the eager model's weights are accurately reflected in the compiled AOTI model.
issues/157474

Incompatibility Between torch 2.6 and torchvision 0.21.0: A potential incompatibility exists between torch 2.6 and torchvision 0.21.0, as the official PyTorch documentation suggests compatibility, but an error occurs during installation due to conflicting dependencies. The torchvision wheel requires torch version 2.8.0 or higher, leading to confusion about whether the documentation is outdated or the metadata is incorrect.
issues/157476

Crash in all_to_all_single_autograd Compilation: A crash occurs when attempting to compile the all_to_all_single_autograd operation in PyTorch using torch.compile, due to the lack of support for dynamic shapes. This results in a runtime error related to an internal assertion failure when handling tensor shapes.
issues/157479

Backward Pass Error with FSDPModule in PyTorch: Setting a custom reduce scatter divide factor for a subset of modules in a model using the FSDPModule.set_reduce_scatter_divide_factor method results in a backward pass error due to an incorrect data type. A suggested solution involves trying a specific pull request that addresses a related issue with bf16 reduce-scatter in ProcessGroupNCCL.
issues/157485

Support for Quantized ONNX Gather Layer in PyTorch: There is a request to add support for a quantized version of the ONNX Gather layer in the PyTorch project to facilitate Quantization Aware Training (QAT). The current lack of support is causing errors during model quantization.
issues/157490

Enhancing Coverage of tensor_metadata and redistribute_cost: There is a need to enhance the coverage of tensor_metadata and redistribute_cost for various operation strategies in the PyTorch project. Several operations currently lack complete OpSpec information, which is essential for the AutoParallel feature to compute strategy costs effectively.
issues/157495

Build Errors in CUDA 12.9 Periodic CI Test: A failure occurs in the CUDA 12.9 periodic CI test for the PyTorch project, where the use of the deprecated cub::TransformInputIterator is causing build errors. There is a need to investigate why this error does not occur in nightly builds and to identify the source of the CCCL warning.
issues/157502

Deprecation of Support for Older GPU Architectures in PyTorch 2.9: The planned deprecation of support for Maxwell, Pascal, and Volta GPU architectures in the PyTorch 2.9 release follows NVIDIA's announcement that these architectures will no longer receive enhancements and will be unsupported in future CUDA Toolkit versions. Users are prompted to migrate to newer architectures.
issues/157517

Runtime Error in Qwen Model Inference with PyTorch: A runtime error is encountered during the inference process using the Qwen model, where an internal assertion failure related to NVML_SUCCESS occurs in the PyTorch CUDACachingAllocator. This is potentially due to package compatibility issues.
issues/157535

Failure to Detect AVX Capabilities in PyTorch: PyTorch fails to detect and utilize AVX capabilities, despite being built and functioning on a system that supports AVX instructions. The output of the torch.__config__.show() command shows "CPU capability usage: NO AVX," even though the environment and operating system recognize and can use AVX instructions properly.
issues/157538

RuntimeError with FlexAttention Module in PyTorch: A RuntimeError is encountered when using the FlexAttention module in PyTorch, specifically related to a custom mask function that appears to involve data-dependent control flow with a Tensor. This is not currently supported by the vmap implementation, leading to confusion as the user believes they are only using a list for masking.
issues/157543

Test Failure in test_dtensor_save_load_import: The test_dtensor_save_load_import test fails due to an autoloader that imports torch._dynamo, causing indirect imports of torch.distributed.tensor and preventing the expected exception from being raised in the negative test path. A revision of the test is suggested to account for such import dependencies.
issues/157545

Uninformative NotImplementedError in PyTorch Functions: Several torch.* functions in PyTorch raise uninformative NotImplementedErrors when called with integer dtype. It is suggested that these functions should provide clearer error messages or potentially use a more accurate exception type like ValueError, as the current errors are misleading and do not adequately inform users about the correct usage or alternatives.
issues/157547

Missing cu128 aarch64 Wheels for PyTorch Nightly Builds: The nightly cu128 aarch64 wheels for PyTorch have not been built since June 13th, prompting a request for investigation into the cause. Some comments suggest a migration to CUDA version 12.9 as a potential reason.
issues/157548

Incompatibility with NVIDIA RTX Pro 6000 in PyTorch Nightly: The latest PyTorch nightly build (2.9.0.dev20250701 + cu12.9) lacks support for the NVIDIA RTX Pro 6000 (Blackwell, SM122 / Compute Capability 12.2), resulting in immediate failures of any CUDA operations due to incompatibility with the current PyTorch installation, which only supports up to CUDA capabilities sm_90 compute_37.
issues/157549

Discrepancy in Advanced Indexing Between PyTorch and NumPy: A discrepancy exists between PyTorch's torch.compile and NumPy's behavior, specifically related to advanced indexing. The assumption that both libraries handle advanced indices equivalently is incorrect, leading to differences in behavior when indices are separated by a slice.
issues/157569

Precompilation Failure in PyTorch Inductor Backend: A precompilation failure occurs during the training of nanogpt using the PyTorch Inductor backend, where a KeyError occurs in the compute_loss function due to incomplete compilation of the backward pass. Attempts to autosave the DynamoCache have been made.
issues/157577

Improving Communication Cost Model for DTensor in PyTorch: There is a need to improve the communication cost model for DTensor in PyTorch, as the current model does not accurately account for all communication costs, such as the additional all_gather operation required in certain redistribution strategies. Leveraging NVIDIA's NCCL cost model is suggested for a more reliable estimation.
issues/157585

Migration of PT2E Quantization Code in PyTorch: The migration of PT2E quantization code and documentation from the PyTorch repository to the PyTorch AO repository is being tracked. This includes updating internal call sites, adding deprecation warnings, and revising documentation and tutorials to reflect these changes.
issues/157591

Compatibility Problem with einops and PyTorch Dynamo Compiler: A potential compatibility problem exists between PyTorch version 2.7.1 and einops versions 0.8.2 or 0.9.0, where the PyTorch Dynamo compiler may fail to handle the einops.rearrange function due to the absence of an allow_in_graph for it. This leads to a possible error when attempting to inline through the function.
issues/157601

Incorrect Inference of groups Parameter in channel_shuffle: A bug in the PyTorch project causes the groups parameter in the channel_shuffle function to be incorrectly inferred as a Tensor type instead of an int when compiling ShuffleNet using torch.jit.script, leading to a runtime error.
issues/157603

Missing Collective Operations in PyTorch Backward Pass: A bug in the PyTorch project results in the last collective operations, specifically the backward allreduce for Tensor Parallelism (TP) and the backward reduce_scatter for Sharded Parallelism (SP), being missing during the backward pass in both the TP and SP examples and unit tests.
issues/157606

Failure with torch.compile and torch.vdot() on Complex Tensors: A bug in the PyTorch library causes a failure when using torch.compile() on a model that utilizes torch.vdot() with complex tensors. This results in an unimplemented feature error, causing an AssertionError regardless of whether the code is executed on a CPU or CUDA.
issues/157607

Runtime Error with torch.ops.prims.broadcast_in_dim.default: A bug in PyTorch occurs when using torch.compile() on a module that calls torch.ops.prims.broadcast_in_dim.default, resulting in a runtime error due to alias annotations on the output tensor. Potential solutions include removing the alias annotation or using .clone() to prevent storage sharing.
issues/157610

NotImplementedError with flex_attention Module on CPU: A bug is encountered when using torch.compile with the flex_attention module, where a NotImplementedError is raised on CPU due to the query, key, and value being the same buffer. The code runs successfully in eager mode and on CUDA.
issues/157612

Fixed Batch Size in ONNX Export of ResNet50 Model: A bug occurs when exporting a PyTorch ResNet50 model to ONNX format with a dynamic batch size for variable batch size inferencing. The model is instead exported with a fixed batch size of 1, contrary to the user's expectations.
issues/157621

Errors with DTensors and TORCH_DISTRIBUTED_DEBUG=DETAIL: Setting the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL in PyTorch 2.8.0-rc1 causes errors with DTensors due to the NCCL ProcessGroup being wrapped in a ProcessGroupWrapper that does not override all necessary methods. This leads to unimplemented exceptions for certain collective operations.
issues/157622

File Adapter Error on Windows in PyTorch: A bug in the PyTorch project causes the file_adapter component to fail to correctly read the file_path on Windows, resulting in an error due to the file_name being output as an empty string. This occurs specifically in the pytorch\caffe2\serialize\file_adapter.cc file at line 31.
issues/157624

Feature Map Extraction in PyTorch Graphs: There is an inquiry about whether there is an official method in the PyTorch toolbox, particularly around pt2 produced by torch.export, to extract the feature map of each node in a graph. This is similar to the functionality provided by torchvision.models.feature_extraction.create_feature_extractor().
issues/157625

Segmentation Faults in test_ops.py with GCC 13 on AArch64: Segmentation faults occur in specific tests within the test_ops.py file when using GCC 13 on AArch64 architecture, specifically on neoverse-v1, due to a recent pull request. Potential resolutions include reverting the pull request, downgrading GCC for AArch64 images, or disabling GCC's auto-vectorizer.
issues/157626

Segmentation Fault in PyTorch with gather_object and destroy_process_group: A regression in the PyTorch library causes a script using torch.distributed.gather_object and destroy_process_group to successfully complete its main logic but crash with a segmentation fault upon exit. This problem is not present in an older version of the NVIDIA PyTorch container.
issues/157627

Assertion Error with torch.einsum and DTensor Inputs: The torch.einsum function fails when used in inference mode with DTensor inputs, resulting in an assertion error due to the absence of a DeviceMesh from the DTensor arguments. A possible solution is to add a missing record for einsum to OpDispatcher.sharding_propagator.op_to_schema_info with the runtime_schema_info.needs_pytree=True property.
issues/157631

RuntimeError with torchvision::nms Operator in PyTorch: A RuntimeError indicates that the operator torchvision::nms does not exist when using specific versions of PyTorch and Torchvision in a Python 3.12 environment on macOS. The issue persists even with the latest nightly builds and when using conda for installation, suggesting a potential problem with the integration or availability of certain operators in the specified setup.
issues/157648

Metal4 Update and PyTorch MPS Backend Performance: There is a discussion on whether the Metal4 update, which has been optimized for machine learning and now natively supports tensors, will lead to significant performance improvements and enhanced compatibility for PyTorch's metal performance shader (mps) backend on Mac systems.
issues/157660

Deadlock in OffsetBasedRNGTracker in PyTorch: A deadlock occurs in the OffsetBasedRNGTracker class's run_state_sync method within PyTorch's distributed tensor module. This is caused by an inconsistent order of the dist.broadcast operation across ranks, leading to synchronization failures when processes do not initialize the tracker consistently.
issues/157662

Channel-Last Layout Support in PyTorch Convolution Operations: There is a need for PyTorch's convolution operations to support a channel-last layout (e.g., (N,L,C), (N,H,W,C), (N,D,H,W,C)) to improve performance and compatibility with other operations that assume channels in the last axis. Current workarounds are inefficient and cumbersome.
issues/157663

Version Parsing Error in torch.utils.cpp_extension with Clang: The torch.utils.cpp_extension module fails to correctly parse the version string of Clang when it includes a suffix like '20.1.7+libcxx', leading to a ValueError during the compilation of extensions such as torchvision. This can be resolved by adjusting the version parser.
issues/157665

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 14
Summarized Issues:

Bugs in PyTorch operations: Several issues highlight bugs in PyTorch operations, such as incorrect results in Conv2d-unsqueeze-AdaptiveAvgPool3d with Triton backend and stride==0 tensors, and matrix multiplication on MPS backend not checking data type mismatches. These bugs can lead to incorrect results and silent failures, affecting model performance and reliability.
issues/157248, issues/157303

PyTorch version inconsistencies: Issues have been raised about inconsistencies between PyTorch versions, such as torch.export tests failing in version 2.8.0 but passing in 2.7.1, and a regression identified between versions 2.6 and 2.7. These inconsistencies can cause confusion and require developers to adjust their code or wait for fixes.
issues/157284

CUDA and PyPI package concerns: There are concerns about the size difference between CUDA 12.6 and 12.9 wheel packages and the availability of only the 12.6 package on PyPI. The size increase is due to added support for the Blackwell architecture, and PyPI's limitations in hosting multiple CUDA versions are noted.
issues/157265

MPS support on Apple Silicon: Confusion arose over the absence of a dedicated MPS-compatible build for PyTorch 2.7.1 on Apple Silicon M4, though it was clarified that all macOS builds support MPS. This suggests a need for documentation updates to prevent misunderstandings about GPU acceleration support.
issues/157271

Model compilation and execution errors: Issues with model compilation and execution include a graph break error in YOLOv5 due to time.time function tracing and a failure in vision_maskrcnn model on H100 and MI300 GPUs. These errors require troubleshooting and guidance to resolve, impacting model deployment.
issues/157316, issues/157352

AMP training and GradScaler issues: A potential bug in amp.GradScaler during AMP training causes the scale value to drop unexpectedly on multiple GPUs. This behavior contradicts expectations and could affect training stability and performance.
issues/157436

Typographical errors in documentation and code: Minor typographical errors have been identified in PyTorch documentation and code, such as "see" instead of "seen" and "paramter" instead of "parameter." These errors, while minor, can affect readability and should be corrected.
issues/157444, issues/157564

Continuous integration and compiler warnings: A CI job was temporarily disabled, requiring a reason and manual updates, and a compiler warning about an unused variable "threshold" was noted. Addressing these issues ensures smoother development and integration processes.
issues/157530, issues/157653

Torch.compile and user code bugs: An issue with Torch.compile Dynamo failing to execute an FX node with fake tensors was identified as a bug in the user's code. This highlights the importance of debugging user code to ensure proper execution of PyTorch features.
issues/157657

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 144
Key Open Pull Requests
1. [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/torch.py: This pull request aims to address part of issue #147913cc by replacing the unimplemented function with unimplemented_v2 in the torch/_dynamo/variables/torch.py file, as part of the PyTorch project, and involves multiple updates and contributions, including co-authored commits by William Wen.

URL: pull/157344

Merged: No

Associated Commits: 20d1f, c7c70, 53b5c, b402f, fd8e5, 7d53e, 8b88e, f6854, a0dc3, a7eca, dcfbe, fc5ec, e70be, 0987a, 96080, 4bdd3, 80763, 08c72, a42df, 1673c, 43f68, c023d, f40d3, f0387, b832a, 4f764, 37379, 7ef8b, 51447, 4b6e8, 5168d, 641c9, 474d1, ec5d5, 37d44, a8aaa, aacfe, c9f62, 49dab, 00135, d0c69, d03d5, 17a4e, 335e9, 3df8c, 2ec26, 1f808, 8c0c3, f120d, 69a11, 7a1a5, be7ff, f41d6, 9a3a7, 54714, 61a19, db537

2. [build] make SDist buildable: bootstrap git repo and submodules: This pull request aims to make the source distribution (SDist) of the PyTorch project buildable by bootstrapping the Git repository and its submodules, as part of a stack of changes managed by ghstack, and involves multiple updates across several commits.

URL: pull/157432

Merged: No

Associated Commits: 5efc8, 348da, af9b9, ea88a, ba698, 8e9b5, 1eb1d, 05ae8, f98d9, 7f970, 250f5, 05951, 3ea44, 3156d, af210

3. Fix init CUDA preload: get correct versions (#147001): This pull request addresses the issue of initializing CUDA preload by ensuring the correct versions are selected, primarily by updating the cuda_libs dictionary to search for specific library versions first and using less-specific patterns as a backup, while also implementing a sorting mechanism in _preload_cuda_deps to prioritize newer library versions, and it includes several commits to fix related issues such as type hints and merge conflicts.

URL: pull/157264

Merged: No

Associated Commits: af4ef, 891d4, f6e5c, a1ca3, b0774, 9b961, a0d88, 681a2, 95ab7, 44e70, cbfb5

Other Open Pull Requests

Code Refactoring and Hook Finalization: This topic involves refactoring code to ensure all registered template hooks are finalized before accessing the template's code object. A new function RenderPartial.finalize_remaining was introduced to finalize any remaining active hooks, and a test was included to verify proper finalization by the scheduler.
pull/157270

Continuous Integration and Testing Enhancements: Several pull requests focus on improving CI and testing processes. These include re-enabling the ET test, addressing CI failures for CUDA versions greater than 12.9, and resolving NVSHMEM compilation issues by updating CI docker environments.
pull/157298, pull/157385, pull/157411

Precision and Performance Improvements: Enhancements in computational efficiency and precision handling are addressed by allowing BF16 and TF32 as internal precision for various operations and enabling TF32 for matrix multiplication, linear, and convolution operations in the MKL-DNN backend.
pull/157433, pull/157520

Documentation and Typo Corrections: Updates to documentation dependencies and typo corrections in backend code are covered. These changes involve fixing build errors, adding new documentation files, and correcting a typo from "inpt" to "input."
pull/157287, pull/157361

Removal of 'allow-untyped-defs' Directives: A series of pull requests aim to remove the 'allow-untyped-defs' directive from various files in the PyTorch project. These changes are part of a series managed through the ghstack tool and include multiple commits marked as "[ghstack-poisoned]."
pull/157228, pull/157230, pull/157231, pull/157232, pull/157233, pull/157234, pull/157235

Feature Implementations and Enhancements: New features and enhancements include a backward pass for max_pool3d for MPS, AI-generated inductor lowerings for pooling functions, and a work-in-progress feature for saving and loading compiled models with torch._dynamo.save/load().
pull/157498, pull/157331, pull/157481

Bug Fixes and Issue Resolutions: Bug fixes include addressing an integer overflow in the FlexAttention kernel and resolving an issue with einops and torch.compile interaction by reverting code to a previous state.
pull/157447, pull/157600

Experimental and Diagnostic Efforts: Experimental attempts to diagnose and resolve issues include trying different approaches on the CI system for an issue that cannot be reproduced locally and ensuring consistent behavior of integer attributes within [dynamo][fsdp] components.
pull/157301, pull/157262

Optimization and Code Efficiency: Efforts to optimize code include avoiding unnecessary slices and using reduceOpSum for scenarios where the world size is 1, as well as updating scripts to utilize torch.accelerator and ensuring compatibility with device count checks.
pull/157528, pull/157529, pull/157317

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 124
Key Closed Pull Requests
1. Fix docs issue 153531 : This pull request addresses a documentation issue by providing a more detailed explanation of the Short-Time Fourier Transform (STFT) function in torch/functional.py, specifically clarifying the use of n_fft instead of win_length in the STFT equation exponent, with changes made between lines 598 and 614.

URL: pull/157561

Merged: No

Associated Commits: 54aca, 24397, 99718, 12bcf, 0863b, 44d11, e57f0, 15185, 1b702, 8be26, 7f55e, 6ca19, da4bf, e8ebe, 5a4f1, 2ad9c, f9e2b, 32e18, df3ca, bc244, 9fd51, be254, 596bb, 444e1, ce29e, 0288d, 953c9, ab750, 1a3e3, 7e97e, 4cf10, 738b4, 96d2d, 1c8ba, 3a44b, 24903, 8ac9b, 44ab7, 0cd06, 55d10, 574f4, e9956, a412d, d65d0, b126b, 6a3a3, b9814, bac09, 4ae86, 7b436, 2304d, bbfcf, 95ea4, 445b0, e80c8, 4f882, dcaee, 24e47, 94035, eef51, 0aa3f, 3eaae, 07779, f00f0, 6c8c5, f9386, 56a20, 3184b, d37ef, da3f5, 5ba8a, 49022, abe17, c1f8e, 13a51, 9e6f4, 39901, 1fee5

2. [Distributed] Add check to verify if local shards in FSDP have tensors before accessing: This pull request addresses an issue in the PyTorch project where a test in the Fully Sharded Data Parallel (FSDP) module was causing a "list index out of range" error by attempting to access local_shards[0] without verifying the existence of tensors on all ranks, and it introduces a check to ensure that local shards contain tensors before accessing them, thereby resolving the error across all cases.

URL: pull/157275

Merged: No

Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55, 7dade, 580aa, c0f57, 6dedb, 20d07, 124ff, 0bea1, 7409a, 636cb, 0d5a8, 3826e, cb711, d6cd1, 624be, 8d8c5, 1cf78, 41475, 0628c, b0d93, 58eb8, e558e, 83ac5, 0e7a7, a2b2f, 8de00, 91f5d, 39e6c, 06d6c, a0590, 31ddf, 9a6df, 50cb9, 08559, f6a8c, dc0be, 9fd7e, cce98, cf3bf, e3957, cd29a, 7b81c, 42401, 1dee9, c3f9f

3. Update _linux-test to support B200 runner: This pull request aims to update the _linux-test workflow to support the B200 runner by enabling OIDC for ECR access and S3 stats upload, disabling sccache connection to S3 on B200, and making necessary adjustments to permissions and environment variables across related workflows.

URL: pull/157341

Merged: No

Associated Commits: 8f291, 9cae9, b1172, 7a318, 7a56b, b9d5f, d0cd6, c7316, 170ea, db0af, 6615d, 20963, 6c17f, a23be, 3df91, 45d6a

Other Closed Pull Requests

CUDA and Triton Updates: This topic covers updates and fixes related to CUDA and Triton in the PyTorch project. One pull request addresses critical fixes for CUDA 12.9 aarch64 GPU builds by updating the CUDA_STABLE variable and incorporating a missed Triton change. Another pull request addresses an ImportError issue by updating the triton_key import to be compatible with the latest version of Triton.
pull/157630, pull/157242

Code Refactoring and Maintenance: Several pull requests focus on code refactoring and maintenance within the PyTorch project. One pull request aims to refactor the setup.py file by replacing os.path.* functions with pathlib.Path for better readability. Another pull request involves relocating functions to improve code organization and maintainability.
pull/157268, pull/157611

Testing and CI Improvements: This topic includes pull requests aimed at improving testing and continuous integration (CI) processes. One pull request introduces a new CI job to test multiple versions of the einops library with torch.compile. Another pull request addresses a CI failure by fixing the dependency order in the CMake build for AOT inductor.
pull/157416, pull/157557

Performance and Optimization: Pull requests under this topic focus on performance enhancements and optimizations. One pull request addresses slow performance in CUDA-11.3 by adding an exit condition for NaN values in special operations. Another pull request involves porting passes to bucket all_gathers to optimize performance and enhance memory optimizations.
pull/157464, pull/157396

Type and Serialization Enhancements: This topic covers enhancements related to type handling and serialization. One pull request aims to remove the 'allow-untyped-defs' option from specific files as part of a series of changes. Another pull request makes the "DUPLICATED_INPUT" guard serializable by ensuring it can be safely serialized and reconstructed.
pull/157229, pull/157236, pull/157492

Inductor and Kernel Improvements: Pull requests in this category focus on improvements to the Inductor component and kernel definitions. One pull request addresses an issue where Triton kernel definitions can break if they contain triple quotes. Another pull request enhances the NVSHMEM discovery process by extending the search to include system locations.
pull/157322, pull/157513

Experimental Features and Enhancements: This topic includes pull requests introducing experimental features and enhancements. One pull request involves running an experiment to enable a "keep going" feature on the main branch. Another pull request proposes the addition of a progressive compile mode, although it was not merged.
pull/157470, pull/157305

Bug Fixes and Issue Resolutions: Pull requests under this topic address bug fixes and issue resolutions. One pull request addresses a bug related to the dict(mapping_proxy) functionality within the Dynamo component. Another pull request involves additional testing of Python arithmetic operators when applied between tensors and scalars.
pull/157467, pull/157632

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

XuehaiPan
508
48
3
29

bobrenjc93
231
50
0
15

malfet
160
23
10
102

atalman
104
12
14
20

guilhermeleobas
106
24
1
1

williamwen42
24
8
5
92

svekars
47
3
0
72

Skylion007
18
5
3
93

davidberard98
79
12
6
12

guangyey
75
7
3
23

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
XuehaiPan	508	48	3	29
bobrenjc93	231	50	0	15
malfet	160	23	10	102
atalman	104	12	14	20
guilhermeleobas	106	24	1	1
williamwen42	24	8	5	92
svekars	47	3	0	72
Skylion007	18	5	3	93
davidberard98	79	12	6	12
guangyey	75	7	3	23