Weekly GitHub Report for Pytorch: June 30, 2025 - July 07, 2025 (12:01:05)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and FP16 support on X86 CPUs. Notably, the release marks a shift away from publishing on Conda, with a focus on using official wheel packages, and introduces a backward-incompatible change by setting weights_only=True
as the default for torch.load
, enhancing security.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
How to compose HSDP with CP?: This issue involves a user encountering difficulties while attempting to compose Hierarchical Sharded Data Parallel (HSDP) with Context Parallel (CP) in a PyTorch project, specifically when trying to flatten a device mesh for expert and non-expert parameters. The user is unsure if the problem is due to a bug or a misunderstanding of the intended behavior of the DeviceMesh, and seeks guidance on whether a flattened mesh can be used as the replication dimension for HSDP.
- The comments discuss potential solutions and clarifications regarding the issue, with contributors suggesting different methods for sharding and highlighting the limitations of the current implementation. A contributor identifies that the problem arises from the inability to slice out non-contiguous flattened dimensions, and a pull request is mentioned to improve error messaging. The conversation also includes a follow-up question about configuring the setup correctly, indicating ongoing challenges with the implementation.
- Number of comments this week: 10
-
Unexpected, batch size and device dependent NaN propagation in Conv1d: This issue describes a bug in PyTorch where unexpected NaN propagation occurs during causal 1D convolutions, and the problem is dependent on the batch size and the device used. Specifically, when sequences in a batch contain NaN elements at their ends, the convolution operation produces additional NaNs at the boundary between NaN and non-NaN elements, but this behavior is inconsistent across different devices.
- The comments discuss attempts to reproduce the issue on various hardware configurations, with some users confirming the problem on Apple CPUs and others unable to reproduce it on different ARM architectures. It is noted that the issue seems specific to macOS and occurs with certain data types and batch sizes, with a potential link to NNPACK being investigated.
- Number of comments this week: 8
-
torch 2.8 RC
gives 10000 larger output difference in sometransformers
tests: This issue highlights a significant increase in output differences in sometransformers
tests when using thetorch 2.8 RC
version, where the discrepancy has grown from a magnitude of1e-9
to1e-5
, raising concerns about potential impacts on integration tests. The problem is particularly evident in models with a vision component, and the user is seeking assistance from the torch team to investigate the cause before updating the expected outputs in thetransformers
tests.- The comments discuss whether the issue pertains to eager or compiler use cases, confirm that the tests use
torch.float32
, and suggest that the differences might be due to different cuBLAS kernels or accumulation order. The user is advised to use cuBLAS logging ornsys nvprof
to investigate kernel selection, and it is noted that using "amax" is a pessimistic approach compared tortol
+atol
for determining acceptable differences. - Number of comments this week: 8
- The comments discuss whether the issue pertains to eager or compiler use cases, confirm that the tests use
-
Regression in llama2 model export: This issue reports a regression in the export functionality of the Llama2 model using the PyTorch 2.8.0 nightly build, which results in a fake tensor error that was not present in previous versions. The user is uncertain whether this is a regression or a user error and seeks the attention of the exporter team due to an upcoming release.
- The comments discuss attempts to reproduce the error, with one user unable to replicate it and suggesting trying a different PyTorch version. The original poster confirms using transformers version 4.53.0, and another user identifies a change in the transformers library as the cause of the issue. A workaround is suggested using
export_with_dynamic_cache
to avoid the regression by switching the sdpa version. - Number of comments this week: 8
- The comments discuss attempts to reproduce the error, with one user unable to replicate it and suggesting trying a different PyTorch version. The original poster confirms using transformers version 4.53.0, and another user identifies a change in the transformers library as the cause of the issue. A workaround is suggested using
-
[autograd] Slowdown in backward after #151079: This issue reports a performance regression in the PyTorch library's autograd backward pass, specifically after a recent pull request (#151079), where a simple matrix multiplication benchmark shows a slowdown of over 5%. The user provides a detailed reproduction script and mentions the slowdown is observed on different hardware setups, suggesting the cause might be related to additional event recording during the backward pass.
- The comments discuss potential causes of the slowdown, with some users unable to reproduce the issue and others suggesting it might be due to additional eventWait calls. A user provides a detailed reproduction script, and another user suggests a fix related to event recording, which is acknowledged and will be tested.
- Number of comments this week: 8
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend, and is likely related to compatibility or versioning issues with the Triton library.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing MaxPool2D in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached
cuda_utils.so
file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with atmpfs
permission set to1777
, and the problem is highlighted by the inability to map a segment from the shared object, which is crucial for the model's execution. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are exempt from this formatting standard. The process requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature request is to reduce the size of TorchScript files, especially for small models with quantization, to make them more suitable for deployment on mobile devices, as demonstrated by the user's experience where removing these files manually resulted in a substantial reduction in file size without affecting model functionality.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 87
Summarized Issues:
- Unexpected NaN Propagation in PyTorch Conv1d on macOS: This issue involves unexpected NaN propagation in causal 1D convolutions using PyTorch's
Conv1d
on macOS. The occurrence of additional NaNs in the output is dependent on the batch size and device, specifically appearing with larger batch sizes on CPU but not on MPS, and is suspected to be related to NNPACK.
- Illegal Memory Access Errors in PyTorch 2.8 with Triton: A bug in the PyTorch 2.8 branch causes illegal memory access (IMA) errors when using on-device TMA and AOTI with Triton. These errors have been fixed in a pull request but need to be cherry-picked into the release/2.8 branch to address compatibility issues with the updated Triton 3.4 API.
- Test Failures and Warnings in PyTorch: The
test_tensor_with_grad_to_scalar_warning
fails when run as part of thetest_torch.py
suite due to a one-time warning being triggered and silenced by a preceding test. Wrapping certain function calls in awith torch.no_grad():
block could resolve the problem.
- Type Checking Issues in PyTorch Integer-Only Operators: Improved type checking is needed for certain integer-only operators in PyTorch that currently throw a
TypeError
at runtime but are incorrectly allowed by the type system. A two-commit solution is proposed to expand existing type-checking tests and fix the typing issues to prevent regressions.
- NaN Return Bug in PyTorch CPU Implementation: A bug in the PyTorch library causes the CPU implementation of
torch.reciprocal
andtorch.divide
functions to incorrectly return NaN for complex infinity inputs when the tensor has four or more elements. The GPU implementation and CPU implementation for tensors with fewer than four elements correctly return zero.
- AttributeError in Transformers with PyTorch 2.8 RC: The test
test_can_compile_fast_image_processor
in thetransformers
library passes withtorch 2.7
but fails withtorch 2.8 RC
due to anAttributeError
related to theBitImageProcessorFast
object not having the__wrapped__
attribute. This affects multiple models running on a single GPU in the CI suite.
- Output Differences in Transformers Tests with PyTorch 2.8 RC: A significant increase in output differences is observed in some
transformers
tests when using thetorch 2.8 RC
version. The discrepancy between expected and actual outputs has grown from a magnitude of1e-9
to1e-5
, potentially affecting numerous tests, particularly those involving models with vision components.
- PTXAS Error in Transformers Tests on T4 GPUs with PyTorch 2.8 RC: Several
transformers
tests fail when usingtorch 2.8 RC
on T4 GPUs due to a PTXAS error related to Triton configurations. These tests pass on A10 GPUs and withtorch 2.7.1
, indicating a regression potentially linked to the Triton version used in thetorch 2.8 RC
release.
- Inefficient Operations in PyTorch's run_decompositions Function: The
run_decompositions
function in PyTorch generates inefficient operations by creating unnecessary slices and usingslice_scatter
to copy data. Optimizing this pattern could be beneficial, especially since it is a common occurrence in large language models (LLMs).
- Outdated Technologies in PyTorch ONNX Documentation: Several PyTorch documentation pages related to ONNX need auditing and updating to ensure they do not contain outdated technologies such as FSDP1, TorchScript, Torchserve, FX tracing, and old TorchMobile. Necessary changes must be made by July 31 to avoid deprecation of the tutorials.
- Timeout Error in PyTorch Elastic Framework: A user encounters a timeout error while waiting on an exit barrier in the PyTorch Elastic framework and inquires about the possibility of increasing the barrier wait timeout. The issue references a specific line in the code and tags several contributors for assistance.
- Symmetric Memory Test Failure with NVSHMEM in PyTorch: A symmetric memory test fails in the PyTorch project with the environment variable
TORCH_SYMMMEM
set toNVSHMEM
, resulting in a runtime error due to the allocation backend NVSHMEM not being found. Tests pass without this environment variable set.
- Regression in Exporting Llama2 Model with PyTorch 2.8.0 Nightly: A regression occurs when exporting the Llama2 model using the latest PyTorch 2.8.0 nightly build, where an assertion error related to fake tensors occurs. This is potentially due to a change in the transformers library that affects the exportability of the model.
- Dynamic Shape Constraint Failures in Hugging Face Models: Several Hugging Face models fail in the cudagraph_dynamic shape configuration due to error guard failures related to dynamic shape constraints being violated. These models have been intentionally excluded from the Inductor Dashboard HUD as of June 30, 2025.
- Performance Discrepancy in PyTorch 2D Depthwise Convolution: A 2D depthwise convolution implemented in PyTorch is observed to run approximately three times slower and consume significantly more power compared to an equivalent implementation using JAX/XLA on a GPU.
- ImportError in PyTorch Project with Scaled MM Configs: An ImportError occurs in a PyTorch project where the code fails to import 'scaled_mm_configs' from 'torch._inductor.kernel.mm_common'. The error message hints at 'scaled_mm_options' instead, suggesting a possible typo or missing module.
- Performance Discrepancy in PyTorch's RMSNorm: PyTorch's implementation of RMSNorm is unexpectedly slower than LayerNorm despite theoretical expectations of a speedup. Significant degradation is observed across various input sizes, particularly focusing on C=1024.
- Incorrect CUDA Version in PyTorch CMake File: A bug in the
cmake/public/cuda.cmake
file results in a missing CUDA version in the error message. The variablecuda_version_from_findcuda
is not properly set, leading to an empty value being displayed instead of the expected nvcc-reported version.
- GuardOnDataDependentSymNode Failure in PyTorch Export: The
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode
fails to guard on a data-dependent expression during the export process of a PyTorch model usingtorch.export.export
. This results in a runtime error due to the inability to handle expressions involving infinite values in the input tensor.
- Discrepancy in torch.Tensor.addmm_ Function Results: The
torch.Tensor.addmm_
function in PyTorch shows a significant discrepancy in computation results, where the output on a CPU significantly differs from the expected results on GPU and NPU. This suggests a potential bug in the CPU implementation or configuration.
- Incorrect Outputs with nn.Linear Layers in PyTorch: Compiling a neural network model containing
nn.Linear
layers with themode="reduce-overhead"
option in PyTorch results in incorrect outputs for large inputs. The failure is dependent on the GPU model and size of the input, and it is suggested that the problem may be related to synchronization issues in the tutorial code rather than in PyTorch itself.
- Test Failure in test_linalg_cholesky on M4 Architecture: The
test_linalg_cholesky
function fails on the M4 architecture, resulting in a significant mismatch between CPU and MPS outputs, despite the same test passing on the M2 architecture.
- Unreliable CI Tests on MacOS-15 for M2Pro: Continuous integration tests on MacOS-15 for M2Pro have become unreliable again after a brief period of stability. The issue is detailed in the provided links.
- ImportError in PyTorch with CUDA 12.4: PyTorch becomes unusable due to an ImportError related to an undefined symbol in
libtorch_cpu.so
when CUDA version 12.4 is installed locally. This causes compatibility issues with thecuptiActivityEnableDriverApi
function inlibcupti.so.12
.
- Documentation on Modifying DTensor Model Parameters: There is a need to improve documentation regarding the behavior of modifying
DTensor
model parameters in PyTorch. It highlights that while users can modifyDTensor
model parameters after full shard initialization and before the first forward pass, using.data
does not work, and it is not possible to update unsharded model parameters.
- Error in Composing HSDP with Context Parallel in PyTorch: Composing Hierarchical Sharded Data Parallel (HSDP) with Context Parallel (CP) in a PyTorch project results in an error when attempting to flatten a device mesh for expert and non-expert parameters. The user seeks clarification on whether this behavior is expected or a bug.
- Mixed Precision Casting Responsibility in PyTorch: There is a discussion on whether the responsibility for implementing mixed precision (mp_policy) casting should lie with the user during the pre all-gather phase or if the FSDP2 framework should handle it. The issue references a specific unit test and a related discussion on INT8 mixed precision training.
- Dynamo Tracing Error with Constant Return Values: A problem arises when a function traced with PyTorch's Dynamo in non-strict mode cannot return a constant value. An error occurs when a function intended to return a constant integer is traced, leading to a failure because the operation returns a non-Tensor value, which is unsupported in the Dynamo FX graph output.
- Performance Degradation in PyTorch After Ubuntu Upgrade: After upgrading Ubuntu from version 22.04 to 24.04, the performance of a deep network written in PyTorch degrades due to unexpected behavior when copying data between GPUs. Some tensor values are incorrectly set to 0 instead of 1, despite no apparent GPU memory usage conflicts.
- RecompileError with torch._dynamo.disable Decorator: Using the
@torch._dynamo.disable
decorator inside a@torch.compile
function causes the function to recompile every time it is called, leading to aRecompileError
due to cache invalidation. A suggested workaround is to define the disabled function outside the main function.
- TypeError in PyTorch's AOTI with TorchBind ScriptObjects: A TypeError occurs in PyTorch's AOTI when using TorchBind ScriptObjects, where the
torch.equal
function fails to handleFakeScriptObject
inputs. A proposed fix involves adding logic to bypass this check forFakeScriptObject
to allow successful compilation.
- RuntimeError with NestedTensor in PyTorch: A
RuntimeError
is encountered when attempting to call theunbind
function on a 2DNestedTensor
in PyTorch. The error is demonstrated through a provided script that reproduces the error by creating a nested tensor from ragged data and attempting to unbind it, resulting in a condition check failure.
- Performance Regression in PyTorch Autograd Backward Pass: A performance regression is reported in the PyTorch autograd backward pass, specifically a slowdown of over 5% in matrix multiplication operations following a recent pull request. Discussions focus on potential causes such as additional event recording and suggestions for fixes.
- s390x-Periodic Test Failures in PyTorch CI Pipeline: The s390x-periodic tests in the CI pipeline fail due to the unavailability of a compatible version of cuda-bindings, specifically requiring a version between 12.0 and 13.0. The development team has acknowledged the issue, and a fix is being addressed in a separate pull request.
- Compatibility Problem with einops and torch.compile: A compatibility problem arises between einops version 0.6.1 and the
torch.compile
function in PyTorch nightlies. Executing a specific code snippet results in aTypeError
due to an "unhashable type: non-nested SymInt" error when using therepeat
function.
- Discrepancy in torch.Tensor.scatter_ Function Documentation: The PyTorch
torch.Tensor.scatter_
function documentation states that theself
,index
, andsrc
tensors should have the same number of dimensions, but in practice, this validation check is not enforced on CPU and GPU. A validation mechanism needs to be implemented to prevent potential runtime errors.
- Discrepancy in nll_loss Function Input Dimensions: The PyTorch
nll_loss
function unexpectedly computes a result when both the input and target are 1D tensors, contrary to the documented expected input dimensions.
- Discrepancy in torch.gather Documentation: The
torch.gather
andTensor.gather
documentation states that the input and index must have the same number of dimensions, yet there is no validation implemented to enforce this requirement for both CPU and GPU operations. This leads to potential runtime errors.
- Precision Errors in torch.quantile Function: An edge case in the
torch.quantile
function results in inconsistent outputs due to precision errors when calculating quantiles with very close floating-point values. A rewrite is suggested to handle such cases more accurately.
- Regression in resnet50_quantized_qat Model with PyTorch 2.8: A regression in PyTorch 2.8 causes the
resnet50_quantized_qat
model to fail to run, despite working in PyTorch 2.7. The issue is due to changes introduced by a specific pull request and can be reproduced across different hardware backends like CPU, CUDA, and XPU.
- Challenges with DDP and DTensor-Based Tensor Parallelism: Integrating DistributedDataParallel (DDP) with DTensor-based tensor parallelism in PyTorch presents challenges and unexpected behaviors. Issues include problems with parameter conversion logic, activation checkpointing, and inconsistent propagation of
requires_grad
, resulting in a stateless model and complications with optimizer interactions.
- Indexing Type in Triton Kernel for FlexAttention: Ensuring the correct indexing type is used in the Triton kernel for the FlexAttention component is necessary when dealing with dimension sizes significantly larger than int32 elements, as indicated in the PyTorch GitHub project.
- Version Check for einops in Dynamo: A problem with the version check for einops in Dynamo suggests the removal of the check for version 0.8.1 as it is the latest version. A related pull request needs to be addressed.
- Graph Break on nn.Parameter Constructors in PyTorch: There is a need to implement a graph break on
nn.Parameter
constructors in PyTorch's dynamo tracing, as the current support is fragile. Users are advised to initializenn.Parameters
at model initialization time, outside of the compiled region.
- Deprecation of CUTLASS Python Interface: The deprecation of the CUTLASS Python interface affects users of both PyTorch and the existing Python interfaces of CUTLASS. Clarification is sought on PyTorch's plans for adapting to this change, especially considering that the
nvidia-cutlass
does not plan to support Blackwell architecture.
- Error in PyTorch RNN Documentation Pseudocode: A potential error in the pseudocode of the PyTorch RNN documentation suggests that the expression
x[t]
should be replaced with(x[t] if layer == 0 else h_t[layer - 1])
to ensure accuracy.
- Missing C++ Compiler in PyTorch 2.9.0.dev on Windows 11: Code that previously ran successfully with PyTorch version 2.7 fails with version 2.9.0.dev due to a missing C++ compiler, specifically the 'cl' compiler. This is required for the code to execute properly on a Windows 11 system with CUDA support.
- Test Failures in vLLM Project with PyTorch 2.8rc: Test failures in the vLLM project when using PyTorch 2.8rc are attributed to changes in CUDA memory management and initialization requirements. A proposed solution includes creating a compatibility layer to ensure functionality with both PyTorch 2.7 and 2.8.
- Missing Optional Features in PyTorch NCCL Builds: The PyTorch nightly builds have a problem where the statically linked NCCL (NVIDIA Collective Communications Library) is missing support for several optional features, such as IBVERBS, MLX5DV, SHARP, and RDMACM. This is due to missing RDMA/IB libraries and headers, raising questions about the logic of compiling NCCL from scratch in docker containers versus using a redistributable version from Nvidia.
- Unit Test for load_state_dict in AOTI Models: A unit test is needed to verify that the
load_state_dict
function correctly updates the weights of an AOTI (Ahead-Of-Time Inductor) model when using user-managed weights. This ensures that changes to the eager model's weights are accurately reflected in the compiled AOTI model.
- Incompatibility Between torch 2.6 and torchvision 0.21.0: A potential incompatibility exists between torch 2.6 and torchvision 0.21.0, as the official PyTorch documentation suggests compatibility, but an error occurs during installation due to conflicting dependencies. The torchvision wheel requires torch version 2.8.0 or higher, leading to confusion about whether the documentation is outdated or the metadata is incorrect.
- Crash in all_to_all_single_autograd Compilation: A crash occurs when attempting to compile the
all_to_all_single_autograd
operation in PyTorch usingtorch.compile
, due to the lack of support for dynamic shapes. This results in a runtime error related to an internal assertion failure when handling tensor shapes.
- Backward Pass Error with FSDPModule in PyTorch: Setting a custom reduce scatter divide factor for a subset of modules in a model using the
FSDPModule.set_reduce_scatter_divide_factor
method results in a backward pass error due to an incorrect data type. A suggested solution involves trying a specific pull request that addresses a related issue with bf16 reduce-scatter in ProcessGroupNCCL.
- Support for Quantized ONNX Gather Layer in PyTorch: There is a request to add support for a quantized version of the ONNX Gather layer in the PyTorch project to facilitate Quantization Aware Training (QAT). The current lack of support is causing errors during model quantization.
- Enhancing Coverage of tensor_metadata and redistribute_cost: There is a need to enhance the coverage of
tensor_metadata
andredistribute_cost
for various operation strategies in the PyTorch project. Several operations currently lack complete OpSpec information, which is essential for the AutoParallel feature to compute strategy costs effectively.
- Build Errors in CUDA 12.9 Periodic CI Test: A failure occurs in the CUDA 12.9 periodic CI test for the PyTorch project, where the use of the deprecated
cub::TransformInputIterator
is causing build errors. There is a need to investigate why this error does not occur in nightly builds and to identify the source of the CCCL warning.
- Deprecation of Support for Older GPU Architectures in PyTorch 2.9: The planned deprecation of support for Maxwell, Pascal, and Volta GPU architectures in the PyTorch 2.9 release follows NVIDIA's announcement that these architectures will no longer receive enhancements and will be unsupported in future CUDA Toolkit versions. Users are prompted to migrate to newer architectures.
- Runtime Error in Qwen Model Inference with PyTorch: A runtime error is encountered during the inference process using the Qwen model, where an internal assertion failure related to NVML_SUCCESS occurs in the PyTorch CUDACachingAllocator. This is potentially due to package compatibility issues.
- Failure to Detect AVX Capabilities in PyTorch: PyTorch fails to detect and utilize AVX capabilities, despite being built and functioning on a system that supports AVX instructions. The output of the
torch.__config__.show()
command shows "CPU capability usage: NO AVX," even though the environment and operating system recognize and can use AVX instructions properly.
- RuntimeError with FlexAttention Module in PyTorch: A
RuntimeError
is encountered when using the FlexAttention module in PyTorch, specifically related to a custom mask function that appears to involve data-dependent control flow with a Tensor. This is not currently supported by the vmap implementation, leading to confusion as the user believes they are only using a list for masking.
- Test Failure in test_dtensor_save_load_import: The
test_dtensor_save_load_import
test fails due to an autoloader that importstorch._dynamo
, causing indirect imports oftorch.distributed.tensor
and preventing the expected exception from being raised in the negative test path. A revision of the test is suggested to account for such import dependencies.
- Uninformative NotImplementedError in PyTorch Functions: Several
torch.*
functions in PyTorch raise uninformativeNotImplementedError
s when called with integerdtype
. It is suggested that these functions should provide clearer error messages or potentially use a more accurate exception type likeValueError
, as the current errors are misleading and do not adequately inform users about the correct usage or alternatives.
- Missing cu128 aarch64 Wheels for PyTorch Nightly Builds: The nightly cu128 aarch64 wheels for PyTorch have not been built since June 13th, prompting a request for investigation into the cause. Some comments suggest a migration to CUDA version 12.9 as a potential reason.
- Incompatibility with NVIDIA RTX Pro 6000 in PyTorch Nightly: The latest PyTorch nightly build (2.9.0.dev20250701 + cu12.9) lacks support for the NVIDIA RTX Pro 6000 (Blackwell, SM122 / Compute Capability 12.2), resulting in immediate failures of any CUDA operations due to incompatibility with the current PyTorch installation, which only supports up to CUDA capabilities sm_90 compute_37.
- Discrepancy in Advanced Indexing Between PyTorch and NumPy: A discrepancy exists between PyTorch's
torch.compile
and NumPy's behavior, specifically related to advanced indexing. The assumption that both libraries handle advanced indices equivalently is incorrect, leading to differences in behavior when indices are separated by a slice.
- Precompilation Failure in PyTorch Inductor Backend: A precompilation failure occurs during the training of nanogpt using the PyTorch Inductor backend, where a KeyError occurs in the
compute_loss
function due to incomplete compilation of the backward pass. Attempts to autosave the DynamoCache have been made.
- Improving Communication Cost Model for DTensor in PyTorch: There is a need to improve the communication cost model for DTensor in PyTorch, as the current model does not accurately account for all communication costs, such as the additional all_gather operation required in certain redistribution strategies. Leveraging NVIDIA's NCCL cost model is suggested for a more reliable estimation.
- Migration of PT2E Quantization Code in PyTorch: The migration of PT2E quantization code and documentation from the PyTorch repository to the PyTorch AO repository is being tracked. This includes updating internal call sites, adding deprecation warnings, and revising documentation and tutorials to reflect these changes.
- Compatibility Problem with einops and PyTorch Dynamo Compiler: A potential compatibility problem exists between PyTorch version 2.7.1 and einops versions 0.8.2 or 0.9.0, where the PyTorch Dynamo compiler may fail to handle the
einops.rearrange
function due to the absence of anallow_in_graph
for it. This leads to a possible error when attempting to inline through the function.
- Incorrect Inference of groups Parameter in channel_shuffle: A bug in the PyTorch project causes the
groups
parameter in thechannel_shuffle
function to be incorrectly inferred as aTensor
type instead of anint
when compiling ShuffleNet usingtorch.jit.script
, leading to a runtime error.
- Missing Collective Operations in PyTorch Backward Pass: A bug in the PyTorch project results in the last collective operations, specifically the backward allreduce for Tensor Parallelism (TP) and the backward reduce_scatter for Sharded Parallelism (SP), being missing during the backward pass in both the TP and SP examples and unit tests.
- Failure with torch.compile and torch.vdot() on Complex Tensors: A bug in the PyTorch library causes a failure when using
torch.compile()
on a model that utilizestorch.vdot()
with complex tensors. This results in an unimplemented feature error, causing anAssertionError
regardless of whether the code is executed on a CPU or CUDA.
- Runtime Error with torch.ops.prims.broadcast_in_dim.default: A bug in PyTorch occurs when using
torch.compile()
on a module that callstorch.ops.prims.broadcast_in_dim.default
, resulting in a runtime error due to alias annotations on the output tensor. Potential solutions include removing the alias annotation or using.clone()
to prevent storage sharing.
- NotImplementedError with flex_attention Module on CPU: A bug is encountered when using
torch.compile
with theflex_attention
module, where aNotImplementedError
is raised on CPU due to the query, key, and value being the same buffer. The code runs successfully in eager mode and on CUDA.
- Fixed Batch Size in ONNX Export of ResNet50 Model: A bug occurs when exporting a PyTorch ResNet50 model to ONNX format with a dynamic batch size for variable batch size inferencing. The model is instead exported with a fixed batch size of 1, contrary to the user's expectations.
- Errors with DTensors and TORCH_DISTRIBUTED_DEBUG=DETAIL: Setting the environment variable
TORCH_DISTRIBUTED_DEBUG=DETAIL
in PyTorch 2.8.0-rc1 causes errors with DTensors due to the NCCL ProcessGroup being wrapped in a ProcessGroupWrapper that does not override all necessary methods. This leads to unimplemented exceptions for certain collective operations.
- File Adapter Error on Windows in PyTorch: A bug in the PyTorch project causes the
file_adapter
component to fail to correctly read thefile_path
on Windows, resulting in an error due to thefile_name
being output as an empty string. This occurs specifically in thepytorch\caffe2\serialize\file_adapter.cc
file at line 31.
- Feature Map Extraction in PyTorch Graphs: There is an inquiry about whether there is an official method in the PyTorch toolbox, particularly around pt2 produced by torch.export, to extract the feature map of each node in a graph. This is similar to the functionality provided by
torchvision.models.feature_extraction.create_feature_extractor()
.
- Segmentation Faults in test_ops.py with GCC 13 on AArch64: Segmentation faults occur in specific tests within the
test_ops.py
file when using GCC 13 on AArch64 architecture, specifically on neoverse-v1, due to a recent pull request. Potential resolutions include reverting the pull request, downgrading GCC for AArch64 images, or disabling GCC's auto-vectorizer.
- Segmentation Fault in PyTorch with gather_object and destroy_process_group: A regression in the PyTorch library causes a script using
torch.distributed.gather_object
anddestroy_process_group
to successfully complete its main logic but crash with a segmentation fault upon exit. This problem is not present in an older version of the NVIDIA PyTorch container.
- Assertion Error with torch.einsum and DTensor Inputs: The
torch.einsum
function fails when used in inference mode with DTensor inputs, resulting in an assertion error due to the absence of a DeviceMesh from the DTensor arguments. A possible solution is to add a missing record for einsum toOpDispatcher.sharding_propagator.op_to_schema_info
with theruntime_schema_info.needs_pytree=True
property.
- RuntimeError with torchvision::nms Operator in PyTorch: A
RuntimeError
indicates that the operatortorchvision::nms
does not exist when using specific versions of PyTorch and Torchvision in a Python 3.12 environment on macOS. The issue persists even with the latest nightly builds and when using conda for installation, suggesting a potential problem with the integration or availability of certain operators in the specified setup.
- Metal4 Update and PyTorch MPS Backend Performance: There is a discussion on whether the Metal4 update, which has been optimized for machine learning and now natively supports tensors, will lead to significant performance improvements and enhanced compatibility for PyTorch's metal performance shader (mps) backend on Mac systems.
- Deadlock in OffsetBasedRNGTracker in PyTorch: A deadlock occurs in the
OffsetBasedRNGTracker
class'srun_state_sync
method within PyTorch's distributed tensor module. This is caused by an inconsistent order of thedist.broadcast
operation across ranks, leading to synchronization failures when processes do not initialize the tracker consistently.
- Channel-Last Layout Support in PyTorch Convolution Operations: There is a need for PyTorch's convolution operations to support a channel-last layout (e.g.,
(N,L,C)
,(N,H,W,C)
,(N,D,H,W,C)
) to improve performance and compatibility with other operations that assume channels in the last axis. Current workarounds are inefficient and cumbersome.
- Version Parsing Error in torch.utils.cpp_extension with Clang: The
torch.utils.cpp_extension
module fails to correctly parse the version string of Clang when it includes a suffix like '20.1.7+libcxx', leading to aValueError
during the compilation of extensions such as torchvision. This can be resolved by adjusting the version parser.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 14
Summarized Issues:
- Bugs in PyTorch operations: Several issues highlight bugs in PyTorch operations, such as incorrect results in
Conv2d-unsqueeze-AdaptiveAvgPool3d
with Triton backend and stride==0 tensors, and matrix multiplication on MPS backend not checking data type mismatches. These bugs can lead to incorrect results and silent failures, affecting model performance and reliability.
- PyTorch version inconsistencies: Issues have been raised about inconsistencies between PyTorch versions, such as
torch.export
tests failing in version 2.8.0 but passing in 2.7.1, and a regression identified between versions 2.6 and 2.7. These inconsistencies can cause confusion and require developers to adjust their code or wait for fixes.
- CUDA and PyPI package concerns: There are concerns about the size difference between CUDA 12.6 and 12.9 wheel packages and the availability of only the 12.6 package on PyPI. The size increase is due to added support for the Blackwell architecture, and PyPI's limitations in hosting multiple CUDA versions are noted.
- MPS support on Apple Silicon: Confusion arose over the absence of a dedicated MPS-compatible build for PyTorch 2.7.1 on Apple Silicon M4, though it was clarified that all macOS builds support MPS. This suggests a need for documentation updates to prevent misunderstandings about GPU acceleration support.
- Model compilation and execution errors: Issues with model compilation and execution include a graph break error in YOLOv5 due to
time.time
function tracing and a failure invision_maskrcnn
model on H100 and MI300 GPUs. These errors require troubleshooting and guidance to resolve, impacting model deployment.
- AMP training and GradScaler issues: A potential bug in
amp.GradScaler
during AMP training causes the scale value to drop unexpectedly on multiple GPUs. This behavior contradicts expectations and could affect training stability and performance.
- Typographical errors in documentation and code: Minor typographical errors have been identified in PyTorch documentation and code, such as "see" instead of "seen" and "paramter" instead of "parameter." These errors, while minor, can affect readability and should be corrected.
- Continuous integration and compiler warnings: A CI job was temporarily disabled, requiring a reason and manual updates, and a compiler warning about an unused variable "threshold" was noted. Addressing these issues ensures smoother development and integration processes.
- Torch.compile and user code bugs: An issue with Torch.compile Dynamo failing to execute an FX node with fake tensors was identified as a bug in the user's code. This highlights the importance of debugging user code to ensure proper execution of PyTorch features.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 144
Key Open Pull Requests
1. [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/torch.py
: This pull request aims to address part of issue #147913cc by replacing the unimplemented
function with unimplemented_v2
in the torch/_dynamo/variables/torch.py
file, as part of the PyTorch project, and involves multiple updates and contributions, including co-authored commits by William Wen.
- URL: pull/157344
- Merged: No
- Associated Commits: 20d1f, c7c70, 53b5c, b402f, fd8e5, 7d53e, 8b88e, f6854, a0dc3, a7eca, dcfbe, fc5ec, e70be, 0987a, 96080, 4bdd3, 80763, 08c72, a42df, 1673c, 43f68, c023d, f40d3, f0387, b832a, 4f764, 37379, 7ef8b, 51447, 4b6e8, 5168d, 641c9, 474d1, ec5d5, 37d44, a8aaa, aacfe, c9f62, 49dab, 00135, d0c69, d03d5, 17a4e, 335e9, 3df8c, 2ec26, 1f808, 8c0c3, f120d, 69a11, 7a1a5, be7ff, f41d6, 9a3a7, 54714, 61a19, db537
2. [build] make SDist buildable: bootstrap git repo and submodules: This pull request aims to make the source distribution (SDist) of the PyTorch project buildable by bootstrapping the Git repository and its submodules, as part of a stack of changes managed by ghstack, and involves multiple updates across several commits.
- URL: pull/157432
- Merged: No
- Associated Commits: 5efc8, 348da, af9b9, ea88a, ba698, 8e9b5, 1eb1d, 05ae8, f98d9, 7f970, 250f5, 05951, 3ea44, 3156d, af210
3. Fix init CUDA preload: get correct versions (#147001): This pull request addresses the issue of initializing CUDA preload by ensuring the correct versions are selected, primarily by updating the cuda_libs
dictionary to search for specific library versions first and using less-specific patterns as a backup, while also implementing a sorting mechanism in _preload_cuda_deps
to prioritize newer library versions, and it includes several commits to fix related issues such as type hints and merge conflicts.
- URL: pull/157264
- Merged: No
Other Open Pull Requests
- Code Refactoring and Hook Finalization: This topic involves refactoring code to ensure all registered template hooks are finalized before accessing the template's code object. A new function
RenderPartial.finalize_remaining
was introduced to finalize any remaining active hooks, and a test was included to verify proper finalization by the scheduler.
- Continuous Integration and Testing Enhancements: Several pull requests focus on improving CI and testing processes. These include re-enabling the ET test, addressing CI failures for CUDA versions greater than 12.9, and resolving NVSHMEM compilation issues by updating CI docker environments.
- Precision and Performance Improvements: Enhancements in computational efficiency and precision handling are addressed by allowing BF16 and TF32 as internal precision for various operations and enabling TF32 for matrix multiplication, linear, and convolution operations in the MKL-DNN backend.
- Documentation and Typo Corrections: Updates to documentation dependencies and typo corrections in backend code are covered. These changes involve fixing build errors, adding new documentation files, and correcting a typo from "inpt" to "input."
- Removal of 'allow-untyped-defs' Directives: A series of pull requests aim to remove the 'allow-untyped-defs' directive from various files in the PyTorch project. These changes are part of a series managed through the ghstack tool and include multiple commits marked as "[ghstack-poisoned]."
- Feature Implementations and Enhancements: New features and enhancements include a backward pass for
max_pool3d
for MPS, AI-generated inductor lowerings for pooling functions, and a work-in-progress feature for saving and loading compiled models withtorch._dynamo.save/load()
.
- Bug Fixes and Issue Resolutions: Bug fixes include addressing an integer overflow in the FlexAttention kernel and resolving an issue with einops and torch.compile interaction by reverting code to a previous state.
- Experimental and Diagnostic Efforts: Experimental attempts to diagnose and resolve issues include trying different approaches on the CI system for an issue that cannot be reproduced locally and ensuring consistent behavior of integer attributes within [dynamo][fsdp] components.
- Optimization and Code Efficiency: Efforts to optimize code include avoiding unnecessary slices and using
reduceOpSum
for scenarios where the world size is 1, as well as updating scripts to utilizetorch.accelerator
and ensuring compatibility with device count checks.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 124
Key Closed Pull Requests
1. Fix docs issue 153531 : This pull request addresses a documentation issue by providing a more detailed explanation of the Short-Time Fourier Transform (STFT) function in torch/functional.py
, specifically clarifying the use of n_fft
instead of win_length
in the STFT equation exponent, with changes made between lines 598 and 614.
- URL: pull/157561
- Merged: No
- Associated Commits: 54aca, 24397, 99718, 12bcf, 0863b, 44d11, e57f0, 15185, 1b702, 8be26, 7f55e, 6ca19, da4bf, e8ebe, 5a4f1, 2ad9c, f9e2b, 32e18, df3ca, bc244, 9fd51, be254, 596bb, 444e1, ce29e, 0288d, 953c9, ab750, 1a3e3, 7e97e, 4cf10, 738b4, 96d2d, 1c8ba, 3a44b, 24903, 8ac9b, 44ab7, 0cd06, 55d10, 574f4, e9956, a412d, d65d0, b126b, 6a3a3, b9814, bac09, 4ae86, 7b436, 2304d, bbfcf, 95ea4, 445b0, e80c8, 4f882, dcaee, 24e47, 94035, eef51, 0aa3f, 3eaae, 07779, f00f0, 6c8c5, f9386, 56a20, 3184b, d37ef, da3f5, 5ba8a, 49022, abe17, c1f8e, 13a51, 9e6f4, 39901, 1fee5
2. [Distributed] Add check to verify if local shards in FSDP have tensors before accessing: This pull request addresses an issue in the PyTorch project where a test in the Fully Sharded Data Parallel (FSDP) module was causing a "list index out of range" error by attempting to access local_shards[0]
without verifying the existence of tensors on all ranks, and it introduces a check to ensure that local shards contain tensors before accessing them, thereby resolving the error across all cases.
- URL: pull/157275
- Merged: No
- Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55, 7dade, 580aa, c0f57, 6dedb, 20d07, 124ff, 0bea1, 7409a, 636cb, 0d5a8, 3826e, cb711, d6cd1, 624be, 8d8c5, 1cf78, 41475, 0628c, b0d93, 58eb8, e558e, 83ac5, 0e7a7, a2b2f, 8de00, 91f5d, 39e6c, 06d6c, a0590, 31ddf, 9a6df, 50cb9, 08559, f6a8c, dc0be, 9fd7e, cce98, cf3bf, e3957, cd29a, 7b81c, 42401, 1dee9, c3f9f
3. Update _linux-test to support B200 runner: This pull request aims to update the _linux-test
workflow to support the B200 runner by enabling OIDC for ECR access and S3 stats upload, disabling sccache connection to S3 on B200, and making necessary adjustments to permissions and environment variables across related workflows.
- URL: pull/157341
- Merged: No
- Associated Commits: 8f291, 9cae9, b1172, 7a318, 7a56b, b9d5f, d0cd6, c7316, 170ea, db0af, 6615d, 20963, 6c17f, a23be, 3df91, 45d6a
Other Closed Pull Requests
- CUDA and Triton Updates: This topic covers updates and fixes related to CUDA and Triton in the PyTorch project. One pull request addresses critical fixes for CUDA 12.9 aarch64 GPU builds by updating the CUDA_STABLE variable and incorporating a missed Triton change. Another pull request addresses an ImportError issue by updating the
triton_key
import to be compatible with the latest version of Triton.
- Code Refactoring and Maintenance: Several pull requests focus on code refactoring and maintenance within the PyTorch project. One pull request aims to refactor the
setup.py
file by replacingos.path.*
functions withpathlib.Path
for better readability. Another pull request involves relocating functions to improve code organization and maintainability.
- Testing and CI Improvements: This topic includes pull requests aimed at improving testing and continuous integration (CI) processes. One pull request introduces a new CI job to test multiple versions of the einops library with torch.compile. Another pull request addresses a CI failure by fixing the dependency order in the CMake build for AOT inductor.
- Performance and Optimization: Pull requests under this topic focus on performance enhancements and optimizations. One pull request addresses slow performance in CUDA-11.3 by adding an exit condition for NaN values in special operations. Another pull request involves porting passes to bucket all_gathers to optimize performance and enhance memory optimizations.
- Type and Serialization Enhancements: This topic covers enhancements related to type handling and serialization. One pull request aims to remove the 'allow-untyped-defs' option from specific files as part of a series of changes. Another pull request makes the "DUPLICATED_INPUT" guard serializable by ensuring it can be safely serialized and reconstructed.
- Inductor and Kernel Improvements: Pull requests in this category focus on improvements to the Inductor component and kernel definitions. One pull request addresses an issue where Triton kernel definitions can break if they contain triple quotes. Another pull request enhances the NVSHMEM discovery process by extending the search to include system locations.
- Experimental Features and Enhancements: This topic includes pull requests introducing experimental features and enhancements. One pull request involves running an experiment to enable a "keep going" feature on the main branch. Another pull request proposes the addition of a progressive compile mode, although it was not merged.
- Bug Fixes and Issue Resolutions: Pull requests under this topic address bug fixes and issue resolutions. One pull request addresses a bug related to the
dict(mapping_proxy)
functionality within the Dynamo component. Another pull request involves additional testing of Python arithmetic operators when applied between tensors and scalars.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
XuehaiPan | 508 | 48 | 3 | 29 |
bobrenjc93 | 231 | 50 | 0 | 15 |
malfet | 160 | 23 | 10 | 102 |
atalman | 104 | 12 | 14 | 20 |
guilhermeleobas | 106 | 24 | 1 | 1 |
williamwen42 | 24 | 8 | 5 | 92 |
svekars | 47 | 3 | 0 | 72 |
Skylion007 | 18 | 5 | 3 | 93 |
davidberard98 | 79 | 12 | 6 | 12 |
guangyey | 75 | 7 | 3 | 23 |