Weekly GitHub Report for Pytorch: June 16, 2025 - June 23, 2025 (12:06:29)

            Weekly GitHub Report for Pytorch: June 16, 2025 - June 23, 2025 (12:06:29)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for Python 3.13 with torch.compile, a new performance-related feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release also marks a shift away from publishing on Conda, with a focus on using Manylinux 2.28 for Linux binaries, and introduces a backward compatibility-breaking change by setting weights_only=True as the default for torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Inductor error with Torch XPU optimizations to StableDiffusion3 Pipeline: This issue involves a bug where the torch.compile function fails when used with the StableDiffusion3 Pipeline on Intel XPUs, resulting in an InductorError related to the SYCL directory not being found. The error occurs despite attempts to resolve it by disabling max-autotune and reinstalling PyTorch, with the problem persisting across different environment setups.

The comments discuss troubleshooting steps, including disabling max-autotune and verifying the installation of the intel-sycl-rt package. Despite these efforts, the issue persists, with suggestions to avoid using the uv tool on Windows and instead use pip for installation. The conversation also highlights differences in behavior between Windows and Ubuntu environments, with the latter working correctly.
Number of comments this week: 10

Deprecation notice of torch.norm and Tensor.norm across the documentation: This issue highlights a discrepancy in the PyTorch documentation where torch.norm is marked as deprecated in some sections, but still referenced in others, such as Tensor.norm and the MaskedTensor quasi-tutorial. The issue raises questions about whether the documentation should be updated to reflect a shift towards using torch.linalg.norm, torch.linalg.vector_norm, and torch.linalg.matrix_norm, and suggests potential changes to the documentation to address this inconsistency.

A user expresses interest in working on the documentation issue, and another user confirms it is a documentation task, suggesting adding warnings where missing. A separate issue with torch.norm's typing overload is noted. The interested user requests assignment to the issue, and is advised to start working on it while waiting for official assignment, with a reminder to contact someone with merge access for PRs.
Number of comments this week: 7

[compile][torchtune] Full model compiled Qwen3 is 4x slower than eager: This issue highlights a performance discrepancy where the fully compiled Qwen3 model in the torchtune project is running four times slower than when executed in eager mode. The problem is significant as it affects a popular model, and the discussion suggests that the torchtune implementation may have inefficiencies compared to other implementations like vllm.

The comments discuss the need for triage due to the model's popularity, with suggestions to compare implementations for performance differences. It is noted that torchtune's approach involves per-layer compilation, which is faster than eager mode, but full model compilation is slower. Benchmark results are shared, showing different speeds for eager, per-layer, and full model compilation, with a suggestion to explore hierarchical compilation to optimize performance.
Number of comments this week: 7

Error shm.dll: This issue is about a user encountering an error related to the shm.dll file when trying to import PyTorch on a system running Windows 7 with Python 3.13.1 and Intel HD Graphics 4600. The user reports that the error persists across multiple versions of PyTorch, specifically 2.6.0, 2.7.0, and 2.7.1, and seeks assistance from the community by tagging several contributors.

The comments reveal a request for more environment information, which the user provides, showing a lack of support for Windows 7. A discussion follows about the compatibility of PyTorch with Windows 7, with one commenter suggesting that Windows 10 is the minimum supported version, and another explaining that PyTorch's components, including shm.dll, are written in C++ and not just Python.
Number of comments this week: 5

functorch_maml_omniglot is a bad CPU performance smoketest model: This issue discusses the inadequacy of the functorch_maml_omniglot model as a CPU performance smoketest in the PyTorch project, highlighting its susceptibility to significant performance shifts with seemingly unrelated changes. The issue points out that the benchmark does not provide actionable insights for optimization, as evidenced by the lack of updates to expected results following performance changes in various pull requests.

The comments discuss the stability issues of benchmarks in CI, with one user noting a regression caused by a specific PR and suggesting a new PR to fix it. Another user explains that the regression was not detected due to the one-sided nature of the performance test, which does not require updates for performance improvements. There is also a mention of the benchmark's random performance fluctuations and a user's attempt to debug their performance problem further.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs during the execution of a script that utilizes the OotdPipeline with specific configurations, including the use of PyTorch's torch.compile function to optimize the unet_garm and unet_vton components, and is likely related to compatibility or versioning issues with the Triton library.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a model in a Docker environment with a tmpfs permission set to 1777, where the execution of a cached cuda_utils.so file in the /tmp directory fails due to the absence of the execution bit, despite the directories having the correct permissions. The error occurs during the execution of a PyTorch model, specifically when using the torch.compile function, and is related to the inability to map a segment from the shared object, resulting in an ImportError.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature request is to reduce the size of model files, particularly for deployment on mobile devices, where storage space is limited, as demonstrated by the user's experience of reducing a model's size from 6.7MB to 5.6MB by manually removing these debug files.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 71
Summarized Issues:

Accuracy Failures in PyTorch Tests: This issue involves accuracy failures in a PyTorch test related to matrix multiplication with persistent memory and the new TMA API. A significant number of tensor elements do not match expected values, indicating a regression introduced in a recent version of Triton.
issues/156028

Build System Migration: This issue involves migrating a GitHub project's build system from using deprecated setup.py commands to modern Python build tools and standards. The migration aims to ensure compatibility and future-proofing, while considering the potential use of scikit-build for better handling of CMake and custom commands.
issues/156029

Custom Backend Device Guard Registration: This issue is about enabling the registration of a device guard for a custom PyTorch backend entirely in Python. Currently, this requires a workaround involving C++ code, and the goal is to expose the necessary functionality to Python.
issues/156052

Performance Impact of Guard Evaluation Latency: This issue highlights a significant performance impact caused by increased guard evaluation latency when using torch.distributed.tensor.parallel.style.ColwiseParallel with nn.Linear. The latency rises from 0.598 microseconds to 5.819 microseconds due to the addition of approximately 30 extra guards, leading to a substantial cumulative delay during dynamo cache lookup.
issues/156054

Bugs in PyTorch Library: Several issues describe bugs in the PyTorch library, including torch.equal causing compilation to revert to eager mode, incorrect reference count output in CPU-only builds, and inconsistent results between CPU and GPU for conv_transpose3d. These bugs affect model compilation, reference counting, and result consistency across different hardware.
issues/156057, issues/156059, issues/156062

Documentation Errors and Feature Requests: Issues highlight documentation errors in PyTorch functions like torch.max and torch.logsumexp, which are incorrectly stated to accept dim=None. Additionally, there are requests for adding support for dim=None in torch.logsumexp to improve usability.
issues/156071, issues/156072, issues/156075

Software Bill of Materials (SBOM) Integration: This issue proposes the integration of automatic SBOM generation and optional vulnerability scanning into the PyTorch project using Anchore's open-source tools. The goal is to enhance supply chain transparency, security, and trust by providing detailed insights into the project's dependencies and potential vulnerabilities.
issues/156085

Disabled Tests and Instability in Test Environments: Issues involve disabled tests due to failures on specific platforms and instability in ROCm jobs on the linux-jammy-rocm-py3.10 test environment. These issues require attention from specific contributors and involve multiple stakeholders for resolution.
issues/156089, issues/156098

Performance Problems in Model Compilation: This issue highlights a performance problem where the fully compiled Qwen3 model using torchtune is four times slower than when run in eager mode. Discussions suggest potential discrepancies in implementation compared to vllm's Qwen3 and the need for optimization in model compilation time.
issues/156103

Graph Break Stack Traces in Dynamo: This issue highlights the need to incorporate carets in graph break stack traces within the Dynamo project. The goal is to enhance precision in identifying problematic code locations, similar to how regular Python stack traces utilize carets.
issues/156127

Unit Test for DDP and FSDP2 Mixed Precision: This issue involves adding a unit test to demonstrate the use of Distributed Data Parallel (DDP) mixed precision in conjunction with Fully Sharded Data Parallel (FSDP2) mixed precision. The focus is on the ignore_params feature by enhancing the test_fully_shard_ignore_params.py to cover this scenario.
issues/156130

Tensor Creation in Compiled Regions: This issue describes a problem where using the .item() method on a Tensor created within a compiled region in PyTorch fails. The operation returns a non-Tensor, which cannot be traced into the Dynamo FX graph output, whereas moving the tensor creation outside the compiled region resolves the issue.
issues/156135

Discrepancy in torch.rsqrt Function Output: This issue highlights a discrepancy in the torch.rsqrt function's output for a complex128 tensor with a value of (0+infj). The CPU backend returns nan+nanj while the CUDA backend correctly returns 0+0j, suggesting that the CPU's result may be due to separate computation steps for the square root and reciprocal.
issues/156152

Error Related to shm.dll on Windows 7: This issue involves a user encountering an error related to the shm.dll file or its dependencies when importing PyTorch on a system running Windows 7 with Python 3.13. The discussion suggests that the problem may be due to PyTorch's lack of support for Windows 7, as it typically requires Windows 10 or later.
issues/156159

Gradient Calculation Discrepancy in SDPA: This issue reports a discrepancy in gradient calculations during the backward pass of the scaled dot product attention (SDPA) operation on an NVIDIA RTX 5080 GPU compared to a CPU. The discrepancy occurs particularly when the standard deviation of the 'key' tensor is large and the sequence length is increased, resulting in abnormal gradients on the GPU while remaining normal on the CPU.
issues/156160

AttributeError with NumPy Array in torch.compile: This issue involves a bug in PyTorch version 2.7.1+cu126 where using torch.compile with a function that multiplies a NumPy array results in an AttributeError. The NumPy array lacks a 'mul' attribute, causing a failure when running FX nodes with fake tensors on a CUDA device.
issues/156162

Feature Request for C++ API in LibTorch: This issue is a feature request for the PyTorch project, specifically asking for the development of a native C++ API in LibTorch to enable direct ONNX model export without relying on Python. This is currently a significant obstacle for projects that are purely C++.
issues/156168

Bug in torch.nn.Module.load_state_dict Method: This issue highlights a bug in the PyTorch library where the torch.nn.Module.load_state_dict method always calls the internal _load_from_state_dict function with strict=True. This prevents hooks registered with register_state_dict_pre_hook from adapting to a strict=False scenario, affecting the ability to load state dictionaries of lazily initialized submodules.
issues/156177

Stack Overflow Error in Windows Wheel Builds: This issue involves a stack overflow error occurring during the build process of Windows Wheel builds for CUDA 12.9.1 in the PyTorch project. The build fails with an "LLVM ERROR: out of memory" message, particularly when compiling the SegmentReduce.cu file, despite attempts to mitigate the problem by using a worker with a larger memory footprint and reducing the number of maximum jobs.
issues/156181

Graph Breaks Due to Logging Functions: This issue describes a problem where using logging functions in a PyTorch training script causes graph breaks due to the inability of the torch.compile function to trace the __len__ method of an unknown type. This significantly impacts performance despite a recent change intended to ignore certain logging functions.
issues/156191

Upgrade for SM100 Architecture Support: This issue involves upgrading the torch._grouped_mm module to support the SM100 architecture by updating the CUTLASS submodule. The upgrade includes modifying architecture tags and kernel parameters for Blackwell, adapting to potential CUTLASS API changes, and ensuring the build and testing processes are compatible with the new architecture.
issues/156202

Performance Regression in AOTI-Compiling Submodules: This issue highlights a significant performance regression when using the torch.export export/save/load/unflatten workflow for selectively AOTI-compiling submodules within a larger neural network. The regression introduces substantial runtime overhead compared to direct submodule replacement, particularly when deploying on edge devices without the nn.Module definition.
issues/156206

Comprehensive Feature Request for DTensor: This issue is a comprehensive feature request for the DTensor component in a GitHub project, focusing on enabling AutoParallel by addressing known issues. The request includes auditing existing OpStrategy, replacing internal use of register_prop_rule with register_op_strategy, improving OpStrategy to meet AutoParallel's requirements, and proposing a new source_strategy for tensor input distribution.
issues/156217

Upgrade for f8f8bf16_grouped_mm Function: This issue involves upgrading the f8f8bf16_grouped_mm function in PyTorch to support NVIDIA's SM100 architecture by adapting a block-wise scaling approach from a similar kernel in the sglang repository. The upgrade includes analyzing architectural differences, defining SM100-specific configurations, implementing core components, and finalizing dispatch logic.
issues/156238

Translation Validation Failure in PyTorch: This issue involves a translation validation failure in a PyTorch project when using the fake_tensor_propagate_real_tensors configuration. A function compiled with torch.compile and specific configurations leads to a ValidationException due to incorrect handling of dynamic shapes and scalar outputs, resulting in runtime assertion failures and incorrect tensor computations.
issues/156251

Tutorial for Exporting Hugging Face Transformers Model: This issue involves creating a tutorial for exporting a Hugging Face transformers model to ONNX format, incorporating features like dynamic caching. The tutorial includes mentions of several contributors for collaboration.
issues/156258

Illegal CUDA Memory Access in torch._foreach_copy_: This issue reports a bug in the PyTorch library where the torch._foreach_copy_ function causes an illegal CUDA memory access under certain conditions. The issue occurs when attempting to copy data between tensors on a CUDA device, resulting in a runtime error and suggesting the use of TORCH_USE_CUDA_DSA for device-side assertions.
issues/156261

Failure in torch.compile on Intel XPUs: This issue involves a bug where the torch.compile function fails when using the StableDiffusion3 Pipeline on Intel XPUs, resulting in an InductorError due to a missing SYCL directory. The error is suspected to be related to the environment setup and installation method of PyTorch, particularly when using the uv tool on Windows.
issues/156303

Behavior of Nested torch.compile Calls: This issue highlights the lack of clearly defined behavior for nested torch.compile calls with differing arguments in PyTorch. The expected inlining of the innermost compile call within the outermost one does not occur, leading to potential performance issues and the need for a mechanism to determine the active compilation mode when such nesting is present.
issues/156308

Tracing Custom Operators in Dynamo: This issue addresses the need for a mechanism to enable PyTorch's Dynamo to trace into operators defined with torch.library.custom_op. Currently, such operators are not traceable by Dynamo, which limits their usability in APIs requiring OpOverload and prevents the performance benefits of torch.compile from being realized.
issues/156322

Memory Layout Assertion Error in torch.compile: This issue involves a bug in the PyTorch project where the torch.compile function throws a memory layout assertion error for the "B" tensor in a scaled grouped matrix multiplication operation. The error occurs despite prior assertions confirming the correct memory layout, potentially due to recent changes involving autotuner support and fusion processes.
issues/156325

Instability of MI250 Nodes in Test Environment: This issue pertains to the instability of MI250 nodes in the 'linux-jammy-rocm-py3.10' test environment due to the rocminfo command hanging. The team has marked these nodes as unstable while they work on resolving the problem, as evidenced by a specific example of the hang provided in the linked GitHub Actions run.
issues/156327

Recompilation in transformers Models: This issue discusses a bug related to recompilation in PyTorch's transformers models when using mark_static_address with cudagraphs. The recompilation occurs due to ID matching on tensors, and the issue proposes exploring the possibility of avoiding recompilation by reusing existing logic for multiple cudagraphs to improve compile time efficiency.
issues/156377

NaN Values in DataParallel Module: This issue describes a bug in PyTorch's DataParallel module where the gathered results across multiple GPUs show NaN values under certain configurations. The behavior is inconsistent across different machines and reboots, particularly when fix_dp_output_device is set to False.
issues/156392

TorchRuntimeError in FX IR Export: This issue involves a bug encountered when attempting to export the FX IR for a model using PyTorch, where a TorchRuntimeError occurs during the call_function aten.lift_fresh_copy.default operation. The error is resolved by disabling the optimizer, suggesting a potential compiler bug since the input remains unchanged.
issues/156411

Segmentation Fault in torch.compile: This issue reports a segmentation fault occurring on an x86 CPU when running a PyTorch script with the @torch.compile(mode='max-autotune') decorator. The fault occurs specifically when executing a batched matrix multiplication using torch.bmm, and provides detailed environment and version information for troubleshooting.
issues/156412

Potential Bug in HistogramObserver._combine_histograms(): This issue is about a potential bug in the HistogramObserver._combine_histograms() function of the PyTorch library. The user questions the use of torch.sum(update_hist) instead of torch.sum(orig_hist) for counting samples in the original histogram, and also seeks clarification on the use of min_val_neg and max_val_pos for computing quantization parameters in the UniformQuantizationObserverBase class, particularly in the context of asymmetric quantization.
issues/156414

ShardedTensor Class Bug: This issue describes a bug in the PyTorch library where the ShardedTensor class does not support the is_cuda property. This property is used by the _cycles.py script's is_cuda_tensor() function for cycle detection, resulting in a RuntimeError when attempting to access this property.
issues/156417

Integration of cuda-bindings Package: This issue involves exploring the integration of the lightweight package cuda-bindings, which has no Python dependencies, into the core of PyTorch. The focus is on understanding its applications, trade-offs, and feasibility, while acknowledging that it may not lead to immediate changes.
issues/156424

Hanging CI Job in GitHub Project: This issue describes a problem where the "Update viable/strict" continuous integration (CI) job in a GitHub project occasionally hangs indefinitely after being scheduled. This causes subsequent jobs to wait and eventually cancel themselves, which prevents updates for several days despite the CI status being green, and the only current solution is to manually cancel the hung job.
issues/156425

Incorrect Results on MPS Backend: This issue involves a bug in the PyTorch project where a function using torch.var_mean and an epilogue produces incorrect results on the Metal Performance Shaders (MPS) backend. The problem is demonstrated by a discrepancy between expected and actual outputs when running a specific Python script, with the problem traced to the generated Metal code, and a proposed fix that corrects the issue but reduces efficiency.
issues/156426

C++ Documentation Update: This issue suggests updating the C++ documentation for PyTorch to recommend using the torch CMake target instead of the ${TORCH_LIBRARIES} variable for linking, as indicated by a specific commit in the PyTorch repository.
issues/156434

Disabled Test in TestTEFuserDynamic Suite: This issue concerns a disabled test named 'test_skip_grad_in_check' from the 'main.TestTEFuserDynamic' suite on the Linux platform, which started failing on the main branch after an unrelated change was made. Despite a revert attempt, it remains unresolved, prompting a triage review to evaluate the necessity of running these tests given that TorchScript is in maintenance mode and TEFuser is not widely utilized.
issues/156436

Accuracy Minifier Failure: This issue describes a problem with the accuracy minifier in a PyTorch project, where the tool fails to further minify a script intended to diagnose accuracy issues with the MPSInductor backend. Despite correctly identifying initial accuracy test failures, it encounters a runtime exception unrelated to the accuracy problem.
issues/156437

Disabled Test in TestTEFuserDynamic Suite: This issue pertains to a disabled test named 'test_inlined_optimized_graph' within the 'main.TestTEFuserDynamic' suite, which was deactivated due to failures on the main branch. The issue is referenced in a related issue and recent failure examples, and involves contributors such as @EikanWang, @jgong5, @wenzhe-nrv, and @sanchitintel.
issues/156438

Runtime Error with is_pinned() Method: This issue describes a bug where accessing the is_pinned() method on a CPU-based PyTorch tensor raises a runtime error after renaming the privateuseone backend. The error occurs due to the lack of a registered implementation for the renamed backend, which is expected behavior but suggests a need for a Python-registerable solution.
issues/156444

UnicodeDecodeError in torch.compile(): This issue involves a UnicodeDecodeError occurring when using torch.compile() on a PyTorch model that injects extreme values like NaN and Inf. The error results in a failure due to a non-UTF-8 byte in the code, potentially related to non-unicode comments in the codebase.
issues/156451

ONNX Shape Inference Parameter Issue: This issue highlights a problem where the parameter "onnx_shape_inference" cannot be successfully passed to the "_export" interface in "torch/onnx/utils.py" when using the "torch.onnx.export" interface. The shape type inference always runs with its default value, which is problematic for certain models that fail during this process.
issues/156480

AttributeError in Dynamo Benchmark Test: This issue reports a failure in the Dynamo benchmark test for PyTorch, where an AttributeError occurs because the 'torch.dtype' object lacks the 'name' attribute. The error leads to a runtime error during the execution of a model from the Hugging Face library.
issues/156482

NaN Values in SDPA FLASH_ATTENTION Backend: This issue reports that the SDPA FLASH_ATTENTION backend is generating NaN values when used with the Intel Extension for PyTorch (IPEX) on Intel CPUs. The issue is demonstrated by the provided Python code snippet.
issues/156487

Missing Stub for mypy-torch._C._jit_tree_views: This issue involves adding a missing stub for mypy-torch._C._jit_tree_views to achieve full mypy compliance. The addition is part of the effort to create stub files for PyTorch's extension modules, specifically mentioned in the project's type annotation guide.
issues/156488

Deprecation Warning in Mypy Type Checking: This issue is about a deprecation warning encountered during mypy type checking, indicating that the numpy.typing.mypy_plugin is deprecated and will be removed in a future release. The warning prompts the need to remove plugins = numpy.typing.mypy_plugin from the mypy configuration to avoid future issues.
issues/156489

Suitability of functorch_maml_omniglot Model: This issue highlights concerns about the suitability of the functorch_maml_omniglot model as a CPU performance smoketest in the PyTorch project. The benchmark is prone to significant and unexplained performance shifts, lacks actionable insights for optimization, and has not been effectively updated or flagged for performance changes in the continuous integration process.
issues/156511

Incorrect Gradients in BFloat16 Mixed BatchNorm: This issue reports that the Native NCHW BFloat16 Mixed BatchNorm training in PyTorch produces incorrect gradients for weight, bias, and input when compared to MIOpen and CPU results. The discrepancies are particularly highlighted when using the native backend on CUDA.
issues/156513

Convenient Method for torch.device Creation: This issue is about finding a more convenient method to create a torch.device with a specific device index using the torch.accelerator API. The current approach is considered cumbersome, and the user suggests potential alternatives for improvement.
issues/156519

Compile Time Optimization for transformers Models: This issue addresses the challenge of optimizing compile time for transformers models by exploring the potential of moving torch.compile to a repeated block rather than the full model. The approach has shown significant compile time reduction in diffusion models but encounters multiple problems such as excessive recompilations, cudagraph issues, and static address marking complications, which are detailed in the provided test case and error logs.
issues/156520

Unstable Job in PyTorch Project: This issue pertains to marking a specific job as unstable in the PyTorch project due to a failure in the functorch_maml_omniglot test. The failure is indicated in the test inductor_torchbench_cpu_smoketest_perf for the linux-jammy-cpu-py3.9-gcc11-inductor configuration.
issues/156521

DTensor Dispatch Logic Improvement: This issue addresses the need for improving the DTensor dispatch logic in PyTorch to allow operations that modify only a specific shard without requiring an all_gather operation first. The current approach results in inefficiencies and unnecessary replication of output.
issues/156523

Attention Mechanism Tests in ONNX Module: This issue pertains to updating the tests for the attention mechanism in the ONNX module of the PyTorch project. The update is referenced by the pull request at https://github.com/pytorch/pytorch/pull/156431.
issues/156524

Incorrect Functionality of << and >> Operators: This issue highlights a problem where the << and >> operators do not function correctly for DTensor operands. The operators have no effect and result in incorrect outcomes, unlike when using scalar operands, necessitating a workaround using torch.bitwise_left_shift and torch.bitwise_right_shift functions.
issues/156533

Tensor Incompatibility in FSDP2 Forward Pass: This issue involves a bug encountered during the forward pass of a PyTorch model using FSDP2, where an error arises due to a tensor incompatibility between torch.Tensor and DTensor when converting tokens into embeddings. The issue is likely because the embedding layer is not sharded and remains an ordinary torch.Tensor, causing a conflict with distributed operators.
issues/156535

NotImplementedError in SparseMPS Backend: This issue involves a NotImplementedError encountered when using the SparseMPS backend for the Whisper model in PyTorch. The error is specifically related to the operation 'aten::_sparse_coo_tensor_with_dims_and_tensors', which is not supported by this backend and may require a custom build or alternative backend to resolve.
issues/156540

TorchScript Method Access Failure: This issue highlights a bug in TorchScript where it fails to access methods specific to nested tensors, such as offsets. The failure results in runtime errors when attempting to script or trace functions that utilize these methods.
issues/156544

Inconsistent Results in Neural Time Series Models: This issue reports inconsistent and unreliable results when running neural time series models, particularly the Darts TCNModel, on Windows using PyTorch with CUDA. The GPU builds produce higher errors and random crashes compared to stable and reproducible results on CPU-only builds, and seeks guidance on ensuring result parity and potential bug identification in the CUDA/cuDNN backend or Darts' PyTorch interface.
issues/156547

Segmentation Fault in GroupNorm Backward Operation: This issue involves a failure in the ConvertTritonGPUToLLVM pass during a fused GroupNorm backward operation on a system with SM 89 architecture when using the Inductor backend in PyTorch's torch.compile() API. The failure results in segmentation faults and Triton compilation crashes, potentially due to environment configuration issues, and was temporarily resolved by modifying fusion options and clearing the Triton cache.
issues/156549

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 32
Summarized Issues:

Runtime Errors on Specific GPUs: This issue involves a runtime error where the function 'MmBackward0' returns NaN values in its 0th output when running a PyTorch model on an RTX 5080 GPU, despite the same code running successfully on an RTX 2070 or CPU. The problem is potentially due to a gradient explosion caused by the SDPA (Self-Attention with Dot-Product Attention) mechanism.
pytorch/pytorch/issues/156015

Discrepancies in Function Outputs Between CPU and CUDA: Several issues highlight discrepancies in function outputs between CPU and CUDA implementations in PyTorch, such as torch.cumprod, torch.histc, and torch.ops.aten.index_put. These discrepancies are often attributed to differences in numerical precision and execution order, particularly affecting operations with float16 data type.
pytorch/pytorch/issues/156018, pytorch/pytorch/issues/156019, pytorch/pytorch/issues/156173

Performance and Compilation Issues: Issues related to performance regressions and compilation failures in PyTorch include a 6% increase in latency for LLM int8 due to a specific pull request and a failure in Ahead-Of-Time (AOT) compilation with the SYCL extension. These issues discuss potential solutions and adjustments to mitigate the impact while maintaining performance.
pytorch/pytorch/issues/156037, pytorch/pytorch/issues/156249

Test Failures on Specific Platforms: Some issues involve the disabling of tests within the PrecompileContextTests suite on the xpu platform due to consistent failures on the main branch. Contributors are notified for further attention to address these failures.
pytorch/pytorch/issues/156063, pytorch/pytorch/issues/156146

Non-Deterministic and Unexpected Behavior in PyTorch Layers: Issues describe non-deterministic behavior in PyTorch layers, such as a bug in the BatchNorm1d layer when track_running_stats=False and discrepancies in results between nn.Conv2d and nn.Linear. These issues highlight unexpected behavior and seek solutions to align with expected outcomes.
pytorch/pytorch/issues/156051, pytorch/pytorch/issues/156154

Export and Serialization Issues: Several issues involve problems with exporting and serializing models in PyTorch, such as a KeyError in the torch.export function and a schema version mismatch when using torch.export.load. These issues suggest potential improvements in state tracking and error messaging.
pytorch/pytorch/issues/156167, pytorch/pytorch/issues/156354

Installation and Compatibility Challenges: Users face challenges with installing PyTorch, such as conflicts with manylinux requirements on older Ubuntu versions and difficulties with pip installations. These issues request continued support for older systems and suggest potential solutions for installation errors.
pytorch/pytorch/issues/156215, pytorch/pytorch/issues/156413

Optimizer and Training Issues: Issues with optimizers and training in PyTorch include difficulties with the L-BFGS optimizer and unexpected behavior with DTensor parameters and torch.optim.Adam. These issues explore potential bugs or incorrect usage and seek guidance for resolution.
pytorch/pytorch/issues/156501, pytorch/pytorch/issues/156453

CI/CD and Infrastructure Problems: Problems in the PyTorch CI/CD pipeline include the unavailability of Windows Runners and disk space issues on MI300 runners. These issues were resolved by reverting infrastructure changes and implementing mitigation steps.
pytorch/pytorch/issues/156352, pytorch/pytorch/issues/156360

ONNX Export Compatibility Issues: An issue with torch.onnx.export involves the inclusion of dropout nodes in the exported ONNX model when using opset_version = 22, causing compatibility issues with CoreML. Setting the opset version to 20 resolves the problem.
pytorch/pytorch/issues/156542

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 196
Key Open Pull Requests
1. [CUTLASS] [CUDA] SM100 GroupMM: This pull request introduces Blackwell support for GroupMM on CUDA SM100, reusing much of the existing code for SM90 while adjusting the kernel schedule as per NVIDIA's documentation, and includes preliminary benchmarking results comparing H200 and B200 performance.

URL: pull/156203

Merged: No

Associated Commits: aea5b, c7727, 8bafa, 45dd0, 790a0, 1e837, d4b59, 356c8, 9e4c2, 21c55, 70c41, 3eea5, e2ae6, 6a8ef, 8fd22, 19ef6, 70006, 485f4, 16d51, 77a93, b5eab, a46d4, fe454, c85df

2. Deprecate CUDAAllocatorConfig, use AllocatorConfig instead: This pull request aims to deprecate the CUDAAllocatorConfig in favor of using AllocatorConfig within the PyTorch project, as part of a series of changes tracked by the ghstack tool, and involves multiple updates and contributions from various collaborators.

URL: pull/156165

Merged: No

Associated Commits: 85fbc, d2405, 41664, a92a4, 1c3bf, 50bb4, 9051e, 29277, 1fac5, 56d2c, dd9f8, b87c4, 79e7b, 2fb05, f2e38, ceabc, 5e2f5, f8f73, d80ec, ec38b, 51798, cd55a, e9298

3. [build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install[ -e] .: This pull request aims to modernize the development installation process by replacing the deprecated python setup.py develop/install commands with the more current [uv] pip install [-e] . approach, in line with the changes in setuptools>=80.0, to avoid future deprecation warnings and errors.

URL: pull/156027

Merged: No

Associated Commits: bbe70, 278b3, 7e10c, 6386d, 08226, 6582e, a242b, 845c3, d4caa, bd377, 1ebc8, 7b63c, 862de, 954db, 722e1, fb275, f14a9, 56720, a6d97, 4f736

Other Open Pull Requests

Setuptools Version Limit Removal: This pull request proposes removing the upper version limit for setuptools<80.0 in the build configuration of the PyTorch project. The change is aimed at ensuring compatibility with newer versions of setuptools, as indicated by the title and the associated commits.
pull/156049

Windows CUDA 12.9.1 Support: This pull request aims to add support for building with Windows CUDA 12.9.1 in the PyTorch project. It addresses an unspecified issue and includes multiple commits that adjust memory limits, fix out-of-memory errors, resolve merge conflicts, and update build scripts.
pull/156179

GBID Weblink Generation: This pull request addresses the generation of a GBID weblink when the unimplemented_v2() function is called. It includes the necessary JSON file in setup.py for CI packaging and resolves an issue with hardcoded error messages that incorrectly strip a '/' from 'https'.
pull/156033

New API for Accelerators: This pull request is a work in progress aimed at introducing a new API for setting allocators specifically for accelerators in the PyTorch project. It is indicated by its title and the series of commits linked through the ghstack tool.
pull/156175

AOTriton Library Update: This pull request proposes an update to the AOTriton library to version 0.10b, introducing significant enhancements for SDPA operators on AMD systems. It includes official support for new architectures, a substantial reduction in the binary size of libaotriton.so, and the addition of sliding window attention support.
pull/156499

Error Handling Enhancements: This pull request aims to enhance the error handling in the PyTorch project by raising exceptions when invalid method calls are made on lists. It involves multiple commits that are part of a stack of changes managed by ghstack.
pull/156148

Dtype Documentation Revamp: This pull request aims to revamp the dtype documentation for 2025 by consolidating duplicated documentation into tensor_attributes.rst. It reorganizes the dtype table into separate sections for floating point and integer dtypes, introduces definitions for shell dtypes and various dtype suffixes, and removes outdated quantized dtypes.
pull/156087

List Method Implementations: This pull request aims to implement the list.remove function in the PyTorch project, as part of a stack of related changes. It involves multiple updates and contributions from various collaborators.
pull/156242

CPU Log Softmax Kernel Extraction: This pull request involves extracting CPU log_softmax kernels into a header file to facilitate their sharing with ExecuTorch. It is part of a series of commits aimed at improving code modularity and reusability within the PyTorch project.
pull/156243

Triton Compile Worker Pool Quiescence: This pull request introduces a mechanism to quiesce the Triton compile worker pool after each dynamo compile. It addresses performance and resource usage challenges by shutting down worker subprocesses using the SubprocPool implementation.
pull/156187

Kernel Template Configuration: This pull request introduces a system where kernel templates are used to determine configuration heuristics. It allows for a unified interface across different backends by passing a KernelTemplate to obtain configuration heuristics alongside KernelInputs.
pull/156282

FlexAttn Configuration Refactor: This pull request refactors the FlexAttn configuration to align with the GEMM/Conv configuration approach. It introduces an exhaustive tuning mode for performance benchmarking and updates ROCm flex autotune configurations to enhance performance.
pull/156307

Project Collaboration and Updates: This pull request, titled "[br][mk] attempt 1," involves multiple updates and commits related to a project on the PyTorch GitHub repository. It includes collaboration and review from several contributors, as indicated by the numerous mentions in the body of the pull request.
pull/156428

JIT Documentation Conversion: This pull request involves converting the documentation file for the Just-In-Time (JIT) compiler from reStructuredText format to Markdown. It addresses issue #155024cc and includes several commits for renaming the file, updating its content, and fixing indentation.
pull/156094

List Addition Methods: This pull request aims to implement the list.__add__ and list.__iadd__ methods in the PyTorch project. It is part of a series of related changes tracked by ghstack and involves multiple contributors and reviewers.
pull/156270

List Multiplication Methods: This pull request proposes the addition of the list.__mul__ and list.__imul__ methods to the project. It is part of a series of changes managed through the ghstack tool and involves multiple contributors and reviewers.
pull/156271

Triton Template Consolidation: This pull request aims to simplify the import and dependency structure of the PyTorch project by breaking out the TritonTemplate, TritonTemplateKernel, and TritonTemplateCaller from the select_algorithm.py file. It is part of a two-part effort to consolidate all Triton template-related components and streamline the codebase.
pull/156280, pull/156281

Typographical Error Corrections: This pull request addresses the correction of typographical errors within the torch/csrc/ directory of the PyTorch project. It is part of a series of related changes managed through the ghstack tool and involves multiple commits with updates marked as "[ghstack-poisoned]."
pull/156319, pull/156321

Dynamo Warning Mechanism Update: This pull request addresses issue #155352 by modifying the warning mechanism in PyTorch's Dynamo to use torch._dynamo.utils.warn_once. It ensures that warnings about functools.lru_cache-wrapped functions are emitted only once and adds user stack traces in debug mode.
pull/156463

Torchbench Environment Setup Update: This pull request updates the environment setup script for torchbench by replacing the outdated Makefile. It allows developers to set up the environment for specific models in just 1-2 minutes, significantly reducing setup time and improving usability.
pull/156465

Composable Kernel Library Integration: This pull request aims to integrate the Composable Kernel library into the ROCm backend for PyTorch's Inductor. It transitions from a submodule to commit pin control to facilitate its inclusion in the 2.8 release, enabling its use with both inductor and AOT inductor.
pull/156192

Compile Kernel Function Testing: This pull request aims to ensure that the compile_kernel function integrates effectively with custom operations by adding new tests. It does not introduce any new code functionality and is a continuation of work from a previous pull request (#151484).
pull/156332

List Deletion Method: This pull request introduces the list.__delitem__ method to the project. It is part of a series of changes managed through the ghstack tool and involves multiple contributors and reviewers for its development and review process.
pull/156339

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 213
Key Closed Pull Requests
1. [release/2.6] Numpy Dtype: This pull request, titled "[release/2.6] Numpy Dtype," aims to address and fix an unspecified issue (indicated by "Fixes #ISSUE_NUMBER") related to the Numpy data type in the PyTorch project, as part of the release/2.6 branch, and involves multiple contributors and reviewers, as evidenced by the extensive list of individuals tagged in the body of the pull request.

URL: pull/156440

Merged: No

Associated Commits: c69ea, af92b, aad1c, f3c08, 5363f, 5fbc4, 2b84d, c92f6, 1d3ff, f9e99, 46f55, 0cdf8, 6628b, c953e, 22775, 9b688, 4b9b7, f61bf, 31b52, b1a10, 5eb54, d9eed, 41811, 23e39, f01a6, 929ef, 4e418, 478a9, 7d329, 3a3de, f35ab, 7092d, 8c034, be126, e1858, d155d, 4d9de, a99cc, 51829, 983ea, 47f4e, eb304, 6e304, e2067, 57421, a61b5, 4658a, e19c1, 232eb, 1d2c2, a2639, cd15d, 9c34a, 8d4b8, dcb8a, 7be6b, ca3c3, 32070, 2236d, 1eba9, 93864, 88b97, bbd00, 1f32b, d33dd, f1481, ea546, ac7d6, ed487, 66dfe, 3783d, 8adc1, 5c4fa, 639ee, b445b, 8d72c, 374e5, 6a3b5, e607b, aafc7, d5947, 1b753, ba1ba, 70f30, 3398f, 8354d, 737cf, 4202f, 7c27e, 2e2c7, 3a818, 53ad2, 8eb5d, dbe8c, fcdff, 92b55, f6789, 2e1ed, 13339, 82ac2, 3608e, bfb23, 86b0a, 03714, 34caa, ac032, 5dd61, 7d528, d9a03, 7c072, 73dd0, b08d9, d70a9, 7ad5a, 2fd46, ed8c6, bf084, 20ad8, 2fb0a, 8cfa9, 50a04, 45896, 9d0a4, 1a808, 6fe84, a3632, 68180, fb24f, e53a9, 2cda1, 9d566, a7044, c7ba8, cbd7b, c3733, faf90, a87c9, ce6b7, dc41a, 469ce, 8ccfc, 1290e, 93693, 75628, f4c96, 9cf15, 5c42a, 1a150, 1ded2, 2ff80, b6e5f, 50924, 4642c, 5de86, f0b4a, 96c61, 751e4, 684f6, 0afd4, 93ff7, e22ae, 2eff7, 2a634, f3092, eb37e, d1c90, 01546, dae14

2. Fix profiler dynamic toggle: This pull request addresses an issue with profiling CUDA code on ROCm devices by modifying the profiler to check for the presence of "hip" (indicating hipified code) instead of "cuda" in the profiler event names, thereby fixing the problem referenced in issue #480147.

URL: pull/156293

Merged: No

Associated Commits: 2e0b1, 1f8eb, 8a7fd, 97f3d, 550bc, e7cb7, f61af, 0fd19, 167b4, 06da6, 0412e, 123a1, 2ee3a, a95ad, f070d, bef73, 95105, 0036d, 6c0d1, 0dd76, eb265, c20a8, 6894b, baf34, 51916, 3d6ba, 1a5a7, c113e, 78867, be308, ab8a9, cc13b, 63cbb, 5286c, 9d8f0, 79fa0, 1dea6, dec5b, 81e75, a771d, 2fbd2, 15f91, f7b26, 73cf3, 222ae, ec0c5, 45e1d, bb655, e4c1c, 45985, d37c4, 3a570, 46344, be4f8, befce, 1aa5d, aef0f, 5b344, dc726, b345d, bbae9, fa9fa, 9e184, 08da4, 0b79e, f1ad4, cf324, 13a86, 3057d, a0a9d, ecce5

3. [ONNX] Implement Attention-23: This pull request aims to implement the Attention-23 operation using the scaled dot-product attention (sdpa) method in the ONNX framework, while also updating the conversion logic to eliminate trailing None inputs, as part of the PyTorch project, although it was ultimately not merged.

URL: pull/156431

Merged: No

Associated Commits: 4af17, d44b6, b02b9, cf000, d0bad, a1ebd, f0e46, 430c9, 25a90, d188f, 27f34, 4405e, 4d63b, 6eee5, 86d46, 4a133, ee28e, ff1ea, e37f7, a8e02, 08363, 853c7, a7a67, 0796b, f6385, 8488e, af854, 30eb5, dea1e, e1f05, 1d691, 8ceaf, 0dd5e, cff49, 140cd

Other Closed Pull Requests

Pipeline Schedule Updates: This pull request updates the pipeline schedules in the PyTorch project to utilize the new _PipelineScheduleRuntime. It ensures that the ScheduleGPipe and Schedule1F1B classes now generate pipeline_order IR and operate with the unified execution engine while maintaining backward compatibility.
pull/156013

Typographical Error Corrections: Multiple pull requests aim to correct typographical errors across various directories in the PyTorch project, including functorch/, c10/, CMake files, and more. Despite their efforts, these pull requests were not merged into the main codebase.
pull/156080, pull/156081, pull/156078, pull/156079, pull/156082, pull/156083, pull/156077, pull/156069

MPS Backend Enhancements: This pull request adds forward and backward operations for the nearest_3d function in the MPS backend. It introduces a generalizable UpsampleParams structure and replaces the existing upsample_nearest3d MPS fallback with a proper shader.
pull/156090

AOTriton Library Update: This pull request updates the AOTriton library to version 0.10b, introducing significant enhancements for SDPA operators on AMD systems. It includes official support for new architectures, a substantial reduction in the binary size of libaotriton.so, and the addition of sliding window attention support.
pull/156290

CUDA Implementation Transition: This pull request transitions the CUDA implementation to use the runtime driver API for the cuStreamWriteValue32 function. It addresses issue #154073 and includes multiple commits for refactoring, symbol exportation, and code linting, although it remains unmerged.
pull/156097

Metal Kernels for Scan Operations: This pull request implements Metal kernels for scan operations by migrating cumsum and cumprod from the MPSGraph implementation to Metal. It also adds MPS backend support for cummin and cummax, addressing issue #154881.
pull/156100

PyTorch Setup Optimization: This pull request optimizes the setup process for the PyTorch development environment by downloading the torch package via pip and installing other dependencies using uv. It significantly reduces the setup time from 70 seconds to 17.4 seconds while ensuring the correct installation of pinned NVIDIA dependencies.
pull/156409

Torchscript Exporter Update: This pull request updates the default opset for the torchscript exporter to version 18 to align with the dynamo exporter. It removes the hard limit on the torchscript exporter to allow for greater flexibility in future updates.
pull/156023

Inter-Process Communication Enhancement: This pull request aims to enhance the inter-process communication (IPC) for expandable segments in the PyTorch project by utilizing the fabric handle when feasible. It builds upon a previous pull request and was inspired by a specific issue comment, although it was ultimately not merged.
pull/156074

NVSHMEM Runtime Detection: This pull request introduces runtime detection of NVSHMEM in the PyTorch project to allow the selection of the default backend for SymmetricMemory. It adds a Python API torch.distributed._symmetric_memory.is_nvshmem_available() to check the availability of NVSHMEM.
pull/156291

Memory Layout Mismatch Fix: This pull request addresses a memory layout mismatch in the fft_r2c XPU implementation by aligning its Inductor meta deducing with the CPU fallback. It updates the torch-xpu-ops commit and ensures the XPU performs the R2C transform on the last dimension followed by iterative C2C transforms on the remaining dimensions.
pull/156048

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

[wip]
Toxicity Score: 0.55 (Defensive responses, unresolved frustration, mediation attempts.)
This GitHub conversation involves username1 expressing frustration over username2's proposed solution, which did not resolve the issue at hand. Username2 responds defensively, leading to a tense exchange. Username3 attempts to mediate by suggesting alternative approaches, but the tone remains strained as username1 continues to express dissatisfaction.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

XuehaiPan
318
52
2
36

malfet
183
26
10
148

bobrenjc93
303
35
3
3

svekars
55
5
30
130

Skylion007
50
15
0
124

guilhermeleobas
141
31
1
0

guangyey
92
7
1
43

laithsakka
97
19
1
25

davidberard98
108
10
9
12

justinchuby
65
6
3
58

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
XuehaiPan	318	52	2	36
malfet	183	26	10	148
bobrenjc93	303	35	3	3
svekars	55	5	30	130
Skylion007	50	15	0	124
guilhermeleobas	141	31	1	0
guangyey	92	7	1	43
laithsakka	97	19	1	25
davidberard98	108	10	9	12
justinchuby	65	6	3	58