Weekly GitHub Report for Pytorch: April 28, 2025 - May 05, 2025 (12:01:49)

            Weekly GitHub Report for Pytorch: April 28, 2025 - May 05, 2025 (12:01:49)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release also marks a shift away from publishing on Conda, with a focus on using official wheel packages or conda-forge, and introduces a backward-incompatible change by setting weights_only=True as the default for torch.load to enhance security.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Illegal Instruction Caused by grid_sample Under Windows: This issue involves a bug in PyTorch 2.7.0+cu118 on Windows 10, where using grid_sample with float64 tensors causes an "illegal instruction" error, leading to crashes, particularly affecting CI/CD processes. The problem is specific to this PyTorch version and does not occur with previous versions or when using float32 tensors.

The comments discuss linking the issue to a previous one, but it is determined to be a new problem specific to PyTorch 2.7, possibly due to a switch to VS2022 for building. Debugging reveals the issue is related to AVX512 instructions running on AVX2 machines, and a rollback to VS2019 is suggested. A PR is drafted to revert to VS2019, and testing confirms that this resolves the issue.
Number of comments this week: 14

NCCL out of memory error after updating to PyTorch 2.7: This issue describes a problem encountered after updating to PyTorch 2.7, where using the init_process_group with NCCL and calling DDP(model, device_ids=[rank]) results in an out-of-memory error, despite using minimal memory. The error did not occur before the update, and the same code worked fine with previous versions of PyTorch.

The comments discuss potential causes and solutions for the out-of-memory error, including running with NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALLOC for more detailed logs. It is confirmed that the error does not occur when using a single GPU or when using the gloo backend instead of NCCL. A suggestion to run with NCCL_CUMEM_HOST_ENABLE=0 resolves the issue, indicating a potential problem with cuMem host allocations, especially under WSL. The discussion also mentions a forthcoming NCCL patch that might address similar issues.
Number of comments this week: 11

Newly added lint-urls jobs are very flaky: This issue reports that the newly added lint-urls jobs in the PyTorch project are intermittently failing on pull requests and the main branch, causing instability in the continuous integration process. The issue is being actively discussed and addressed by the development team, with suggestions to mark the job as unstable and investigate potential causes such as rate limiting by external services.

The comments discuss various examples of the job failures and potential solutions, including marking the job as unstable, investigating rate limits, and applying fixes. There is a suggestion to disable the job temporarily, and discussions on how to handle false positives in the script. The team is working on ensuring stability and considering using tags to skip checks, with some fixes already implemented and under observation.
Number of comments this week: 10

setup.py develop command is disappearing soon from setuptools: This issue highlights the impending removal of the setup.py develop command from setuptools, which PyTorch currently relies on for development and continuous integration processes. The deprecation of this command necessitates urgent action to transition to alternative methods, such as using pip install -e . -v --no-build-isolation, and potentially adopting new developer tools to manage builds more effectively.

The comments discuss the broader impact on related PyTorch projects and suggest limiting the setuptools version as a short-term solution. There is a proposal to adopt a dedicated developer CLI tool for better user experience, and a deadline is noted for transitioning away from the setup.py interface. Additionally, a related issue for dynamic dependency pinning is mentioned.
Number of comments this week: 7

Update torch/nn/modules/conv.py to use Literal for support padding modes: This issue proposes updating the torch/nn/modules/conv.py file in the PyTorch project to use typing.Literal for specifying supported padding modes, instead of using a generic str type, to enhance type checking and catch potential bugs. The goal is to improve code reliability by explicitly defining the supported padding modes, such as "valid" and "same", which can be verified by the type checker before the code is executed.

Multiple contributors expressed interest in working on the issue, with some seeking approval to proceed and others offering to resolve related errors. There was a concern about coordination to avoid duplicate efforts, and a contributor highlighted additional files with similar issues, suggesting further improvements.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that using multiple smaller MaxPool2D operations can reduce computational costs on a CPU. The approach involves representing a larger kernel size with multiple smaller ones, which has been shown to yield a speedup in processing time, as demonstrated by the provided testing code and performance comparisons.
cuda_utils.so: failed to map segment from shared object: This issue involves a problem encountered when running a PyTorch model within a Docker container, where the execution of a cached shared object file, cuda_utils.so, fails due to a missing execution permission despite being run as the root user. The error occurs specifically in a Docker environment with a tmpfs permission set to 1777, and the file in question lacks the execution bit, leading to an ImportError when attempting to map a segment from the shared object.
Enable UFMT on all files in PyTorch: This issue involves enabling the UFMT (Universal Format) tool on all files within the PyTorch codebase, which currently has approximately 1,500 files that are not formatted according to UFMT standards. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the size of exported models, particularly for deployment on mobile devices, where storage space is limited, as demonstrated by the user's experience of reducing a model's size from 6.7MB to 5.6MB by manually removing these debug files.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 98
Summarized Issues:

Forward Compatibility in torch.export: The lack of forward compatibility in torch.export for PyTorch is problematic for users converting models to CoreML and Litert, as these conversions require specific and conflicting PyTorch versions. Unlike torchscript, which allows exporting in one environment and converting in another, users are inquiring about plans to address this once torch.export exits beta.  
pytorch/pytorch/issues/152283

Runtime Errors and Compatibility Issues: Various runtime errors and compatibility issues have been reported, such as a runtime error with the transformers library on Windows (arm64) due to a missing module, and a bug in PyTorch nightly build PT2.8 causing incorrect outputs on Intel GPUs. These issues highlight the challenges in maintaining compatibility across different systems and versions.  
pytorch/pytorch/issues/152285, pytorch/pytorch/issues/152290

Type Errors and Regression Bugs: Upgrading to newer PyTorch versions has led to type errors in nn.Module.dtype when using torch.autocast, and discrepancies in torch.sparse.log_softmax outputs between CPU and CUDA. These issues suggest potential regressions in type handling and execution consistency.  
pytorch/pytorch/issues/152292, pytorch/pytorch/issues/152293

Overflow and Precision Issues: Unexpected overflow behavior in torch.addcmul with mixed precision tensors and incorrect results in torch.compile() for asinh_() operation highlight precision and overflow challenges in PyTorch. These issues emphasize the need for careful handling of precision in tensor operations.  
pytorch/pytorch/issues/152294

Distributed and Parallel Processing Bugs: Bugs in distributed processing, such as dist.all_reduce with dist.ReduceOp.SUM and torch.xpu.is_bf16_supported() returning incorrect values, indicate challenges in distributed and parallel processing. These issues affect the reliability of distributed operations in PyTorch.  
pytorch/pytorch/issues/152300

Memory and Performance Issues: Out-of-memory errors with NCCL backend in PyTorch 2.7 and recompilation errors with FP8 support highlight memory and performance challenges. These issues underscore the importance of efficient memory management and compatibility in high-performance computing environments.  
pytorch/pytorch/issues/152302

Attribute and Error Handling Bugs: Bugs such as AttributeError in custom module classes and incorrect error messages in torch.compile with torch.nn.functional.multi_head_attention_forward indicate issues in attribute handling and error reporting. These issues affect the usability and debugging experience in PyTorch.  
pytorch/pytorch/issues/152308

Checkpointing and Compilation Errors: Distributed checkpointing issues and compilation errors in torch._dynamo module highlight challenges in checkpointing and code compilation. These issues affect the stability and reliability of PyTorch in distributed and compiled environments.  
pytorch/pytorch/issues/152310

Compiler and Code Generation Inefficiencies: Inefficiencies in code generation, such as unnecessary kernel creation and outdated documentation, highlight challenges in optimizing PyTorch's compiler. These issues affect the performance and maintainability of PyTorch's codebase.  
pytorch/pytorch/issues/152323

Documentation and Test Failures: Lack of documentation for functions like torch.nonzero_static and test failures on specific platforms indicate challenges in documentation and testing. These issues affect the accessibility and reliability of PyTorch's features and tests.  
pytorch/pytorch/issues/152347

Debugging and Logging Issues: Inconsistent debug log generation and outdated Dynamo overview documents highlight challenges in debugging and documentation. These issues affect the ability to effectively troubleshoot and understand PyTorch's internals.  
pytorch/pytorch/issues/152374

Instruction and Build Environment Errors: "Illegal instruction" errors on Windows 10 and test failures due to specific commits highlight challenges in build environments and instruction set compatibility. These issues affect the stability and compatibility of PyTorch across different systems.  
pytorch/pytorch/issues/152385

Tensor and Execution Errors: Bugs in tensor operations, such as .item() on DTensor and torch.nn.functional.ctc_loss, highlight challenges in tensor handling and execution. These issues affect the correctness and reliability of tensor operations in PyTorch.  
pytorch/pytorch/issues/152406

Quantile and Compilation Discrepancies: Inconsistent handling of NaN values in torch.quantile and incorrect tensor outputs in static compilation highlight challenges in numerical accuracy and compilation consistency. These issues affect the reliability of numerical computations in PyTorch.  
pytorch/pytorch/issues/152423

Build and Dependency Management: Migrating Docker containers to newer GCC versions and managing optional dependencies like "optree" highlight challenges in build and dependency management. These issues affect the maintainability and compatibility of PyTorch's build system.  
pytorch/pytorch/issues/152426

Memory Leaks and Optimization Issues: Memory leaks with MPS backend and inefficiencies in Profile-Guided Optimization highlight challenges in memory management and optimization. These issues affect the performance and resource utilization of PyTorch.  
pytorch/pytorch/issues/152550

Compilation Time Variance and Kernel Errors: Variance in compilation times and incorrect results from binary kernel operations highlight challenges in compilation efficiency and kernel execution. These issues affect the performance and correctness of PyTorch's compiled code.  
pytorch/pytorch/issues/152566

Build and CUDA Detection Issues: Difficulties in building PyTorch from source and CUDA detection errors highlight challenges in build configuration and hardware compatibility. These issues affect the ability to build and run PyTorch on different systems.  
pytorch/pytorch/issues/152592

Performance Discrepancies and Regression Issues: Performance discrepancies in scaled dot-product attention and regressions in AOTI for specific models highlight challenges in performance optimization and regression testing. These issues affect the efficiency and stability of PyTorch's features.  
pytorch/pytorch/issues/152595

Unexpected Output and CI Testing: Unexpected output logits with hooks and loops, and running CI on Triton pin updates highlight challenges in output consistency and continuous integration testing. These issues affect the reliability and testing of PyTorch's features.  
pytorch/pytorch/issues/152607

Python-less Environment and Profiler Crashes: Extending torch.compile to Python-less environments and profiler crashes with PyTorch Lightning highlight challenges in environment compatibility and profiling. These issues affect the usability and debugging of PyTorch in different environments.  
pytorch/pytorch/issues/152612

Distributed Operations and Release Tracking: Intermittent hangs with NCCL and release tracking for PyTorch 2.7.1 highlight challenges in distributed operations and release management. These issues affect the stability and coordination of PyTorch's distributed features and releases.  
pytorch/pytorch/issues/152623

Strides and Assertion Errors: Incorrect strides in nonzero_static and assertion errors in TestFlexAttentionCUDA highlight challenges in tensor handling and test reliability. These issues affect the correctness and robustness of PyTorch's tensor operations and tests.  
pytorch/pytorch/issues/152634

Segmentation Faults and Similarity Checks: Segmentation faults in ProcessGroupGloo.allgather_into_tensor_coalesced and proposals for similarity checks highlight challenges in distributed operations and testing utilities. These issues affect the reliability and testing of PyTorch's distributed features.  
pytorch/pytorch/issues/152645

Docker Caching and Error Messages: Docker caching issues on MI300 runners and unclear error messages in infer_size(a, b) highlight challenges in caching and error reporting. These issues affect the efficiency and usability of PyTorch's build and error handling processes.  
pytorch/pytorch/issues/152655

Kernel and Utility Package Proposals: Bugs in torch.__group_gemm and proposals for a cuda_tools utility package highlight challenges in kernel execution and device management. These issues affect the performance and usability of PyTorch's kernel operations and device management.  
pytorch/pytorch/issues/152668

Documentation and String Support: Lack of documentation for FlexAttention and string support in torch.library.custom_op highlight challenges in documentation and feature support. These issues affect the accessibility and extensibility of PyTorch's features.  
pytorch/pytorch/issues/152683

Architecture Support and Foreach Operations: Adding SASS support for NVIDIA architectures and bugs in torch._foreach_pow highlight challenges in architecture compatibility and foreach operations. These issues affect the compatibility and correctness of PyTorch's operations on different architectures.  
pytorch/pytorch/issues/152690

CI Workflow and Dependency Issues: CI workflows not triggering and dependency issues in Triton Windows build highlight challenges in continuous integration and dependency management. These issues affect the reliability and maintainability of PyTorch's CI processes and builds.  
pytorch/pytorch/issues/152697

Test Failures and Runtime Errors: Test failures on MI200 platforms and runtime errors with torch.func.jacfwd highlight challenges in test reliability and runtime execution. These issues affect the stability and correctness of PyTorch's tests and runtime operations.  
pytorch/pytorch/issues/152700

Gradient and Configuration Issues: Gradient backpropagation issues with Categorical distribution and NO_SHARD configuration proposals highlight challenges in gradient handling and configuration optimization. These issues affect the differentiability and performance of PyTorch's operations and configurations.  
pytorch/pytorch/issues/152703

Distributed API and Masking Errors: Disappearance of model parameters in distributed APIs and errors with boolean masks on sharded DTensors highlight challenges in distributed processing and tensor masking. These issues affect the reliability and flexibility of PyTorch's distributed features.  
pytorch/pytorch/issues/152712

Mergebot and Fake Implementation Issues: Enhancements to mergebot functionality and bugs with fake implementations highlight challenges in automation and custom operation registration. These issues affect the efficiency and extensibility of PyTorch's automation and custom operations.  
pytorch/pytorch/issues/152718

Overflow and Accuracy Issues: Discrepancies in float16 overflow handling and accuracy issues with opmath_t highlight challenges in numerical accuracy and overflow handling. These issues affect the consistency and precision of PyTorch's numerical operations.  
pytorch/pytorch/issues/152731

Optimization and Stride Mismatch Errors: Proposals for optimizing all-gather operations and stride mismatch errors in F.scaled_dot_product_attention highlight challenges in optimization and tensor handling. These issues affect the performance and correctness of PyTorch's operations.  
pytorch/pytorch/issues/152746

Padding and Error Message Discrepancies: Runtime errors with zero-size tensor padding and unclear error messages in checkpoint_sequential highlight challenges in tensor operations and error reporting. These issues affect the usability and reliability of PyTorch's tensor operations and error handling.  
pytorch/pytorch/issues/152750

Tensor Method Discrepancies and Compilation Errors: Discrepancies in torch.Tensor.put_ method and compilation errors with Cuda-12.9 highlight challenges in tensor method consistency and build compatibility. These issues affect the correctness and compatibility of PyTorch's tensor methods and builds.  
pytorch/pytorch/issues/152755

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 47
Summarized Issues:

Quantization Aware Training (QAT) Performance Issues: Users have reported unexpected performance degradation when using QAT with MobileNetV2 for conversion to TFLite, where the quantized model performs worse than the original. The issue may be related to the conversion of the QAT model into ATen operators, which are not optimized for specific devices.
issues/150746

Inductor CUDA Graph Tree Buffer Allocation: The Inductor CUDA Graph Tree implementation fails to capture buffer allocations in multi-stream programs on side CUDA streams, leading to runtime errors. This is due to storage data pointers not being allocated in the expected CUDA graph pool.
issues/151199

PyTorch Export Functionality Bugs: There are issues with PyTorch's export functionality, including the loss of custom metadata for constant tensors during export. Additionally, the torch.export function fails with an IndexError when exporting models using InstanceNorm1d unless the strict parameter is set to True.
issues/151476, issues/152467

Dynamic Axis Specialization Confusion: Users experience confusion when specifying an axis as dynamic in PyTorch, only to find it specialized without clear indication. A warning is suggested to advise users to set example dimension sizes greater than one, especially for dynamic batch sizes.
issues/151582

ONNX Export Process Simplification: The ONNX export process needs streamlining by primarily using TorchExportNonStrictStrategy and falling back to the TorchScript-based exporter if necessary. Additional strategies do not significantly enhance export coverage and complicate error messaging.
issues/151703

Circular Dependency in PyTorch Build Process: A circular dependency problem exists in the PyTorch project where building magma tarballs for ROCm or CUDA requires a manylinux image that itself needs the magma tarball. This necessitates a sequence of pull requests to bypass the dependency temporarily.
issues/151707

Dynamo Compiler Hang with _scaled_grouped_mm: The Dynamo compiler hangs during the processing of the _scaled_grouped_mm operator when use_fast_accum is set to True. This issue appears to be related to a cudaStreamSynchronize() call and is not associated with the auto-tuning feature.
issues/151743

MPS Backend aten::col2im Operation Support: The lack of support for the aten::col2im operation on the MPS backend in PyTorch causes it to default to CPU execution, potentially affecting performance. While the forward version aten::im2col is implemented, the backward version is not.
issues/151820

torch.distributed.tensor.debug.visualize_sharding Enhancement: There is a proposal to enhance the torch.distributed.tensor.debug.visualize_sharding function to produce colorized visualizations similar to jax.debug.visualize_array_sharding. A hybrid approach using the rich library is suggested.
issues/151857

Peak Memory Usage in torch.compile: A bug is reported where the peak memory usage of a torch.compiled model is higher during the first run compared to subsequent fresh runs. This raises questions about expected behavior and potential caching mechanisms at the inductor or dynamo level.
issues/151995

Regression Error in torch-xpu-ops: An update to the torch-xpu-ops caused a test case for the interpolate_bilinear function on XPU with float32 to fail due to significant discrepancies in tensor values. This resulted in an assertion error during testing.
issues/152020

FX Graph Cache Utilization Issues: There is difficulty in determining why the FX graph cache is not being utilized or created in the PyTorch project, despite using logging tools. More detailed logging is suggested to diagnose the problem effectively.
issues/152065

torch.fliplr Function Crash: A bug in PyTorch causes the program to crash with an "Aborted (core dumped)" error when using the torch.fliplr function with invalid data. There is a request to convert potentially unsafe .pkl files into a safer format for reproduction and debugging.
issues/152085

torch.log1p Gradient Computation Bug: The torch.log1p function's gradient computation incorrectly results in NaN for an excluded element due to masking, instead of the expected zero. This occurs when using a tensor with a condition that should exclude certain elements from contributing to the gradient calculation.
issues/152088

MacOS CI Build and Test Script Updates: The CI build and test scripts for MacOS in the PyTorch project need updating to eliminate the dependency on Anaconda. This is part of a broader effort to address a related issue and involves modifying several specific scripts and workflows.
issues/152113

Torch Profiler Stream Count Bug: A bug in the Torch Profiler shows 40 streams in the trace file despite using only two streams to overlap operations in a CUDA graph. This suggests that each loop iteration creates a new stream, leading to unnecessary overhead and concerns about stream reuse best practices.
issues/152114

Anaconda-Related Benchmark Removal: The PyTorch project is removing Anaconda-related benchmarks, specifically addressing files such as benchmarks/dynamo/Makefile and others. This is related to a previous issue (#138506).
issues/152123

Utility Script and Workflow Bugs: There are bugs in the PyTorch project related to utility scripts and workflows, involving files such as python_doc_push_script.sh, upload-test-stats-while-running.yml, and others. These issues are linked to a previous issue (#138506).
issues/152124, issues/152126

torch.lobpcg() Performance Bug: The torch.lobpcg() function experiences a performance bug where changing the tolerance parameter tol from 1e-07 to 1e-08 causes the function to hang for over 40 minutes. This suggests a problem with achieving higher precision beyond the default float32 capabilities.
issues/152154

torch.distributions Documentation Update: There is a need to update the documentation for several functions in the torch.distributions module to include a description of the validate_args parameter. This parameter is currently used in the code but not documented.
issues/152165

C++ Compilation Error in CI Pipeline: A C++ compilation error is encountered in the CI pipeline of a GitHub project following the release of Torch 2.7. This error is inconsistent across different systems and appears to be related to AVX-512 support.
issues/152172

torch._dynamo.exc.BackendCompilerFailed Error: Upgrading PyTorch nightlies results in a torch._dynamo.exc.BackendCompilerFailed error due to a missing expecttest module. This can be temporarily resolved by installing the expecttest package, although this dependency should not be required.
issues/152225

PyTorch 2.7 py_limited_api Compilation Bug: A bug in PyTorch 2.7 occurs when setting py_limited_api=True during the compilation of torch extensions, resulting in build errors due to undeclared identifiers in the pybind11 library. This can be resolved by setting py_limited_api=False.
issues/152243

torch.flipud Function Segmentation Fault: A bug in the PyTorch library causes a segmentation fault (core dumped) error when using the torch.flipud function with certain arguments. This is likely due to incompatibility with quantized tensors, and a workaround is suggested by dequantizing the tensor before applying the function.
issues/152253

FSDP Peak Memory Usage Spike: The FullyShardedDataParallel (FSDP) implementation in PyTorch experiences an unexpected peak memory usage spike during the initialization phase when training the Llama 4 model. This prevents the model from being loaded due to out-of-memory (OOM) errors.
issues/152263

scaler.step(optimizer) Enhancement Proposal: There is a proposal to enhance the scaler.step(optimizer) function in PyTorch to return a value indicating whether a step was skipped due to underflow or overflow. This would help users identify when necessary loss functions are not being applied during model training.
issues/152279

Windows Compilation with CUDA 12.6 Error: A build failure occurs when compiling PyTorch from source on Windows using CUDA 12.6 and MSVC 2019. The compilation of the file cuda_vectorized_test.cu fails due to an ambiguous reference to the std namespace.
issues/152291

Inductor Unit Test Accuracy Failure: An accuracy failure in the Inductor unit tests related to chunk_cat on XPU is caused by a change that began supporting contiguous inputs. This disrupted the previous assumption that all inputs were contiguous, leading to failures in the torch-xpu-ops implementation.
issues/152296

MPS Backend Memory Leak: A memory leak occurs in the Metal Performance Shaders (MPS) backend when using scaled dot product attention (SDPA) with float32 tensors in PyTorch. This involves increasing memory usage over iterations and may be due to a bug in the MetalPerformanceShaderGraph framework.
issues/152344

MPS Backend Memory Allocation Problem: A memory allocation problem is encountered during model training using PyTorch on a Mac, where the MPS backend runs out of memory. The use of the environment variable PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 is suggested to disable the upper memory limit.
issues/152351

Compiling PyTorch on WSL for RTX 5070 Ti: A user seeks advice on compiling PyTorch from source on Windows Subsystem for Linux (WSL) to support an NVIDIA GeForce RTX 5070 Ti. Challenges include unsupported GPU architecture sm_129 and memory issues, with a suggestion to use sm_90 as a fallback.
issues/152400

torch.scatter_reduce ONNX Conversion Bug: The torch.scatter_reduce function with a "max" reduction operation fails to convert correctly to ONNX format when the input is a two-dimensional tensor. This results in shape verification errors during ONNX runtime execution.
issues/152419

test_host_memory_stats Failure: A failing test named 'test_host_memory_stats' in the PyTorch project version 2.7.0+cu118 encounters an error when running test_cuda.py locally. This is potentially due to improper cleanup in some tests and has been resolved by merging an associated pull request.
issues/152422

Disabling Long List Print Truncation: A feature is requested to disable the truncation of long list prints to facilitate accurate bandwidth calculations for uneven all-to-all operations. All numbers are necessary for these calculations and subsequent post-processing tasks.
issues/152427

test_inductor_debug Disabling: The test_inductor_debug within the LoggingTests suite is disabled due to its failure on the main branch. The issue has been closed after identifying the cause of the problem.
issues/152511

CUDA 12.1 Support Request: A request is made for the release of the PyTorch library with support for CUDA 12.1 for version 2.6 and potentially newer versions. This highlights the need for compatibility with the specified CUDA version.
issues/152524

DTensor Dtype Transmute Operator Bug: A bug in the PyTorch library causes the dtype transmute operator to fail for DTensor due to the absence of a registered sharding strategy for the operator aten.view.dtype. This results in a NotImplementedError.
issues/152530

Inductor Component Compatibility Problem: The Inductor component of PyTorch uses an outdated API due to recent changes in the Triton project, specifically the removal and renaming of the launch_enter_hook function. This has been addressed by a subsequent fix.
issues/152531

torch.randint Function Overflow Bug: A bug in the PyTorch library causes the torch.randint function to fail with large high arguments, particularly with the full range of torch.uint64. This leads to runtime errors due to overflow, and documentation updates are suggested.
issues/152564

CUDAGraphs and Mixed Operations: A user inquires about capturing a model containing both CUDA and CPU operations as a single CUDAGraph in PyTorch. The response indicates that CUDAGraphs are only applicable to CUDA operations, with a suggestion to use the PyTorch forum for further questions.
issues/152584

Torchtitan CI Failure: A failure in the torchtitan CI is caused by a static CUDA launcher resulting in a RuntimeError: CUDA driver error: invalid device context. This can be resolved by manually disabling the static CUDA launcher.
issues/152639

test_nvshmem Disabling: The test_nvshmem in the PyTorch project is disabled because NVSHMEM is not yet installed on the continuous integration (CI) machines. Plans are in place to enable it by building PyTorch with NVSHMEM by default and installing NVSHMEM wheels in CI workflows.
issues/152649

Inductor Component Periodic Failures: Periodic failures in the PyTorch project related to the inductor component occur, where the hf_BigBird model transitions from a fail_to_run state to fail_accuracy. This indicates silent incorrectness and an increase in graph breaks.
issues/152691

Modded-nanogpt Performance Regression: A minor performance regression is reported in the modded-nanogpt project between nightly builds from February 9th to March 1st. There is a slight increase in runtime, with plans for further investigation and code analysis.
issues/152761

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 185
Key Open Pull Requests
1. [Do not merge] poke CI with FX IR always on: This pull request is focused on testing the continuous integration (CI) system by enabling FX IR conversion at all times to identify and address potential bugs, as indicated by the title "[Do not merge] poke CI with FX IR always on" and the detailed commit history that includes various code refactoring, bug fixes, and testing enhancements related to FX IR and its integration into the PyTorch project.

URL: pull/152405

Merged: No

Associated Commits: d3a36, 6e176, c55fd, f299c, 403b9, fd03d, 2cd01, bd78b, df2f5, 0dfd0, 40011, 20b1d, 91a39, e7335, be417, fc5f0, 6c1cb, c94d1, 2b978, 2b6a4, 5e33e, 6e09c, b4c7a, 41cab, 2a839, 43773, 30cb1, 46e84, e6a12, 1f33f, 04bbe, 8dac9, 69979, a4a87, c5769, 51ed2, 46fed, 32130, 81831, 55f51, 0ca29, 50fca, d4be6, 07ab6, 2fab6, 12b40, 13710, c3e64, 05cf7, aa7a2, 76b12, 2831c, 8678c, fc9ff, 2f17b, fb252, d4ecc, c9a37, feda8, 5fe4e, f5e56, f64ca, 20f97, 6bbee, dcd78, 61388, e2546, 0c1db, 002e9, 3cb46, 43192, 27009, 8f948, 23da2, 328b4, bd632, 6bfc4, 16cb3, 66fad, 597b5, a7d36, c6051, b164d, 81e54, 34736, 7a343, bed61, 2a9ec, 4fbd7, 9a8d1, f4469, 9de81, bcc51, ede4a, 61b83, a13a0, f21d2, 36de7, 72dda, 7cdd2, addc7, a3609, d00c7, 12de0, eb216, fa8c1, a5e72, 0080b, c765c, bb405, 103bc, f8af0, 8d665, 59aaf, b35fb, 5eba4, ac33f, 522fb, e25a1, 079fc, 26b33, 5a0fe, c53a9, b2158, fd88b

2. [inductor][dynamo] Include operator name in size/stride/alignment assertion: This pull request updates the assert_size_stride and assert_alignment functions in the PyTorch project to include an optional op_name argument for enhanced error messaging, modifies the corresponding type stubs, extracts operator names from the FX graph for better debugging in Triton code, and adds unit tests to ensure that both successful and failing assertions include the operator name.

URL: pull/152353

Merged: No

Associated Commits: 73efa, 825ff, 38329, 7a6e6, 61f6c, 5176d, da6e9, 724a5, 57926, 33d19, bbed0

3. [CI] Use cmake from pip instead of conda in CI docker images: This pull request proposes changing the continuous integration (CI) process for PyTorch's Docker images by using CMake installed via pip instead of conda, as indicated by the title and supported by multiple commits with the message "tc."

URL: pull/152537

Merged: No

Associated Commits: 1d14e, 14465, 3d942, 0e87a, a3557, 294a0, 975d1, c516a, 03b5d, 0b627

Other Open Pull Requests

Conda Removal from PyTorch Documentation and Development: Multiple pull requests focus on removing Conda references from the PyTorch project. These changes aim to streamline the installation and development processes by eliminating Conda dependencies and optimizing the build process.
pull/152546, pull/152713

Enhancements to PyTorch's Autotuning and Profiling: Pull requests introduce additional profiling events and code generation for GEMM kernels to improve performance analysis and autotuning capabilities. These updates align with existing features and add new templates for code generation.
pull/152449, pull/152341

Backend and Compilation Improvements: Several pull requests enhance backend specializations and hierarchical compilation processes. These changes involve incorporating mutation dependencies and introducing new keyword arguments to improve flexibility and performance.
pull/152601, pull/152410, pull/152597

Vectorization and Quantization Enhancements: Pull requests introduce vectorized operations for FP8 formats and enable vectorized code generation for quantization processes. These updates aim to enhance performance and efficiency in handling specific data types.
pull/152417, pull/152418, pull/152706

Fault Recovery and Vendor Neutrality: Efforts to make the Fault Recovery code vendor-neutral are evident in multiple pull requests. These changes remove dependencies on specific backends, allowing broader compatibility while maintaining backward compatibility.
pull/152563, pull/152614

PyTorch Functionality and Interoperability Enhancements: Pull requests focus on improving the functionality and interoperability of PyTorch's features. These include making Functorch interpreters serializable and enhancing tree functions to accept various object types.
pull/152616, pull/152624

CUDA and ROCm Integration: Pull requests introduce new mechanisms for CUDA graph launches and integrate AITER for ROCm. These changes aim to improve robustness and performance by utilizing advanced features and refining integration paths.
pull/152622, pull/152630

Miscellaneous Fixes and Updates: Various pull requests address issues such as dead links, numerical instability, and code generation failures. These updates ensure documentation accuracy, consistent results across platforms, and robust code generation processes.
pull/152734, pull/152373, pull/152579

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 269
Key Closed Pull Requests
1. Add scripts to check xrefs and urls: This pull request involves adding and updating scripts to traverse the documentation and codebase of the PyTorch project to identify and address any broken cross-references and URLs, although it was ultimately not merged.

URL: pull/151844

Merged: No

Associated Commits: 1d738, dcd6e, 72d59, 9605a, 2a07f, a37df, 150c5, 48c4e, ee06f, 83ed6, 8037b, e963c, 0a041, 90093, a0cbb, 8f72f, 8ad07, 55b02, 4edf5

2. [conda] Remove conda from lint-autoformat.yml: This pull request aims to remove the use of conda from the lint-autoformat.yml file, address a missing setuptools module error by installing it, switch from Python 3.10 to the system's Python 3.9, and utilize a virtual environment to manage dependencies not included in the base setup.

URL: pull/152433

Merged: No

Associated Commits: 6a8ff, 0df51, 153fa, 96e92, d02a7, dc71d, 72e33, 05a46, c0743, edef7, 73a94, fef8f, 1dca9, 179f9, 2cdb5, f0227

3. [conda] Remove conda usage from TD llm retriever job: This pull request aims to eliminate the use of conda in the TD llm retriever job for the PyTorch project, addressing concerns about managing different Python versions, as the base currently uses Python 3.9.

URL: pull/152338

Merged: No

Associated Commits: ff3f5, 764b9, f4167, f00db, 46fb7, ab12e, f821f, 3e4b5, ab8df, b107f, 24e82, b5447

Other Closed Pull Requests

Subgraph Output Code Simplification: This topic involves simplifying the resulting output code for subgraphs in the PyTorch project. The changes are documented with comparisons of output code before and after the modifications, highlighting the improvements made.
pull/152383, pull/152490

Assertion Checks Removal: The removal of assertion checks for the outputs of the invoke_subgraph function is covered here. The input assertions are considered sufficient, making output assertions unnecessary.
pull/152384

TensorIterator Migration: This topic covers the migration of the multiplication operation to use TensorIterator in PyTorch. It includes a refactor of binary operation tensor generators to handle mixed data types correctly.
pull/152515

Cutlass Epilogue Visitor Code Generation: This involves implementing a Python code generator for the Cutlass epilogue visitor. The pull request is part of a larger stack of related changes in the PyTorch project.
pull/150905, pull/151405

Incorrect Typing in Inductor Module: This topic addresses incorrect typing issues in the cuda_kernel and cuda_template.py components of the Inductor module. Both pull requests were ultimately not merged.
pull/150908, pull/150909

Throwaway Changes: This pull request consists of non-mergeable, experimental commits intended for testing or temporary purposes. It is indicated by the repeated "throwaway" label.
pull/150910

AOTAutogradCache Issue: This pull request addresses an issue with the AOTAutogradCache by saving the bw_module in the cache after stripping it of unserializable metadata. It ensures that both the lowered backward and the bw_module are cached to prevent crashes.
pull/151860

Enhancing Visualize Sharding: This topic involves enhancing the torch.distributed.tensor.debug.visualize_sharding functionality by adding rich support. It addresses issue #151857 and includes updates for compatibility with systems running on at least 4 GPUs.
pull/152027

Export Functionality Enhancement: This pull request enhances the export functionality by supporting the export of hops with function schema arguments. It simplifies the implementation by using pytree.register_constant.
pull/152073

Memory Allocation in CUDA Graph Trees: This topic addresses memory allocation on side streams in CUDA Graph Trees by implementing a safer approach for multithreading. It fixes the error reported in issue #151199.
pull/152472

Test Configuration Monitoring: This pull request enhances the monitoring capabilities of test configurations by adding parameters for log intervals and incorporating an upload step for all test YAML files.
pull/152541

Link Linting Process Enhancement: This topic involves enhancing the link linting process by configuring it to run on modified files only or on all files when scheduled. It includes updates to the linting configuration files.
pull/152377

Metal Kernel Migration: This pull request migrates all addition and subtraction operations to Metal kernels. It addresses a bug related to improper downcasting of CPU double/complexDouble scalars to floats.
pull/152510

XLA Issue Resolution: This topic aims to address and resolve an issue related to XLA in the PyTorch project. Despite multiple updates and commits, the pull request was ultimately not merged.
pull/152456

Memory Management Optimization: This pull request optimizes memory management by ensuring that buffers are freed before invoking a subgraph call. It is indicated by the title and multiple updates in the commit messages.
pull/152494

Profile-Guided Optimization Enhancement: This topic enhances the PGO process by making code state identification independent of file paths. It uses a hash of the file content along with the function name and line number.
pull/152628

Gather Operation Data Type Support: This pull request enhances the PyTorch library by supporting additional data types for both input and indices in the gather operation. It is part of a series of changes tracked through the ghstack tool.
pull/151822

Static Input Indexes Collection: This topic addresses an issue related to the collection of static_input_idxs in the cudagraphs component. It is part of a stack of changes tracked via ghstack.
pull/152287

Tangent Metadata Caching: This pull request implements caching on tangent metadata and enables retracing if necessary. It is part of a series of related changes managed through the ghstack tool.
pull/152357

Unpacked Operands in Subgraph: This pull request involves unpacking operands within a subgraph. It is part of a stack of related changes managed through the ghstack tool.
pull/152547

Intermediate Representation Line for Symbolic Call Arguments: This pull request introduces a new wrapper IR line to handle symbolic call arguments. It aims to streamline code between Python and C++ backends.
pull/152587

Log Message Length Reduction: This topic addresses the issue of generating unnecessarily long log messages for suppressed data-dependent errors. It aims to prevent logging of such errors when recording is not enabled.
pull/151023

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

try something

Toxicity Score: 0.55 (Defensive responses, persistent tension, lack of resolution.)
This GitHub conversation involves username1 and username2, where username1 initially proposes a solution that username2 critiques for not addressing the issue effectively. Username1 responds with a defensive tone, expressing frustration over the feedback, which triggers a tense exchange. Username2 attempts to clarify their point, but the conversation remains strained, with both parties showing signs of impatience.

[caffe2] Make c10::str works with scoped enum

Toxicity Score: 0.55 (Defensive responses,critique of solution,tense exchange)
This GitHub conversation involves username1 providing a solution, which username2 critiques for not addressing the issue effectively. Username1 responds defensively, leading to a tense exchange. Username3 attempts to mediate by suggesting alternative approaches, but the conversation remains strained.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
223
31
8
134

anijain2305
177
19
2
14

mlazos
160
21
0
10

FFFrog
138
9
0
2

laithsakka
78
21
8
29

swolchok
112
10
0
2

zou3519
40
9
16
44

Skylion007
12
1
1
92

bobrenjc93
85
13
2
5

guangyey
82
7
0
16

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	223	31	8	134
anijain2305	177	19	2	14
mlazos	160	21	0	10
FFFrog	138	9	0	2
laithsakka	78	21	8	29
swolchok	112	10	0	2
zou3519	40	9	16	44
Skylion007	12	1	1	92
bobrenjc93	85	13	2	5
guangyey	82	7	0	16