Weekly GitHub Report for Pytorch: March 24, 2025 - March 31, 2025 (12:06:11)

            Weekly GitHub Report for Pytorch: March 24, 2025 - March 31, 2025 (12:06:11)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notable changes include the deprecation of PyTorch's official Anaconda channel, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default value for the weights_only parameter in torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

SDPA (EFFICIENT_ATTENTION) slower than torch.compile decomposition: This issue highlights a performance regression where the SDPA efficient attention mechanism results in slower training times compared to a manually compiled attention implementation, despite showing improvements in memory usage. The user provides a detailed code example and profiling data, indicating that the runtime regression is more pronounced in their actual codebase, and seeks insights into the potential causes and solutions.

The comments discuss various profiling attempts and comparisons between SDPA and manual attention implementations, with suggestions to test on different versions and configurations. The user identifies that enabling TF32 significantly impacts performance, with SDPA not utilizing tensor cores effectively, leading to slower runtimes. The discussion includes requests for further benchmarking and insights into whether TF32 should affect SDPA performance, with contributors suggesting potential areas for investigation and improvement.
Number of comments this week: 17

RuntimeError: UR error with XPU: This issue involves a RuntimeError: UR error encountered when attempting to use XPU with PyTorch versions 2.6.0+xpu and 2.8.0.dev20250321+xpu, despite torch.xpu.is_available() returning True. The user has tried various solutions, including updating conda and removing libstdc++.so.6, but continues to face compatibility issues with oneAPI and PyTorch-XPU.

The comments discuss potential causes and solutions for the error, including environment mismatches and broken nightly builds. Suggestions include using the stable PyTorch 2.6 release, ensuring the correct oneAPI version, and possibly building PyTorch from source. The user provides additional information about their setup, including device properties, and attempts various troubleshooting steps, but the issue persists.
Number of comments this week: 13

[XPU] XPU build has been broken: This issue reports a problem with the XPU build process in a GitHub project, which is failing due to errors introduced by a specific pull request. The error messages indicate issues with the C++ configuration, particularly related to the _GLIBCXX_USE_CXX11_ABI macro, which is causing the build to crash.

The comments discuss the build failure, with contributors suggesting adding checks to ensure the XPU build is correctly configured. There is a discussion about reverting the problematic pull request as a temporary workaround, and a fix is proposed in a related repository. The conversation also touches on the complexity of the proposed solution and the need for a simple release-specific change to address the issue.
Number of comments this week: 10

Concatenating CSR matrices fails: This issue highlights a problem with the PyTorch library where functions like cat, stack, vstack, and hstack do not work on CSR (Compressed Sparse Row) matrices, despite the documentation suggesting otherwise. The error encountered is a RuntimeError indicating that sparse CSR tensors do not have the is_contiguous attribute, which prevents these operations from being executed as expected.

The comments discuss the discrepancy between the documentation and actual functionality, suggesting workarounds using COO (Coordinate) format for concatenation. Users share their use cases for CSR matrices and discuss the complexity of implementing direct concatenation for CSR tensors. Some propose using intermediate COO tensors as a solution, while others share insights from related research and existing implementations in other libraries like SciPy.
Number of comments this week: 9

Auto-selective activation checkpointing is not optimal for speed (issue with min_cut_rematerialization_partition): This issue highlights a performance inefficiency in PyTorch's selective activation checkpointing, specifically with the min_cut_rematerialization_partition implementation, which is not optimal for speed. The user provides a minimal code example demonstrating that the current approach stores certain variables unnecessarily, leading to additional computations during the backward pass, which could be avoided for better performance.

The comments discuss the runtime differences and the rationale behind the current implementation, with some users suggesting that recomputing certain operations like add should be free due to their nature. There is a debate on the interpretation of the activation_memory_budget setting and its impact on recomputation, with code snippets provided to exclude the add operator for testing purposes.
Number of comments this week: 9

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that using multiple smaller MaxPool2D operations can reduce computational costs on a CPU. The approach involves representing a larger kernel size with multiple smaller ones, which has been shown to yield a significant speedup in processing time, as demonstrated by the provided testing code and results.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached cuda_utils.so file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with a tmpfs permission set to 1777, and the problem is highlighted by the inability to map a segment from the shared object, which is crucial for the model's execution.
Enable UFMT on all files in PyTorch: This issue addresses the need to apply uniform formatting (UFMT) to approximately 1,500 files in the PyTorch codebase that are currently not formatted according to the project's standards. The process involves removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to ensure all files adhere to the desired formatting, with additional preparatory work required to resolve known issues in certain files before applying the UFMT changes.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in the PyTorch library to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models. The motivation behind this feature request is to reduce the file size of models, particularly for deployment on mobile devices, by eliminating unnecessary debug files, as demonstrated by a reduction from 6.7MB to 5.6MB in a test case.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 121
Summarized Issues:

Memory Leaks in PyTorch: Memory leaks are a recurring issue in PyTorch, affecting various functionalities such as @torch.compile and torch.save. These leaks lead to increased memory usage and potential out-of-memory errors, as seen in cases where memory is not released after saving tensors or when activations are not cleared after backward passes.  
pytorch/pytorch/issues/149834, pytorch/pytorch/issues/149846, pytorch/pytorch/issues/149876

PyTorch Compilation and Graph Breaks: Compilation issues in PyTorch often result in graph breaks, affecting the execution of models. Problems such as unsupported operations, incorrect path handling, and unexpected graph elements like FunctionCtx disrupt the compilation process, leading to errors and inconsistent behavior.  
pytorch/pytorch/issues/149895, pytorch/pytorch/issues/149906, pytorch/pytorch/issues/149907, pytorch/pytorch/issues/149921, pytorch/pytorch/issues/149929

Performance and Efficiency Issues: PyTorch faces several performance-related challenges, including inefficient attention mechanisms and excessive recompilations. These issues result in slower training times and skewed performance profiles, necessitating optimizations and profiling improvements.  
pytorch/pytorch/issues/149857, pytorch/pytorch/issues/149872, pytorch/pytorch/issues/149908

Autograd and Custom Function Limitations: Limitations in PyTorch's autograd functionality, particularly with custom functions, pose challenges for users. Issues such as failure to handle lists of tensors in C++ and unexpected graph elements hinder the seamless execution of models.  
pytorch/pytorch/issues/149839, pytorch/pytorch/issues/149877

Device Compatibility and Execution Errors: PyTorch users encounter various device-related issues, including runtime errors on specific GPUs and compatibility problems with CUDA and XPU devices. These issues often require environment adjustments or indicate potential bugs in device handling.  
pytorch/pytorch/issues/149845, pytorch/pytorch/issues/149953, pytorch/pytorch/issues/150253

Documentation and Usability Concerns: PyTorch's documentation sometimes lacks clarity or contains errors, affecting user understanding and implementation of functions. Issues such as missing explanations for tensor operations and incorrect behavior descriptions highlight the need for documentation improvements.  
pytorch/pytorch/issues/150009, pytorch/pytorch/issues/150124, pytorch/pytorch/issues/150181

Inductor Backend and Numerical Discrepancies: The inductor backend in PyTorch exhibits numerical discrepancies and miscompilation issues, leading to significant differences in model outputs compared to eager execution. These discrepancies raise concerns about the reliability of the inductor backend for accurate model inference.  
pytorch/pytorch/issues/150019, pytorch/pytorch/issues/150113, pytorch/pytorch/issues/150114

CI/CD and Build Failures: Continuous integration and build processes in PyTorch face disruptions due to various factors, including incompatible software versions and infrastructure issues. These failures impact the stability and reliability of the build environment, necessitating adjustments and fixes.  
pytorch/pytorch/issues/150039, pytorch/pytorch/issues/150046, pytorch/pytorch/issues/150261

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 152
Summarized Issues:

Compilation and Runtime Errors: Compilation and runtime errors are prevalent in PyTorch, often due to device compatibility issues or specific function implementations. For instance, errors like "HIPBLAS_STATUS_NOT_SUPPORTED" and "illegal hardware instruction" occur due to unsupported tensor shapes or environment configurations. These issues highlight the need for careful management of device-specific operations and configurations to ensure smooth execution.
issues/150016, issues/149156

Performance Regressions: Performance regressions are a common concern, particularly when new versions of PyTorch or its dependencies are released. For example, a 40% regression in aten::mm and aten::bmm operations on AMD GPUs was linked to the enabling of hipblaslt, and reverting a specific commit restored performance. Such regressions necessitate thorough testing and validation to maintain optimal performance across updates.
issues/150155, issues/148883

Dynamic Shape and Export Issues: Dynamic shape handling and model export processes in PyTorch can lead to errors, such as "Pending unbacked symbols" or failures in ONNX export with dynamic axes. These issues underscore the complexity of managing dynamic dimensions and the need for robust export mechanisms to handle varying input sizes and configurations.
issues/138073, issues/149826

Device-Specific Bugs: Device-specific bugs, particularly on newer or less common platforms like Apple's MPS or Intel's XPU, can result in incorrect computations or unsupported operations. For instance, complex conjugations on MPS yield incorrect results, and LayerNorm on XPU produces NaN values, indicating the need for platform-specific optimizations and testing.
issues/148156, issues/141642

Graph Breaks and Unsupported Operations: Graph breaks and unsupported operations in PyTorch's dynamo and export functions can hinder model compilation and execution. Issues like unsupported use of dict.update or torch.vmap with certain configurations highlight the challenges in maintaining seamless graph transformations and the need for comprehensive support for common operations.
issues/137411, issues/149509

Documentation and Usability Concerns: Documentation discrepancies and usability issues, such as outdated web pages or missing docstrings, can impede user understanding and adoption of PyTorch features. Ensuring up-to-date and comprehensive documentation is crucial for user support and effective utilization of the library's capabilities.
issues/146683, issues/149006

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 144
Key Open Pull Requests
1. Enable XPU distributed test for PT2.8: This pull request aims to enable XPU (Intel's accelerator technology) support for distributed testing in PyTorch version 2.8 by incorporating various updates such as adding XPU support for distributed data parallel (DDP) and pipeline test cases, porting fully sharded data parallel (FSDP) tests, and fixing backend mapping errors, while also involving multiple merges and reverts to refine the implementation.

URL: pull/149916

Merged: No

Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55

2. S390x: update more tests: This pull request focuses on updating and enabling more tests for the s390x architecture in the PyTorch project, addressing specific issues related to s390x, marking certain tests as failing or skipped, and making various improvements such as fixing byte order constants, adding necessary build tools, switching to a newer GCC version, and handling new warnings, while also ensuring that some tests are either fixed or appropriately marked for the s390x platform.

URL: pull/150116

Merged: No

Associated Commits: 280c8, e1293, 6406e, 939a3, 91b71, 534f3, 19619, 86ce3, b19c8, a8847, b92b7, 8f01a, 77215, 682f0, fb409, b849a, 7258b

3. [WIP] rewrite pad_nd with guard_or_false: This pull request aims to rewrite the pad_nd function using guard_or_false in the PyTorch project, replacing the existing guard_size_oblivious approach to improve code readability, reduce complexity, and minimize data-dependent errors, as detailed in the associated commits and documentation.

URL: pull/149998

Merged: No

Associated Commits: 94167, 3f56d, 31e50, 1e0ba, 8ed91, 76154, 41959, cfb21, 13325, f2c92, 25c1a, 53c14, 56554, 1ca8f, 7f4ba, 66864

Other Open Pull Requests

GitHub Runner Behaviors and Workflow Configuration: This topic involves testing GitHub Runner behaviors by combining Windows x64 and ARM64 YAML template files, adjusting build scripts, and modifying workflow steps to ensure compatibility and efficiency across different platforms. Additionally, it includes streamlining the workflow configuration by combining the Windows x64 and Arm64 YAML template files into a single file, thereby eliminating the separate win-arm64 template.
pull/150014, pull/149850

Reshape Decomposition and Guard Mechanisms: This topic covers the rewrite of the reshape decomposition and the infer_size function for wildcard dimensions, utilizing the guard_or_false mechanism to prevent data-dependent errors. It also includes the introduction of C++ bindings for the guard_or_false/true functionality, aiming to implement the base version despite challenges in finding a suitable location.
pull/150127, pull/150148

Tensor Parallelism and Activation Checkpointing: This topic addresses the issue of fusing matmul-reduce-scatters in asynchronous tensor parallelism when reduce-scatter operations have multiple users. It involves implementing additional pattern matching logic to accommodate reduce-scatter nodes with two users and ensuring that the fused node is saved for backward passes instead of the reduce-scatter node.
pull/149946

Triton Lowering and Matrix Multiplication: This topic aims to consolidate the Triton lowering of scaled_mm (FP8 matrix multiplication) into the existing mm template by adding an epilogue to handle scale multiplication. This will facilitate the development of future scaled variants of batched matrix multiplication (BMM) and grouped general matrix multiplication (GEMM) in the inductor.
pull/150045

Gradient Scaler for MPS Backend: This topic addresses the implementation of a gradient scaler for the Metal Performance Shaders (MPS) backend in PyTorch, aiming to fix issue #142397. It involves handling different dtype/device tensors in TensorList, optimizing the foreach kernel grouping, and enabling tests for the MPS device.
pull/150255

FSDP2 Pre-forward Function Logic: This topic addresses an issue where the pre_forward function in FSDP2 is called twice when using checkpoint(), leading to errors due to incorrect training state and argument casting. The solution involves reordering the pre_forward logic to align with FSDP1's handling and ensuring proper casting when the training state is pre_backward.
pull/150044

Binary Build Matrix Generation: This topic introduces a new script designed to generate a binary build matrix for the PyTorch project, aiming to refactor and improve upon an existing script. It ensures each binary build is a distinct object with explicit metadata, focusing on CPU architecture and accelerator specifications.
pull/149830

Torch Export Initialization Process: This topic addresses an issue by modifying the behavior of the torch.export initialization process to ensure that a no-GPU warning is only triggered when attempting to enable the tf32 setting in a CPU-only environment. This change aligns the warning mechanism with the correct usage context.
pull/149926

Inductor Component and Symbolic Integers: This topic addresses an issue where the Inductor component fails to handle non-trivial tile ranges with unbacked symbolic integers (symints). A fallback mechanism is implemented to provide size hints, preventing errors related to the inability to convert symbols to integers.
pull/149994

AOT Autograd and Saved Tensors Hooks: This topic aims to enhance the aot_autograd functionality in PyTorch by introducing support for saved tensors hooks. It involves implementing features like dynamo guards for recompilation when hooks change and handling saved tensors hooks that pack into subclasses.
pull/150032

Continuous Integration Optimization: This topic aims to optimize the continuous integration (CI) process by using the system-installed NCCL in the build. It involves installing NCCL in the Docker image, setting USE_SYSTEM_NCCL=1 in CI builds to reduce build time, and unifying various NCCL version pins across different installation scripts and Docker files.
pull/150226

FP8 Data Types in Assert Close Function: This topic adds support for fp8 data types in the assert_close function by comparing them bitwise with zero absolute and relative tolerances. It addresses issue #135998 and includes a new unit test to cover the updated code paths.
pull/150002

FlexAttention Module Enhancements: This topic aims to enhance the FlexAttention module by enabling it to dispatch to the SAC (Soft Actor-Critic) for flexible operations. It is part of a series of updates tracked through the ghstack tool and involves multiple contributors for review and collaboration.
pull/150080

Invoke Subgraph Function Support: This topic aims to enhance the PyTorch project by adding support for None values in the forward output of the invoke_subgraph function. It is part of a series of related changes tracked through the ghstack tool.
pull/150082

ROCm TunableOp Unit Tests: This topic introduces stricter unit tests for both online and offline tuning in the ROCm TunableOp, enhancing the comparison criteria by including both OpSig and ParamSig. It ensures comprehensive testing across different transposition combinations and adds warnings for unsupported tensor shapes during offline tuning.
pull/150142

Documentation Build Errors: This topic addresses and resolves documentation build errors in the PyTorch project caused by unsupported section titles. It ensures successful HTML generation and improves the rendering of the documentation.
pull/150205

Docker Build Failures for Executorch and Halide: This topic addresses the issue of failing docker builds for executorch and halide due to a CMake update. It involves setting the CMAKE_POLICY_VERSION_MINIMUM environment variable, which can be removed once executorch and halide update their builds and the hash is updated.
pull/150220

Torch Accelerator Device Count Adaptation: This topic aims to adapt torch.accelerator.device_count for multi-process usage by delegating its functionality to torch.xxx.device_count. It ensures alignment with the behavior of torch.get_device_module(device).device_count to avoid issues like fork poisoning.
pull/149924

CI System Testing for Origin/Main Branch: This topic is aimed at testing whether the continuous integration (CI) system for the 'origin/main' branch of the PyTorch project is malfunctioning. It includes testing changes and merging updates from the main branch.
pull/149948

PyPI Package Validation for CUDA Binaries: This topic aims to disable the PyPI package validation for binaries that include CUDA libraries in the smoke test process. It addresses a specific issue where these binaries do not install packages via PyPI, as evidenced by a runtime error indicating the absence of the 'cudnn' package in PyPI for a specific Torch version.
pull/150194

CUDA and ROCm Stream Handling in DLPack: This topic addresses the logic for handling CUDA and ROCm streams in the creation of DLPack capsules from tensors. It ensures the use of the legacy default stream when tensor.__dlpack__(stream=None) is called for a CUDA tensor and introduces error handling for unsupported stream values in both CUDA and ROCm contexts.
pull/150217

DLPack Keyword Arguments Support: This topic adds support for the missing keyword arguments dl_device and copy introduced in DLPack version 2023.12. It updates the C++ implementation of to_dlpack(...) to handle these arguments and introduces a new Python API torchDeviceToDLDevice().
pull/150218

Global State Dictionary Loading: This topic addresses the issue of adding a strict check when loading a global state dictionary into a local one in the PyTorch project. It ensures that if the 'strict' option is set to true, only matching keys are loaded, while if set to false, additional keys from the global state are also included.
pull/150239

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 257
Key Closed Pull Requests
1. Brister/always tiled reduction: This pull request aims to test the continuous integration (CI) system by enabling tiled reductions by default, which could potentially identify bugs, and includes numerous commits addressing various issues such as fixing non-dense strides, adding unit tests, and refining the tiling and reduction logic.

URL: pull/144008

Merged: No

Associated Commits: b88c1, 6fa6c, 7de6d, 1fbf8, 1b68e, 462c8, cfc0d, 11420, 81837, 83620, 5f0c2, 29bdd, 762ed, bac7d, 6e502, 4e1b3, 85432, a313b, 607a2, 6d0af, 64191, 1ab9c, 1b5cb, 4a77b, bf01a, 273aa, 350a8, d8381, 0a003, 60be7, 27bd2, 0a010, 84590, d882e, 379e7, 4cc2b, 6534d, 1d552, 8ba68, 6edc4, 24a1a, 6aabd, 783c7, dc2b0, 71716, 704b1, 9a397, f92eb, 1edcd, 0062e, 29298, 53a99, 103fe, 1ed28, e81c3, 6e999, 3e9e4, e19b7, 35ee8, b5eda, e8216, 71eaf, b5e88, 8d5d6, 81eba, a5a7f, 34d78, 824e3, c46da, a6e9d, ad113, 55a01, 72dab, 7aa94, 8c4b8, 309f7, e6931, 6d979, 4e8a0, ee832, e7356, b0d5c, eaa48, 80c99, e1596, 87bfd, b7005, 80838, 0244b, 99be2, 3f1a1, 8fc5a, fe55c, 4b59d, a5d56, a07f1, d2e05, 157f2, 41f14, 8dc05, 21c54, 8fe5e, bc404, 7f4fb, 04bed, 4e205, 02b45, 96e57, 9186c, 3b538, cf573, d75b9, 5d57b, 81609, 9e6a2, b99f2, 131ee, 07089, bc42b, 8802d, 2dc7f, 89861, 874d0, aa7a3, f5ad6, 3107b, 02365, 0a465, 1da5d, 36624, 17825, 4ba2b, 89106, ab074, eac98, 26c5e, 574a8, f1169, a5aa8, 720d2, cb021, f60ab, e8b23, 1bbf8, f9c0f, 0a958, 6504b, 920b0, da951, 87ee2, 0ab2a, 70961, e4e22, c79b7, c7913, 5fd2e, f94f7, d88e6, 42311, e7cc7, e3471, ca2e8, 3a431, ea561, 7c731, 32c3c, ed79c, d79a0, 69577, e9ced, 0fc56, 6e320, 8ad92, 6d15d, a4c1b, cc8ed, 93071, 1042a, 59343, d3146, 4b073, 714ba, 79987, 942a7, a5bbd, 01ded, 50e45, bb464, ae132, 258cf, 3d2f0, 38ce2, e1407, 53c08, 72b21, bbf00, 12bb7, 67738, 8c9ef, 95704, 39a10, 1ee6b, 7c3c6, 417bc, 7cd2f, be966, 60c3c, b80b9, edb88, 6d169, 3faed, 9573d, 3e133, 30eda, d5334, fbf34, 6da70, 29495, 55100, b885f, 76ed9, 16d15, 0dbc5, 98740, 931ec, 04b7a, 47565, a3b1b, a4623, 8223c, 0def4, a960f, c324e, 305c4, 7b395

2. Combine win and win-arm64 templates: This pull request aims to combine the Windows and Windows ARM64 templates in the PyTorch project, as indicated by the title and the initial commit message, although it was ultimately not merged.

URL: pull/149613

Merged: No

Associated Commits: 1c988, 93670, a0912, 496bb, d67c1, 22271, 80dfc, 9b112, 2c4bc, 62374, 1d940, c99ef, 7bb9c, b99fc, 88a26, 6285a, 1d221, ffa08, aae4c, 44e64, 18435, 6e843, 4a4a7, 24176, f17ae, 406d4, a7031, b07b8, a268c, f64c3, ce5ad, 1d3c5, bf34e, f47aa, 90543, 29756, 1099c, c2ada, 5ebc2, e4816, 66dd0, 362b4, 06923, a39bf, 64bd8, 732f9, ee6a0, bf662, 53278, ccd5d, 4ea58, 0a396, 0ed34, cfc08, 34743, 68dfd, d0722, 2b90e, 5d4b5, 1eab8, bdc13, e35ef, 64d22, 70026, fa5f5, 99a4f, f7d1b, 1b08a, b0a5d, 19b76, 46dd2, 0eb3a, 09aa6, c5dea, 85f6d, 842d5, 5757a, fb07f, ff020, d46c1, 1c6b5, 7f836, d320a, 27370, b238e, c73a5, 9d02b, 021b3, b9a5e, 01b1d, 51fa8, 8f7fb, 621c8, 6bbe8, abf0e, 2b848, 9367f, 539db, fe954, 85027, c201d, 8bece, 2dccd, de3ac, d5ce5, 24848, 21ab4, d13a1, 63f9f, afaa0, 24972, 15f8d, f6cbb, bdd89

3. cpp_wrapper: persist autotune example tensors until last use: This pull request addresses an issue in the PyTorch project where randomly generated example tensors could cause kernel autotuning to fail by ensuring that these tensors persist until their last use, thereby fixing a specific test failure related to compile-time autotuning.

URL: pull/146706

Merged: No

Associated Commits: a56a4, bd891, 391e6, b0944, 913e8, 25078, e1958, e5083, 0026e, 0f41e, ce975, e7962, d43be, a6b57, 73360, 02a13, 4bb0f, 49d49, 3084e, e0f0d, 328c3, 0d19a, caf65, 073cf, 37e4e, 74f9d, e05c2, 1475d, b60db, b41a6, b41df, 3b50b, 9843a, 6d848, 2db66, eb4b7, 266b6

Other Closed Pull Requests

CUDA Kernel Enhancements: This pull request introduces a new CUDA kernel to improve the performance of the backward pass for gamma and beta calculations in layer normalization. It shows significant speed improvements for input dimensions that are powers of two, despite a slight increase in binary size and compile time.
pull/148605

AOT Autograd Cache: This pull request implements caching for the AC HOP in the AOT Autograd Cache within PyTorch. It is part of a series of changes managed through the ghstack tool but was not merged.
pull/139094

Graph Partitioning for Custom Operations: This pull request introduces support for graph partitioning on custom operations in PyTorch. It provides a new API to register or unregister custom operations for graph partitioning, with example usage and tests, although it is not yet merged.
pull/149782

Input Aliasing and Mutation Checks: This pull request implements input aliasing and mutation checks within the Dynamo component of PyTorch. It focuses on using versioning to manage these checks in the invoke_subgraph function, involving multiple contributors.
pull/148953

Profile-Guided Optimization (PGO) Cache Misses: This pull request addresses cache misses in internal models due to PGO by using source hashing to generate consistent symbolic IDs. It ensures stable assignment and prevents catastrophic symbol collisions through linear probing.
pull/149665

TorchTune Fixes: This pull request, titled "[dont review][dont merge] All fixes to make TorchTune work," was created to implement fixes for TorchTune. It was not intended for review or merging and was closed without being merged.
pull/140032

Flash-Attention Integration: This pull request transforms the integration of flash-attention into a third-party submodule. It addresses changes in Cuda-graph RNG handling and dependencies on a related Flash PR, while dealing with backward compatibility issues.
pull/144120

Type Annotations in _inductor/ir.py: This pull request enhances type annotations in the _inductor/ir.py file by removing all # type: ignore comments. It addresses resulting type failures while avoiding changes to existing behavior.
pull/148358

cuBLAS nvfp4 Kernel Integration: This pull request integrates the torch._scaled_mm function with the cuBLAS nvfp4 kernel for matrix multiplication. It allows the operation to utilize the specialized fp4 gemm kernel for improved performance.
pull/148792

Asynchronous Tensor Parallelism: This pull request addresses fusing matmul-reduce-scatters in asynchronous tensor parallelism. It implements pattern matching logic to accommodate reduce-scatter nodes with multiple users, preventing memory leaks.
pull/149875

Intermediate Node Name Normalization: This pull request normalizes intermediate node names to ensure isomorphic graphs produce the same outputted graph. It improves cache utilization by performing an alpha renaming of intermediate variables.
pull/149415

Dilation Support in max_pool2d: This pull request enhances PyTorch by adding support for dilation in the lowering process of the max_pool2d operation. It is indicated by the title and the series of commits associated with the changes.
pull/148209

Fake Tensors in foreach_copy: This pull request adds support for fake tensors in the foreach_copy function within PyTorch. It addresses issue #149111 and includes various commits for adding test cases and fixing lint errors.
pull/149127

StaticCudaLauncher Modifications: This pull request modifies the StaticCudaLauncher to support any number of kernel arguments. It implements a fallback mechanism for arguments exceeding a predefined maximum and addresses a specific issue with zero arguments.
pull/149442

Shared Memory Allocations in StaticCudaLauncher: This pull request enhances the StaticCudaLauncher by enabling support for shared memory allocations exceeding 48KB. It involves special handling by querying the device for maximum memory.
pull/149657

Dynamic Shapes Code Generation: This pull request makes code generation for dynamic shapes more device agnostic. It addresses the assumption that devices are either CPU with Cpp codegen or GPU with Triton codegen, allowing more flexibility.
pull/146830

Autograd Key Graph Tracing: This pull request addresses not tracing the forward and backward graphs in the autograd key within PyTorch. It was ultimately not merged, as indicated by the title and multiple updates in the commit messages.
pull/148930

Fake Tensor Prop Caching: This pull request, titled "[invoke_subgraph] Fake tensor prop caching," aimed to reintroduce changes from a previous pull request. It focused on caching properties of fake tensors within subgraph invocations but was not merged.
pull/149087

PendingUnbackedSymbolNotFound Error: This pull request addresses the "PendingUnbackedSymbolNotFound" error by allowing the intentional creation of unbacked symbols. It provides a method to bypass this error using fake_mode.shape_env.ignore_fresh_unbakced_symbols().
pull/149297

Non-Contiguous Operations Performance: This pull request enhances performance for non-contiguous operations on larger tensors by replacing the indexed approach with a strided flavor. It significantly reduces execution time for operations like fmax on 1000x1000 stride tensors.
pull/149730

Prologue Fusion with constant_pad_nd: This pull request includes constant_pad_nd in prologue fusion within PyTorch. Benchmarking revealed occasional speedups, prompting the change along with a fix for creating a single, contiguous dependency for prologues.
pull/149947

Row-Wise Scaled MM Refactoring: This pull request refactors the row-wise scaled matrix multiplication (MM) by adding configuration selection for SM89.2. It ensures kernels are only built when compiling for the specified architecture.
pull/149978

Tensors with requires_grad=True Warning: This pull request addresses tensors with requires_grad=True being converted to scalars without warning. It introduces a user warning to alert developers of potential unexpected behavior when using operations like math.pow.
pull/143261

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

Add XPU and SYCL Merge Patterns
Toxicity Score: 0.55 (Frustration expressed, defensive responses, mediation attempts, escalating tension.)
This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
249
78
15
242

guilhermeleobas
354
7
2
28

justinchuby
123
19
6
133

XuehaiPan
218
10
0
34

zou3519
16
4
16
222

jamesjwu
170
15
12
50

laithsakka
107
24
8
97

atalman
130
25
16
65

cyyever
138
39
0
30

jansel
112
18
0
71

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	249	78	15	242
guilhermeleobas	354	7	2	28
justinchuby	123	19	6	133
XuehaiPan	218	10	0	34
zou3519	16	4	16	222
jamesjwu	170	15	12	50
laithsakka	107	24	8	97
atalman	130	25	16	65
cyyever	138	39	0	30
jansel	112	18	0	71