Weekly GitHub Report for Pytorch: March 24, 2025 - March 31, 2025 (12:06:11)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notable changes include the deprecation of PyTorch's official Anaconda channel, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default value for the weights_only
parameter in torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
SDPA (
EFFICIENT_ATTENTION
) slower than torch.compile decomposition: This issue highlights a performance regression where the SDPA efficient attention mechanism results in slower training times compared to a manually compiled attention implementation, despite showing improvements in memory usage. The user provides a detailed code example and profiling data, indicating that the runtime regression is more pronounced in their actual codebase, and seeks insights into the potential causes and solutions.- The comments discuss various profiling attempts and comparisons between SDPA and manual attention implementations, with suggestions to test on different versions and configurations. The user identifies that enabling TF32 significantly impacts performance, with SDPA not utilizing tensor cores effectively, leading to slower runtimes. The discussion includes requests for further benchmarking and insights into whether TF32 should affect SDPA performance, with contributors suggesting potential areas for investigation and improvement.
- Number of comments this week: 17
-
RuntimeError: UR error
with XPU: This issue involves aRuntimeError: UR error
encountered when attempting to use XPU with PyTorch versions2.6.0+xpu
and2.8.0.dev20250321+xpu
, despitetorch.xpu.is_available()
returningTrue
. The user has tried various solutions, including updating conda and removinglibstdc++.so.6
, but continues to face compatibility issues with oneAPI and PyTorch-XPU.- The comments discuss potential causes and solutions for the error, including environment mismatches and broken nightly builds. Suggestions include using the stable PyTorch 2.6 release, ensuring the correct oneAPI version, and possibly building PyTorch from source. The user provides additional information about their setup, including device properties, and attempts various troubleshooting steps, but the issue persists.
- Number of comments this week: 13
-
[XPU] XPU build has been broken: This issue reports a problem with the XPU build process in a GitHub project, which is failing due to errors introduced by a specific pull request. The error messages indicate issues with the C++ configuration, particularly related to the
_GLIBCXX_USE_CXX11_ABI
macro, which is causing the build to crash.- The comments discuss the build failure, with contributors suggesting adding checks to ensure the XPU build is correctly configured. There is a discussion about reverting the problematic pull request as a temporary workaround, and a fix is proposed in a related repository. The conversation also touches on the complexity of the proposed solution and the need for a simple release-specific change to address the issue.
- Number of comments this week: 10
-
Concatenating CSR matrices fails: This issue highlights a problem with the PyTorch library where functions like
cat
,stack
,vstack
, andhstack
do not work on CSR (Compressed Sparse Row) matrices, despite the documentation suggesting otherwise. The error encountered is aRuntimeError
indicating that sparse CSR tensors do not have theis_contiguous
attribute, which prevents these operations from being executed as expected.- The comments discuss the discrepancy between the documentation and actual functionality, suggesting workarounds using COO (Coordinate) format for concatenation. Users share their use cases for CSR matrices and discuss the complexity of implementing direct concatenation for CSR tensors. Some propose using intermediate COO tensors as a solution, while others share insights from related research and existing implementations in other libraries like SciPy.
- Number of comments this week: 9
-
Auto-selective activation checkpointing is not optimal for speed (issue with min_cut_rematerialization_partition): This issue highlights a performance inefficiency in PyTorch's selective activation checkpointing, specifically with the
min_cut_rematerialization_partition
implementation, which is not optimal for speed. The user provides a minimal code example demonstrating that the current approach stores certain variables unnecessarily, leading to additional computations during the backward pass, which could be avoided for better performance.- The comments discuss the runtime differences and the rationale behind the current implementation, with some users suggesting that recomputing certain operations like
add
should be free due to their nature. There is a debate on the interpretation of theactivation_memory_budget
setting and its impact on recomputation, with code snippets provided to exclude theadd
operator for testing purposes. - Number of comments this week: 9
- The comments discuss the runtime differences and the rationale behind the current implementation, with some users suggesting that recomputing certain operations like
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that using multiple smaller MaxPool2D operations can reduce computational costs on a CPU. The approach involves representing a larger kernel size with multiple smaller ones, which has been shown to yield a significant speedup in processing time, as demonstrated by the provided testing code and results.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached
cuda_utils.so
file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with atmpfs
permission set to1777
, and the problem is highlighted by the inability to map a segment from the shared object, which is crucial for the model's execution. - Enable UFMT on all files in PyTorch: This issue addresses the need to apply uniform formatting (UFMT) to approximately 1,500 files in the PyTorch codebase that are currently not formatted according to the project's standards. The process involves removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to ensure all files adhere to the desired formatting, with additional preparatory work required to resolve known issues in certain files before applying the UFMT changes. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in the PyTorch library to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models. The motivation behind this feature request is to reduce the file size of models, particularly for deployment on mobile devices, by eliminating unnecessary debug files, as demonstrated by a reduction from 6.7MB to 5.6MB in a test case.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 121
Summarized Issues:
- Memory Leaks in PyTorch: Memory leaks are a recurring issue in PyTorch, affecting various functionalities such as
@torch.compile
andtorch.save
. These leaks lead to increased memory usage and potential out-of-memory errors, as seen in cases where memory is not released after saving tensors or when activations are not cleared after backward passes.
- PyTorch Compilation and Graph Breaks: Compilation issues in PyTorch often result in graph breaks, affecting the execution of models. Problems such as unsupported operations, incorrect path handling, and unexpected graph elements like
FunctionCtx
disrupt the compilation process, leading to errors and inconsistent behavior.
- Performance and Efficiency Issues: PyTorch faces several performance-related challenges, including inefficient attention mechanisms and excessive recompilations. These issues result in slower training times and skewed performance profiles, necessitating optimizations and profiling improvements.
- Autograd and Custom Function Limitations: Limitations in PyTorch's autograd functionality, particularly with custom functions, pose challenges for users. Issues such as failure to handle lists of tensors in C++ and unexpected graph elements hinder the seamless execution of models.
- Device Compatibility and Execution Errors: PyTorch users encounter various device-related issues, including runtime errors on specific GPUs and compatibility problems with CUDA and XPU devices. These issues often require environment adjustments or indicate potential bugs in device handling.
- Documentation and Usability Concerns: PyTorch's documentation sometimes lacks clarity or contains errors, affecting user understanding and implementation of functions. Issues such as missing explanations for tensor operations and incorrect behavior descriptions highlight the need for documentation improvements.
- Inductor Backend and Numerical Discrepancies: The inductor backend in PyTorch exhibits numerical discrepancies and miscompilation issues, leading to significant differences in model outputs compared to eager execution. These discrepancies raise concerns about the reliability of the inductor backend for accurate model inference.
- CI/CD and Build Failures: Continuous integration and build processes in PyTorch face disruptions due to various factors, including incompatible software versions and infrastructure issues. These failures impact the stability and reliability of the build environment, necessitating adjustments and fixes.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 152
Summarized Issues:
- Compilation and Runtime Errors: Compilation and runtime errors are prevalent in PyTorch, often due to device compatibility issues or specific function implementations. For instance, errors like "HIPBLAS_STATUS_NOT_SUPPORTED" and "illegal hardware instruction" occur due to unsupported tensor shapes or environment configurations. These issues highlight the need for careful management of device-specific operations and configurations to ensure smooth execution.
- Performance Regressions: Performance regressions are a common concern, particularly when new versions of PyTorch or its dependencies are released. For example, a 40% regression in
aten::mm
andaten::bmm
operations on AMD GPUs was linked to the enabling ofhipblaslt
, and reverting a specific commit restored performance. Such regressions necessitate thorough testing and validation to maintain optimal performance across updates.
- Dynamic Shape and Export Issues: Dynamic shape handling and model export processes in PyTorch can lead to errors, such as "Pending unbacked symbols" or failures in ONNX export with dynamic axes. These issues underscore the complexity of managing dynamic dimensions and the need for robust export mechanisms to handle varying input sizes and configurations.
- Device-Specific Bugs: Device-specific bugs, particularly on newer or less common platforms like Apple's MPS or Intel's XPU, can result in incorrect computations or unsupported operations. For instance, complex conjugations on MPS yield incorrect results, and LayerNorm on XPU produces NaN values, indicating the need for platform-specific optimizations and testing.
- Graph Breaks and Unsupported Operations: Graph breaks and unsupported operations in PyTorch's dynamo and export functions can hinder model compilation and execution. Issues like unsupported use of
dict.update
ortorch.vmap
with certain configurations highlight the challenges in maintaining seamless graph transformations and the need for comprehensive support for common operations.
- Documentation and Usability Concerns: Documentation discrepancies and usability issues, such as outdated web pages or missing docstrings, can impede user understanding and adoption of PyTorch features. Ensuring up-to-date and comprehensive documentation is crucial for user support and effective utilization of the library's capabilities.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 144
Key Open Pull Requests
1. Enable XPU distributed test for PT2.8: This pull request aims to enable XPU (Intel's accelerator technology) support for distributed testing in PyTorch version 2.8 by incorporating various updates such as adding XPU support for distributed data parallel (DDP) and pipeline test cases, porting fully sharded data parallel (FSDP) tests, and fixing backend mapping errors, while also involving multiple merges and reverts to refine the implementation.
- URL: pull/149916
- Merged: No
- Associated Commits: d0d82, c791d, f5cbd, 4d944, 68441, 5051e, e2aa9, 9e830, 06dd2, a4a73, 6e3f6, 5f473, 345d7, 4a5a5, 20a44, a90a6, 5b1af, 44d55
2. S390x: update more tests: This pull request focuses on updating and enabling more tests for the s390x architecture in the PyTorch project, addressing specific issues related to s390x, marking certain tests as failing or skipped, and making various improvements such as fixing byte order constants, adding necessary build tools, switching to a newer GCC version, and handling new warnings, while also ensuring that some tests are either fixed or appropriately marked for the s390x platform.
- URL: pull/150116
- Merged: No
- Associated Commits: 280c8, e1293, 6406e, 939a3, 91b71, 534f3, 19619, 86ce3, b19c8, a8847, b92b7, 8f01a, 77215, 682f0, fb409, b849a, 7258b
3. [WIP] rewrite pad_nd with guard_or_false: This pull request aims to rewrite the pad_nd
function using guard_or_false
in the PyTorch project, replacing the existing guard_size_oblivious
approach to improve code readability, reduce complexity, and minimize data-dependent errors, as detailed in the associated commits and documentation.
- URL: pull/149998
- Merged: No
- Associated Commits: 94167, 3f56d, 31e50, 1e0ba, 8ed91, 76154, 41959, cfb21, 13325, f2c92, 25c1a, 53c14, 56554, 1ca8f, 7f4ba, 66864
Other Open Pull Requests
- GitHub Runner Behaviors and Workflow Configuration: This topic involves testing GitHub Runner behaviors by combining Windows x64 and ARM64 YAML template files, adjusting build scripts, and modifying workflow steps to ensure compatibility and efficiency across different platforms. Additionally, it includes streamlining the workflow configuration by combining the Windows x64 and Arm64 YAML template files into a single file, thereby eliminating the separate win-arm64 template.
- Reshape Decomposition and Guard Mechanisms: This topic covers the rewrite of the reshape decomposition and the infer_size function for wildcard dimensions, utilizing the guard_or_false mechanism to prevent data-dependent errors. It also includes the introduction of C++ bindings for the
guard_or_false/true
functionality, aiming to implement the base version despite challenges in finding a suitable location.
- Tensor Parallelism and Activation Checkpointing: This topic addresses the issue of fusing matmul-reduce-scatters in asynchronous tensor parallelism when reduce-scatter operations have multiple users. It involves implementing additional pattern matching logic to accommodate reduce-scatter nodes with two users and ensuring that the fused node is saved for backward passes instead of the reduce-scatter node.
- Triton Lowering and Matrix Multiplication: This topic aims to consolidate the Triton lowering of
scaled_mm
(FP8 matrix multiplication) into the existingmm
template by adding an epilogue to handle scale multiplication. This will facilitate the development of future scaled variants of batched matrix multiplication (BMM) and grouped general matrix multiplication (GEMM) in the inductor.
- Gradient Scaler for MPS Backend: This topic addresses the implementation of a gradient scaler for the Metal Performance Shaders (MPS) backend in PyTorch, aiming to fix issue #142397. It involves handling different dtype/device tensors in TensorList, optimizing the foreach kernel grouping, and enabling tests for the MPS device.
- FSDP2 Pre-forward Function Logic: This topic addresses an issue where the
pre_forward
function in FSDP2 is called twice when usingcheckpoint()
, leading to errors due to incorrect training state and argument casting. The solution involves reordering thepre_forward
logic to align with FSDP1's handling and ensuring proper casting when the training state ispre_backward
.
- Binary Build Matrix Generation: This topic introduces a new script designed to generate a binary build matrix for the PyTorch project, aiming to refactor and improve upon an existing script. It ensures each binary build is a distinct object with explicit metadata, focusing on CPU architecture and accelerator specifications.
- Torch Export Initialization Process: This topic addresses an issue by modifying the behavior of the
torch.export
initialization process to ensure that a no-GPU warning is only triggered when attempting to enable thetf32
setting in a CPU-only environment. This change aligns the warning mechanism with the correct usage context.
- Inductor Component and Symbolic Integers: This topic addresses an issue where the Inductor component fails to handle non-trivial tile ranges with unbacked symbolic integers (symints). A fallback mechanism is implemented to provide size hints, preventing errors related to the inability to convert symbols to integers.
- AOT Autograd and Saved Tensors Hooks: This topic aims to enhance the
aot_autograd
functionality in PyTorch by introducing support for saved tensors hooks. It involves implementing features like dynamo guards for recompilation when hooks change and handling saved tensors hooks that pack into subclasses.
- Continuous Integration Optimization: This topic aims to optimize the continuous integration (CI) process by using the system-installed NCCL in the build. It involves installing NCCL in the Docker image, setting
USE_SYSTEM_NCCL=1
in CI builds to reduce build time, and unifying various NCCL version pins across different installation scripts and Docker files.
- FP8 Data Types in Assert Close Function: This topic adds support for fp8 data types in the
assert_close
function by comparing them bitwise with zero absolute and relative tolerances. It addresses issue #135998 and includes a new unit test to cover the updated code paths.
- FlexAttention Module Enhancements: This topic aims to enhance the FlexAttention module by enabling it to dispatch to the SAC (Soft Actor-Critic) for flexible operations. It is part of a series of updates tracked through the ghstack tool and involves multiple contributors for review and collaboration.
- Invoke Subgraph Function Support: This topic aims to enhance the PyTorch project by adding support for
None
values in the forward output of theinvoke_subgraph
function. It is part of a series of related changes tracked through the ghstack tool.
- ROCm TunableOp Unit Tests: This topic introduces stricter unit tests for both online and offline tuning in the ROCm TunableOp, enhancing the comparison criteria by including both OpSig and ParamSig. It ensures comprehensive testing across different transposition combinations and adds warnings for unsupported tensor shapes during offline tuning.
- Documentation Build Errors: This topic addresses and resolves documentation build errors in the PyTorch project caused by unsupported section titles. It ensures successful HTML generation and improves the rendering of the documentation.
- Docker Build Failures for Executorch and Halide: This topic addresses the issue of failing docker builds for executorch and halide due to a CMake update. It involves setting the
CMAKE_POLICY_VERSION_MINIMUM
environment variable, which can be removed once executorch and halide update their builds and the hash is updated.
- Torch Accelerator Device Count Adaptation: This topic aims to adapt
torch.accelerator.device_count
for multi-process usage by delegating its functionality totorch.xxx.device_count
. It ensures alignment with the behavior oftorch.get_device_module(device).device_count
to avoid issues like fork poisoning.
- CI System Testing for Origin/Main Branch: This topic is aimed at testing whether the continuous integration (CI) system for the 'origin/main' branch of the PyTorch project is malfunctioning. It includes testing changes and merging updates from the main branch.
- PyPI Package Validation for CUDA Binaries: This topic aims to disable the PyPI package validation for binaries that include CUDA libraries in the smoke test process. It addresses a specific issue where these binaries do not install packages via PyPI, as evidenced by a runtime error indicating the absence of the 'cudnn' package in PyPI for a specific Torch version.
- CUDA and ROCm Stream Handling in DLPack: This topic addresses the logic for handling CUDA and ROCm streams in the creation of DLPack capsules from tensors. It ensures the use of the legacy default stream when
tensor.__dlpack__(stream=None)
is called for a CUDA tensor and introduces error handling for unsupported stream values in both CUDA and ROCm contexts.
- DLPack Keyword Arguments Support: This topic adds support for the missing keyword arguments
dl_device
andcopy
introduced in DLPack version 2023.12. It updates the C++ implementation ofto_dlpack(...)
to handle these arguments and introduces a new Python APItorchDeviceToDLDevice()
.
- Global State Dictionary Loading: This topic addresses the issue of adding a strict check when loading a global state dictionary into a local one in the PyTorch project. It ensures that if the 'strict' option is set to true, only matching keys are loaded, while if set to false, additional keys from the global state are also included.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 257
Key Closed Pull Requests
1. Brister/always tiled reduction: This pull request aims to test the continuous integration (CI) system by enabling tiled reductions by default, which could potentially identify bugs, and includes numerous commits addressing various issues such as fixing non-dense strides, adding unit tests, and refining the tiling and reduction logic.
- URL: pull/144008
- Merged: No
- Associated Commits: b88c1, 6fa6c, 7de6d, 1fbf8, 1b68e, 462c8, cfc0d, 11420, 81837, 83620, 5f0c2, 29bdd, 762ed, bac7d, 6e502, 4e1b3, 85432, a313b, 607a2, 6d0af, 64191, 1ab9c, 1b5cb, 4a77b, bf01a, 273aa, 350a8, d8381, 0a003, 60be7, 27bd2, 0a010, 84590, d882e, 379e7, 4cc2b, 6534d, 1d552, 8ba68, 6edc4, 24a1a, 6aabd, 783c7, dc2b0, 71716, 704b1, 9a397, f92eb, 1edcd, 0062e, 29298, 53a99, 103fe, 1ed28, e81c3, 6e999, 3e9e4, e19b7, 35ee8, b5eda, e8216, 71eaf, b5e88, 8d5d6, 81eba, a5a7f, 34d78, 824e3, c46da, a6e9d, ad113, 55a01, 72dab, 7aa94, 8c4b8, 309f7, e6931, 6d979, 4e8a0, ee832, e7356, b0d5c, eaa48, 80c99, e1596, 87bfd, b7005, 80838, 0244b, 99be2, 3f1a1, 8fc5a, fe55c, 4b59d, a5d56, a07f1, d2e05, 157f2, 41f14, 8dc05, 21c54, 8fe5e, bc404, 7f4fb, 04bed, 4e205, 02b45, 96e57, 9186c, 3b538, cf573, d75b9, 5d57b, 81609, 9e6a2, b99f2, 131ee, 07089, bc42b, 8802d, 2dc7f, 89861, 874d0, aa7a3, f5ad6, 3107b, 02365, 0a465, 1da5d, 36624, 17825, 4ba2b, 89106, ab074, eac98, 26c5e, 574a8, f1169, a5aa8, 720d2, cb021, f60ab, e8b23, 1bbf8, f9c0f, 0a958, 6504b, 920b0, da951, 87ee2, 0ab2a, 70961, e4e22, c79b7, c7913, 5fd2e, f94f7, d88e6, 42311, e7cc7, e3471, ca2e8, 3a431, ea561, 7c731, 32c3c, ed79c, d79a0, 69577, e9ced, 0fc56, 6e320, 8ad92, 6d15d, a4c1b, cc8ed, 93071, 1042a, 59343, d3146, 4b073, 714ba, 79987, 942a7, a5bbd, 01ded, 50e45, bb464, ae132, 258cf, 3d2f0, 38ce2, e1407, 53c08, 72b21, bbf00, 12bb7, 67738, 8c9ef, 95704, 39a10, 1ee6b, 7c3c6, 417bc, 7cd2f, be966, 60c3c, b80b9, edb88, 6d169, 3faed, 9573d, 3e133, 30eda, d5334, fbf34, 6da70, 29495, 55100, b885f, 76ed9, 16d15, 0dbc5, 98740, 931ec, 04b7a, 47565, a3b1b, a4623, 8223c, 0def4, a960f, c324e, 305c4, 7b395
2. Combine win and win-arm64 templates: This pull request aims to combine the Windows and Windows ARM64 templates in the PyTorch project, as indicated by the title and the initial commit message, although it was ultimately not merged.
- URL: pull/149613
- Merged: No
- Associated Commits: 1c988, 93670, a0912, 496bb, d67c1, 22271, 80dfc, 9b112, 2c4bc, 62374, 1d940, c99ef, 7bb9c, b99fc, 88a26, 6285a, 1d221, ffa08, aae4c, 44e64, 18435, 6e843, 4a4a7, 24176, f17ae, 406d4, a7031, b07b8, a268c, f64c3, ce5ad, 1d3c5, bf34e, f47aa, 90543, 29756, 1099c, c2ada, 5ebc2, e4816, 66dd0, 362b4, 06923, a39bf, 64bd8, 732f9, ee6a0, bf662, 53278, ccd5d, 4ea58, 0a396, 0ed34, cfc08, 34743, 68dfd, d0722, 2b90e, 5d4b5, 1eab8, bdc13, e35ef, 64d22, 70026, fa5f5, 99a4f, f7d1b, 1b08a, b0a5d, 19b76, 46dd2, 0eb3a, 09aa6, c5dea, 85f6d, 842d5, 5757a, fb07f, ff020, d46c1, 1c6b5, 7f836, d320a, 27370, b238e, c73a5, 9d02b, 021b3, b9a5e, 01b1d, 51fa8, 8f7fb, 621c8, 6bbe8, abf0e, 2b848, 9367f, 539db, fe954, 85027, c201d, 8bece, 2dccd, de3ac, d5ce5, 24848, 21ab4, d13a1, 63f9f, afaa0, 24972, 15f8d, f6cbb, bdd89
3. cpp_wrapper: persist autotune example tensors until last use: This pull request addresses an issue in the PyTorch project where randomly generated example tensors could cause kernel autotuning to fail by ensuring that these tensors persist until their last use, thereby fixing a specific test failure related to compile-time autotuning.
- URL: pull/146706
- Merged: No
- Associated Commits: a56a4, bd891, 391e6, b0944, 913e8, 25078, e1958, e5083, 0026e, 0f41e, ce975, e7962, d43be, a6b57, 73360, 02a13, 4bb0f, 49d49, 3084e, e0f0d, 328c3, 0d19a, caf65, 073cf, 37e4e, 74f9d, e05c2, 1475d, b60db, b41a6, b41df, 3b50b, 9843a, 6d848, 2db66, eb4b7, 266b6
Other Closed Pull Requests
- CUDA Kernel Enhancements: This pull request introduces a new CUDA kernel to improve the performance of the backward pass for gamma and beta calculations in layer normalization. It shows significant speed improvements for input dimensions that are powers of two, despite a slight increase in binary size and compile time.
- AOT Autograd Cache: This pull request implements caching for the AC HOP in the AOT Autograd Cache within PyTorch. It is part of a series of changes managed through the ghstack tool but was not merged.
- Graph Partitioning for Custom Operations: This pull request introduces support for graph partitioning on custom operations in PyTorch. It provides a new API to register or unregister custom operations for graph partitioning, with example usage and tests, although it is not yet merged.
- Input Aliasing and Mutation Checks: This pull request implements input aliasing and mutation checks within the Dynamo component of PyTorch. It focuses on using versioning to manage these checks in the invoke_subgraph function, involving multiple contributors.
- Profile-Guided Optimization (PGO) Cache Misses: This pull request addresses cache misses in internal models due to PGO by using source hashing to generate consistent symbolic IDs. It ensures stable assignment and prevents catastrophic symbol collisions through linear probing.
- TorchTune Fixes: This pull request, titled "[dont review][dont merge] All fixes to make TorchTune work," was created to implement fixes for TorchTune. It was not intended for review or merging and was closed without being merged.
- Flash-Attention Integration: This pull request transforms the integration of flash-attention into a third-party submodule. It addresses changes in Cuda-graph RNG handling and dependencies on a related Flash PR, while dealing with backward compatibility issues.
- Type Annotations in _inductor/ir.py: This pull request enhances type annotations in the
_inductor/ir.py
file by removing all# type: ignore
comments. It addresses resulting type failures while avoiding changes to existing behavior.
- cuBLAS nvfp4 Kernel Integration: This pull request integrates the
torch._scaled_mm
function with the cuBLAS nvfp4 kernel for matrix multiplication. It allows the operation to utilize the specialized fp4 gemm kernel for improved performance.
- Asynchronous Tensor Parallelism: This pull request addresses fusing matmul-reduce-scatters in asynchronous tensor parallelism. It implements pattern matching logic to accommodate reduce-scatter nodes with multiple users, preventing memory leaks.
- Intermediate Node Name Normalization: This pull request normalizes intermediate node names to ensure isomorphic graphs produce the same outputted graph. It improves cache utilization by performing an alpha renaming of intermediate variables.
- Dilation Support in max_pool2d: This pull request enhances PyTorch by adding support for dilation in the lowering process of the
max_pool2d
operation. It is indicated by the title and the series of commits associated with the changes.
- Fake Tensors in foreach_copy: This pull request adds support for fake tensors in the
foreach_copy
function within PyTorch. It addresses issue #149111 and includes various commits for adding test cases and fixing lint errors.
- StaticCudaLauncher Modifications: This pull request modifies the StaticCudaLauncher to support any number of kernel arguments. It implements a fallback mechanism for arguments exceeding a predefined maximum and addresses a specific issue with zero arguments.
- Shared Memory Allocations in StaticCudaLauncher: This pull request enhances the StaticCudaLauncher by enabling support for shared memory allocations exceeding 48KB. It involves special handling by querying the device for maximum memory.
- Dynamic Shapes Code Generation: This pull request makes code generation for dynamic shapes more device agnostic. It addresses the assumption that devices are either CPU with Cpp codegen or GPU with Triton codegen, allowing more flexibility.
- Autograd Key Graph Tracing: This pull request addresses not tracing the forward and backward graphs in the autograd key within PyTorch. It was ultimately not merged, as indicated by the title and multiple updates in the commit messages.
- Fake Tensor Prop Caching: This pull request, titled "[invoke_subgraph] Fake tensor prop caching," aimed to reintroduce changes from a previous pull request. It focused on caching properties of fake tensors within subgraph invocations but was not merged.
- PendingUnbackedSymbolNotFound Error: This pull request addresses the "PendingUnbackedSymbolNotFound" error by allowing the intentional creation of unbacked symbols. It provides a method to bypass this error using
fake_mode.shape_env.ignore_fresh_unbakced_symbols()
.
- Non-Contiguous Operations Performance: This pull request enhances performance for non-contiguous operations on larger tensors by replacing the indexed approach with a strided flavor. It significantly reduces execution time for operations like
fmax
on 1000x1000 stride tensors.
- Prologue Fusion with constant_pad_nd: This pull request includes
constant_pad_nd
in prologue fusion within PyTorch. Benchmarking revealed occasional speedups, prompting the change along with a fix for creating a single, contiguous dependency for prologues.
- Row-Wise Scaled MM Refactoring: This pull request refactors the row-wise scaled matrix multiplication (MM) by adding configuration selection for SM89.2. It ensures kernels are only built when compiling for the specified architecture.
- Tensors with requires_grad=True Warning: This pull request addresses tensors with
requires_grad=True
being converted to scalars without warning. It introduces a user warning to alert developers of potential unexpected behavior when using operations likemath.pow
.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- Add XPU and SYCL Merge Patterns
- Toxicity Score: 0.55 (Frustration expressed, defensive responses, mediation attempts, escalating tension.)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 249 | 78 | 15 | 242 |
guilhermeleobas | 354 | 7 | 2 | 28 |
justinchuby | 123 | 19 | 6 | 133 |
XuehaiPan | 218 | 10 | 0 | 34 |
zou3519 | 16 | 4 | 16 | 222 |
jamesjwu | 170 | 15 | 12 | 50 |
laithsakka | 107 | 24 | 8 | 97 |
atalman | 130 | 25 | 16 | 65 |
cyyever | 138 | 39 | 0 | 30 |
jansel | 112 | 18 | 0 | 71 |