Weekly GitHub Report for Pytorch: April 14, 2025 - April 21, 2025 (12:02:13)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for Python 3.13 with torch.compile
, a new torch.compiler.set_stance
feature for dynamic compilation control, and FP16 support on X86 CPUs. Notably, the release also marks a shift away from publishing on Conda, with a focus on using Manylinux 2.28 for Linux builds, and introduces a backward-incompatible change by setting weights_only=True
as the default for torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
index_select performance: This issue discusses the performance differences between the
index_select
andgather
functions in PyTorch, with benchmarking results indicating thatgather
is consistently faster and more expressive thanindex_select
. The issue suggests that if further benchmarking supports these findings, thegather
function should be expanded to cover all data types supported byindex_select
, and theindex_select
kernels should be removed in favor of callinggather
.- The comments discuss various aspects of the performance differences, including specific benchmarking results, the impact of recent code optimizations, and the historical context of
index_select
as a legacy function. Participants also explore the potential forindex_select
to callgather
directly and clarify the differences between various indexing methods in PyTorch. - Number of comments this week: 12
- The comments discuss various aspects of the performance differences, including specific benchmarking results, the impact of recent code optimizations, and the historical context of
-
[ONNX] exported nodes of Multi-head attention can be simplified: This issue involves the export of the
nn.multiheadattention
layer from PyTorch to ONNX, where the user observes unexpected additional operations in the exported model. The user is questioning whether these additional operations are a bug or a feature of the export process.- The comments discuss the user's expectations and provide a code snippet of the wrapper used around
nn.multiheadattention
. A request is made for a reproducible script, which the user provides via a GitHub repository. It is explained that the additional operations are due to PyTorch's implementation, and suggestions are made for potential optimizations, including usingtorch.onnx.export(..., dynamo=True)
and considering graph rewrite rules. An optimized version is shared in the user's repository. - Number of comments this week: 7
- The comments discuss the user's expectations and provide a code snippet of the wrapper used around
-
Sparse tensor conversion performance issues (CPU/GPU): This issue highlights performance concerns related to the conversion of sparse tensors in PyTorch, specifically when converting from dense to sparse formats like COO and CSR on both CPU and GPU. The user reports significant differences in memory usage and processing time between these two conversion methods, with CSR showing unexpectedly high memory consumption and time delays.
- The comments discuss potential causes for the performance spikes, with contributors suggesting optimizations and sharing code modifications to address the issues. A proposed fix involves optimizing the memory usage by altering the code responsible for generating indices, and there is a consensus that the memory characteristics of COO and CSR conversions should be comparable. A write-up on the topic has been shared for further discussion.
- Number of comments this week: 7
-
Compatibility with SymPy 1.14.0: This issue is about ensuring compatibility between the new prerelease of SymPy 1.14.0rc1 and the current release of PyTorch, specifically torch==2.6.0, to prevent any potential problems when the final version of SymPy 1.14.0 is released. The user is seeking confirmation on whether the new SymPy version will cause any issues with PyTorch and is asking if PyTorch's continuous integration (CI) tests this prerelease version.
- The comments discuss running tests to check compatibility, with initial tests passing locally and in CI. There is a concern about potential issues with the current release torch==2.6.0, but further tests indicate no significant problems. The discussion also mentions the upcoming release of torch==2.7.0, which is expected to be similar to the main branch, providing confidence in compatibility.
- Number of comments this week: 6
-
[export] Warn users when 0/1 specialization happens: This issue addresses the confusion users experience when an axis specified as dynamic is unexpectedly specialized, particularly when a dynamic batch size is intended but an example with batch_size=1 is provided. The proposal is to emit a warning suggesting users change the example dimension size to greater than one to avoid this specialization.
- The comments discuss the use of
Dim.AUTO
versusDim.DYNAMIC
in ONNX export, withDim.AUTO
being used due to constraints withDim.DYNAMIC
. There is a conversation about whether falling back to static dimensions can sometimes allow a model to export correctly, and it is noted that users might set more axes to dynamic than necessary. - Number of comments this week: 6
- The comments discuss the use of
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly for other kernel sizes. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a script in a Docker environment with a
tmpfs
permission set to1777
, where the execution of a cachedcuda_utils.so
file in the/tmp
directory fails due to the absence of the execution bit, despite the directories having the correct permissions. The error occurs during the execution of a PyTorch model, specifically when attempting to map a segment from the shared object, resulting in anImportError
and aBackendCompilerFailed
exception, which suggests a problem with the execution rights of the compiled CUDA utilities. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are excluded from this formatting process. The task requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to address known issues such as import cycles and misplaced annotations before the UFMT changes are committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the size of JIT archives, particularly for small models with quantization, to facilitate more efficient deployment on mobile devices where storage space is limited.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 106
Summarized Issues:
- Compilation and Execution Errors: Compilation and execution errors are prevalent in PyTorch, affecting various functionalities. Users report issues such as assertion errors, runtime errors, and incorrect outputs when using features like
torch.compile
,torch.export
, andtorch.vmap
, often due to device mismatches or unsupported operations. These errors hinder the successful execution of models and require workarounds or fixes to ensure compatibility across different backends and devices.
- Backend and Device Discrepancies: Discrepancies between different backends and devices are a common issue in PyTorch, leading to inconsistent results and errors. Users experience problems with functions like
torch.nn.PairwiseDistance
,torch.outer
, andtorch.linalg.inv
, where outputs vary across backends such as Triton, CPP, and Inductor, often due to precision differences or unsupported operations.
- Export and Serialization Issues: Exporting models and handling serialized data across different architectures and formats pose challenges in PyTorch. Users report issues with
torch.export
, ONNX export, and TorchScript models, where metadata is not preserved, and models fail to load correctly on different architectures, leading to errors and incorrect outputs.
- Performance and Optimization Concerns: Performance issues and optimization challenges are frequently reported in PyTorch, affecting both training and inference. Users encounter problems with high memory usage, slow processing times, and performance regressions, particularly when using features like
torch.vmap
,torch.compile
, and dynamic shapes, prompting requests for optimizations and improvements.
- Documentation and Usability Enhancements: Users frequently request improvements in PyTorch's documentation and usability, highlighting unclear or incorrect information. Issues include misleading documentation for functions like
torch.nn.utils.clip_grads_with_norm_
andtorch.export
, as well as requests for better error messages and user guidance to enhance the overall user experience.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 42
Summarized Issues:
- PyTorch Functionality Issues: This category includes various issues related to the functionality of PyTorch features and operations. For instance, the
torch.set_flush_denormal
function does not affect float16 on x86_64 architecture, raising questions about its limitations. Additionally, thetorch.compile
function fails to handle certain operations and data types, such asquantize_activation
on CPU andrms_norm
on MPS, leading to errors and incorrect outputs.
- Error Messages and Debugging: Several issues highlight the need for improved error messages and debugging support in PyTorch. For example, vague error messages when running ROCm-specific tests on non-ROCm machines and unclear dynamic shape constraint violations make troubleshooting difficult. Enhancements in error clarity and guidance are suggested to aid users in resolving these problems.
- Backend and Device Compatibility: Issues in this category focus on compatibility problems with different backends and devices. The MPS backend has several bugs, such as incorrect dtype handling in
torch.isin()
and unimplemented operators likeaten::_linalg_solve_ex.result
. Additionally, there are concerns about CUDA version support and device compatibility, such as the lack of support for CUDA 12.1 in PyTorch 2.6.0.
- Compilation and Export Issues: This group includes issues related to the compilation and export processes in PyTorch. Problems such as the failure of
torch.onnx.export
to handle dynamic input sizes and the inability to export certain models due to dtype mismatches are highlighted. These issues suggest the need for improvements in the export functionality to handle various scenarios more robustly.
- Performance and Optimization: Performance-related issues are also prevalent, such as the significant degradation in Triton operator performance for
scaled_dot_product_attention
. Suggestions include using alternative operators to improve execution times, particularly in cross-attention scenarios, indicating a need for optimization in PyTorch's compilation strategies.
- Testing and Validation: Issues in this category focus on the need for better testing and validation processes. For instance, the disabling of certain tests due to gradient accuracy issues and the migration of ONNX exporter tests to a new framework highlight the ongoing efforts to ensure the reliability and correctness of PyTorch's features.
- Documentation Discrepancies: Several issues point out discrepancies between PyTorch's documentation and its actual behavior. These include the behavior of functions like
torch.cdist()
andtorch.nn.Upsample()
, where the documentation does not accurately reflect the implementation, leading to confusion and the need for documentation updates.
- Infrastructure and Build Concerns: This category includes issues related to the infrastructure and build processes of PyTorch. Problems such as build errors with specific compilers and the use of deprecated RPATH in the aarch64 CPU wheel suggest the need for updates and maintenance in PyTorch's build system to ensure compatibility and stability.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 175
Key Open Pull Requests
1. Implement avg_pool3d for MPS backend: This pull request implements the avg_pool3d
operation for the MPS backend in PyTorch using a custom Metal shader, enabling users with Apple Silicon GPUs to perform 3D average pooling operations without reverting to CPU, and includes a C++ interface, support for forward and backward passes, comprehensive test cases, and fixes for issues related to Metal command buffer handling and non-contiguous tensors, addressing issues #141287 and #141044.
- URL: pull/151742
- Merged: No
- Associated Commits: 795a7, 4b6c1, 1295e, 82ec0, 45378, 1cd86, 24dd4, 97e0f, e6830, abdad, e18f8, d7bcd, fcea8, fa0cb, 8e959, e45e6, 62abb, 42217, 760f5, c96cc, 9320d, 50975, a835c, 7dcf3, ca868, e6541, b915a, af53d, aed0a, f8ea1, 00eed, 5651b, 25e76, b060f, 47f18, adf75, 598ab
2. [Easy] Fix the function signature of torch.Event: This pull request addresses a discrepancy between the declaration and implementation of the torch.Event
function signature in the PyTorch library, proposing a decision on whether to set the enable_timing
parameter to False
for consistency with torch.cuda.Event
or to True
to avoid breaking backward compatibility.
- URL: pull/151221
- Merged: No
- Associated Commits: 25eb3, b2acb, 2fcd3, e5a14, 0f9d1, 0fa2b, be90c, 7cfc4, 9e957, e2f73, 5f5c4, 0df2b, dabbf, bf985
3. Broken Links GHA: This pull request introduces a GitHub Action that runs monthly to check for broken links within the repository, and if any are found, it automatically creates an issue listing the problematic links.
- URL: pull/151454
- Merged: No
- Associated Commits: 12318, 22c4f, 6cb4b, 8887f, 73a25, 49982, 261ab, 41ccd, f1f70, c6d60, 7618b, 5329e, d59d2, 00b3f
Other Open Pull Requests
- Documentation Enhancements for torch.Event: This topic involves improving the documentation of
torch.Event
by adding detailed function or class signatures and correcting the display oftorch.Event.wait
andtorch.Event.record
. The pull request aims to fix and enhance the documentation to provide clearer and more accurate information for users.
- Performance Improvements for CK Gemm on ROCm: The pull request introduces initial changes to improve the performance of CK Gemm on ROCm by reorganizing the CK Gemm code into a dedicated folder. It also adds logic to call CK Gemm with specific templates based on input tensor sizes, adapting the gemm selection logic from the FBGEMM project.
- Dynamic Shapes and Symbolic Shapes Enhancements: This topic covers enhancements to the PyTorch project by experimenting with the
bound_sympy
tool to enable size-oblivious maximum reasoning for dynamic shapes. It addresses compile-time regressions and involves multiple updates to thesymbolic_shapes.py
file.
- Test Skipping Decorators and Class-Level Support: The pull requests address issues with the
skipIfXpu
andskipIfHpu
decorators incorrectly disabling tests when applied to a class. They enhance the functionality by enabling class-level skipping, as part of a series of changes tracked through the ghstack tool.
- Infrastructure for Built-in Operations: This topic involves reapplying a previous update to implement infrastructure for handling built-in operations such as
min
,max
, andmath.pow
. The pull request is part of a stack of changes managed by ghstack, with multiple updates and revisions to ensure non-strict behavior for these operations.
- Deprecation of Legacy Host Allocator APIs: The pull requests aim to deprecate the legacy host allocator APIs in favor of a unified API,
getHostAllocator(device_type)
, providing a more streamlined and consistent interface for memory allocation and management tasks. They also plan to move theis_pinned
function fromAcceleratorHookInterface
toHostAllocator
and deprecategetPinnedMemoryAllocator
.
- CUDAAllocator Simplification: This pull request simplifies and reduces redundancy in the
CUDAAllocator
by removing the customraw_alloc
andraw_delete
methods. It uses the existingraw_allocate
andraw_deallocate
methods fromc10::Allocator
, which are now virtual to allow for customization by other allocators.
- Cutlass Component Enhancements: The pull requests address fixes for end-to-end (e2e) compilation issues related to argument rendering in the Cutlass component. They also enhance the Cutlass library by adding epilogue inputs and outputs to the
def_kernel
function, as part of a series of related updates tracked through the ghstack tool.
- Caching and Fake Tensors: This pull request introduces a feature for caching fake tensors when the output is None, as part of a series of changes in the PyTorch project. It includes multiple updates and commits refining the implementation.
- torch.arange() Precision Fix: This pull request addresses a corner case issue in the
torch.arange()
function where casting start, end, or step values toint64_t
could lead to precision loss. It implements a workaround using double arithmetic for values within the exact representable range of double for consistency across devices.
- MixtureSameFamily Distribution Bug Fix: This pull request addresses a bug in the PyTorch library related to the
MixtureSameFamily
distribution by ensuring that sample validation occurs after padding in thelog_prob
method. It corrects the support to match the component distribution with the first event dimension removed.
- ROCm CI Environment Upgrade: This pull request aims to upgrade the ROCm Continuous Integration (CI) environment to ROCm version 6.4. It involves updates to all ROCm GitHub workflows to use the Jammy distribution and modifications to the
install_rocm.sh
script.
- Test Skipping and SM89 Tests: This pull request addresses the need to skip Triton tests for MPS and modifies the reason for skipping SM89 tests to not rely on the IS_BIG_GPU condition. It combines improvements from two previous pull requests.
- Guard Checking Logic Refactor: This pull request refactors the guard checking logic by lifting it to the AOTAutogradCache. It involves creating a new GuardedCache class and adding a
check_guard_hit
lambda to FXGraphCache.
- Generalized Installation Process: This pull request aims to generalize the installation process to accommodate inputs that are neither explicitly defined nor capable of being flattened by pytree. It is part of a series of updates and commits in the PyTorch project.
- Enhancements for Static Value Detection: This pull request introduces enhancements to the PyTorch project by allowing the use of
statically_known_true
in user code. It adds a new functionhas_static_value
to determine if an input has a static boolean, float, or integer value.
- Compile-Time Traces for invoke_subgraph: This pull request introduces compile-time traces for the "invoke_subgraph" feature in the PyTorch project. It is part of a series of related changes managed through the ghstack tool.
- DTensor HOP Dispatch Feature: This pull request introduces a feature called "DTensor HOP dispatch" to the PyTorch project. It involves multiple updates and is part of a stack of related changes, with testing focused on distributed tensor attention functionality.
- Autocast Context Manager Handling: This pull request addresses the handling of the autocast context manager within the hierarchical compile process in the PyTorch project. It is part of a series of related changes tracked by ghstack.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 238
Key Closed Pull Requests
1. Implement fexp for avx2 and avx512: This pull request implements a fast exponential computation (fexp) for AVX2 and AVX512 architectures to optimize flash attention on X86 with F16 support, based on a 2015 paper by Malossi et al., achieving up to 20% faster performance for mixed precision flash attention compared to the current implementation, with precision valid in hybrid mode fp32 -> f16 due to casting during store operations, as demonstrated by benchmarks on a Xeon 6972P machine.
- URL: pull/151434
- Merged: No
- Associated Commits: ffbd8, b5fa1, a52fd, f1ab9, df83f, fed42, 6f522, 72bcf, 3ef90, ddc41, 9503d, e30d7, 72e61, 756ed, 86968, 15111, b8ac9, 063c9, ad359, 23ba1, 63ff9, baed5, 20ad3, 767b6, 992b2, b74dd, 32a7e, 555e7, 7b71e, af779, 17fba, db9fa, d4955, 144fd, 3de4b, b3b50, 5920b, 17b07, 76a16, cfba6, 100d1, 3ce4f, ba7e3, 3beb3, 9db67, 24ead, 601f9, 4c7c6, 7eb42, 1a71d, 05a99, 11c49, 710a9, 37d34, efadc, 89359, d4e86, 2b00c, 84107, a8ec5, cd7ab, 95fcf, e11f7, cce99, 56bc3, ad432, 433f1, 7ec0e, 33439, 767a1, f7eb1, 93fb6, 913a7, 504b4, dca15, dc656, 20771, 34daf, 40ba5, 1eec6, 23612, 9f133, adf1a, 14f2b, 4dc1e, 34046, 32207, 12bc1, 0bb49, 27a22, de67c, 92aad, a63a2, b9a66, b4da5, 9a918, d5213, 757ae, f2f52, 90b50, fbb55, 13a85, b6387, b303d, 40535, d7d47, 34a56, 4f4a8, 4ed7e, 2412a, 0ec38, d6ec8, 80c60, 0316d, 82566, 6e949, 8c787, 29fe9, dd59e, 51449, 2f66c, b7ec7, 3dcb5, 69507, 5b445, 836bd, 36c5d, ba255, 2ff8b, fd214, 8d740, 2265b, 71670, f361b, ea571, 83b88, 73ccc, 3301c, c9d43, e15c8, 5e8bb, 46fc5, ef561, 8447e, 4b3b8, 4318e, c4b1b, 6aaeb, 20cfc, 7a592, 9f0cb, f4f6e, d4eb2, 4874c, 3c33f, 95fe6, a9274, 81dd4, d388f, 2b90f, 84764, f5fd7, c1983, 9b183, 382e1, a3ce8, 714df, 5578b, 3c46a, 43c32, 042fc, cbdd4, 96280, 9bb11, 11904, 27354, 54295, 815a3, 9b64e, 16b07, 11c0c, e3bc4, 1d3ae, db0af, 17e36, b4ce4, f413a, 3aafc, 837ed, d2b00, 950de, cf99c, 52142, 19dd7, bcc56, 157c7, 9f200, 7be16, c7962, 4a017, 1380d, 2536b, 848c7, 69508, 85cf8, a46fe, 60469, 9553e, 18ada, 2b229, ddeaf, 94479, 7c71f, 74797, dc848, 0351c, 328a0, 9b7d8, b9865, b679c, fc5f8, 03818, 41ecf, 7d4be, 2c780, 4e4fc, 07c9b, 43f92, 89bb4, 58af4, 5600b, ed199, 433d7, ba083, f0ced, 2e490, 91661, b6536, 8d1e5, 24a75, 3d3bb, 40868, c8a27, fc72f, 8cc68, 5c334, 8875d, 88b9f, 3b768, f1612
2. Maxpool Perf Improvement targeting resnet scenarios: This pull request aims to improve the performance of the max pooling operation, specifically targeting scenarios involving ResNet architectures, as indicated by the title "Maxpool Perf Improvement targeting resnet scenarios."
- URL: pull/151720
- Merged: No
- Associated Commits: e1306, 4a3da, ba275, cea56, 612fc, 828d6, 6b14e, 9e315, 98870, 813e0, a889c, 4b5bf, 1db2a, 39209, 19544, b7de7, fb276, becdf, ffed7, c5e52, dd732, 4e6a9, 7a007, 7c550, 6e867, 9b80d, 4b030, cbe47, 2e2c0, 17157, ecd33, c2578, 39641, 6101a, 24bd8, aa574, bc421, 051df, 550ed, 57717, d80f5, 69ed7, 70298, 17d25, 058d3, 8a71e, 8af31, f8c4c, 3a541, 1a0b1, 0b45a, 0b1b6, 1de13, 53752, 783a6, 119e7, 417a0, 32f58, 7d26c, 4be8e, ff940, f6ad6, da863, 50eb2, bb7fd, 8fa58, d8002, ba7c2, 00314, dc956, b253b, 757cb, 8a12b, db943, 5e412, 6efce, 79740, b654f, 362be, 2acd2, aaa31, ed475, 09af6, 088b8, dca35, 02220, 6df27, dca53, 08c07, d88da, feade, 5b76f, 8ec01, 06b6a, f0fb4, 2096c, 75e26, 0e2b4, 9ba9a, 6a4c4, 17dc4, 49c8b, 72908, 7d7ec, 5d212, c3ba1, f0927, abbfe, f069c, f8544, 1ed41, 826ee, b0ea6, e814e, f6389, 78a47, f0207, c5667, d8a7a, 45946, 61ba0, 01137, 0a0be, 78426, a4935, f929e, 33911, 4bed2, 60cb6, dacf5, ff48a, 80f18, 47074, b5380, 5d018, 6a281, 8b752, 8b59e, 38c82, a6f13, 4b515, 8e47c, 23d1a, f27cc, a07b6, 80e18, 1550e, 8ecf0, 4ed5c, 595a2, 8b7ad, 58e54, a02ca, 0e782, c040e, dda59, e481a, a1efa, 4b826, 5eaa4, 0e8bd, 2b906, 01a0b, fb716, 94e61, 94412, fe82c, 973af, cb954, 5a980, 45e62, f2ca4, a92b4, b2e4f, 0119c, 6d856, a0985, a0aaf, 5ca0b, 64033, fc456, 84209, 6c84d, 74fb9
3. Propagate callable parameter types using ParamSpec (#142306): This pull request aims to enhance the PyTorch codebase by propagating callable parameter types using ParamSpec
, addressing partial issues related to type annotations and mypy compatibility, and includes various commits that reorder function parameters, adjust return types, and make formatting changes to satisfy linters and avoid type errors.
- URL: pull/151014
- Merged: No
- Associated Commits: b83b6, f7218, ce461, fccfb, 5af2d, 9bdb2, 7a6f6, f36a4, 0a002, ebef9, 42666, 27d84, 65c28, 81d59, ef2dc, 003b8
Other Closed Pull Requests
- Event Handling Enhancements: This topic covers improvements in PyTorch's event handling, addressing issues with
event_id
always being 0 and adding checks forelapsedTime
. These changes aim to reduce user confusion and ensure robustness through additional tests.
- GEMM and Matrix Operations: Enhancements in matrix operations include support for submatrices in GEMM and ScaledGEMM within the ROCm framework and an Aten GEMM overload for FP32 output from FP16/BF16 inputs. These updates improve functionality and efficiency, although the latter was not merged.
- Memory Management and Optimization: The introduction of a
HostAllocator
class standardizes host memory management across backends, while optimizations in graph partitioning and dispatching mechanisms enhance performance. These changes aim to improve maintainability and reduce computational overhead.
- Symbolic Shape Handling: The addition of
sym_and
andsym_or
functions allows for variadic arguments, simplifying symbolic expressions. This enhancement preserves symbolic expressions for better runtime assertions and branch preservation.
- Compile Time and Tracing Improvements: Enhancements in compile time tracing within the AOT autograd component include logging and timing mechanisms. These changes aim to optimize the compilation process by addressing significant missing gaps.
- Tiling and Kernel Optimization: The removal of unnecessary singleton tiling splits optimizes Triton kernel generation by eliminating superfluous dimensions. This change reduces computational overhead and improves kernel fusion efficiency.
- CI and Testing Enhancements: Continuous integration for the "openreg" component is enabled by relocating test files and updating documentation. These changes ensure better testing coverage and integration.
- Compilation Warnings and Functionality: Addressing compilation warnings in the BlasKernel component involves removing unused functions, while changes to avoid specializing min/max functions improve code quality. These updates enhance maintainability and functionality.
- User Experience and Error Handling: Enhancements in error messages for relaxed constraints and validation of
inputs
intorch.autograd.backward
improve user experience. These changes involve multiple updates and contributions.
- Inductor and Libdevice Operations: The removal of unnecessary libdevice operations in the inductor component optimizes code generation. This change eliminates extra operations that were previously needed for dispatching.
- ONNX Export and Data Type Handling: Fixes in the ONNX export process address incorrect conversion of bfloat16 initializers. These changes ensure proper handling of bfloat16 data types in PyTorch models.
- No-Operation Elimination: Enhancements in noop elimination for
slice
andslice_scatter
operations improve efficiency. These changes include tests and improvements, although they were not merged.
- Standalone Compile Function Improvements: Updates in the
standalone_compile
function ensure correct handling of multiple return values and prevent mutations. These changes enhance the function's reliability and integration with custom backends.
- Flash Attention and Tensor Handling: Saving Q, K, V tensors in flash attention processes involves debugging and adding annotated kernels. These changes focus on improving the attention mechanism's efficiency.
- Tensor Release and Memory Management: Addressing tensor release issues with
pin_memory
involves multiple updates. These changes improve memory management and are supported by related pull requests.
- MPS Backend Benchmarking: Initiating benchmarking for the MPS backend assesses compile results for pass rates and speedup. These changes involve a series of commits and discussions to enhance performance evaluation.
- ROCm and CI/CD Process: The creation of ROCm 6.4 images as part of CI/CD omits the magma tarball and includes updates like switching to Ubuntu 22.04. These changes aim to streamline the integration and delivery process, although they were not merged.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
-
[WIP] Move is_pinned to host allocator
- Toxicity Score: 0.55 (Frustration expressed, Defensive tone, Unresolved tension.)
- This GitHub conversation involves multiple users discussing a work-in-progress pull request. User1 initially provides a solution, but User2 expresses frustration over its ineffectiveness, leading to a tense exchange. User3 attempts to mediate by suggesting alternative approaches, but User1's defensive tone exacerbates the situation. The conversation remains unresolved, with underlying tension due to differing opinions on the implementation strategy.
-
- Toxicity Score: 0.55 (Defensive responses, Frustration expressed, Lack of clarity)
- This GitHub conversation involves username1 and username2, where username1 initially provides feedback on a proposed change, and username2 responds with a defensive tone. The conversation escalates as username1 expresses frustration over the lack of clarity in username2's explanations, leading to a tense exchange.
-
[fake tensor cache] Support index with non bool/int8 indices
- Toxicity Score: 0.55 (Frustration expressed, defensive responses, mediation attempts, unresolved dissatisfaction.)
- This GitHub conversation involves several users discussing a proposed change, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction triggers further tension.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 188 | 27 | 12 | 139 |
FFFrog | 128 | 15 | 0 | 14 |
anijain2305 | 106 | 19 | 18 | 10 |
mlazos | 126 | 18 | 0 | 4 |
pianpwk | 99 | 20 | 2 | 25 |
laithsakka | 85 | 17 | 7 | 32 |
guilhermeleobas | 118 | 13 | 1 | 0 |
guangyey | 91 | 9 | 0 | 31 |
justinchuby | 49 | 6 | 6 | 69 |
StrongerXi | 81 | 6 | 16 | 23 |