Weekly GitHub Report for Pytorch: February 18, 2025 - February 25, 2025
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notable changes include the deprecation of publishing on Conda, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[compile] Modularize very long compilation: This issue addresses the problem of a lengthy compilation process during model export/compile in a GitHub project, where a single generated C++ file with over 78,000 lines takes more than an hour to compile using only one core. The user suggests modularizing and parallelizing the compilation process to improve efficiency, as the current method lacks intermediate progress and is time-consuming.
- The comments discuss the potential causes of the issue, including the generation of a large Triton kernel and the need for modularization. Contributors suggest splitting the file into smaller parts for parallel compilation, though this may be challenging with the current architecture. There is also a mention of testing with lower optimization levels and the possibility of using subgraph handling to manage large models.
- Number of comments this week: 11
-
[Export AOTI] dynamic_shapes export and compile degraded output: This issue involves a bug in exporting and compiling a model with dynamic shapes using PyTorch, where the output is degraded when dynamic width (W) and height (H) are used, compared to when they are fixed. The problem seems to be related to the use of
torch.export.Dim
for dynamic shapes, which causes runtime errors during inference unless the dimensions are aligned with the inference resolution.- The comments discuss the difficulty in debugging the issue without a reproducible example, suggest testing subparts of the model, and mention a tool for minimizing accuracy issues. A runtime error is identified when using dynamic shapes, which is masked when using AOTI compile and package. A minimal reproduction is provided, and it is noted that the core issue might be an invalid graph produced during export, with AOTI and compile errors being secondary.
- Number of comments this week: 10
-
[RFC] Test Cases Enabling for Accelerators: This issue addresses the challenge of enabling existing PyTorch test cases for new device backends, such as accelerators, by proposing a mechanism that dynamically determines which tests to run, skip, or adapt based on a device's specific capabilities. The proposed approach involves creating a unified device-capability abstraction, dynamic capability registration, and capability-based decorators to refine the test suite for handling multiple backends efficiently.
- The comments discuss extending OpInfo for device capabilities, aligning the proposal with ongoing work, and the potential benefits for both in-tree and out-of-tree backends. Questions are raised about the primary use case, adoption challenges, and compatibility across hardware. The proposal is seen as beneficial for third-party vendors, with plans to integrate device capabilities into existing test infrastructure.
- Number of comments this week: 9
-
Triton pin update for PyTorch 2.7 / Triton 3.3: Upgrading PyTorch-Triton to a version that Supports Blackwell: This issue involves updating the PyTorch-Triton integration to support the Blackwell architecture by upgrading to a version of Triton that includes necessary optimizations and features. The update aims to address various technical challenges and ensure compatibility with the upcoming PyTorch 2.7 release, while also tracking related issues and potential improvements.
- The comments discuss the urgency of updating Triton to support Blackwell, with concerns about unresolved issues and the timing of the update relative to the PyTorch 2.7 release. Contributors highlight specific test failures and compatibility issues, propose solutions, and track additional related issues, emphasizing the need for coordination and careful planning to ensure a smooth transition.
- Number of comments this week: 8
-
PyTorch VS2022 official build Windows binary illegal instruction on AVX2(max ISA level) CPU: This issue concerns a bug in the PyTorch official build for Windows using Visual Studio 2022, where an illegal instruction error occurs on CPUs with a maximum ISA level of AVX2 due to the generation of AVX512 instructions. The problem does not affect current PyTorch official binaries built with VS2019, and it is challenging to reproduce locally, suggesting it might be specific to the official build environment.
- The comments discuss potential solutions, including involving Microsoft, understanding the issue's scope across platforms, and maintaining AVX2 support due to its prevalence in client CPUs. There is a suggestion to revert to VS2019 if the issue persists, and a clarification that the proposal was not to drop AVX2 but to make it the new base architecture.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- DISABLED test_transformer_training_is_seq_parallel_False (main.DistTensorParallelExampleTest): This issue pertains to a disabled test,
test_transformer_training_is_seq_parallel_False
, within theDistTensorParallelExampleTest
suite, which is failing on the main branch of a project using the ROCm platform. The failure is suspected to be caused by changes introduced in one of the pull requests #122995, #122996, or #122997, and several contributors and maintainers have been tagged for further investigation and resolution. - [NestedTensor] multiply batch and ragged dimension to get shape of values tensor: This issue discusses a proposed feature for the PyTorch library that involves manipulating the dimensions of a NestedTensor by multiplying the batch and ragged dimensions to reshape the values tensor. The suggestion aims to enhance the flexibility of tensor operations by allowing users to collapse the first two dimensions of a NestedTensor, thereby facilitating more complex tensor manipulations.
- Error: command buffer exited with error status.: This issue describes a problem encountered while training a model using llama2.c on an iMac with an AMD Radeon Pro 5700 XT GPU, where the user experienced a "command buffer exited with error status" during the training process. The error, which occurred at epoch 11,580, was associated with significantly increased epoch times and GPU timeout errors, potentially linked to garbage collection or other factors, although the user was able to resume training without further issues.
- scalar_tensor call with symbolic bool input does not work in inductor: This issue involves a bug in the PyTorch library where the
scalar_tensor
function fails when called with a symbolic boolean input while using the Inductor backend. The error occurs during the execution of a compiled function, resulting in aTypeError
due to an unexpected object type, which prevents the function from running successfully. - Support AOT Autograd level Caching: This issue addresses the need for caching in the
torch.compile
process when using anaot-autograd
enabled backend, as the current compilation time for models like Llama2 7B is significantly long, impacting development speed. The problem is particularly pronounced in the integration of PyTorch/XLA with VLLM, where the lack of support for dynamic shapes results in repeated compilations for different input shape combinations, causing delays in the warm-up phase.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 96
Summarized Issues:
- API and Export Issues in PyTorch: The need for a robust API in
torch.export
is highlighted due to issues with treating certain inputs as constants, leading toValueError
during export. Additionally, dynamic shape export failures occur due to division by zero errors, complicating the decomposition path and raising questions about handling real-tensor and fake-tensor tracing paths effectively.
- Bugs in PyTorch's Dynamo and Compilation Process: PyTorch's Dynamo faces issues with constant tensors not recompiling correctly with device guards, leading to CUDA device failures. Furthermore, the
torch.compile
function fails withdict_items
iteration, and dynamic shapes in export and compile processes result in degraded outputs and runtime errors.
- Precision and Performance Issues in PyTorch: PyTorch faces precision discrepancies with the
polygamma
function and NaN values during backward passes in neural networks. Performance regressions are noted in specific projects, with slower execution times in newer PyTorch versions.
- Backend and Device-Specific Bugs in PyTorch: Bugs are reported in PyTorch's MPS backend with
scaled_dot_product_attention
andclamp_
operations, causing crashes and inconsistent behavior. Additionally, issues with the ROCm backend and CUDA device errors are highlighted.
- Gradient and Memory Management Issues in PyTorch: PyTorch's gradient checkpointing feature does not reduce memory usage as expected, and memory allocator lock contention in templated GEMMs leads to performance degradation. These issues highlight the need for better memory management strategies.
- ONNX Export and Conversion Issues in PyTorch: Significant precision drops and errors occur when exporting models to ONNX format, particularly with the sigmoid function and Graph Attention Networks. These issues suggest a need for improved conversion processes.
- Documentation and API Inconsistencies in PyTorch: Several documentation errors and API inconsistencies are noted, such as incorrect references in method docstrings and discrepancies in default behavior descriptions. These issues necessitate updates for clarity and accuracy.
- Sharding and Distributed Training Challenges in PyTorch: Issues with sharding strategies and distributed tensor operations are reported, including missing strategies for specific operators and challenges with asynchronous communication in NCCL process groups.
- Model Export and Compilation Errors in PyTorch: Errors occur during model export and compilation, such as
torch.jit.trace
failing withretinanet_resnet50_fpn()
andtorch.export.export
encountering guard conditions. These issues highlight the need for robust export mechanisms.
- Feature Requests and Enhancements in PyTorch: Requests for new features and enhancements include implementing the L-BFGS-B algorithm, enhancing
Dim.AUTO
functionalities, and exposing NCCL API for runtime estimation. These requests aim to expand PyTorch's capabilities and improve user experience.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 57
Summarized Issues:
- Test Failures and Disabling Tests: This topic covers multiple issues related to the disabling of tests in the PyTorch project due to failures on the main branch. The tests
test_real_imag_view_lazy_complex128
andtest_flatten_nonview_xla
were disabled in their respective suites due to consistent failures, with references to recent failure examples provided in the issues.
- Compilation and Export Errors: Several issues highlight problems with compilation and export processes in PyTorch. Users encountered errors when exporting models to ONNX, including a RuntimeError due to a tensor requiring gradients and an AttributeError related to dynamic shapes, complicating deployment on platforms like NVIDIA Triton.
- Bugs in PyTorch Functions: Various issues report bugs in PyTorch functions, such as
torch.cholesky_solve
triggering an internal assertion error andtorch.randn
producing identical values across dimensions on macOS. These bugs affect the expected functionality and require fixes to align with documentation.
- Performance and Optimization Concerns: Issues in this category discuss performance discrepancies and optimization needs in PyTorch. For instance, the
torch.compile()
function was found to be slower than expected, and there were suggestions to optimize certain functions to improve performance.
- Platform-Specific Errors: Some issues are specific to certain platforms, such as ROCm or macOS, where users encountered errors like core dumps during matrix multiplication or identical random values in tensors. These platform-specific issues require targeted solutions to ensure compatibility.
- Configuration and Regression Issues: Several issues involve configuration problems and regressions in PyTorch, where previously working functions fail due to recent updates. These issues often require reverting changes or adjusting configurations to restore functionality.
- Documentation and API Consistency: Some issues highlight discrepancies between documentation and actual API behavior, such as the
torch.cuda.clock_rate()
function's return value. Ensuring consistency and clarity in documentation is crucial for user understanding and correct usage.
- Security and Compliance: An issue addresses the need for FIPS compliance in PyTorch by modifying the
hashlib.md5()
function to prevent errors on systems enforcing FIPS modules, highlighting the importance of security compliance in software development.
- Debugging and Logging Enhancements: Enhancements to debugging capabilities in PyTorch are discussed, such as introducing new logging options to facilitate easier debugging of intermediate representations, replacing older mechanisms for improved developer experience.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 163
Key Open Pull Requests
1. [test] 2: This pull request, titled "[test] 2," aims to address and fix an unspecified issue in the PyTorch project, as indicated by the placeholder "#ISSUE_NUMBER," and includes a series of 16 commits, each with the commit message "tc," which suggests a focus on testing or test-related changes, although it has not yet been merged.
- URL: pull/147470
- Merged: No
- Associated Commits: 03e29, 17679, 7b5b1, bd6cd, 1e50c, a324c, 2de74, 0a2e0, ed111, bcad8, 29c02, 32dc2, 4e88f, 15c40, 573b2, b8b97
2. cpp_wrapper: reduce memory usage by removing unneeded temporaries: This pull request aims to reduce memory usage in the cpp_wrapper
by refactoring reinterpret_view
calls to return temporary RAII tensor objects, thereby making the function's callers responsible for saving the handle when necessary, and eliminating unnecessary temporary tensor handles to align memory usage with the default
inductor mode.
- URL: pull/147403
- Merged: No
- Associated Commits: 01424, 67582, eb4f8, 20c1a, a6f57, 1c1b4, aae0f, 3806a, ebf67, 91ceb, 4d5ed, 35f9d, b6bf5, 7deb4, 4ebac
3. [ONNX] Add draft_export as a strategy: This pull request introduces a new strategy called draft_export
to the ONNX export process in PyTorch, which is positioned as the third fallback option, activated by setting the TORCH_ONNX_ENABLE_DRAFT_EXPORT
environment variable, and is designed to specialize tensors without being less robust than the existing JIT trace strategy.
- URL: pull/147529
- Merged: No
Other Open Pull Requests
- MXFP8 and MXFP4 Support in PyTorch: This topic covers the introduction of blockwise MXFP8 support to the
torch._scaled_mm
function for CUDA devices, allowing dispatch to a blockwise kernel from cuBLAS. The pull requests also include enhancements for MX-FP8 matrix multiplications on AMD gfx950 devices with ROCm 6.5+, with plans for future updates to address MXFP4 support.
- Runtime and SACEstimator Modifications: The pull requests involve modifications to the RuntimeEstimator and SACEstimator in the PyTorch project, addressing issues such as fixing default arguments, binding issues, and linting problems. They also include testing fake utilities and collectives with memory trackers.
- Graph Break Hints in Dynamo: This topic introduces generic graph break hints to the Dynamo component of the PyTorch project, as part of a stack of changes. The pull requests include multiple updates and contributions from various collaborators.
- ROCm CK Kernel Updates: The pull requests update the
ck_conv_template
code generation for ROCm CK kernels by parameterizing previously hardcoded convolution parameters. This enhances flexibility and maintainability while reducing the number of generated templates.
- Handling Mismatched Outputs in PyTorch Inductor: The pull requests introduce support for handling mismatched outputs in the PyTorch inductor by extracting
codegen_unbacked_symbol_defs
from the FallbackKernel into a new method. This is specifically for conditional operations, with plans for future updates to extend this support to other operations likewhile_loop
.
- ONNX Operations in PyTorch: The pull requests introduce the ability for users to utilize ONNX operations directly through
torch.ops.onnx.*
. They demonstrate an implementation for RotaryEmbedding with native PyTorch operators that integrate seamlessly with the existing ecosystem.
- Experimental Features in PyTorch: The pull requests introduce an experimental feature for delayed compilation in the PyTorch project, involving multiple updates and revisions. It is currently not merged.
- CUDA Graph Partition Feature: The pull requests implement a CUDA graph partition feature, building on a previous inductor graph partition PR. They include several commits such as recording mappings from partition input/output indices to graph indices, merging branches, and handling metadata partitioning.
- CacheBench Component Testing: The pull requests introduce a new test for the CacheBench component by adding a "ciflow/trunk" test to the PyTorch project. It is part of a stack of changes managed by ghstack and is currently not merged.
- Tensor Slice Overflow Fix: The pull requests address the issue of tensor slice overflow when the step value is near INT64_MAX by implementing a fix to prevent overflow in the calculation of slice output length. This is detailed in the commits and discussed in relation to issue #147071.
- ROCm Backend Boolean Value Fix: The pull requests address an issue in the ROCm backend of the PyTorch project by converting non-standard boolean values into standard boolean values. This ensures correct sorting operations and includes several commits for linting, unskipping unit tests, and fixing typos.
- Magma-Cuda References Removal: The pull requests involve removing references to magma-cuda from the readme.md file and refactoring the magma_conda installation process. This follows the migration of the magma-cuda build from Anaconda to AWS.
- Register Constant Usability in Exportz: The pull requests address the issue of making the register constant usable in the "exportz" functionality of the PyTorch project. They involve multiple updates and revisions as indicated by the series of commits and differential revision link provided.
- Unbacked Renamings in Export Process: The pull requests aim to eliminate the use of unbacked renamings in the export process by introducing a new pass in
_produce_aten_artifact
to recompute unbacked bindings. This ensures that unbacked binding keys remain synchronized with example values and improves compatibility with de/serialization.
- Test Submission Using Ghstack: The pull requests are a test submission, not intended for actual merging, created using the ghstack tool. They involve multiple commits with placeholder messages, while tagging several contributors for notification.
- Templatized CUDA Kernel for GammaBeta Backwards Pass: The pull requests introduce a new templatized CUDA kernel designed to replace three existing non-ROCM CUDA kernels for the GammaBeta backwards pass. They address performance issues by optimizing warp shuffles, coalesced loads, and parallelism across the
M
dimension.
- ONNX Operation Decomposition Migration: The pull requests aim to migrate ONNX operation decomposition functions from onnxscript to PyTorch. This decouples torch.onnx from implementations in onnxscript, with necessary refactoring and test scaffolding provided in related pull requests.
- Cutlass Backend Matrix Multiplication Tests: The pull requests introduce main tests for matrix multiplication (mm), addition and multiplication (addmm), and batch matrix multiplication (bmm) within the Cutlass backend of the PyTorch project.
- Dynamo Error Message Enhancements: The pull requests aim to enhance the error messages in the Dynamo component of the PyTorch project. They are part of a series of improvements and include multiple commits with updates and contributions from various collaborators.
- Poison Fork Documentation for Accelerator APIs: The pull requests aim to document a note regarding "poison fork" for accelerator APIs in the PyTorch project. They are part of a stack of changes managed by ghstack and involve multiple updates marked as "[ghstack-poisoned]".
- GaussianNLLLoss Variance Input Size Fix: The pull requests address issue #147521 by modifying the GaussianNLLLoss function to allow any size of variance input as long as it is broadcastable to the input or target's size. This ensures that the demo code in the issue results in the expected behavior and correct output.
- Continuous Integration Optimization: The pull requests aim to optimize the continuous integration process by utilizing more CPU processes during the checkout phase. They include testing changes from the main branch to a specific branch identified by the commit hash 249a936998e66cc0d6ad8664e0e93ec1b9432a8b.
- ReplicationPad Bool Data Type Handling: The pull requests address the issue of aligning the
replicationpad
function's handling of thebool
data type with the eager execution mode in the PyTorch project. This is part of fixing issue #143779.
- Gather and Scatter Object List Fixes: The pull requests address a fix for the
gather_object
andscatter_object_list
functions in the PyTorch project. They ensure that the destination and source ranks are correctly based on the global process group, regardless of the group argument.
- DeviceMesh.get_group Argument Support: The pull requests introduce support for passing arguments to the
DeviceMesh.get_group
function in the PyTorch project. They include adding tests and updating relevant files liketest_dtensor_compile.py
anddistributed.py
.
- Sparse Tensor Validation: The pull requests address the validation of sparse tensors constructed via a legacy constructor in PyTorch. They highlight issues such as size inconsistencies and storage size calculation overflows during the
torch.load
process.
- FSDP Tests on XPU Device: The pull requests aim to enable Fully Sharded Data Parallel (FSDP) tests on the XPU device within the PyTorch project. They involve multiple commits such as the implementation of an abstracted API to retrieve the backend and adjustments based on review comments.
- Torch.isin Function Decomposition Fixes: The pull requests address two decomposition issues in the
torch.isin
function within the PyTorch project. They specifically fix the lack of support for scalartest_element
and resolve discrepancies in results produced by Inductor compared to eager mode.
- Collective Recomputations in Partitioner: The pull requests propose to always disable the compiler-driven recomputation of collectives by default in the partitioner. This prevents inconsistencies and potential hangs in distributed jobs, with future plans to introduce an
spmd_mode
flag for safe collective recomputation.
- Export Method Introduction: The pull requests introduce an "export method" to the PyTorch project, as part of a stack of changes managed by ghstack. They include multiple commits refining the implementation, although it has not yet been merged.
- Myst_nb Compile Tutorial Demonstration: The pull requests introduce a demonstration of using myst_nb with a compile tutorial in the PyTorch project. They are indicated by the title and multiple commits refining the demonstration.
- Log2 and PowByNatural Printing Issues: The pull requests address issues related to the printing functionality of the
log2
andPowByNatural
operations in the PyTorch project. They include multiple commits with updates, although they have not yet been merged.
- Dynamic Indices Type in Torch.sort: The pull requests introduce an optimization to the
torch.sort
function by implementing adynamic_indices_type
option. This dynamically determines the data type of indices to reduce excessive memory usage, supporting data types such as Byte, UInt16, UInt32, and UInt64.
- Backwards Indexing Enhancements: The pull requests aim to enhance the functionality of backwards indexing in the PyTorch project specifically for cases where the stride is not equal to one. They involve collaboration with several contributors from the ROCm team.
- Inductor C++ Code Generation Bug Fix: The pull requests address a bug in the inductor C++ code generation for a custom operation in PyTorch. They ensure that a list containing a single tensor with an unbacked symbolic integer shape does not result in a type/value mismatch error during template parameter deduction.
- Mixed Precision Fused Adam Optimizer: The pull requests propose an implementation of a mixed precision fused Adam optimizer for the PyTorch project. They are part of a stack of changes managed by ghstack, although they have not yet been merged.
- CachingHostAllocator Memory Statistics: The pull requests introduce an initial implementation of host memory statistics for the CachingHostAllocator in the PyTorch project. They aim to facilitate the diagnosis of performance slowdowns by gathering memory allocation data without significantly altering the allocator's original design.
- OneDNN Primitive Cache for Int4 GEMM: The pull requests introduce an enhancement to the PyTorch project by adding a oneDNN primitive cache specifically for int4 GEMM operations on XPU. They include an example of int4 GEMM migrated from IPEXcc.
- Sparse Tensor Validation in Torch.load: The pull requests aim to enhance the PyTorch library by adding sparse tensors constructed via a legacy constructor to the
_sparse_tensors_to_validate
list. This ensures they are validated at the end of thetorch.load
process.
- Mutation Analysis in Triton Compiler: The pull requests address an issue in the mutation analysis of
scf.if
andscf.for
operations within the Triton compiler. They introduce separatescf.yield
operations for each yield argument, preventing the incorrect marking of all yield arguments as mutated.
- Storage Offset Overflow Checks: The pull requests address issue #145259 by adding two overflow checks to the storage offset calculation in
aten/src/ATen/native/Resize.h
. This prevents crashes and incorrect tensor returns when using large storage offsets in PyTorch'sas_strided
function.
- Scheduler Code Refactoring: The pull requests involve minor refactoring of the scheduler code in the PyTorch project. They include changes such as using a default dictionary and cleaning up the log fusion function as part of ongoing code improvements.
- GuardManagers Reference Change: The pull requests propose a change in the PyTorch project to maintain a reference to the parent instead of the root within GuardManagers. They are part of a stack of changes managed by ghstack and are currently unmerged.
- Dict_tag Optimization Disabling: The pull requests propose to disable the
dict_tag
optimization in ancestor nodes when the ancestor is not common. They are part of a stack of changes in the PyTorch project and include two commits with updates marked as "[ghstack-poisoned]".
- Flip Operation Memory Corruption Fix: The pull requests address a memory corruption issue in the
flip
operation fortorch.quint4x2
andtorch.quint2x4
inputs. They implement a runtime error check for these deprecated data types and include a test plan to verify the change.
- Triton Autotune Configuration Heuristic: The pull requests aim to reintroduce a previously reverted change that introduces a new template heuristic for Triton autotune configurations. They remove additional
ir.device_type
calls inmm_scaled
andunpack_mixed_mm.py
to address compile time regressions.
- Torch.polygamma() Function Consistency: The pull requests address an issue with the
torch.polygamma()
function whenn == 1
by ensuring consistency with the CPU kernel. They include two commits aimed at resolving this problem.
- Sym_not Function in ONNX Module: The pull requests aim to implement the
sym_not
function in the ONNX module of the PyTorch project. They address issue #136572 and are part of a stack of changes managed by ghstack.
- CUDA Device Index Guard Mechanism: The pull requests introduce a guard mechanism for the CUDA device index in the PyTorch project. They ensure that operations are correctly managed across different CUDA devices.
- Device Check Logic Consistency: The pull requests address a bug reported in issue #144748 by modifying the device check logic in the PyTorch codebase. They ensure consistency between eager mode and inductor mode by aligning the behavior of the
find-common-device
method infake_tensor.py
with the device check inadaption.h
.
- Outer Loop Fusion Heuristics Optimization: The pull requests aim to enhance the performance of the PyTorch project by optimizing the heuristics used in outer loop fusion. They are indicated by the title and the associated commits.
- Torch.utils.tensorboard Export Fix: The pull requests address the issue of certain classes not being exported from the
torch.utils.tensorboard
module. They define the__all__
attribute to explicitly specify the public interface, ensuring that classes likeFileWriter
,RecordWriter
, andSummaryWriter
are properly recognized and accessible.
- Import Fix in torch/_inductor/debug.py: The pull requests address an issue with the import of
getArtifactLogger
intorch/_inductor/debug.py
forir_pre_fusion
andir_post_fusion
. They ensure the import is complete and set the logging tooff_by_default
to minimize excessive logging.
- Precompile Cache Utilization Check: The pull requests address the issue of ensuring that the system checks if the
force_disable_caches
flag is set before utilizing the precompile cache. They are part of a series of commits in the PyTorch project and involve multiple contributors for review and collaboration.
- Inductor Cache Selection Algorithm: The pull requests introduce a new algorithm for selecting caches in the fresh inductor cache. They are part of a stack of changes and include discussions and reviews from multiple contributors.
- LazyLinear Module Abnormal Behavior Fix: The pull requests address the abnormal behavior of the LazyLinear module in PyTorch when used in conjunction with LayzLinear and load_state. They update the logic of the
initialize_parameters
function and add new test cases.
- End-to-End Control Plane Flex Attention: The pull requests are an experimental attempt to implement end-to-end control plane (cp) flex_attention within the PyTorch project. They involve multiple commits and collaboration among several contributors.
- Third-Party ONNX Build Process Enhancement: The pull requests aim to enhance the build process of third-party ONNX by removing unnecessary options and addressing a missing dependency. They are indicated by the commits and their association with a specific issue in the PyTorch project.
- Triton XPU Build Process on Windows: The pull requests aim to enable the Triton XPU build process on Windows for the PyTorch project. They are indicated by the title and the commits associated with it.
- AOTD System Output Classification Bug Fix: The pull requests address a bug reported by an internal user in the PyTorch project, where the AOTD (Ahead Of Time Dispatch) system incorrectly classified outputs that are aliases of intermediates in a computational graph. They propose a solution by adding runtime unwrapping to ensure that the base of a detached alias is consistently tracked back to its original tensor.
- Block Radix Sort Performance Enhancement: The pull requests aim to enhance the performance of block radix sort for certain shapes in the ROCm backend by reducing the items processed per thread to 8. This increases the thread block size and achieves higher occupancy.
- Matmul Small Brute Force Tunableop Test Speedup: The pull requests aim to speed up the unit test for the
matmul_small_brute_force_tunableop
by reducing its execution time by over 20 minutes. They include refactoring such as moving a hipBLASLt version check to a different test for simplicity.
- Ruff Rule S324 Enablement: The pull requests aim to enable the ruff rule S324 by adding it to the
pyproject.toml
file. They address issue #147627 and include running a check to clean warnings using thebashlintrunner
tool across all files.
- Manual Dynamism Whitelist Introduction: The pull requests introduce a "manual dynamism whitelist" to the PyTorch project. They involve multiple commits and contributors and are part of a stack of changes managed through the ghstack tool.
- Broken Link Fix in PyTorch Documentation: The pull requests address a broken link issue in the PyTorch documentation by updating a reference to the NumPy documentation. They ensure it correctly redirects to the current NumPy documentation site.
- RandomBatchSampler Performance Enhancement: The pull requests propose a performance enhancement by merging
RandomSampler
andBatchSampler
into a newRandomBatchSampler
. They utilize slicing instead of iteration to output indices, resulting in significant speed improvements.
- Intel GPU TestCommon::test_dtypes Skipping: The pull requests aim to skip the Intel GPU TestCommon::test_dtypes test for the bmm and addbmm operations due to the lack of complex64 support. They also extend the DecorateInfo to accommodate a list of device types.
- Process Group Without Parameters Fix: The pull requests address and fix an issue in the PyTorch project where a process group (PG) without parameters was causing problems. They are referenced in issue #143828 and include updates tracked through the ghstack tool.
- Normal Classes as Dataclasses in Pytree: The pull requests address a discussion from a previous pull request by modifying the PyTorch codebase to allow normal classes to be registered as dataclasses within the pytree module. They are indicated by the commits and the linked discussion.
- NCCL Memory Pool Use Condition Restriction: The pull requests introduce a restriction on the use condition of the NCCL memory pool by adding a check to determine if the CUDA driver supports multicast. This is similar to the implementation in Symmetric Memory.cc and is part of a stack of changes managed by ghstack.
- FlexAttention Module Error Messaging: The pull requests address the issue of inadequate error messaging in the FlexAttention module by adding explicit error messages for cases where the embedding size is less than 16. This aids users who are experimenting with small tensor sizes.
- FSDP Wrapped Module Zero Argument Bug Fix: The pull requests address a bug in the Fully Sharded Data Parallel (FSDP) wrapped module related to a zero argument. They implement a fix and add a unit test, while also removing the
skip_if_lt_x_gpu
condition.
- Inductor Component Casting Logic Rework: The pull requests rework the casting logic in the Inductor component of the PyTorch project to avoid illegal bitcasts. They address issues introduced by Triton's checks on bitcasts where the casted value does not fit into the casted type.
- PT2 Compiler Boolean Type Handling: The pull requests address issues with the PT2 compiler's handling of boolean types in wrapped functions. They add explicit tests to determine if data is of type
i1
and include a test added totest_triton_kernels.py
to ensure compatibility with existing infrastructure.
- NVTX3 Include Directory Hints: The pull requests address the issue of CMake struggling to locate NVTX3 by adding hints to the USE_SYSTEM_NVTX configuration for the NVTX3 include directory. They are detailed in the commit found at https://github.com/pytorch/pytorch/commit/a3c4572bf250ccdde8bdcdcbf642a1cb16bdd113.
- NCCL Communication for Uint64 Tensor Types: The pull requests aim to modify the PyTorch library by enabling NCCL communication to support uint64 tensor types. This is particularly important for applications in cryptography and privacy computing.
- MKLDNN Backend Availability API: The pull requests aim to introduce an
is_available
API fortorch.backends.mkldnn
, similar to existing APIs fortorch.backends.mkl
andtorch.backends.openmp
. This allows users to check the availability of the MKLDNN backend in PyTorch.
- Test_transformers.py File Splitting: The pull requests propose splitting the existing
test_transformers.py
file into two separate files,test_transformers.py
andtest_transformers_privateuser1.py
. This addresses the issue of skipped privateuse1 test cases that currently conflict with CUDA test cases.
- Once Flag Removal and Static Initialization: The pull requests aim to enhance the PyTorch project by removing the unnecessary usage of the "once flag" and replacing it with static initialization. They are part of a series of changes and are linked to a specific issue for tracking and collaboration with multiple contributors.
- XPU Build Process with Visual Studio 2019: The pull requests aim to modify the build process for XPU by enabling the use of Visual Studio 2019. They are part of an effort to address a specific issue referenced in the project and involve collaboration with multiple contributors.
- 128-bit Vectorization Reversion Draft: The pull requests are a draft that aims to revert a previous commit related to the implementation of 128-bit vectorization in the ATen CUDA component of the PyTorch project. They address an unspecified issue.
- RNN Example Code Correction: The pull requests address an issue in the PyTorch documentation by correcting the RNN example code to properly handle multiple layers. They ensure that only the first layer takes the input vector while subsequent layers use the hidden state from the previous layer.
- Ptxas Warnings Resolution: The pull requests address and resolve numerous ptxas warnings during the build process by aligning the thread count for sm_120 with the CUDA C programming guide's specification of a maximum of 1536 threads per SM.
- Cub Iterators Replacement with Thrust Iterators: The pull requests aim to update the PyTorch project by replacing deprecated cub iterators with thrust iterators. This is due to recent changes in the CCCL (cub) development, while acknowledging potential impacts on ROCM usability.
- Dynamo Methods Type Annotations: The pull requests aim to enhance the type annotations for dynamo methods in the PyTorch project. They are indicated by the title and commit message and involve several contributors mentioned in the body.
- UBSAN Test Enablement: The pull requests aim to enable the Undefined Behavior Sanitizer (UBSAN) test in the PyTorch project. They address a specific issue referenced as #ISSUE_NUMBER and include a single commit with the message "Enable UBSAN test."
- ASAN Support for CUDA: The pull requests aim to enable AddressSanitizer (ASAN) support for CUDA in the PyTorch project. They are indicated by the title and commit message and involve collaboration with several contributors mentioned in the body.
- Aten.as_strided.default Operation Introduction: The pull requests introduce the
aten.as_strided.default
operation to address the FakeTensor propagation error identified in issue #145353. They demonstrate an alternative approach in pull request #147517.
- Flex Attention Function Registration: The pull requests attempt to register the
flex_attention
function to a custom function on DTensor within the PyTorch project. They encounter a runtime error related to the use ofFunctionalTensor
without a correspondingFunctionalTensorMode
.
- Flex Attention Custom Dispatch Mode: The pull requests attempt to register the
flex_attention
function to a custom function within a custom dispatch mode in the PyTorch project. They encounter aNotImplementedError
due to the absence of a registered rule for handling theflex_attention
operation with theDTensor
subclass.
- Flex Attention Custom Function Dispatch: The pull requests aim to experiment with registering the
flex_attention
function to a custom function on DTensor within a custom dispatch mode. They allow for successful dispatch offlex_attention
in a given context to the custom CPflex_attention
.
- Pybind11 Submodule Update Test: The pull requests update the pybind11 submodule to version 3.0.0-dev as a test. They address an unspecified issue and include a single commit with the message "Update pybind11 submodule to 3.0.0-dev test."
- Setattr Function KeyError Proposal: The pull requests propose raising a KeyError when the
setattr
function is called on aModule
instance in PyTorch and a class attribute already exists. They address a silent error where users might incorrectly assume that thesetattr
operation was successful when it was not.
- MPS Integer Matmul Kernel Optimization: The pull requests aim to optimize the integer matrix multiplication (matmul) kernel for Metal Performance Shaders (MPS) on macOS. They improve performance through reduced global memory accesses, with a focus on enhancing efficiency for large matrices.
- Elementwise Kernel Input Vectorization: The pull requests introduce input vectorization in elementwise kernels for tensors with heterogeneous types. They specifically demonstrate its application for input tensors with types (float, bfloat16) when the functor type is float(float, float).
- Torch.compile Fullgraph Models API: The pull requests introduce a new API for the
torch.compile
function that allows for the compilation of fullgraph models using a C++ wrapper. They enable the saving and loading of compiled artifacts to disk through a "sticky cache" mechanism.
- Attention Mechanism for Tensors with More Dimensions: The pull requests address issue #147443 by fixing the attention mechanism for tensors with more than four dimensions. They include the addition of relevant tests to ensure functionality.
- SymPy Floating-Point Number Printing: The pull requests address an issue with the printing of floating-point numbers in the SymPy library within the PyTorch project. They are part of a stack of changes managed by ghstack and are linked to a previous pull request #147261.
- Unused-Value Issue in CUDAHooks.cpp: The pull requests address an unused-value issue in the file
caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp
. They modify the code to eliminate unnecessary values that trigger the-Wunused-value
warning in LLVM.
- Pattern Matcher Guard Replacement: The pull requests propose replacing the use of
guard_size_oblivious
withstatically_known_true
in the pattern matcher. They aim to avoid adding unnecessary guards, as detailed in the commit and supported by an internal discussion link.
- ShapeAsConstantBuffer Transfer Mechanism: The pull requests involve implementing a mechanism to transfer a ShapeAsConstantBuffer from a subgraph to the main graph output in the PyTorch project. They handle a symbolic integer returned by the inner subgraph and subsequently by the forward graph after partitioning.
- System Random State Handling: The pull requests aim to improve the handling of the system's random state in the PyTorch project by carefully saving and restoring it. They mark the third attempt to address the issue outlined in a previous discussion on GitHub.
- Compile_tasks.py Unused Functions Removal: The pull requests aim to clean up the codebase by removing unused functions from the
compile_tasks.py
file in the PyTorch project. They are indicated by the non-functional change (NFC) label in the commit message.
- Mark_traceable Feature on Class Methods: The pull requests introduce support for the
mark_traceable
feature on class methods in the PyTorch project. They include a new test calledtest_mark_traceable_on_method
and additional comments explaining the necessity for special handling of methods.
- Global or Captured Tensors in Mark_traceable: The pull requests address the issue of supporting reads to global or captured tensors within functions marked as
mark_traceable
. They introduce a globalFakeTensorTLS
with anallow_non_fake_inputs_override
flag to temporarily adjust the flag during execution.
- CachingAutotuner on Meta Device: The pull requests address issue #146018 by improving the handling of the
CachingAutotuner
on themeta
device. They fix size inference issues, ensuring that dynamic shape handling functions correctly when multiple calls with different tensor sizes are made.
- Function Signatures Refactoring with ParamSpec: The pull requests refactor function signatures in the PyTorch project by replacing
*args: Any
and**kwargs: Any
withParamSpec
. They enhance type safety, improve static type checking with tools like mypy, and maintain code quality by preserving argument information.
- Triton Kernel Grid Handling Simplification: The pull requests aim to simplify grid handling in Triton kernel calls by removing the need to pass the grid as a callable argument. They incorporate grid computation directly within the kernel launcher, enhancing performance by reducing function calls.
- Unique User Kernel Names in Triton: The pull requests introduce a feature called
unique_user_kernel_names
to provide unique naming support for user-defined Triton kernels. They enhance control over naming and generation processes, primarily for debugging purposes.
- XCCL Backend Build Definitions: The pull requests introduce the definitions of
USE_C10D_XCCL
andUSE_XCCL
in PyTorch to enable the building of the XCCL backend. They are similar to the existing support for NCCL, with the default setting forUSE_XCCL
being OFF unless explicitly set to ON.
- OffsetBasedRNGTracker Default Device Fix: The pull requests address the issue of setting the default device type to CUDA when the
OffsetBasedRNGTracker
is called without arguments. They pass the device information explicitly as part of a fix for issue #147584 in the PyTorch project.
- OpDTypes.any_common_cpu_cuda_one Documentation: The pull requests introduce documentation for the
OpDTypes.any_common_cpu_cuda_one
feature in the PyTorch project. They are indicated by the commit message and the title and are linked to a specific issue for resolution.
- CUDA 12.8 Binaries sm_70 Architecture Deprecation: The pull requests propose the deprecation of the sm_70 architecture for CUDA 12.8 binaries in the PyTorch project. They are part of a follow-up to a previous pull request due to the feature-complete status and impending freeze of architecture support for Maxwell, Pascal, and Volta.
- Intel Gaudi Devices Support in test_misc.py: The pull requests adapt the
test_misc.py
file to support Intel Gaudi devices (HPUs) by extending CUDA tests to operate on these devices. They ensure compatibility without affecting existing CUDA tests and include the use of askipIfHPU
decorator.
- Replace_pattern Function Docstring Correction: The pull requests address a minor mistake in the docstring of the
replace_pattern
function within the PyTorch project. They are referenced in issue #147610cc and include a single commit updating thesubgraph_rewriter.py
file.
- SDPA on XPU Backend Enablement: The pull requests aim to enable SDPA on the XPU backend as part of the OneDNN Upstreaming plan. They involve the addition of an
Attention.cpp
file and aGraph.h
for OneDNN graph utilities, along with modifications to test cases intest/xpu/test_transformers.py
.
- OneDNN Component Merge Rules Update: The pull requests aim to update the merge rules for the oneDNN component in the PyTorch project. They are part of a stack of changes managed by ghstack and are currently open and not yet merged.
- Podman Build Process Documentation: The pull requests document the automated build process of Podman with upstream patches applied to address specific issues encountered on s390x runners. They are detailed in the commit found at https://github.com/pytorch/pytorch/commit/5e4db89b85d6ee086582d2dfae5af2a004345458.
- ROCm Split_scan Support Enablement: The pull requests enable split_scan support for ROCm builds in the PyTorch project. They address issue #133228 by removing the condition that previously prevented this support.
- Triton Tests Force_shape_pad Option: The pull requests enable the force_shape_pad option for Triton tests in the test_kernel_benchmark. They address issues where padding paths are slower on ROCm architectures, ensuring that the tests focus on verifying the correctness of padding.
- Set_driver_to_gpu Code Update: The pull requests update the
set_driver_to_gpu
code in the PyTorch project to prevent backend re-initialization issues when using the new Triton. They are indicated by the commit signed by Anatoly Myachev.
- Hugging Face Checkpoints Storage Reader/Writer: The pull requests aim to build a storage reader/writer to enable writing checkpoints in the Hugging Face (HF) format for non-distributed use cases. They address previous lint errors by explicitly ignoring them due to the intentional absence of certain library installations.
- Backend_type_map Removal from Backend: The pull requests aim to remove the
backend_type_map
from the Backend in the PyTorch project. They are no longer used for determining the default device for object collectives or barriers, and the author is awaiting continuous integration (CI) test results to ensure that this change does not introduce any issues.
- Torch/_inductor/ir.py Unnecessary Changes Reversion: The pull requests revert unnecessary changes made to the
torch/_inductor/ir.py
file in a previous update (#146917). They address issues with CUDA tests not passing due to an oversight in syncing environments across different machines.
- Test_halide.py Script Enhancement: The pull requests aim to enhance the
test_halide.py
script by adding functionality to report the command needed to re-run any failed tests. They improve the debugging process for developers working on the PyTorch project.
- MPS Binary Operations Metal Kernel: The pull requests aim to implement a metal kernel for MPS binary operations using TensorIterator. They update and reimplement a previous pull request to help resolve a specific issue in the PyTorch project.
- Rowwise Scaling Tests Skipping on SM100+: The pull requests propose to temporarily skip the rowwise scaling tests on SM100+ architectures in the PyTorch project. They are due to the current lack of implementation and are further discussed with several contributors in the body of the request.
- TCPStore Error Handling Enhancements: The pull requests aim to enhance error handling in the TCPStore and TCPStoreLibUvBackend components of the PyTorch project. They replace generic
TORCH_CHECK
calls with typed exceptions, improving the specificity of error messages that are raised as RuntimeErrors in Python.
- C++ Pytree Compile Time Assessment: The pull requests are an experimental change aimed at assessing the compile time when using C++ pytree in the PyTorch project. They are indicated by the title and commit message and have not yet been merged.
- Unbacked Bindings in .module() Result: The pull requests aim to ensure that the
.module()
result in the PyTorch project does not contain unbacked bindings. They are associated with Differential Revision D70022208.
- Addmm Tests Input Range Restriction: The pull requests aim to restrict the input range for
addmm
tests in the cuBLAS library. They address cancellation issues with larger sizes, enabling testing with tighter tolerances.
- Windows CUDA Wheel and Libtorch CI Testing: The pull requests are focused on testing the continuous integration (CI) process for Windows CUDA wheel and libtorch in the PyTorch project. They are indicated by the title and the associated commit message.
- FIPS Compliance with RUFF Linter: The pull requests aim to enforce full FIPS compliance by adding rule S324 to the RUFF linter in the PyTorch project. They are indicated by the title and the commit message and include a command for testing the changes.
- Pdist_forward Function Error Checking: The pull requests address issue #145064 by adding error checking to the
_pdist_forward
function in PyTorch. They prevent segmentation faults when iterating over an empty tensor, verified through updated test cases that now raise aRuntimeError
instead of causing a crash.
- Use_relative_path Option Renaming: The pull requests involve refactoring by renaming the option from
use_absolute_path
touse_relative_path
. They more accurately reflect its function of compiling a C++ file using its basename rather than its full path.
- CppBuilder.build Function Consolidation: The pull requests refactor the code by replacing the
run_command_and_check
function withCppBuilder.build
. They consolidate the C++ compilation action within the PyTorch project.
- Triton_heuristics.py Grid Overwrite Bug Fix: The pull requests address a bug in the
triton_heuristics.py
file whereargs_with_constexprs
incorrectly overwrites the grid. They add a check to ensure the correct number of arguments are passed to the launcher, enhancing error handling and preventing unexpected failures during Triton kernel execution.
- Dynamo Dictionary Tag Optimization Disabling: The pull requests propose to disable the dictionary tag optimization in the Dynamo project when the guard manager has child accessors. They are indicated by the title and commit message.
- Cpp_extensions Module Ninja Build Error Messaging: The pull requests aim to enhance the error messaging related to missing Ninja build system in the cpp_extensions module of the PyTorch project. They are indicated by the commit message and the associated URL.
- JIT Version Checking Bug Fix: The pull requests address a bug in the version checking mechanism for JIT in Python 3.10. They ensure that a feature is only enabled for version 3.11, as identified by a static linter, and include a commit to correct this issue.
- Gen_patterns.py Script TypeError Fix: The pull requests address a crash issue encountered when running the
gen_patterns.py
script in the PyTorch project. They specifically fix aTypeError
related to theissubclass()
function and provide a fix to resolve this error.
- OpenBLAS Multiple OpenMP Runtimes Issue: The pull requests address an issue in PyTorch where building with OpenBLAS support could lead to multiple OpenMP runtimes being linked in
libtorch_cpu.so
. They ensure thatlibomp.so
is not linked if OpenBLAS is already linked againstlibgomp.so
.
- Intel Triton Component Update for Release 2.7: The pull requests aim to update the Intel Triton component within the PyTorch project to be compatible with the upcoming release 2.7. They are indicated by the work-in-progress status and the associated commit.
- Triplet_margin_loss Test Tolerance Update: The pull requests update the CPU tolerance levels for the
nn.functional.triplet_margin_loss
test in thetest_torchinductor_opinfo
. They prevent failures on AArch64 by increasing the acceptable absolute and relative tolerances (ATOL and RTOL) for F16.
- Torch-xpu-ops Commit Update: The pull requests update the torch-xpu-ops commit to a specific commit hash, 306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7. They incorporate a bug fix for LayerNorm and Nonzeros, as well as an update to the AOT target, and are currently not merged.
- Torchgen Tool Enhancement for C Shim Files: The pull requests propose an enhancement to the torchgen tool by enabling it to automatically update C shim files with a version number and a list of new arguments for modified operations. They address the backward compatibility issue that arises when adding new arguments with default values to fallback operations in Python.
- Inductor Test_kernel_benchmark.py Script Fix: The pull requests address an issue in the PyTorch project by fixing the
inductor/test_kernel_benchmark.py
script. They accommodate changes in the new Triton version by preventing the duplication of parameters in the_dump_launch_params
function.
- P1 INT16 Full Quantization Target MVP: The pull requests introduce a minimum viable product (MVP) for the P1 INT16 Full quantization target. They involve quantizing the input to int16 as part of the PyTorch project.
- Partitioner Component Print Statements Removal: The pull requests involve removing print statements from the partitioner component of the PyTorch project. They are part of a series of changes tracked through the ghstack tool and are currently open and not yet merged.
- Search Survey Link Removal: The pull requests aim to remove a link to a search survey from the PyTorch project. They are indicated by the commit message and the involvement of contributors tagged in the discussion.
- Distributed Checkpointing Protocol Rank Local Checkpointing: The pull requests are a work in progress aimed at demonstrating rank local checkpointing in the Distributed Checkpointing Protocol (DCP) for the PyTorch project. They are not yet ready for review.
- Compile_fx_aot Logging Context Managers: The pull requests introduce context managers in the
compile_fx_aot
function to enhance logging. They add a toplevel Chromium event (tlparse) and a singledynamo_compile
log entry, improving traceability and visibility of events in both Scuba and Perfetto trace tools.
- Gfx1102 Architecture Support in Wheel Builds: The pull requests aim to add support for the gfx1102 architecture to the wheel builds in the PyTorch project. They utilize the
--offload-compress
option to accommodate another graphics target, as indicated by the inclusion of relevant code objects since ROCm 5.5.
- Torch.compile Path Priority Order Respect: The pull requests address the issue where the
torch.compile
path in the PyTorch project was not respecting thepriority_order
setting ofsdpa_kernel
. They ensure that the context manager handling withintorch.compile
now properly acknowledges and applies this configuration.
- Torch.float8_e8m0fnu Feature Testing: The pull requests are focused on testing the implementation of the
torch.float8_e8m0fnu
feature in PyTorch. They are indicated by the title and the involvement of multiple reviewers and subscribers.
- Cpp_builder Clang++ Detection Bug Fix: The pull requests address a bug in the
cpp_builder
where the detection mechanism incorrectly identifiesclang++
asg++
. They include a fix to ensure proper differentiation between the two compilers.
- CK Backend for Memory-Efficient Attention in ROCm: The pull requests introduce the CK backend for memory-efficient attention in ROCm. They enable the use of attention bias while noting that it is still activated via
torch.backends.cuda.preferred_rocm_fa_library("ck")
and does not support Nested Tensors.
- Non-Functional Collectives Support in FakeTensorMode: The pull requests aim to enhance the PyTorch project by adding support for non-functional collectives under FakeTensorMode and fake_pg. They improve memory tracking capabilities.
- Layernorm CUDA Backwards Pass Test: The pull requests introduce a new test to ensure the accuracy of the layernorm CUDA backwards pass. They serve as a foundational step towards future performance improvements.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 241
Key Closed Pull Requests
1. Fix SEGFAULT when None arg was passed in GraphContext.op(..): This pull request addresses a segmentation fault (SEGFAULT) issue in the PyTorch project by fixing a bug in the GraphContext.op(..)
function, which occurred when a None
argument was passed, as indicated by the title "Fix SEGFAULT when None arg was passed in GraphContext.op(..)" and the body referencing issue #145261.
- URL: pull/145265
- Merged: No
- Associated Commits: 74c59, 1ebec, 529cf, 28b7c, 5548d, 5c678, d8fd5, c0e93, d3bc9, ff743, f695f, 72d56, 3056c, bb8dd, 36ae1, cd8a4, 999d7, 89145, fa7cc, d2b0e, f843f, 260e6, 8330e, 7b85b, cb52b, 55531, 959bd, fdd05, 16bec, 52681, 1944a, ff503, 38b54, b88c6, d8d27, cd367, 16a76, a4bf8, a2cc4, 276b2, 107be, 05d80, 18247, fab8f, e1111, 2b999, 3e0f8, 2b120, c30f6, 2898b, a48fe, 8cb85, 7ad87, 40680, 6d4df, e6b00, 3377f, 3d2ca, 025c4, 4f554, 82566, f4e69, 6e7d1, 9cf68, 41fdc, 84727, 9c315, b6188, 93627, 38bf3, 83443, 121b1, b798a, 90e49, fbb02, faf69, c4544, 81bad, 2fb90, a853f, 1c063, d6f20, da502, a8ce3, acda0, ee3b4, f8a46, 9a94c, 33161, 7a30a, 53508, 91ddd, ac0a7, 295a4, f11ef, ed521, 6c1d9, bbf5f, faa7c, 58f63, 891b0, 73d37, 65098, c232f, 63561, a8618, bfa9f, 4fd18, 94b71, 500ec, 9badf, 6c31a, d134e, 0fd7b, e2b74, 00117, 2b207, 55887, 1f9e4, b5517, 54c66, cb4ab, 9debd, ff69f, 3c5dd, 69305, af032, 509d4, 3da27, ecd55, d63f2, 6c91b, 6c856, c06f4, ff383, d822c, cad53, bc4aa, 916c7, e9aaf, 70702, b6995, c1257, 81eed, 6749c, a9685, 478a3, ce891, 8c0dd, 9a84b, ecb3b, d79d1, f96ff, 093ed, 9b8cb, ef7b0, 27342, a80b6, d6846, b643f, b4ee3, 5cf3a, dd503, ffd74, 85ffe, dee02, 924bb, c18d6, 4aed3, 31439, e360f, dd4e4, fb5b0, 9ec0f, b253d, 4d7e0, 7cb38, 9340f, b0fd1, 9493d, 38c6e, aba14, ff1e3, c7f0e, 4386e, 346fa, 8094a, 6aa28, 18028, e3f20, 404b7, 197de, a5af7, c2239, 8d4ec, b8f87, 0f48e, 921c2, 0b1dc, 7bdb5, 71c54, 8dd8f, d9c1e, 89609, d3059, dd1ba, 3838b, a3f90, d4752, 5a8ab, 6ae9b, 1c22e, 65852, c89d5, 772a2, 5962d, c2fee, 5218d, c0e14, 78ec8, cfb0d, a55c0, aae36, 92fae, 12f27, e84fa, 5c3c6, a8d32, 7601d, 71b7f, 292ff, 1e3b9, a7266, aa47f, 06742, 06203, 24fc3, 49383, 1f736, f6443, abe8c, 45211, 3959b, fb130, b4519, 340aa, 217e8, 513b1, ce51a, f2e7a
2. [ONNX] Bump torchlib opset to 22: This pull request aims to update the ONNX torchlib opset to version 22, involving multiple commits that address various aspects such as migrating torchlib into PyTorch, updating tests, fixing issues, and making several code improvements, although it was ultimately not merged.
- URL: pull/146510
- Merged: No
- Associated Commits: d3904, 837c0, caea1, 31fc5, 8c2c5, 77ed8, 00179, 3cf2e, 45049, f61eb, 07c7c, c263b, 4927f, c6737, 40153, 98e32, 702cc, f8f12, fcb0f, 1be91, 60094, b8a63, f590b, 24b7d, ad46e, 4289e, 1d488, 98579, c9ce7, 8cfef, e592a, ed814, 95b33, 0f7ac, 8e574, 1575d, 50f0c, cdeac, 90973, 13bcf, b3aab, fc259, 80506, 125c9, da3d0, bba25, 2b5bb, 61c98, de421, d44c0, 713e5, 67a32, 0637e, 4c0ad, 79838
3. [Intel GPU] qconv_pointwise.binary XPU support: This pull request aims to enable support for quantized fusion operations, specifically qconv+add
and qconv+add+relu
, on the Intel GPU backend by registering the operation via a specific schema and making modifications to allow signed int8 data type during operation lowering, while also reusing existing code for pattern matching and providing unit test verification and runtime exemplification.
- URL: pull/135189
- Merged: No
- Associated Commits: 5f711, 08cde, e8098, eb1e5, 2fc0e, e9e3d, 9e906, d7500, 1dfc4, 0a39e, 26312, 2e6d7, e8221, c90b2, 97b78, fef7f, 0b227, eefe9, 281d9, a3391, d468e, b128d, ec34e, cb30e, 7c4de, 83fb6, ac9fd, e339f, 89086, 2574b, 3a1f7, 4f05b, e6115, 5a848, 76852, 346d6, 43978, b27a8, ea400, 65c14, e6322, 12234, 4a36f, c5edf, b44bf, 3b1ee, ed8ad, 2fa83, 5011d, c34e6, 161c1, 34d44, a1a32, 6a0e3
Other Closed Pull Requests
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
-
- Toxicity Score: 0.55 (Defensive responses, critique of solution, tense exchange.)
- This GitHub conversation involves multiple users discussing a series of commits related to an 'export method'. User1 initially provides a solution, which User2 critiques, expressing dissatisfaction with its effectiveness. User3 attempts to mediate by suggesting improvements, but User1 responds defensively, leading to a tense exchange. The tone shifts from collaborative to confrontational, with User2 and User1 exchanging terse comments.
-
[pytree] Register normal class to register_dataclass
- Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Repeated misunderstandings.)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone is tense, with username3 attempting to mediate and suggest solutions. The conversation is marked by repeated misunderstandings and a lack of consensus, leading to increased tension.
-
- Toxicity Score: 0.55 (Frustration expressed, defensive responses, mediation attempts, escalating tension.)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 197 | 61 | 2 | 216 |
anijain2305 | 273 | 59 | 3 | 77 |
guilhermeleobas | 337 | 16 | 2 | 34 |
jansel | 208 | 26 | 2 | 119 |
zou3519 | 60 | 19 | 20 | 246 |
justinchuby | 141 | 23 | 8 | 142 |
benjaminglass1 | 241 | 14 | 0 | 41 |
Skylion007 | 46 | 20 | 3 | 205 |
eellison | 96 | 9 | 7 | 160 |
cyyever | 138 | 49 | 0 | 48 |