Weekly GitHub Report for Pytorch: February 18, 2025 - February 25, 2025

            Weekly GitHub Report for Pytorch: February 18, 2025 - February 25, 2025

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notable changes include the deprecation of publishing on Conda, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[compile] Modularize very long compilation: This issue addresses the problem of a lengthy compilation process during model export/compile in a GitHub project, where a single generated C++ file with over 78,000 lines takes more than an hour to compile using only one core. The user suggests modularizing and parallelizing the compilation process to improve efficiency, as the current method lacks intermediate progress and is time-consuming.

The comments discuss the potential causes of the issue, including the generation of a large Triton kernel and the need for modularization. Contributors suggest splitting the file into smaller parts for parallel compilation, though this may be challenging with the current architecture. There is also a mention of testing with lower optimization levels and the possibility of using subgraph handling to manage large models.
Number of comments this week: 11

[Export AOTI] dynamic_shapes  export and compile degraded output: This issue involves a bug in exporting and compiling a model with dynamic shapes using PyTorch, where the output is degraded when dynamic width (W) and height (H) are used, compared to when they are fixed. The problem seems to be related to the use of torch.export.Dim for dynamic shapes, which causes runtime errors during inference unless the dimensions are aligned with the inference resolution.

The comments discuss the difficulty in debugging the issue without a reproducible example, suggest testing subparts of the model, and mention a tool for minimizing accuracy issues. A runtime error is identified when using dynamic shapes, which is masked when using AOTI compile and package. A minimal reproduction is provided, and it is noted that the core issue might be an invalid graph produced during export, with AOTI and compile errors being secondary.
Number of comments this week: 10

[RFC] Test Cases Enabling for Accelerators: This issue addresses the challenge of enabling existing PyTorch test cases for new device backends, such as accelerators, by proposing a mechanism that dynamically determines which tests to run, skip, or adapt based on a device's specific capabilities. The proposed approach involves creating a unified device-capability abstraction, dynamic capability registration, and capability-based decorators to refine the test suite for handling multiple backends efficiently.

The comments discuss extending OpInfo for device capabilities, aligning the proposal with ongoing work, and the potential benefits for both in-tree and out-of-tree backends. Questions are raised about the primary use case, adoption challenges, and compatibility across hardware. The proposal is seen as beneficial for third-party vendors, with plans to integrate device capabilities into existing test infrastructure.
Number of comments this week: 9

Triton pin update for PyTorch 2.7 / Triton 3.3: Upgrading PyTorch-Triton to a version that Supports Blackwell: This issue involves updating the PyTorch-Triton integration to support the Blackwell architecture by upgrading to a version of Triton that includes necessary optimizations and features. The update aims to address various technical challenges and ensure compatibility with the upcoming PyTorch 2.7 release, while also tracking related issues and potential improvements.

The comments discuss the urgency of updating Triton to support Blackwell, with concerns about unresolved issues and the timing of the update relative to the PyTorch 2.7 release. Contributors highlight specific test failures and compatibility issues, propose solutions, and track additional related issues, emphasizing the need for coordination and careful planning to ensure a smooth transition.
Number of comments this week: 8

PyTorch VS2022 official build Windows binary illegal instruction on AVX2(max ISA level) CPU: This issue concerns a bug in the PyTorch official build for Windows using Visual Studio 2022, where an illegal instruction error occurs on CPUs with a maximum ISA level of AVX2 due to the generation of AVX512 instructions. The problem does not affect current PyTorch official binaries built with VS2019, and it is challenging to reproduce locally, suggesting it might be specific to the official build environment.

The comments discuss potential solutions, including involving Microsoft, understanding the issue's scope across platforms, and maintaining AVX2 support due to its prevalence in client CPUs. There is a suggestion to revert to VS2019 if the issue persists, and a clarification that the proposal was not to drop AVX2 but to make it the new base architecture.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

DISABLED test_transformer_training_is_seq_parallel_False (main.DistTensorParallelExampleTest): This issue pertains to a disabled test, test_transformer_training_is_seq_parallel_False, within the DistTensorParallelExampleTest suite, which is failing on the main branch of a project using the ROCm platform. The failure is suspected to be caused by changes introduced in one of the pull requests #122995, #122996, or #122997, and several contributors and maintainers have been tagged for further investigation and resolution.
[NestedTensor] multiply batch and ragged dimension to get shape of values tensor: This issue discusses a proposed feature for the PyTorch library that involves manipulating the dimensions of a NestedTensor by multiplying the batch and ragged dimensions to reshape the values tensor. The suggestion aims to enhance the flexibility of tensor operations by allowing users to collapse the first two dimensions of a NestedTensor, thereby facilitating more complex tensor manipulations.
Error: command buffer exited with error status.: This issue describes a problem encountered while training a model using llama2.c on an iMac with an AMD Radeon Pro 5700 XT GPU, where the user experienced a "command buffer exited with error status" during the training process. The error, which occurred at epoch 11,580, was associated with significantly increased epoch times and GPU timeout errors, potentially linked to garbage collection or other factors, although the user was able to resume training without further issues.
scalar_tensor call with symbolic bool input does not work in inductor: This issue involves a bug in the PyTorch library where the scalar_tensor function fails when called with a symbolic boolean input while using the Inductor backend. The error occurs during the execution of a compiled function, resulting in a TypeError due to an unexpected object type, which prevents the function from running successfully.
Support AOT Autograd level Caching: This issue addresses the need for caching in the torch.compile process when using an aot-autograd enabled backend, as the current compilation time for models like Llama2 7B is significantly long, impacting development speed. The problem is particularly pronounced in the integration of PyTorch/XLA with VLLM, where the lack of support for dynamic shapes results in repeated compilations for different input shape combinations, causing delays in the warm-up phase.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 96
Summarized Issues:

API and Export Issues in PyTorch: The need for a robust API in torch.export is highlighted due to issues with treating certain inputs as constants, leading to ValueError during export. Additionally, dynamic shape export failures occur due to division by zero errors, complicating the decomposition path and raising questions about handling real-tensor and fake-tensor tracing paths effectively.
issues/147397, issues/147402

Bugs in PyTorch's Dynamo and Compilation Process: PyTorch's Dynamo faces issues with constant tensors not recompiling correctly with device guards, leading to CUDA device failures. Furthermore, the torch.compile function fails with dict_items iteration, and dynamic shapes in export and compile processes result in degraded outputs and runtime errors.
issues/147405, issues/147440, issues/147475

Precision and Performance Issues in PyTorch: PyTorch faces precision discrepancies with the polygamma function and NaN values during backward passes in neural networks. Performance regressions are noted in specific projects, with slower execution times in newer PyTorch versions.
issues/147450, issues/147444, issues/147463

Backend and Device-Specific Bugs in PyTorch: Bugs are reported in PyTorch's MPS backend with scaled_dot_product_attention and clamp_ operations, causing crashes and inconsistent behavior. Additionally, issues with the ROCm backend and CUDA device errors are highlighted.
issues/147443, issues/147510, issues/147626

Gradient and Memory Management Issues in PyTorch: PyTorch's gradient checkpointing feature does not reduce memory usage as expected, and memory allocator lock contention in templated GEMMs leads to performance degradation. These issues highlight the need for better memory management strategies.
issues/147449, issues/147766

ONNX Export and Conversion Issues in PyTorch: Significant precision drops and errors occur when exporting models to ONNX format, particularly with the sigmoid function and Graph Attention Networks. These issues suggest a need for improved conversion processes.
issues/147508, issues/147617

Documentation and API Inconsistencies in PyTorch: Several documentation errors and API inconsistencies are noted, such as incorrect references in method docstrings and discrepancies in default behavior descriptions. These issues necessitate updates for clarity and accuracy.
issues/147696, issues/147631

Sharding and Distributed Training Challenges in PyTorch: Issues with sharding strategies and distributed tensor operations are reported, including missing strategies for specific operators and challenges with asynchronous communication in NCCL process groups.
issues/147724, issues/147729

Model Export and Compilation Errors in PyTorch: Errors occur during model export and compilation, such as torch.jit.trace failing with retinanet_resnet50_fpn() and torch.export.export encountering guard conditions. These issues highlight the need for robust export mechanisms.
issues/147739, issues/147623

Feature Requests and Enhancements in PyTorch: Requests for new features and enhancements include implementing the L-BFGS-B algorithm, enhancing Dim.AUTO functionalities, and exposing NCCL API for runtime estimation. These requests aim to expand PyTorch's capabilities and improve user experience.
issues/147520, issues/147483, issues/147753

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 57
Summarized Issues:

Test Failures and Disabling Tests: This topic covers multiple issues related to the disabling of tests in the PyTorch project due to failures on the main branch. The tests test_real_imag_view_lazy_complex128 and test_flatten_nonview_xla were disabled in their respective suites due to consistent failures, with references to recent failure examples provided in the issues.  
issues/147711, issues/147712, issues/147713, issues/147714, issues/147715, issues/147716, issues/147717, issues/147718, issues/147719

Compilation and Export Errors: Several issues highlight problems with compilation and export processes in PyTorch. Users encountered errors when exporting models to ONNX, including a RuntimeError due to a tensor requiring gradients and an AttributeError related to dynamic shapes, complicating deployment on platforms like NVIDIA Triton.  
issues/147076, issues/147606

Bugs in PyTorch Functions: Various issues report bugs in PyTorch functions, such as torch.cholesky_solve triggering an internal assertion error and torch.randn producing identical values across dimensions on macOS. These bugs affect the expected functionality and require fixes to align with documentation.  
issues/147456, issues/147624

Performance and Optimization Concerns: Issues in this category discuss performance discrepancies and optimization needs in PyTorch. For instance, the torch.compile() function was found to be slower than expected, and there were suggestions to optimize certain functions to improve performance.  
issues/128354, issues/129457

Platform-Specific Errors: Some issues are specific to certain platforms, such as ROCm or macOS, where users encountered errors like core dumps during matrix multiplication or identical random values in tensors. These platform-specific issues require targeted solutions to ensure compatibility.  
issues/128381, issues/147624

Configuration and Regression Issues: Several issues involve configuration problems and regressions in PyTorch, where previously working functions fail due to recent updates. These issues often require reverting changes or adjusting configurations to restore functionality.  
issues/128217, issues/129637

Documentation and API Consistency: Some issues highlight discrepancies between documentation and actual API behavior, such as the torch.cuda.clock_rate() function's return value. Ensuring consistency and clarity in documentation is crucial for user understanding and correct usage.  
issues/147098

Security and Compliance: An issue addresses the need for FIPS compliance in PyTorch by modifying the hashlib.md5() function to prevent errors on systems enforcing FIPS modules, highlighting the importance of security compliance in software development.  
issues/147236

Debugging and Logging Enhancements: Enhancements to debugging capabilities in PyTorch are discussed, such as introducing new logging options to facilitate easier debugging of intermediate representations, replacing older mechanisms for improved developer experience.  
issues/147002

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 163
Key Open Pull Requests
1. [test] 2: This pull request, titled "[test] 2," aims to address and fix an unspecified issue in the PyTorch project, as indicated by the placeholder "#ISSUE_NUMBER," and includes a series of 16 commits, each with the commit message "tc," which suggests a focus on testing or test-related changes, although it has not yet been merged.

URL: pull/147470

Merged: No

Associated Commits: 03e29, 17679, 7b5b1, bd6cd, 1e50c, a324c, 2de74, 0a2e0, ed111, bcad8, 29c02, 32dc2, 4e88f, 15c40, 573b2, b8b97

2. cpp_wrapper: reduce memory usage by removing unneeded temporaries: This pull request aims to reduce memory usage in the cpp_wrapper by refactoring reinterpret_view calls to return temporary RAII tensor objects, thereby making the function's callers responsible for saving the handle when necessary, and eliminating unnecessary temporary tensor handles to align memory usage with the default inductor mode.

URL: pull/147403

Merged: No

Associated Commits: 01424, 67582, eb4f8, 20c1a, a6f57, 1c1b4, aae0f, 3806a, ebf67, 91ceb, 4d5ed, 35f9d, b6bf5, 7deb4, 4ebac

3. [ONNX] Add draft_export as a strategy: This pull request introduces a new strategy called draft_export to the ONNX export process in PyTorch, which is positioned as the third fallback option, activated by setting the TORCH_ONNX_ENABLE_DRAFT_EXPORT environment variable, and is designed to specialize tensors without being less robust than the existing JIT trace strategy.

URL: pull/147529

Merged: No

Associated Commits: 8ae3e, 3b3c9, eeab5, 7e5f4, 27003, 322b2, b16a9, bbfa2

Other Open Pull Requests

MXFP8 and MXFP4 Support in PyTorch: This topic covers the introduction of blockwise MXFP8 support to the torch._scaled_mm function for CUDA devices, allowing dispatch to a blockwise kernel from cuBLAS. The pull requests also include enhancements for MX-FP8 matrix multiplications on AMD gfx950 devices with ROCm 6.5+, with plans for future updates to address MXFP4 support.
pull/147548, pull/147553

Runtime and SACEstimator Modifications: The pull requests involve modifications to the RuntimeEstimator and SACEstimator in the PyTorch project, addressing issues such as fixing default arguments, binding issues, and linting problems. They also include testing fake utilities and collectives with memory trackers.
pull/147750

Graph Break Hints in Dynamo: This topic introduces generic graph break hints to the Dynamo component of the PyTorch project, as part of a stack of changes. The pull requests include multiple updates and contributions from various collaborators.
pull/147429

ROCm CK Kernel Updates: The pull requests update the ck_conv_template code generation for ROCm CK kernels by parameterizing previously hardcoded convolution parameters. This enhances flexibility and maintainability while reducing the number of generated templates.
pull/147504

Handling Mismatched Outputs in PyTorch Inductor: The pull requests introduce support for handling mismatched outputs in the PyTorch inductor by extracting codegen_unbacked_symbol_defs from the FallbackKernel into a new method. This is specifically for conditional operations, with plans for future updates to extend this support to other operations like while_loop.
pull/147567

ONNX Operations in PyTorch: The pull requests introduce the ability for users to utilize ONNX operations directly through torch.ops.onnx.*. They demonstrate an implementation for RotaryEmbedding with native PyTorch operators that integrate seamlessly with the existing ecosystem.
pull/147576

Experimental Features in PyTorch: The pull requests introduce an experimental feature for delayed compilation in the PyTorch project, involving multiple updates and revisions. It is currently not merged.
pull/147591

CUDA Graph Partition Feature: The pull requests implement a CUDA graph partition feature, building on a previous inductor graph partition PR. They include several commits such as recording mappings from partition input/output indices to graph indices, merging branches, and handling metadata partitioning.
pull/147648

CacheBench Component Testing: The pull requests introduce a new test for the CacheBench component by adding a "ciflow/trunk" test to the PyTorch project. It is part of a stack of changes managed by ghstack and is currently not merged.
pull/147688

Tensor Slice Overflow Fix: The pull requests address the issue of tensor slice overflow when the step value is near INT64_MAX by implementing a fix to prevent overflow in the calculation of slice output length. This is detailed in the commits and discussed in relation to issue #147071.
pull/147433

ROCm Backend Boolean Value Fix: The pull requests address an issue in the ROCm backend of the PyTorch project by converting non-standard boolean values into standard boolean values. This ensures correct sorting operations and includes several commits for linting, unskipping unit tests, and fixing typos.
pull/147459

Magma-Cuda References Removal: The pull requests involve removing references to magma-cuda from the readme.md file and refactoring the magma_conda installation process. This follows the migration of the magma-cuda build from Anaconda to AWS.
pull/147476

Register Constant Usability in Exportz: The pull requests address the issue of making the register constant usable in the "exportz" functionality of the PyTorch project. They involve multiple updates and revisions as indicated by the series of commits and differential revision link provided.
pull/147533

Unbacked Renamings in Export Process: The pull requests aim to eliminate the use of unbacked renamings in the export process by introducing a new pass in _produce_aten_artifact to recompute unbacked bindings. This ensures that unbacked binding keys remain synchronized with example values and improves compatibility with de/serialization.
pull/147574

Test Submission Using Ghstack: The pull requests are a test submission, not intended for actual merging, created using the ghstack tool. They involve multiple commits with placeholder messages, while tagging several contributors for notification.
pull/147689

Templatized CUDA Kernel for GammaBeta Backwards Pass: The pull requests introduce a new templatized CUDA kernel designed to replace three existing non-ROCM CUDA kernels for the GammaBeta backwards pass. They address performance issues by optimizing warp shuffles, coalesced loads, and parallelism across the M dimension.
pull/147773

ONNX Operation Decomposition Migration: The pull requests aim to migrate ONNX operation decomposition functions from onnxscript to PyTorch. This decouples torch.onnx from implementations in onnxscript, with necessary refactoring and test scaffolding provided in related pull requests.
pull/147469

Cutlass Backend Matrix Multiplication Tests: The pull requests introduce main tests for matrix multiplication (mm), addition and multiplication (addmm), and batch matrix multiplication (bmm) within the Cutlass backend of the PyTorch project.
pull/147485

Dynamo Error Message Enhancements: The pull requests aim to enhance the error messages in the Dynamo component of the PyTorch project. They are part of a series of improvements and include multiple commits with updates and contributions from various collaborators.
pull/147494

Poison Fork Documentation for Accelerator APIs: The pull requests aim to document a note regarding "poison fork" for accelerator APIs in the PyTorch project. They are part of a stack of changes managed by ghstack and involve multiple updates marked as "[ghstack-poisoned]".
pull/147507

GaussianNLLLoss Variance Input Size Fix: The pull requests address issue #147521 by modifying the GaussianNLLLoss function to allow any size of variance input as long as it is broadcastable to the input or target's size. This ensures that the demo code in the issue results in the expected behavior and correct output.
pull/147522

Continuous Integration Optimization: The pull requests aim to optimize the continuous integration process by utilizing more CPU processes during the checkout phase. They include testing changes from the main branch to a specific branch identified by the commit hash 249a936998e66cc0d6ad8664e0e93ec1b9432a8b.
pull/147652

ReplicationPad Bool Data Type Handling: The pull requests address the issue of aligning the replicationpad function's handling of the bool data type with the eager execution mode in the PyTorch project. This is part of fixing issue #143779.
pull/147666

Gather and Scatter Object List Fixes: The pull requests address a fix for the gather_object and scatter_object_list functions in the PyTorch project. They ensure that the destination and source ranks are correctly based on the global process group, regardless of the group argument.
pull/147675

DeviceMesh.get_group Argument Support: The pull requests introduce support for passing arguments to the DeviceMesh.get_group function in the PyTorch project. They include adding tests and updating relevant files like test_dtensor_compile.py and distributed.py.
pull/147741

Sparse Tensor Validation: The pull requests address the validation of sparse tensors constructed via a legacy constructor in PyTorch. They highlight issues such as size inconsistencies and storage size calculation overflows during the torch.load process.
pull/147408

FSDP Tests on XPU Device: The pull requests aim to enable Fully Sharded Data Parallel (FSDP) tests on the XPU device within the PyTorch project. They involve multiple commits such as the implementation of an abstracted API to retrieve the backend and adjustments based on review comments.
pull/147518

Torch.isin Function Decomposition Fixes: The pull requests address two decomposition issues in the torch.isin function within the PyTorch project. They specifically fix the lack of support for scalar test_element and resolve discrepancies in results produced by Inductor compared to eager mode.
pull/147519

Collective Recomputations in Partitioner: The pull requests propose to always disable the compiler-driven recomputation of collectives by default in the partitioner. This prevents inconsistencies and potential hangs in distributed jobs, with future plans to introduce an spmd_mode flag for safe collective recomputation.
pull/147561

Export Method Introduction: The pull requests introduce an "export method" to the PyTorch project, as part of a stack of changes managed by ghstack. They include multiple commits refining the implementation, although it has not yet been merged.
pull/147573

Myst_nb Compile Tutorial Demonstration: The pull requests introduce a demonstration of using myst_nb with a compile tutorial in the PyTorch project. They are indicated by the title and multiple commits refining the demonstration.
pull/147577

Log2 and PowByNatural Printing Issues: The pull requests address issues related to the printing functionality of the log2 and PowByNatural operations in the PyTorch project. They include multiple commits with updates, although they have not yet been merged.
pull/147592

Dynamic Indices Type in Torch.sort: The pull requests introduce an optimization to the torch.sort function by implementing a dynamic_indices_type option. This dynamically determines the data type of indices to reduce excessive memory usage, supporting data types such as Byte, UInt16, UInt32, and UInt64.
pull/147629

Backwards Indexing Enhancements: The pull requests aim to enhance the functionality of backwards indexing in the PyTorch project specifically for cases where the stride is not equal to one. They involve collaboration with several contributors from the ROCm team.
pull/147630

Inductor C++ Code Generation Bug Fix: The pull requests address a bug in the inductor C++ code generation for a custom operation in PyTorch. They ensure that a list containing a single tensor with an unbacked symbolic integer shape does not result in a type/value mismatch error during template parameter deduction.
pull/147649

Mixed Precision Fused Adam Optimizer: The pull requests propose an implementation of a mixed precision fused Adam optimizer for the PyTorch project. They are part of a stack of changes managed by ghstack, although they have not yet been merged.
pull/147653

CachingHostAllocator Memory Statistics: The pull requests introduce an initial implementation of host memory statistics for the CachingHostAllocator in the PyTorch project. They aim to facilitate the diagnosis of performance slowdowns by gathering memory allocation data without significantly altering the allocator's original design.
pull/147660

OneDNN Primitive Cache for Int4 GEMM: The pull requests introduce an enhancement to the PyTorch project by adding a oneDNN primitive cache specifically for int4 GEMM operations on XPU. They include an example of int4 GEMM migrated from IPEXcc.
pull/147693

Sparse Tensor Validation in Torch.load: The pull requests aim to enhance the PyTorch library by adding sparse tensors constructed via a legacy constructor to the _sparse_tensors_to_validate list. This ensures they are validated at the end of the torch.load process.
pull/147759

Mutation Analysis in Triton Compiler: The pull requests address an issue in the mutation analysis of scf.if and scf.for operations within the Triton compiler. They introduce separate scf.yield operations for each yield argument, preventing the incorrect marking of all yield arguments as mutated.
pull/147762

Storage Offset Overflow Checks: The pull requests address issue #145259 by adding two overflow checks to the storage offset calculation in aten/src/ATen/native/Resize.h. This prevents crashes and incorrect tensor returns when using large storage offsets in PyTorch's as_strided function.
pull/147398

Scheduler Code Refactoring: The pull requests involve minor refactoring of the scheduler code in the PyTorch project. They include changes such as using a default dictionary and cleaning up the log fusion function as part of ongoing code improvements.
pull/147410

GuardManagers Reference Change: The pull requests propose a change in the PyTorch project to maintain a reference to the parent instead of the root within GuardManagers. They are part of a stack of changes managed by ghstack and are currently unmerged.
pull/147414

Dict_tag Optimization Disabling: The pull requests propose to disable the dict_tag optimization in ancestor nodes when the ancestor is not common. They are part of a stack of changes in the PyTorch project and include two commits with updates marked as "[ghstack-poisoned]".
pull/147415

Flip Operation Memory Corruption Fix: The pull requests address a memory corruption issue in the flip operation for torch.quint4x2 and torch.quint2x4 inputs. They implement a runtime error check for these deprecated data types and include a test plan to verify the change.
pull/147430

Triton Autotune Configuration Heuristic: The pull requests aim to reintroduce a previously reverted change that introduces a new template heuristic for Triton autotune configurations. They remove additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py to address compile time regressions.
pull/147452

Torch.polygamma() Function Consistency: The pull requests address an issue with the torch.polygamma() function when n == 1 by ensuring consistency with the CPU kernel. They include two commits aimed at resolving this problem.
pull/147453

Sym_not Function in ONNX Module: The pull requests aim to implement the sym_not function in the ONNX module of the PyTorch project. They address issue #136572 and are part of a stack of changes managed by ghstack.
pull/147472

CUDA Device Index Guard Mechanism: The pull requests introduce a guard mechanism for the CUDA device index in the PyTorch project. They ensure that operations are correctly managed across different CUDA devices.
pull/147481

Device Check Logic Consistency: The pull requests address a bug reported in issue #144748 by modifying the device check logic in the PyTorch codebase. They ensure consistency between eager mode and inductor mode by aligning the behavior of the find-common-device method in fake_tensor.py with the device check in adaption.h.
pull/147501

Outer Loop Fusion Heuristics Optimization: The pull requests aim to enhance the performance of the PyTorch project by optimizing the heuristics used in outer loop fusion. They are indicated by the title and the associated commits.
pull/147523

Torch.utils.tensorboard Export Fix: The pull requests address the issue of certain classes not being exported from the torch.utils.tensorboard module. They define the __all__ attribute to explicitly specify the public interface, ensuring that classes like FileWriter, RecordWriter, and SummaryWriter are properly recognized and accessible.
pull/147550

Import Fix in torch/_inductor/debug.py: The pull requests address an issue with the import of getArtifactLogger in torch/_inductor/debug.py for ir_pre_fusion and ir_post_fusion. They ensure the import is complete and set the logging to off_by_default to minimize excessive logging.
pull/147560

Precompile Cache Utilization Check: The pull requests address the issue of ensuring that the system checks if the force_disable_caches flag is set before utilizing the precompile cache. They are part of a series of commits in the PyTorch project and involve multiple contributors for review and collaboration.
pull/147589

Inductor Cache Selection Algorithm: The pull requests introduce a new algorithm for selecting caches in the fresh inductor cache. They are part of a stack of changes and include discussions and reviews from multiple contributors.
pull/147590

LazyLinear Module Abnormal Behavior Fix: The pull requests address the abnormal behavior of the LazyLinear module in PyTorch when used in conjunction with LayzLinear and load_state. They update the logic of the initialize_parameters function and add new test cases.
pull/147599

End-to-End Control Plane Flex Attention: The pull requests are an experimental attempt to implement end-to-end control plane (cp) flex_attention within the PyTorch project. They involve multiple commits and collaboration among several contributors.
pull/147603

Third-Party ONNX Build Process Enhancement: The pull requests aim to enhance the build process of third-party ONNX by removing unnecessary options and addressing a missing dependency. They are indicated by the commits and their association with a specific issue in the PyTorch project.
pull/147616

Triton XPU Build Process on Windows: The pull requests aim to enable the Triton XPU build process on Windows for the PyTorch project. They are indicated by the title and the commits associated with it.
pull/147637

AOTD System Output Classification Bug Fix: The pull requests address a bug reported by an internal user in the PyTorch project, where the AOTD (Ahead Of Time Dispatch) system incorrectly classified outputs that are aliases of intermediates in a computational graph. They propose a solution by adding runtime unwrapping to ensure that the base of a detached alias is consistently tracked back to its original tensor.
pull/147638

Block Radix Sort Performance Enhancement: The pull requests aim to enhance the performance of block radix sort for certain shapes in the ROCm backend by reducing the items processed per thread to 8. This increases the thread block size and achieves higher occupancy.
pull/147657

Matmul Small Brute Force Tunableop Test Speedup: The pull requests aim to speed up the unit test for the matmul_small_brute_force_tunableop by reducing its execution time by over 20 minutes. They include refactoring such as moving a hipBLASLt version check to a different test for simplicity.
pull/147659

Ruff Rule S324 Enablement: The pull requests aim to enable the ruff rule S324 by adding it to the pyproject.toml file. They address issue #147627 and include running a check to clean warnings using the bashlintrunner tool across all files.
pull/147665

Manual Dynamism Whitelist Introduction: The pull requests introduce a "manual dynamism whitelist" to the PyTorch project. They involve multiple commits and contributors and are part of a stack of changes managed through the ghstack tool.
pull/147685

Broken Link Fix in PyTorch Documentation: The pull requests address a broken link issue in the PyTorch documentation by updating a reference to the NumPy documentation. They ensure it correctly redirects to the current NumPy documentation site.
pull/147697

RandomBatchSampler Performance Enhancement: The pull requests propose a performance enhancement by merging RandomSampler and BatchSampler into a new RandomBatchSampler. They utilize slicing instead of iteration to output indices, resulting in significant speed improvements.
pull/147706

Intel GPU TestCommon::test_dtypes Skipping: The pull requests aim to skip the Intel GPU TestCommon::test_dtypes test for the bmm and addbmm operations due to the lack of complex64 support. They also extend the DecorateInfo to accommodate a list of device types.
pull/147721

Process Group Without Parameters Fix: The pull requests address and fix an issue in the PyTorch project where a process group (PG) without parameters was causing problems. They are referenced in issue #143828 and include updates tracked through the ghstack tool.
pull/147730

Normal Classes as Dataclasses in Pytree: The pull requests address a discussion from a previous pull request by modifying the PyTorch codebase to allow normal classes to be registered as dataclasses within the pytree module. They are indicated by the commits and the linked discussion.
pull/147752

NCCL Memory Pool Use Condition Restriction: The pull requests introduce a restriction on the use condition of the NCCL memory pool by adding a check to determine if the CUDA driver supports multicast. This is similar to the implementation in Symmetric Memory.cc and is part of a stack of changes managed by ghstack.
pull/147764

FlexAttention Module Error Messaging: The pull requests address the issue of inadequate error messaging in the FlexAttention module by adding explicit error messages for cases where the embedding size is less than 16. This aids users who are experimenting with small tensor sizes.
pull/147765

FSDP Wrapped Module Zero Argument Bug Fix: The pull requests address a bug in the Fully Sharded Data Parallel (FSDP) wrapped module related to a zero argument. They implement a fix and add a unit test, while also removing the skip_if_lt_x_gpu condition.
pull/147771

Inductor Component Casting Logic Rework: The pull requests rework the casting logic in the Inductor component of the PyTorch project to avoid illegal bitcasts. They address issues introduced by Triton's checks on bitcasts where the casted value does not fit into the casted type.
pull/147395

PT2 Compiler Boolean Type Handling: The pull requests address issues with the PT2 compiler's handling of boolean types in wrapped functions. They add explicit tests to determine if data is of type i1 and include a test added to test_triton_kernels.py to ensure compatibility with existing infrastructure.
pull/147416

NVTX3 Include Directory Hints: The pull requests address the issue of CMake struggling to locate NVTX3 by adding hints to the USE_SYSTEM_NVTX configuration for the NVTX3 include directory. They are detailed in the commit found at https://github.com/pytorch/pytorch/commit/a3c4572bf250ccdde8bdcdcbf642a1cb16bdd113.
pull/147418

NCCL Communication for Uint64 Tensor Types: The pull requests aim to modify the PyTorch library by enabling NCCL communication to support uint64 tensor types. This is particularly important for applications in cryptography and privacy computing.
pull/147424

MKLDNN Backend Availability API: The pull requests aim to introduce an is_available API for torch.backends.mkldnn, similar to existing APIs for torch.backends.mkl and torch.backends.openmp. This allows users to check the availability of the MKLDNN backend in PyTorch.
pull/147432

Test_transformers.py File Splitting: The pull requests propose splitting the existing test_transformers.py file into two separate files, test_transformers.py and test_transformers_privateuser1.py. This addresses the issue of skipped privateuse1 test cases that currently conflict with CUDA test cases.
pull/147441

Once Flag Removal and Static Initialization: The pull requests aim to enhance the PyTorch project by removing the unnecessary usage of the "once flag" and replacing it with static initialization. They are part of a series of changes and are linked to a specific issue for tracking and collaboration with multiple contributors.
pull/147445

XPU Build Process with Visual Studio 2019: The pull requests aim to modify the build process for XPU by enabling the use of Visual Studio 2019. They are part of an effort to address a specific issue referenced in the project and involve collaboration with multiple contributors.
pull/147448

128-bit Vectorization Reversion Draft: The pull requests are a draft that aims to revert a previous commit related to the implementation of 128-bit vectorization in the ATen CUDA component of the PyTorch project. They address an unspecified issue.
pull/147455

RNN Example Code Correction: The pull requests address an issue in the PyTorch documentation by correcting the RNN example code to properly handle multiple layers. They ensure that only the first layer takes the input vector while subsequent layers use the hidden state from the previous layer.
pull/147490

Ptxas Warnings Resolution: The pull requests address and resolve numerous ptxas warnings during the build process by aligning the thread count for sm_120 with the CUDA C programming guide's specification of a maximum of 1536 threads per SM.
pull/147491

Cub Iterators Replacement with Thrust Iterators: The pull requests aim to update the PyTorch project by replacing deprecated cub iterators with thrust iterators. This is due to recent changes in the CCCL (cub) development, while acknowledging potential impacts on ROCM usability.
pull/147493

Dynamo Methods Type Annotations: The pull requests aim to enhance the type annotations for dynamo methods in the PyTorch project. They are indicated by the title and commit message and involve several contributors mentioned in the body.
pull/147499

UBSAN Test Enablement: The pull requests aim to enable the Undefined Behavior Sanitizer (UBSAN) test in the PyTorch project. They address a specific issue referenced as #ISSUE_NUMBER and include a single commit with the message "Enable UBSAN test."
pull/147511

ASAN Support for CUDA: The pull requests aim to enable AddressSanitizer (ASAN) support for CUDA in the PyTorch project. They are indicated by the title and commit message and involve collaboration with several contributors mentioned in the body.
pull/147512

Aten.as_strided.default Operation Introduction: The pull requests introduce the aten.as_strided.default operation to address the FakeTensor propagation error identified in issue #145353. They demonstrate an alternative approach in pull request #147517.
pull/147514

Flex Attention Function Registration: The pull requests attempt to register the flex_attention function to a custom function on DTensor within the PyTorch project. They encounter a runtime error related to the use of FunctionalTensor without a corresponding FunctionalTensorMode.
pull/147515

Flex Attention Custom Dispatch Mode: The pull requests attempt to register the flex_attention function to a custom function within a custom dispatch mode in the PyTorch project. They encounter a NotImplementedError due to the absence of a registered rule for handling the flex_attention operation with the DTensor subclass.
pull/147516

Flex Attention Custom Function Dispatch: The pull requests aim to experiment with registering the flex_attention function to a custom function on DTensor within a custom dispatch mode. They allow for successful dispatch of flex_attention in a given context to the custom CP flex_attention.
pull/147517

Pybind11 Submodule Update Test: The pull requests update the pybind11 submodule to version 3.0.0-dev as a test. They address an unspecified issue and include a single commit with the message "Update pybind11 submodule to 3.0.0-dev test."
pull/147524

Setattr Function KeyError Proposal: The pull requests propose raising a KeyError when the setattr function is called on a Module instance in PyTorch and a class attribute already exists. They address a silent error where users might incorrectly assume that the setattr operation was successful when it was not.
pull/147525

MPS Integer Matmul Kernel Optimization: The pull requests aim to optimize the integer matrix multiplication (matmul) kernel for Metal Performance Shaders (MPS) on macOS. They improve performance through reduced global memory accesses, with a focus on enhancing efficiency for large matrices.
pull/147526

Elementwise Kernel Input Vectorization: The pull requests introduce input vectorization in elementwise kernels for tensors with heterogeneous types. They specifically demonstrate its application for input tensors with types (float, bfloat16) when the functor type is float(float, float).
pull/147527

Torch.compile Fullgraph Models API: The pull requests introduce a new API for the torch.compile function that allows for the compilation of fullgraph models using a C++ wrapper. They enable the saving and loading of compiled artifacts to disk through a "sticky cache" mechanism.
pull/147528

Attention Mechanism for Tensors with More Dimensions: The pull requests address issue #147443 by fixing the attention mechanism for tensors with more than four dimensions. They include the addition of relevant tests to ensure functionality.
pull/147545

SymPy Floating-Point Number Printing: The pull requests address an issue with the printing of floating-point numbers in the SymPy library within the PyTorch project. They are part of a stack of changes managed by ghstack and are linked to a previous pull request #147261.
pull/147552

Unused-Value Issue in CUDAHooks.cpp: The pull requests address an unused-value issue in the file caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp. They modify the code to eliminate unnecessary values that trigger the -Wunused-value warning in LLVM.
pull/147555

Pattern Matcher Guard Replacement: The pull requests propose replacing the use of guard_size_oblivious with statically_known_true in the pattern matcher. They aim to avoid adding unnecessary guards, as detailed in the commit and supported by an internal discussion link.
pull/147557

ShapeAsConstantBuffer Transfer Mechanism: The pull requests involve implementing a mechanism to transfer a ShapeAsConstantBuffer from a subgraph to the main graph output in the PyTorch project. They handle a symbolic integer returned by the inner subgraph and subsequently by the forward graph after partitioning.
pull/147559

System Random State Handling: The pull requests aim to improve the handling of the system's random state in the PyTorch project by carefully saving and restoring it. They mark the third attempt to address the issue outlined in a previous discussion on GitHub.
pull/147562

Compile_tasks.py Unused Functions Removal: The pull requests aim to clean up the codebase by removing unused functions from the compile_tasks.py file in the PyTorch project. They are indicated by the non-functional change (NFC) label in the commit message.
pull/147564

Mark_traceable Feature on Class Methods: The pull requests introduce support for the mark_traceable feature on class methods in the PyTorch project. They include a new test called test_mark_traceable_on_method and additional comments explaining the necessity for special handling of methods.
pull/147571

Global or Captured Tensors in Mark_traceable: The pull requests address the issue of supporting reads to global or captured tensors within functions marked as mark_traceable. They introduce a global FakeTensorTLS with an allow_non_fake_inputs_override flag to temporarily adjust the flag during execution.
pull/147572

CachingAutotuner on Meta Device: The pull requests address issue #146018 by improving the handling of the CachingAutotuner on the meta device. They fix size inference issues, ensuring that dynamic shape handling functions correctly when multiple calls with different tensor sizes are made.
pull/147580

Function Signatures Refactoring with ParamSpec: The pull requests refactor function signatures in the PyTorch project by replacing *args: Any and **kwargs: Any with ParamSpec. They enhance type safety, improve static type checking with tools like mypy, and maintain code quality by preserving argument information.
pull/147582

Triton Kernel Grid Handling Simplification: The pull requests aim to simplify grid handling in Triton kernel calls by removing the need to pass the grid as a callable argument. They incorporate grid computation directly within the kernel launcher, enhancing performance by reducing function calls.
pull/147583

Unique User Kernel Names in Triton: The pull requests introduce a feature called unique_user_kernel_names to provide unique naming support for user-defined Triton kernels. They enhance control over naming and generation processes, primarily for debugging purposes.
pull/147587

XCCL Backend Build Definitions: The pull requests introduce the definitions of USE_C10D_XCCL and USE_XCCL in PyTorch to enable the building of the XCCL backend. They are similar to the existing support for NCCL, with the default setting for USE_XCCL being OFF unless explicitly set to ON.
pull/147593

OffsetBasedRNGTracker Default Device Fix: The pull requests address the issue of setting the default device type to CUDA when the OffsetBasedRNGTracker is called without arguments. They pass the device information explicitly as part of a fix for issue #147584 in the PyTorch project.
pull/147594

OpDTypes.any_common_cpu_cuda_one Documentation: The pull requests introduce documentation for the OpDTypes.any_common_cpu_cuda_one feature in the PyTorch project. They are indicated by the commit message and the title and are linked to a specific issue for resolution.
pull/147605

CUDA 12.8 Binaries sm_70 Architecture Deprecation: The pull requests propose the deprecation of the sm_70 architecture for CUDA 12.8 binaries in the PyTorch project. They are part of a follow-up to a previous pull request due to the feature-complete status and impending freeze of architecture support for Maxwell, Pascal, and Volta.
pull/147607

Intel Gaudi Devices Support in test_misc.py: The pull requests adapt the test_misc.py file to support Intel Gaudi devices (HPUs) by extending CUDA tests to operate on these devices. They ensure compatibility without affecting existing CUDA tests and include the use of a skipIfHPU decorator.
pull/147609

Replace_pattern Function Docstring Correction: The pull requests address a minor mistake in the docstring of the replace_pattern function within the PyTorch project. They are referenced in issue #147610cc and include a single commit updating the subgraph_rewriter.py file.
pull/147611

SDPA on XPU Backend Enablement: The pull requests aim to enable SDPA on the XPU backend as part of the OneDNN Upstreaming plan. They involve the addition of an Attention.cpp file and a Graph.h for OneDNN graph utilities, along with modifications to test cases in test/xpu/test_transformers.py.
pull/147614

OneDNN Component Merge Rules Update: The pull requests aim to update the merge rules for the oneDNN component in the PyTorch project. They are part of a stack of changes managed by ghstack and are currently open and not yet merged.
pull/147615

Podman Build Process Documentation: The pull requests document the automated build process of Podman with upstream patches applied to address specific issues encountered on s390x runners. They are detailed in the commit found at https://github.com/pytorch/pytorch/commit/5e4db89b85d6ee086582d2dfae5af2a004345458.
pull/147618

ROCm Split_scan Support Enablement: The pull requests enable split_scan support for ROCm builds in the PyTorch project. They address issue #133228 by removing the condition that previously prevented this support.
pull/147619

Triton Tests Force_shape_pad Option: The pull requests enable the force_shape_pad option for Triton tests in the test_kernel_benchmark. They address issues where padding paths are slower on ROCm architectures, ensuring that the tests focus on verifying the correctness of padding.
pull/147620

Set_driver_to_gpu Code Update: The pull requests update the set_driver_to_gpu code in the PyTorch project to prevent backend re-initialization issues when using the new Triton. They are indicated by the commit signed by Anatoly Myachev.
pull/147621

Hugging Face Checkpoints Storage Reader/Writer: The pull requests aim to build a storage reader/writer to enable writing checkpoints in the Hugging Face (HF) format for non-distributed use cases. They address previous lint errors by explicitly ignoring them due to the intentional absence of certain library installations.
pull/147622

Backend_type_map Removal from Backend: The pull requests aim to remove the backend_type_map from the Backend in the PyTorch project. They are no longer used for determining the default device for object collectives or barriers, and the author is awaiting continuous integration (CI) test results to ensure that this change does not introduce any issues.
pull/147635

Torch/_inductor/ir.py Unnecessary Changes Reversion: The pull requests revert unnecessary changes made to the torch/_inductor/ir.py file in a previous update (#146917). They address issues with CUDA tests not passing due to an oversight in syncing environments across different machines.
pull/147639

Test_halide.py Script Enhancement: The pull requests aim to enhance the test_halide.py script by adding functionality to report the command needed to re-run any failed tests. They improve the debugging process for developers working on the PyTorch project.
pull/147640

MPS Binary Operations Metal Kernel: The pull requests aim to implement a metal kernel for MPS binary operations using TensorIterator. They update and reimplement a previous pull request to help resolve a specific issue in the PyTorch project.
pull/147644

Rowwise Scaling Tests Skipping on SM100+: The pull requests propose to temporarily skip the rowwise scaling tests on SM100+ architectures in the PyTorch project. They are due to the current lack of implementation and are further discussed with several contributors in the body of the request.
pull/147645

TCPStore Error Handling Enhancements: The pull requests aim to enhance error handling in the TCPStore and TCPStoreLibUvBackend components of the PyTorch project. They replace generic TORCH_CHECK calls with typed exceptions, improving the specificity of error messages that are raised as RuntimeErrors in Python.
pull/147647

C++ Pytree Compile Time Assessment: The pull requests are an experimental change aimed at assessing the compile time when using C++ pytree in the PyTorch project. They are indicated by the title and commit message and have not yet been merged.
pull/147654

Unbacked Bindings in .module() Result: The pull requests aim to ensure that the .module() result in the PyTorch project does not contain unbacked bindings. They are associated with Differential Revision D70022208.
pull/147656

Addmm Tests Input Range Restriction: The pull requests aim to restrict the input range for addmm tests in the cuBLAS library. They address cancellation issues with larger sizes, enabling testing with tighter tolerances.
pull/147658

Windows CUDA Wheel and Libtorch CI Testing: The pull requests are focused on testing the continuous integration (CI) process for Windows CUDA wheel and libtorch in the PyTorch project. They are indicated by the title and the associated commit message.
pull/147664

FIPS Compliance with RUFF Linter: The pull requests aim to enforce full FIPS compliance by adding rule S324 to the RUFF linter in the PyTorch project. They are indicated by the title and the commit message and include a command for testing the changes.
pull/147668

Pdist_forward Function Error Checking: The pull requests address issue #145064 by adding error checking to the _pdist_forward function in PyTorch. They prevent segmentation faults when iterating over an empty tensor, verified through updated test cases that now raise a RuntimeError instead of causing a crash.
pull/147670

Use_relative_path Option Renaming: The pull requests involve refactoring by renaming the option from use_absolute_path to use_relative_path. They more accurately reflect its function of compiling a C++ file using its basename rather than its full path.
pull/147679

CppBuilder.build Function Consolidation: The pull requests refactor the code by replacing the run_command_and_check function with CppBuilder.build. They consolidate the C++ compilation action within the PyTorch project.
pull/147680

Triton_heuristics.py Grid Overwrite Bug Fix: The pull requests address a bug in the triton_heuristics.py file where args_with_constexprs incorrectly overwrites the grid. They add a check to ensure the correct number of arguments are passed to the launcher, enhancing error handling and preventing unexpected failures during Triton kernel execution.
pull/147690

Dynamo Dictionary Tag Optimization Disabling: The pull requests propose to disable the dictionary tag optimization in the Dynamo project when the guard manager has child accessors. They are indicated by the title and commit message.
pull/147694

Cpp_extensions Module Ninja Build Error Messaging: The pull requests aim to enhance the error messaging related to missing Ninja build system in the cpp_extensions module of the PyTorch project. They are indicated by the commit message and the associated URL.
pull/147698

JIT Version Checking Bug Fix: The pull requests address a bug in the version checking mechanism for JIT in Python 3.10. They ensure that a feature is only enabled for version 3.11, as identified by a static linter, and include a commit to correct this issue.
pull/147702

Gen_patterns.py Script TypeError Fix: The pull requests address a crash issue encountered when running the gen_patterns.py script in the PyTorch project. They specifically fix a TypeError related to the issubclass() function and provide a fix to resolve this error.
pull/147723

OpenBLAS Multiple OpenMP Runtimes Issue: The pull requests address an issue in PyTorch where building with OpenBLAS support could lead to multiple OpenMP runtimes being linked in libtorch_cpu.so. They ensure that libomp.so is not linked if OpenBLAS is already linked against libgomp.so.
pull/147725

Intel Triton Component Update for Release 2.7: The pull requests aim to update the Intel Triton component within the PyTorch project to be compatible with the upcoming release 2.7. They are indicated by the work-in-progress status and the associated commit.
pull/147727

Triplet_margin_loss Test Tolerance Update: The pull requests update the CPU tolerance levels for the nn.functional.triplet_margin_loss test in the test_torchinductor_opinfo. They prevent failures on AArch64 by increasing the acceptable absolute and relative tolerances (ATOL and RTOL) for F16.
pull/147742

Torch-xpu-ops Commit Update: The pull requests update the torch-xpu-ops commit to a specific commit hash, 306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7. They incorporate a bug fix for LayerNorm and Nonzeros, as well as an update to the AOT target, and are currently not merged.
pull/147743

Torchgen Tool Enhancement for C Shim Files: The pull requests propose an enhancement to the torchgen tool by enabling it to automatically update C shim files with a version number and a list of new arguments for modified operations. They address the backward compatibility issue that arises when adding new arguments with default values to fallback operations in Python.
pull/147745

Inductor Test_kernel_benchmark.py Script Fix: The pull requests address an issue in the PyTorch project by fixing the inductor/test_kernel_benchmark.py script. They accommodate changes in the new Triton version by preventing the duplication of parameters in the _dump_launch_params function.
pull/147746

P1 INT16 Full Quantization Target MVP: The pull requests introduce a minimum viable product (MVP) for the P1 INT16 Full quantization target. They involve quantizing the input to int16 as part of the PyTorch project.
pull/147747

Partitioner Component Print Statements Removal: The pull requests involve removing print statements from the partitioner component of the PyTorch project. They are part of a series of changes tracked through the ghstack tool and are currently open and not yet merged.
pull/147749

Search Survey Link Removal: The pull requests aim to remove a link to a search survey from the PyTorch project. They are indicated by the commit message and the involvement of contributors tagged in the discussion.
pull/147751

Distributed Checkpointing Protocol Rank Local Checkpointing: The pull requests are a work in progress aimed at demonstrating rank local checkpointing in the Distributed Checkpointing Protocol (DCP) for the PyTorch project. They are not yet ready for review.
pull/147758

Compile_fx_aot Logging Context Managers: The pull requests introduce context managers in the compile_fx_aot function to enhance logging. They add a toplevel Chromium event (tlparse) and a single dynamo_compile log entry, improving traceability and visibility of events in both Scuba and Perfetto trace tools.
pull/147760

Gfx1102 Architecture Support in Wheel Builds: The pull requests aim to add support for the gfx1102 architecture to the wheel builds in the PyTorch project. They utilize the --offload-compress option to accommodate another graphics target, as indicated by the inclusion of relevant code objects since ROCm 5.5.
pull/147761

Torch.compile Path Priority Order Respect: The pull requests address the issue where the torch.compile path in the PyTorch project was not respecting the priority_order setting of sdpa_kernel. They ensure that the context manager handling within torch.compile now properly acknowledges and applies this configuration.
pull/147768

Torch.float8_e8m0fnu Feature Testing: The pull requests are focused on testing the implementation of the torch.float8_e8m0fnu feature in PyTorch. They are indicated by the title and the involvement of multiple reviewers and subscribers.
pull/147770

Cpp_builder Clang++ Detection Bug Fix: The pull requests address a bug in the cpp_builder where the detection mechanism incorrectly identifies clang++ as g++. They include a fix to ensure proper differentiation between the two compilers.
pull/147775

CK Backend for Memory-Efficient Attention in ROCm: The pull requests introduce the CK backend for memory-efficient attention in ROCm. They enable the use of attention bias while noting that it is still activated via torch.backends.cuda.preferred_rocm_fa_library("ck") and does not support Nested Tensors.
pull/147778

Non-Functional Collectives Support in FakeTensorMode: The pull requests aim to enhance the PyTorch project by adding support for non-functional collectives under FakeTensorMode and fake_pg. They improve memory tracking capabilities.
pull/147566

Layernorm CUDA Backwards Pass Test: The pull requests introduce a new test to ensure the accuracy of the layernorm CUDA backwards pass. They serve as a foundational step towards future performance improvements.
pull/147763

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 241
Key Closed Pull Requests
1. Fix SEGFAULT when None arg was passed in GraphContext.op(..): This pull request addresses a segmentation fault (SEGFAULT) issue in the PyTorch project by fixing a bug in the GraphContext.op(..) function, which occurred when a None argument was passed, as indicated by the title "Fix SEGFAULT when None arg was passed in GraphContext.op(..)" and the body referencing issue #145261.

URL: pull/145265

Merged: No

Associated Commits: 74c59, 1ebec, 529cf, 28b7c, 5548d, 5c678, d8fd5, c0e93, d3bc9, ff743, f695f, 72d56, 3056c, bb8dd, 36ae1, cd8a4, 999d7, 89145, fa7cc, d2b0e, f843f, 260e6, 8330e, 7b85b, cb52b, 55531, 959bd, fdd05, 16bec, 52681, 1944a, ff503, 38b54, b88c6, d8d27, cd367, 16a76, a4bf8, a2cc4, 276b2, 107be, 05d80, 18247, fab8f, e1111, 2b999, 3e0f8, 2b120, c30f6, 2898b, a48fe, 8cb85, 7ad87, 40680, 6d4df, e6b00, 3377f, 3d2ca, 025c4, 4f554, 82566, f4e69, 6e7d1, 9cf68, 41fdc, 84727, 9c315, b6188, 93627, 38bf3, 83443, 121b1, b798a, 90e49, fbb02, faf69, c4544, 81bad, 2fb90, a853f, 1c063, d6f20, da502, a8ce3, acda0, ee3b4, f8a46, 9a94c, 33161, 7a30a, 53508, 91ddd, ac0a7, 295a4, f11ef, ed521, 6c1d9, bbf5f, faa7c, 58f63, 891b0, 73d37, 65098, c232f, 63561, a8618, bfa9f, 4fd18, 94b71, 500ec, 9badf, 6c31a, d134e, 0fd7b, e2b74, 00117, 2b207, 55887, 1f9e4, b5517, 54c66, cb4ab, 9debd, ff69f, 3c5dd, 69305, af032, 509d4, 3da27, ecd55, d63f2, 6c91b, 6c856, c06f4, ff383, d822c, cad53, bc4aa, 916c7, e9aaf, 70702, b6995, c1257, 81eed, 6749c, a9685, 478a3, ce891, 8c0dd, 9a84b, ecb3b, d79d1, f96ff, 093ed, 9b8cb, ef7b0, 27342, a80b6, d6846, b643f, b4ee3, 5cf3a, dd503, ffd74, 85ffe, dee02, 924bb, c18d6, 4aed3, 31439, e360f, dd4e4, fb5b0, 9ec0f, b253d, 4d7e0, 7cb38, 9340f, b0fd1, 9493d, 38c6e, aba14, ff1e3, c7f0e, 4386e, 346fa, 8094a, 6aa28, 18028, e3f20, 404b7, 197de, a5af7, c2239, 8d4ec, b8f87, 0f48e, 921c2, 0b1dc, 7bdb5, 71c54, 8dd8f, d9c1e, 89609, d3059, dd1ba, 3838b, a3f90, d4752, 5a8ab, 6ae9b, 1c22e, 65852, c89d5, 772a2, 5962d, c2fee, 5218d, c0e14, 78ec8, cfb0d, a55c0, aae36, 92fae, 12f27, e84fa, 5c3c6, a8d32, 7601d, 71b7f, 292ff, 1e3b9, a7266, aa47f, 06742, 06203, 24fc3, 49383, 1f736, f6443, abe8c, 45211, 3959b, fb130, b4519, 340aa, 217e8, 513b1, ce51a, f2e7a

2. [ONNX] Bump torchlib opset to 22: This pull request aims to update the ONNX torchlib opset to version 22, involving multiple commits that address various aspects such as migrating torchlib into PyTorch, updating tests, fixing issues, and making several code improvements, although it was ultimately not merged.

URL: pull/146510

Merged: No

Associated Commits: d3904, 837c0, caea1, 31fc5, 8c2c5, 77ed8, 00179, 3cf2e, 45049, f61eb, 07c7c, c263b, 4927f, c6737, 40153, 98e32, 702cc, f8f12, fcb0f, 1be91, 60094, b8a63, f590b, 24b7d, ad46e, 4289e, 1d488, 98579, c9ce7, 8cfef, e592a, ed814, 95b33, 0f7ac, 8e574, 1575d, 50f0c, cdeac, 90973, 13bcf, b3aab, fc259, 80506, 125c9, da3d0, bba25, 2b5bb, 61c98, de421, d44c0, 713e5, 67a32, 0637e, 4c0ad, 79838

3. [Intel GPU] qconv_pointwise.binary XPU support: This pull request aims to enable support for quantized fusion operations, specifically qconv+add and qconv+add+relu, on the Intel GPU backend by registering the operation via a specific schema and making modifications to allow signed int8 data type during operation lowering, while also reusing existing code for pattern matching and providing unit test verification and runtime exemplification.

URL: pull/135189

Merged: No

Associated Commits: 5f711, 08cde, e8098, eb1e5, 2fc0e, e9e3d, 9e906, d7500, 1dfc4, 0a39e, 26312, 2e6d7, e8221, c90b2, 97b78, fef7f, 0b227, eefe9, 281d9, a3391, d468e, b128d, ec34e, cb30e, 7c4de, 83fb6, ac9fd, e339f, 89086, 2574b, 3a1f7, 4f05b, e6115, 5a848, 76852, 346d6, 43978, b27a8, ea400, 65c14, e6322, 12234, 4a36f, c5edf, b44bf, 3b1ee, ed8ad, 2fa83, 5011d, c34e6, 161c1, 34d44, a1a32, 6a0e3

Other Closed Pull Requests

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

export method

Toxicity Score: 0.55 (Defensive responses, critique of solution, tense exchange.)
This GitHub conversation involves multiple users discussing a series of commits related to an 'export method'. User1 initially provides a solution, which User2 critiques, expressing dissatisfaction with its effectiveness. User3 attempts to mediate by suggesting improvements, but User1 responds defensively, leading to a tense exchange. The tone shifts from collaborative to confrontational, with User2 and User1 exchanging terse comments.

[pytree] Register normal class to register_dataclass

Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Repeated misunderstandings.)
This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone is tense, with username3 attempting to mediate and suggest solutions. The conversation is marked by repeated misunderstandings and a lack of consensus, leading to increased tension.

[BE] Fix tensor stub

Toxicity Score: 0.55 (Frustration expressed, defensive responses, mediation attempts, escalating tension.)
This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
197
61
2
216

anijain2305
273
59
3
77

guilhermeleobas
337
16
2
34

jansel
208
26
2
119

zou3519
60
19
20
246

justinchuby
141
23
8
142

benjaminglass1
241
14
0
41

Skylion007
46
20
3
205

eellison
96
9
7
160

cyyever
138
49
0
48

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	197	61	2	216
anijain2305	273	59	3	77
guilhermeleobas	337	16	2	34
jansel	208	26	2	119
zou3519	60	19	20	246
justinchuby	141	23	8	142
benjaminglass1	241	14	0	41
Skylion007	46	20	3	205
eellison	96	9	7	160
cyyever	138	49	0	48