Weekly GitHub Report for Pytorch: July 07, 2025 - July 14, 2025 (12:04:59)

            Weekly GitHub Report for Pytorch: July 07, 2025 - July 14, 2025 (12:04:59)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notable changes include the deprecation of publishing on Conda, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default value for the weights_only parameter in torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[inductor] grouped_mm is autotuning under torch.compile default mode: This issue is about a bug encountered when compiling the grouped_mm from the MoE layer in the llama4/deepseekv3 torchtitan, where the process hangs during compilation with specific arguments. The issue suggests a workaround by avoiding autotuning outside of max-autotune and falling back to the aten implementation for torch.compile's default mode, while keeping the issue open to address max-autotuning concerns.

The comments discuss the reproduction of the issue, with some users unable to replicate it due to changes in autotuning behavior. There is a debate on whether to merge a proposed fix, with some suggesting that the default mode should not autotune, while others argue against it. The discussion also touches on the use of use_triton_template for consistency across kernels and the need for a systematic approach to autotuning configurations.
Number of comments this week: 13

[RFC] Replace setuptools build backend with scikit-build-core: This issue proposes replacing the current setuptools-based build backend for PyTorch with scikit-build-core to modernize the packaging process and improve interoperability and build performance. The motivation for this change is driven by upcoming deprecations in setuptools, which necessitate a reevaluation of the build system, with scikit-build-core being favored due to its compatibility with PyTorch's existing CMake-based build system.

The comments discuss the feasibility and scope of replacing setup.py with scikit-build-core, including potential changes to CMake files and the separation of libtorch. There is a consensus on the benefits of the proposed change, but concerns are raised about the complexity and the need for a gradual transition. Suggestions include starting with scikit-build before moving to scikit-build-core and considering the implications for standalone libtorch and its integration with ExecuTorch.
Number of comments this week: 11

[RFC]: PyTorch Low-Precision GEMMs Public API: This issue proposes the introduction of public APIs for low-precision matrix multiplications in PyTorch, aiming to replace the current reliance on private and undocumented APIs like _scaled_mm and _scaled_grouped_mm. The proposal outlines two potential approaches: creating new dedicated functions for scaled operations or extending existing functions with optional scaling parameters, each with its own trade-offs in terms of API clarity, type safety, and user experience.

The comments discuss the value of exposing low-level, low-precision kernels quickly without waiting for high-level constructs, with some users supporting the proposal's approach of not defining end-to-end workflows. There is a preference for Approach 1 due to its future-proof nature, and suggestions include using an "algo_hints" dictionary for additional arguments and considering a "scale format" enum for specific hardware requirements. A beginner expresses interest in contributing by writing test coverage or documentation.
Number of comments this week: 9

BC-breaking change to symint range constraints from 2.7 -> 2.8: This issue concerns a backward compatibility-breaking change in the PyTorch library from version 2.7 to 2.8, where a specific code snippet involving dynamic marking and tensor operations errors out in version 2.8 but passes in version 2.7. The problem seems to be related to changes in the handling of symbolic integer range constraints, which affects certain test cases and requires further investigation and potential documentation updates.

The comments discuss the need to investigate the issue, with some suggesting it might require documentation updates. There are mentions of related test failures and internal bug investigations, with a specific pull request identified as a potential cause. The discussion includes technical details about cache keys and specialization points, and there is consensus not to revert a specific pull request despite its involvement in the issue.
Number of comments this week: 8

Drop SSE4 support in oneDNN: This issue discusses the proposal to remove SSE4 support from the oneDNN library and seeks feedback on whether this change is acceptable for PyTorch, particularly concerning support for older platforms that do not have AVX capabilities. The issue also inquires about the availability of reference implementations for non-AVX platforms after the removal of SSE4 support.

The comments discuss the need for details on reference implementations for non-AVX platforms and confirm that PyTorch should work on older hardware, though performance expectations are low. It is clarified that reference implementations will remain, but jit-optimized implementations for certain operations will be removed. There is a suggestion to drop SSE4 support due to its age and encourage AVX support for VMs. A question about solving an IPO issue on MSVC is raised and answered negatively.
Number of comments this week: 6

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment configured with specific model and pipeline settings. The error occurs in a setup using PyTorch version 2.4.0 with CUDA 12.1 on an Ubuntu 22.04.3 LTS system, and it affects the compilation of certain components of a pipeline, specifically the 'unet_garm' and 'unet_vton' models, using the 'torch.compile' function.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
cuda_utils.so: failed to map segment from shared object: This issue involves a problem encountered when running a PyTorch model within a Docker container, where the execution of a cached shared object file, cuda_utils.so, fails due to a missing execution permission despite being run as the root user. The error occurs in a temporary filesystem with specific permissions, and the user reports that the file lacks the execution bit, leading to an ImportError when attempting to map a segment from the shared object.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standard. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to address known issues such as type annotations and import cycles before the UFMT changes are committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in the PyTorch library to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the file size of models, particularly for deployment on mobile devices, by eliminating unnecessary debug files, as demonstrated by the user's experience where removing these files manually resulted in a substantial reduction in model size without affecting functionality.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 78
Summarized Issues:

Compilation Errors in PyTorch: Compilation errors in PyTorch can arise from unsupported operations or configurations, such as when using the TorchInductor backend with complex64 tensors, leading to C++ compile errors. Another instance is the failure during CUDA tests due to undefined variables in generated C++ code, highlighting the need for better error handling and support for complex operations.
issues/157683, issues/158060

Segmentation Faults and Crashes: Segmentation faults and crashes in PyTorch can occur due to various reasons, such as running specific unit tests on macOS or mutating tensors while serializing them in a multi-threaded environment. These issues often require community assistance or workarounds to resolve, as they can affect a wide range of users and configurations.
issues/157704, issues/158071

Runtime and Import Errors: PyTorch users may encounter runtime errors, such as RuntimeError in torch.randint with large upper bounds, or import errors due to outdated system libraries. These issues highlight the importance of handling large data types and ensuring compatibility with system dependencies.
issues/157707, issues/157709

CUDA and GPU Compatibility Issues: Compatibility issues with CUDA and GPUs can arise when porting projects to different hardware, such as Intel GPUs, or when using specific PyTorch features like torch.xpu, leading to errors or unexpected behavior. These issues often require detailed troubleshooting and community input to resolve.
issues/157714, issues/157775

PyTorch Compile and Dynamo Bugs: Bugs in PyTorch's torch.compile and Dynamo backend can lead to errors when handling unsupported data types or operations, such as object dtypes or slicing results of torch.linalg.svd. These issues highlight the need for improved support and error handling in PyTorch's compilation features.
issues/157720, issues/157945

Graph Breaks and Meta Kernel Issues: Graph breaks and meta kernel issues in PyTorch can occur when using certain operators or configurations, such as the absence of a meta kernel for aten::quantize_per_tensor.tensor_qparams, leading to breaks instead of errors. These issues necessitate the addition of fake implementations or other solutions to maintain functionality.
issues/157729

Continuous Integration and Build Process Problems: Problems in the continuous integration and build process, such as timeouts in C++ documentation builds or proposals to modernize the packaging process, can impact the development workflow. These issues often require discussions on improvements and temporary solutions to maintain efficiency.
issues/157763, issues/157807

Performance Discrepancies and Regression: Performance discrepancies, such as slower backward passes with certain data types or regressions in execution time, can affect the efficiency of PyTorch operations. These issues highlight the need for consistent performance across different configurations and data types.
issues/157808, issues/158000

Fully Sharded Data Parallel (FSDP) Bugs: Bugs in the Fully Sharded Data Parallel (FSDP) module, such as incorrect argument handling or inconsistencies with Distributed Data Parallel (DDP), can lead to unexpected behavior and require careful handling to ensure consistent model performance.
issues/157832, issues/157917

CUDA and NCCL Integration Issues: Integration issues with CUDA and NCCL, such as invalid memory access or problems with CUDA runtime detection, can cause errors during execution. These issues often require workarounds or fixes to ensure proper functionality in distributed and parallel computing environments.
issues/157844, issues/158029

PyTorch Documentation and API Discrepancies: Discrepancies between PyTorch documentation and actual functionality, such as incorrect descriptions of function arguments or missing methods, can lead to confusion and errors. These issues highlight the importance of accurate documentation to guide users effectively.
issues/157948, issues/158093

TorchDynamo and Python Compatibility: Compatibility issues between TorchDynamo and Python features, such as sys.monitoring, can cause graph breaks and affect debugging processes. Addressing these issues is crucial for seamless integration with Python's evolving features.
issues/158164

Distributed and Parallel Computing Enhancements: Proposals for enhancements in distributed and parallel computing, such as introducing a new CUDA Unified Memory backend or improving checkpoint serialization, aim to support larger models and improve performance. These proposals often involve discussions on implementation and potential benefits.
issues/158122, issues/158187

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 19
Summarized Issues:

UnicodeDecodeError in torch.compile on Windows with MSVC: This issue involves a UnicodeDecodeError that occurs when compiling a simple model using torch.compile on Windows with MSVC in a non-English environment. The error arises because the Inductor attempts to decode the MSVC compiler's output as UTF-8, which fails due to the output being in the system's local encoding.
issues/157673

Warnings and Errors in CUDA and PyTorch Compilation: Users encounter warnings and errors when compiling CUDA and PyTorch projects, such as warnings about extern declarations being treated as static definitions and unnecessary integer comparison warnings. These issues can degrade the user experience and require guidance or fixes to resolve.
issues/157674

Flex Attention and torch.compile Issues in PyTorch: Limitations and bugs in PyTorch's Flex Attention and torch.compile features are highlighted, including the inability to create custom attention masks and an AttributeError due to missing attributes. These issues restrict functionality and are partially resolved in newer versions.
issues/157675

Test Failures and Disabling in PyTorch Projects: Several tests in PyTorch projects are failing or have been disabled due to platform-specific issues, such as failures on XPU platforms and runtime errors on multi-GPU machines. These failures highlight the need for platform-specific adjustments and better test management.
issues/157697, issues/158095

Build and Compilation Errors on aarch64 Architecture: Users face build and compilation errors when working with the aarch64 architecture, including delays in Docker image builds and GCC-related errors. These issues require workarounds and better coordination in build processes to ensure smooth operation.
issues/157785

Discrepancies in Function Behavior and Precision Loss: Discrepancies in function behavior between different modes and platforms, such as F.conv_transpose2d and torch.nn.TransformerEncoderLayer, lead to unexpected outputs and precision loss. These issues necessitate further investigation to align behavior across environments.
issues/157909

torch.compile and Quantization Challenges: The torch.compile feature struggles with models using torch.quantization.QuantStub and DeQuantStub, as these functions are skipped during compilation. This raises questions about design choices and the need for alternative workflows compatible with torch.compile.
issues/157998

Advanced Indexing Discrepancies Between Numpy and PyTorch: A discrepancy in advanced indexing behavior between Numpy and PyTorch is discussed, particularly with boolean masks. While PyTorch generally aligns with expectations, specific indexing patterns can lead to unexpected results, highlighting the need for careful handling of edge cases.
issues/158134

RuntimeError and Export Issues on MPS Devices: A RuntimeError occurs when using torch.export.export on MPS devices due to unallocated placeholder storage. This issue can be resolved by importing torch._dynamo before executing the export function, indicating a need for better documentation or error handling.
issues/158121

PyTorch on Apple Silicon GPUs: The implementation of torch.compile aims to enhance machine learning pipelines on Apple Silicon GPUs, specifically targeting MacOS 14+ with version 2.8.0. This development promises speedups for large language model inference, showcasing PyTorch's adaptability to new hardware.
issues/157957

TypeError with NumPy Scalars in PyTorch Tests: A TypeError is encountered in PyTorch tests when using NumPy version greater than 2.0, due to the inability to convert numpy.bool_ scalar objects to tensors. This issue affects test reliability and requires updates to handle new NumPy scalar types.
issues/157973

Unimplemented Dynamo Guard Source in PyTorch: The unimplemented Dynamo guard source due to integer specialization is a concern when implementing frozen dataclass sources. This issue is documented with a reproduction and error trace, indicating areas for future development and bug fixes.
issues/157992

Incorrect Path in CONTRIBUTING.md for PyTorch Documentation: An incorrect path in the CONTRIBUTING.md file for installing requirements.txt led to confusion and discussion about the correct path. The issue appears to be specific to the original poster's setup, suggesting a need for clearer documentation.
issues/157680

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 175
Key Open Pull Requests
1. [WIP] WindowsArm64 CI changes: This pull request is a work-in-progress draft aimed at testing and implementing changes to the continuous integration (CI) process for Windows Arm64 builds in the PyTorch project, involving numerous commits that add and modify scripts, workflows, and configurations to support and optimize the build and test processes for this architecture.

URL: pull/157935

Merged: No

Associated Commits: a8e6d, 0d998, 23071, df086, 3380c, 2743b, 28b9d, dcff5, e7bb4, 9ef11, a9ee4, a6e04, b9a19, d7922, 64cd9, 78a40, e5f0b, 6aa19, bb4ec, d03be, 4ee65, 94513, e063c, 1c745, 037cc, 16fab, d0102, 1f1de, 5d227, 24710, 4a6ff, 33091, 20a45, 3e266, 136b2, 69113, b1545, acd76, 2787f, 127fa, 56e33, 5ed58, 2c165, bf215, d1d01, 435f3, b5749, cd8e4, c5bf2, fa363, 080a5, daa19, b06a4, 0ad2a, 55cd5, 515f8, 305c7, 8ee40, 28bbe, 51666, a84ff, af487, e923a, d377d, 11230, e5918, 903b9, 0f2a0, 9cbe1, 49a03, 538b3, e428e, 7713b, 91ab8, 9cba2, a0d52, cb75d, 0d223, b2079, 4c2fc, 390a4, 7619b, 14a3a, 9b072, ee16d, 9a084, bdf7e, 07a42, ee9b5, 8149f, 44f95, 30298, 5c758, 26a8a, a3f5a, 9c376, e89ae, 09de9, fa240, 77872, a803f, 2ae77, a1aea, 23ff7, 9824c, 1a099, 2ae9b, bc612, d182e, 51af0, 400ea, 2eda5, 9cd3a, f3454, 1a5ad, 20fa1, f11a7, ca621, a1c81, 629b7, 9eca5, a13cd, 4badb, 91f69, c487e, 95c29, 36c21, fa9f9, 87a2d, 0a943, 27a35, f2710, c8949, 4ffcc, 0b604, 16c00, e59ea, 5535e, 6f7c2, 408b5, 4118d, bb092, 834da, c008d, c4b37, 6595c, 71a0b, 51a78, 965f0, 2128d, 8c9f9, bd9b0, a5bfd, 5b518, f297b, c62fe, 61e2d, dd205, a87ab, ca5e5, e5e12, bdedb, 21db5, 71eec, d04fe, 3d35f, 40e5e, 0fe6a, 4a163, 9ed0c, 7ea71, d5dd0, 16a26, 3a3d5, 21f65, 6a301, e65e2, f337a, 0758b, 56cda, d36cc, 568e5, 75e5c

2. [inductor] initial triton static config lookup table: This pull request introduces the initial version of a static lookup table for Triton configurations, supporting operations like mm, addmm, bmm, and mm_plus_mm, with the aim of enabling a broader internal benchmarking and adoption, while also providing a foundation for expected behavior testing for future expansions involving more backends and functions.

URL: pull/157699

Merged: No

Associated Commits: 09ac7, 01fa0, af299, 7b98f, a85f1, 25dfc, e0a13, ad3c3, ae254, b9a52, d01a0, e2222, 22f51, 124a6, 3c6ff, 66c2f, e76b0, 2e0bd, 2e8b6, c2a83, 95baf

3. Reproduce issue from #156097: This pull request aims to reproduce an issue from a previous report (#156097) by utilizing an older driver version (525.105.17) from a specific test branch, while incorporating various commits that include changes such as using the runtime driver API for cuStreamWriteValue32, refactoring and reverting driver API changes, and adding version checks, among other updates.

URL: pull/158181

Merged: No

Associated Commits: 1484e, 0c71b, 263a3, 2084b, 96084, 77c01, 7e9f7, cd530, a91b9, d539a, 55581, 9e4be, 31557, c72d3, 21763, dd75c, 063e2, fe844, c1182, 36ea7

Other Open Pull Requests

TMA Compatibility and Flex-Attention Updates: This topic covers updates to the Tensor Memory Allocation (TMA) compatibility requirements for the Inductor and Triton components, as well as enabling TMA for flex-attention on supported devices. The pull requests involve multiple updates and commits to address issues with Flex Attention kernel TMA block descriptors and loads, ensuring compatibility and performance improvements.
pull/157881, pull/157822

SHFMT Linter and Shell Script Formatting: The addition of the SHFMT linter to format shell scripts is proposed, involving multiple contributors and stakeholders. This series of pull requests, managed through the ghstack tool, aims to format shell scripts across various directories in the PyTorch project.
pull/157685, pull/157686, pull/157687, pull/157688, pull/157689

GCC Version Migration: This pull request focuses on migrating the PyTorch continuous integration jobs from GCC version 11 to GCC version 13. The change ensures compatibility and addresses regression issues, aligning with the binary builds that have already transitioned to GCC 13.
pull/157748

MoviePy Version 2.x Support: This pull request adds support for MoviePy version 2.x in the PyTorch project, addressing issue #147317. It includes updates to the summary.py file within the torch/utils/tensorboard directory to ensure compatibility with the new MoviePy version.
pull/157712

CUDA 12.4 Integration in CI: This pull request integrates a CUDA 12.4 build into the continuous integration process for the PyTorch project. It addresses issues from a previous GitHub issue and involves multiple commits for testing, installation, and skipping failing tests specific to CUDA 12.4.
pull/157958

PrivateUse1 Key Configuration: Functions to configure the PrivateUse1 key as a Python backend device in PyTorch are introduced. This allows users to subclass tensors to hold arbitrary Python data as "device data" and register operations with torch.library, addressing issues related to device guards and tensor subclassing for non-CUDA/CPU devices.
pull/157859

AOTAutogradCache Normalization: This pull request introduces a pass to the sanitize_gm_for_cache function to normalize placeholder names across input dynamo graphs to the AOTAutogradCache. This ensures consistency and safety, as the original dynamo graph's node names are not used by AOTAutograd, making this change effectively a no-op except for cache key checks.
pull/157916

Inductor and Triton Matmul Enhancements: The initial implementation of native tl.dot support in Inductor is introduced to generate Triton matmul kernels directly. This includes a new configuration flag, a new ops.dot IR node, suitable tiling for matmul, code generation for ops.dot, and Triton autotuning heuristics.
pull/157743

Float8 Data Type Support: This pull request addresses an assertion issue by adding the float8_e4m3fn data type to the assertion dtype list in the PyTorch project. It involves multiple commits for supporting this data type in lowering and meta functions, refining code, fixing lint errors, and adding unit tests.
pull/157684

Reduction Configuration Lookup Table: A work-in-progress lookup table for reduction configurations in the PyTorch project is introduced. This builds upon a previous differential revision and involves multiple contributors for review and collaboration.
pull/157700

Error Handling in torch.compile: This pull request enhances error handling in the torch.compile function by explicitly raising a torch._dynamo.exc.Unsupported exception with a clear explanation. It improves user experience by replacing the previous less intuitive TorchRuntimeError with a more direct and actionable error message.
pull/157810

Composable Kernel Generation Reduction: Changes to reduce the number of composable kernel (ck) kernels generated in the PyTorch project are made. This includes updates such as modifying the ck kernel generation process, updating specific header files, changing APIs, and testing wrappers.
pull/157964

Build Configuration Update for PEP 639: The build configuration of the PyTorch project is updated by pinning the setuptools version to 77 or higher to enable support for PEP 639. This is part of a series of changes managed through the ghstack tool.
pull/158104

ideep Library Update Testing: This pull request tests the update of the ideep library to a more recent oneDNN commit. It involves multiple commits such as updating OpenBLAS and fixing a segmentation fault, and is not intended for merging.
pull/157782

B200 Platform Benchmarking: Work on enabling the nightly PT2 benchmark on the B200 platform is resumed and completed. This includes several commits for testing and refining the benchmark process.
pull/158011

Loop Iteration Simplification in Triton GEMM: This pull request simplifies loop iteration and centralizes modifications by extracting common logic for converting BaseConfig objects into keyword arguments for Triton GEMM templates. This ensures consistent updates across all templates and reduces the risk of missing changes.
pull/158015

Indexing for Large Tensors: Indexing for large tensors is enabled by converting int32 indices to int64. This is part of a stack of changes tracked via ghstack and involves multiple contributors for review and collaboration.
pull/157767

Compiled Mode Broadcasting Modification: The compiled mode broadcasting in PyTorch is modified to use zero strides for new dimensions of length one. This addresses inconsistencies in stride behavior compared to eager mode and is dependent on another pull request for testing.
pull/157854

Intermediate Representation Node Reordering: A helper function to reorder intermediate representation nodes for pre-fetching bucketed AG/RS nodes is introduced. This optimizes the forward and backward pass by adjusting the sequence of operations to improve efficiency in the PyTorch project.
pull/158098

Group Split Function Prototype: A prototype implementation of the group_split function as part of the dist2 project is introduced. This follows the proposal outlined in a shared Google document and involves multiple updates and collaborations with contributors.
pull/157716

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 194
Key Closed Pull Requests
1. [inductor][triton] Add experimental use_tensor_descriptor config option: This pull request introduces an experimental configuration option to use tensor descriptors in the PyTorch project, refactoring the code to support TMA (Tensor Memory Access) descriptors in general code generation, ensuring compatibility with Triton's requirements for minimum block sizes, and updating tests to accommodate these changes, while also requiring an upgrade to Triton version 3.4.0.

URL: pull/157906

Merged: No

Associated Commits: 0458c, ac074, 81e29, 90684, 594d5, 8f794, 15fb0, a925c, f31d3, 417ea, 7a0b9, ac8f5, a8bdc, 5b51a, e8d67, 86046, cee8b, 36ea5, 615a2, 0eb37, 803ee, 7dcf7, 0e916, 494ef, 55de3

2. #IS157973/numpy version issue: This pull request addresses issue #157973 by modifying the THPUtils_unpackNumberAsBool function to explicitly recognize numpy.bool_ scalars using torch::utils::is_numpy_bool, ensuring that if the object is a NumPy boolean, its truth value is retrieved via PyObject_IsTrue to avoid the previous error-prone path that treated it as an integer.

URL: pull/158036

Merged: No

Associated Commits: ae85a, 51556, a359e, dabab, 5bd5c, a0792, b7cb0, c2c95, b8f9b, 292c9, 58cd6, a66b7, 428e3, ba14a, 4de53, 82aa0, 093f6, e7569, cae9e, f2d48, 709bb, 0687a, dbc98

3. [WIP] [Inductor][Intel GPU] Always use channel last for only freezing mode.Channel last: This pull request aims to ensure that the PyTorch project consistently uses the channel-last memory format specifically in freezing mode for Intel GPUs, as indicated by the title and the involvement of multiple contributors and reviewers from various organizations.

URL: pull/157827

Merged: No

Associated Commits: 3e6f0, 8e00a, 2b0f8, 3db12, d91af, 0cb3b, b1040, 8b879, 7228e, 9fd94, 34725, 30a9a, f6166, 90d16, a83d7, 65508

Other Closed Pull Requests

Setuptools and Wheel Management: This topic involves managing the build requirements in the PyTorch project, specifically pinning the setuptools version to ensure compatibility and removing the wheel package from the build requirements. These changes are part of a stack of related updates tracked through the ghstack tool, although the removal of the wheel package was not merged.
pull/157783, pull/158027

Dynamo and Compilation Enhancements: Enhancements to the PyTorch dynamo component include introducing a recompilation hook for logging custom compile-related information and implementing the expand_hints function to execute and expand graph_break_hints. These updates aim to improve the flexibility and functionality of the dynamo component, although the expand_hints function was not merged.
pull/157961, pull/158078

Cost Coverage and Strategy Improvements: The PyTorch project has improved cost coverage by filling in missing redistribute_cost for operations like cat and slice_scatter and expanding the cat strategy. This enhancement increases the number of strategies based on input tensors and operation specifications, thereby optimizing the placement of each input tensor.
pull/157738

Quantization Documentation Transition: The PyTorch project is transitioning its quantization documentation to use TorchAO, involving multiple updates to ensure correct generation from the continuous integration process. This transition marks a significant shift in how quantization documentation is managed within the project.
pull/157766

CUDA and Architecture Support: Enhancements to the PyTorch project include adding support for the sm_80 architecture to the CUDA 12.9 SBSA build on the aarch64 platform and reintroducing the PTX microarchitecture into CUDA 12.8. These updates aim to expand the project's compatibility with various architectures, although the sm_80 support was not merged.
pull/157830, pull/157930

Library and Dependency Management: The PyTorch project has addressed issues with library paths by replacing find_path with find_library for NVSHMEM libraries and removed the unused astunparse dependency. These changes streamline the build process and ensure compatibility with system installation locations.
pull/157695, pull/157907

Device Module and Indexing Improvements: Tracing through the torch.get_device_module function and addressing advanced indexing discrepancies with NumPy are key improvements in the PyTorch project. These updates enhance the project's functionality and ensure compatibility with NumPy's dimension ordering.
pull/157980, pull/157676

Index Kernel Performance and Bug Fixes: The PyTorch project has improved the index_kernel function's performance for large tensors and fixed issues with index_put_accumulate for boolean types. These changes result in nearly double the performance gain for large tensors and address specific bugs in the function.
pull/158064

ONNX Exporter Enhancements: Enhancements to the ONNX exporter in PyTorch include adding support for symbolic arguments, allowing for more flexible model input handling. This update addresses previous limitations related to constant arguments in the export process.
pull/157734

Symbolic Integer and Collective Enhancements: The PyTorch project has added support for unbacked symbolic integers in the Static Data Parallel Function Application and improved the "reorder_collectives_preserve_peak" feature. These updates enhance the project's functionality, although the symbolic integer support was not merged.
pull/157739, pull/157706

Inductor Collectives and Validation Checks: Improvements in the PyTorch project include handling sink waits in inductor collectives and adding validation checks for the jagged_dim parameter. These updates ensure better handling of collective operations and parameter validation, although the validation check was not merged.
pull/157708, pull/157770

CPython Tests and Serialization Protocol: The PyTorch project has added several CPython tests for various operators and dunder methods, and enhanced the serialization protocol of the Cutlass backend. These updates improve testing coverage and significantly reduce loading time for instantiation.
pull/157799, pull/157800, pull/157801, pull/157802, pull/157840

Configuration and Communication Cleanup: The PyTorch project has introduced a mechanism to raise errors for unrecognized keys in the accelerator allocator configuration and removed unnecessary global rank parameters for NCCL communication. These changes streamline configuration management and communication processes.
pull/157908, pull/157978

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

add ir node bucket helper function
Toxicity Score: 0.55 (Defensive responses, perceived dismissiveness, unresolved frustration.)
This GitHub conversation involves a series of interactions where username1 initially proposes a change, and username2 provides feedback that is perceived as dismissive by username1, leading to a defensive response. Username3 attempts to mediate by offering a compromise, but username1 remains frustrated, feeling their efforts are undervalued. The tone shifts from collaborative to tense, with username1 expressing dissatisfaction with the process.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

XuehaiPan
568
57
3
28

bobrenjc93
194
53
0
1

malfet
112
19
7
103

coconutruben
104
18
0
49

Skylion007
16
11
3
123

guilhermeleobas
109
29
0
1

atalman
94
12
12
17

williamwen42
36
10
5
82

jansel
26
8
1
67

guangyey
60
8
4
24

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
XuehaiPan	568	57	3	28
bobrenjc93	194	53	0	1
malfet	112	19	7	103
coconutruben	104	18	0	49
Skylion007	16	11	3	123
guilhermeleobas	109	29	0	1
atalman	94	12	12	17
williamwen42	36	10	5	82
jansel	26	8	1	67
guangyey	60	8	4	24