Weekly GitHub Report for Pytorch: July 07, 2025 - July 14, 2025 (12:04:59)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notable changes include the deprecation of publishing on Conda, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default value for the weights_only
parameter in torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[inductor] grouped_mm is autotuning under torch.compile default mode: This issue is about a bug encountered when compiling the
grouped_mm
from the MoE layer in the llama4/deepseekv3 torchtitan, where the process hangs during compilation with specific arguments. The issue suggests a workaround by avoiding autotuning outside of max-autotune and falling back to the aten implementation for torch.compile's default mode, while keeping the issue open to address max-autotuning concerns.- The comments discuss the reproduction of the issue, with some users unable to replicate it due to changes in autotuning behavior. There is a debate on whether to merge a proposed fix, with some suggesting that the default mode should not autotune, while others argue against it. The discussion also touches on the use of
use_triton_template
for consistency across kernels and the need for a systematic approach to autotuning configurations. - Number of comments this week: 13
- The comments discuss the reproduction of the issue, with some users unable to replicate it due to changes in autotuning behavior. There is a debate on whether to merge a proposed fix, with some suggesting that the default mode should not autotune, while others argue against it. The discussion also touches on the use of
-
[RFC] Replace setuptools build backend with scikit-build-core: This issue proposes replacing the current setuptools-based build backend for PyTorch with scikit-build-core to modernize the packaging process and improve interoperability and build performance. The motivation for this change is driven by upcoming deprecations in setuptools, which necessitate a reevaluation of the build system, with scikit-build-core being favored due to its compatibility with PyTorch's existing CMake-based build system.
- The comments discuss the feasibility and scope of replacing
setup.py
with scikit-build-core, including potential changes to CMake files and the separation of libtorch. There is a consensus on the benefits of the proposed change, but concerns are raised about the complexity and the need for a gradual transition. Suggestions include starting with scikit-build before moving to scikit-build-core and considering the implications for standalone libtorch and its integration with ExecuTorch. - Number of comments this week: 11
- The comments discuss the feasibility and scope of replacing
-
[RFC]: PyTorch Low-Precision GEMMs Public API: This issue proposes the introduction of public APIs for low-precision matrix multiplications in PyTorch, aiming to replace the current reliance on private and undocumented APIs like
_scaled_mm
and_scaled_grouped_mm
. The proposal outlines two potential approaches: creating new dedicated functions for scaled operations or extending existing functions with optional scaling parameters, each with its own trade-offs in terms of API clarity, type safety, and user experience.- The comments discuss the value of exposing low-level, low-precision kernels quickly without waiting for high-level constructs, with some users supporting the proposal's approach of not defining end-to-end workflows. There is a preference for Approach 1 due to its future-proof nature, and suggestions include using an "algo_hints" dictionary for additional arguments and considering a "scale format" enum for specific hardware requirements. A beginner expresses interest in contributing by writing test coverage or documentation.
- Number of comments this week: 9
-
BC-breaking change to symint range constraints from 2.7 -> 2.8: This issue concerns a backward compatibility-breaking change in the PyTorch library from version 2.7 to 2.8, where a specific code snippet involving dynamic marking and tensor operations errors out in version 2.8 but passes in version 2.7. The problem seems to be related to changes in the handling of symbolic integer range constraints, which affects certain test cases and requires further investigation and potential documentation updates.
- The comments discuss the need to investigate the issue, with some suggesting it might require documentation updates. There are mentions of related test failures and internal bug investigations, with a specific pull request identified as a potential cause. The discussion includes technical details about cache keys and specialization points, and there is consensus not to revert a specific pull request despite its involvement in the issue.
- Number of comments this week: 8
-
Drop SSE4 support in oneDNN: This issue discusses the proposal to remove SSE4 support from the oneDNN library and seeks feedback on whether this change is acceptable for PyTorch, particularly concerning support for older platforms that do not have AVX capabilities. The issue also inquires about the availability of reference implementations for non-AVX platforms after the removal of SSE4 support.
- The comments discuss the need for details on reference implementations for non-AVX platforms and confirm that PyTorch should work on older hardware, though performance expectations are low. It is clarified that reference implementations will remain, but jit-optimized implementations for certain operations will be removed. There is a suggestion to drop SSE4 support due to its age and encourage AVX support for VMs. A question about solving an IPO issue on MSVC is raised and answered negatively.
- Number of comments this week: 6
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment configured with specific model and pipeline settings. The error occurs in a setup using PyTorch version 2.4.0 with CUDA 12.1 on an Ubuntu 22.04.3 LTS system, and it affects the compilation of certain components of a pipeline, specifically the 'unet_garm' and 'unet_vton' models, using the 'torch.compile' function.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
- cuda_utils.so: failed to map segment from shared object: This issue involves a problem encountered when running a PyTorch model within a Docker container, where the execution of a cached shared object file,
cuda_utils.so
, fails due to a missing execution permission despite being run as the root user. The error occurs in a temporary filesystem with specific permissions, and the user reports that the file lacks the execution bit, leading to an ImportError when attempting to map a segment from the shared object. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standard. The process requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to address known issues such as type annotations and import cycles before the UFMT changes are committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in the PyTorch library to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the file size of models, particularly for deployment on mobile devices, by eliminating unnecessary debug files, as demonstrated by the user's experience where removing these files manually resulted in a substantial reduction in model size without affecting functionality.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 78
Summarized Issues:
- Compilation Errors in PyTorch: Compilation errors in PyTorch can arise from unsupported operations or configurations, such as when using the TorchInductor backend with complex64 tensors, leading to C++ compile errors. Another instance is the failure during CUDA tests due to undefined variables in generated C++ code, highlighting the need for better error handling and support for complex operations.
- Segmentation Faults and Crashes: Segmentation faults and crashes in PyTorch can occur due to various reasons, such as running specific unit tests on macOS or mutating tensors while serializing them in a multi-threaded environment. These issues often require community assistance or workarounds to resolve, as they can affect a wide range of users and configurations.
- Runtime and Import Errors: PyTorch users may encounter runtime errors, such as
RuntimeError
intorch.randint
with large upper bounds, or import errors due to outdated system libraries. These issues highlight the importance of handling large data types and ensuring compatibility with system dependencies.
- CUDA and GPU Compatibility Issues: Compatibility issues with CUDA and GPUs can arise when porting projects to different hardware, such as Intel GPUs, or when using specific PyTorch features like
torch.xpu
, leading to errors or unexpected behavior. These issues often require detailed troubleshooting and community input to resolve.
- PyTorch Compile and Dynamo Bugs: Bugs in PyTorch's
torch.compile
and Dynamo backend can lead to errors when handling unsupported data types or operations, such as object dtypes or slicing results oftorch.linalg.svd
. These issues highlight the need for improved support and error handling in PyTorch's compilation features.
- Graph Breaks and Meta Kernel Issues: Graph breaks and meta kernel issues in PyTorch can occur when using certain operators or configurations, such as the absence of a meta kernel for
aten::quantize_per_tensor.tensor_qparams
, leading to breaks instead of errors. These issues necessitate the addition of fake implementations or other solutions to maintain functionality.
- Continuous Integration and Build Process Problems: Problems in the continuous integration and build process, such as timeouts in C++ documentation builds or proposals to modernize the packaging process, can impact the development workflow. These issues often require discussions on improvements and temporary solutions to maintain efficiency.
- Performance Discrepancies and Regression: Performance discrepancies, such as slower backward passes with certain data types or regressions in execution time, can affect the efficiency of PyTorch operations. These issues highlight the need for consistent performance across different configurations and data types.
- Fully Sharded Data Parallel (FSDP) Bugs: Bugs in the Fully Sharded Data Parallel (FSDP) module, such as incorrect argument handling or inconsistencies with Distributed Data Parallel (DDP), can lead to unexpected behavior and require careful handling to ensure consistent model performance.
- CUDA and NCCL Integration Issues: Integration issues with CUDA and NCCL, such as invalid memory access or problems with CUDA runtime detection, can cause errors during execution. These issues often require workarounds or fixes to ensure proper functionality in distributed and parallel computing environments.
- PyTorch Documentation and API Discrepancies: Discrepancies between PyTorch documentation and actual functionality, such as incorrect descriptions of function arguments or missing methods, can lead to confusion and errors. These issues highlight the importance of accurate documentation to guide users effectively.
- TorchDynamo and Python Compatibility: Compatibility issues between TorchDynamo and Python features, such as
sys.monitoring
, can cause graph breaks and affect debugging processes. Addressing these issues is crucial for seamless integration with Python's evolving features.
- Distributed and Parallel Computing Enhancements: Proposals for enhancements in distributed and parallel computing, such as introducing a new CUDA Unified Memory backend or improving checkpoint serialization, aim to support larger models and improve performance. These proposals often involve discussions on implementation and potential benefits.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 19
Summarized Issues:
- UnicodeDecodeError in torch.compile on Windows with MSVC: This issue involves a
UnicodeDecodeError
that occurs when compiling a simple model usingtorch.compile
on Windows with MSVC in a non-English environment. The error arises because the Inductor attempts to decode the MSVC compiler's output as UTF-8, which fails due to the output being in the system's local encoding.
- Warnings and Errors in CUDA and PyTorch Compilation: Users encounter warnings and errors when compiling CUDA and PyTorch projects, such as warnings about extern declarations being treated as static definitions and unnecessary integer comparison warnings. These issues can degrade the user experience and require guidance or fixes to resolve.
- Flex Attention and torch.compile Issues in PyTorch: Limitations and bugs in PyTorch's Flex Attention and
torch.compile
features are highlighted, including the inability to create custom attention masks and anAttributeError
due to missing attributes. These issues restrict functionality and are partially resolved in newer versions.
- Test Failures and Disabling in PyTorch Projects: Several tests in PyTorch projects are failing or have been disabled due to platform-specific issues, such as failures on XPU platforms and runtime errors on multi-GPU machines. These failures highlight the need for platform-specific adjustments and better test management.
- Build and Compilation Errors on aarch64 Architecture: Users face build and compilation errors when working with the aarch64 architecture, including delays in Docker image builds and GCC-related errors. These issues require workarounds and better coordination in build processes to ensure smooth operation.
- Discrepancies in Function Behavior and Precision Loss: Discrepancies in function behavior between different modes and platforms, such as
F.conv_transpose2d
andtorch.nn.TransformerEncoderLayer
, lead to unexpected outputs and precision loss. These issues necessitate further investigation to align behavior across environments.
- torch.compile and Quantization Challenges: The
torch.compile
feature struggles with models usingtorch.quantization.QuantStub
andDeQuantStub
, as these functions are skipped during compilation. This raises questions about design choices and the need for alternative workflows compatible withtorch.compile
.
- Advanced Indexing Discrepancies Between Numpy and PyTorch: A discrepancy in advanced indexing behavior between Numpy and PyTorch is discussed, particularly with boolean masks. While PyTorch generally aligns with expectations, specific indexing patterns can lead to unexpected results, highlighting the need for careful handling of edge cases.
- RuntimeError and Export Issues on MPS Devices: A
RuntimeError
occurs when usingtorch.export.export
on MPS devices due to unallocated placeholder storage. This issue can be resolved by importingtorch._dynamo
before executing the export function, indicating a need for better documentation or error handling.
- PyTorch on Apple Silicon GPUs: The implementation of
torch.compile
aims to enhance machine learning pipelines on Apple Silicon GPUs, specifically targeting MacOS 14+ with version 2.8.0. This development promises speedups for large language model inference, showcasing PyTorch's adaptability to new hardware.
- TypeError with NumPy Scalars in PyTorch Tests: A TypeError is encountered in PyTorch tests when using NumPy version greater than 2.0, due to the inability to convert
numpy.bool_
scalar objects to tensors. This issue affects test reliability and requires updates to handle new NumPy scalar types.
- Unimplemented Dynamo Guard Source in PyTorch: The unimplemented Dynamo guard source due to integer specialization is a concern when implementing frozen dataclass sources. This issue is documented with a reproduction and error trace, indicating areas for future development and bug fixes.
- Incorrect Path in CONTRIBUTING.md for PyTorch Documentation: An incorrect path in the CONTRIBUTING.md file for installing requirements.txt led to confusion and discussion about the correct path. The issue appears to be specific to the original poster's setup, suggesting a need for clearer documentation.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 175
Key Open Pull Requests
1. [WIP] WindowsArm64 CI changes: This pull request is a work-in-progress draft aimed at testing and implementing changes to the continuous integration (CI) process for Windows Arm64 builds in the PyTorch project, involving numerous commits that add and modify scripts, workflows, and configurations to support and optimize the build and test processes for this architecture.
- URL: pull/157935
- Merged: No
- Associated Commits: a8e6d, 0d998, 23071, df086, 3380c, 2743b, 28b9d, dcff5, e7bb4, 9ef11, a9ee4, a6e04, b9a19, d7922, 64cd9, 78a40, e5f0b, 6aa19, bb4ec, d03be, 4ee65, 94513, e063c, 1c745, 037cc, 16fab, d0102, 1f1de, 5d227, 24710, 4a6ff, 33091, 20a45, 3e266, 136b2, 69113, b1545, acd76, 2787f, 127fa, 56e33, 5ed58, 2c165, bf215, d1d01, 435f3, b5749, cd8e4, c5bf2, fa363, 080a5, daa19, b06a4, 0ad2a, 55cd5, 515f8, 305c7, 8ee40, 28bbe, 51666, a84ff, af487, e923a, d377d, 11230, e5918, 903b9, 0f2a0, 9cbe1, 49a03, 538b3, e428e, 7713b, 91ab8, 9cba2, a0d52, cb75d, 0d223, b2079, 4c2fc, 390a4, 7619b, 14a3a, 9b072, ee16d, 9a084, bdf7e, 07a42, ee9b5, 8149f, 44f95, 30298, 5c758, 26a8a, a3f5a, 9c376, e89ae, 09de9, fa240, 77872, a803f, 2ae77, a1aea, 23ff7, 9824c, 1a099, 2ae9b, bc612, d182e, 51af0, 400ea, 2eda5, 9cd3a, f3454, 1a5ad, 20fa1, f11a7, ca621, a1c81, 629b7, 9eca5, a13cd, 4badb, 91f69, c487e, 95c29, 36c21, fa9f9, 87a2d, 0a943, 27a35, f2710, c8949, 4ffcc, 0b604, 16c00, e59ea, 5535e, 6f7c2, 408b5, 4118d, bb092, 834da, c008d, c4b37, 6595c, 71a0b, 51a78, 965f0, 2128d, 8c9f9, bd9b0, a5bfd, 5b518, f297b, c62fe, 61e2d, dd205, a87ab, ca5e5, e5e12, bdedb, 21db5, 71eec, d04fe, 3d35f, 40e5e, 0fe6a, 4a163, 9ed0c, 7ea71, d5dd0, 16a26, 3a3d5, 21f65, 6a301, e65e2, f337a, 0758b, 56cda, d36cc, 568e5, 75e5c
2. [inductor] initial triton static config lookup table: This pull request introduces the initial version of a static lookup table for Triton configurations, supporting operations like mm, addmm, bmm, and mm_plus_mm, with the aim of enabling a broader internal benchmarking and adoption, while also providing a foundation for expected behavior testing for future expansions involving more backends and functions.
- URL: pull/157699
- Merged: No
- Associated Commits: 09ac7, 01fa0, af299, 7b98f, a85f1, 25dfc, e0a13, ad3c3, ae254, b9a52, d01a0, e2222, 22f51, 124a6, 3c6ff, 66c2f, e76b0, 2e0bd, 2e8b6, c2a83, 95baf
3. Reproduce issue from #156097: This pull request aims to reproduce an issue from a previous report (#156097) by utilizing an older driver version (525.105.17
) from a specific test branch, while incorporating various commits that include changes such as using the runtime driver API for cuStreamWriteValue32
, refactoring and reverting driver API changes, and adding version checks, among other updates.
- URL: pull/158181
- Merged: No
- Associated Commits: 1484e, 0c71b, 263a3, 2084b, 96084, 77c01, 7e9f7, cd530, a91b9, d539a, 55581, 9e4be, 31557, c72d3, 21763, dd75c, 063e2, fe844, c1182, 36ea7
Other Open Pull Requests
- TMA Compatibility and Flex-Attention Updates: This topic covers updates to the Tensor Memory Allocation (TMA) compatibility requirements for the Inductor and Triton components, as well as enabling TMA for flex-attention on supported devices. The pull requests involve multiple updates and commits to address issues with Flex Attention kernel TMA block descriptors and loads, ensuring compatibility and performance improvements.
- SHFMT Linter and Shell Script Formatting: The addition of the
SHFMT
linter to format shell scripts is proposed, involving multiple contributors and stakeholders. This series of pull requests, managed through theghstack
tool, aims to format shell scripts across various directories in the PyTorch project.
- GCC Version Migration: This pull request focuses on migrating the PyTorch continuous integration jobs from GCC version 11 to GCC version 13. The change ensures compatibility and addresses regression issues, aligning with the binary builds that have already transitioned to GCC 13.
- MoviePy Version 2.x Support: This pull request adds support for MoviePy version 2.x in the PyTorch project, addressing issue #147317. It includes updates to the
summary.py
file within thetorch/utils/tensorboard
directory to ensure compatibility with the new MoviePy version.
- CUDA 12.4 Integration in CI: This pull request integrates a CUDA 12.4 build into the continuous integration process for the PyTorch project. It addresses issues from a previous GitHub issue and involves multiple commits for testing, installation, and skipping failing tests specific to CUDA 12.4.
- PrivateUse1 Key Configuration: Functions to configure the PrivateUse1 key as a Python backend device in PyTorch are introduced. This allows users to subclass tensors to hold arbitrary Python data as "device data" and register operations with
torch.library
, addressing issues related to device guards and tensor subclassing for non-CUDA/CPU devices.
- AOTAutogradCache Normalization: This pull request introduces a pass to the
sanitize_gm_for_cache
function to normalize placeholder names across input dynamo graphs to theAOTAutogradCache
. This ensures consistency and safety, as the original dynamo graph's node names are not used byAOTAutograd
, making this change effectively a no-op except for cache key checks.
- Inductor and Triton Matmul Enhancements: The initial implementation of native
tl.dot
support in Inductor is introduced to generate Triton matmul kernels directly. This includes a new configuration flag, a newops.dot
IR node, suitable tiling for matmul, code generation forops.dot
, and Triton autotuning heuristics.
- Float8 Data Type Support: This pull request addresses an assertion issue by adding the
float8_e4m3fn
data type to the assertion dtype list in the PyTorch project. It involves multiple commits for supporting this data type in lowering and meta functions, refining code, fixing lint errors, and adding unit tests.
- Reduction Configuration Lookup Table: A work-in-progress lookup table for reduction configurations in the PyTorch project is introduced. This builds upon a previous differential revision and involves multiple contributors for review and collaboration.
- Error Handling in torch.compile: This pull request enhances error handling in the
torch.compile
function by explicitly raising atorch._dynamo.exc.Unsupported
exception with a clear explanation. It improves user experience by replacing the previous less intuitiveTorchRuntimeError
with a more direct and actionable error message.
- Composable Kernel Generation Reduction: Changes to reduce the number of composable kernel (ck) kernels generated in the PyTorch project are made. This includes updates such as modifying the ck kernel generation process, updating specific header files, changing APIs, and testing wrappers.
- Build Configuration Update for PEP 639: The build configuration of the PyTorch project is updated by pinning the
setuptools
version to 77 or higher to enable support for PEP 639. This is part of a series of changes managed through the ghstack tool.
- ideep Library Update Testing: This pull request tests the update of the ideep library to a more recent oneDNN commit. It involves multiple commits such as updating OpenBLAS and fixing a segmentation fault, and is not intended for merging.
- B200 Platform Benchmarking: Work on enabling the nightly PT2 benchmark on the B200 platform is resumed and completed. This includes several commits for testing and refining the benchmark process.
- Loop Iteration Simplification in Triton GEMM: This pull request simplifies loop iteration and centralizes modifications by extracting common logic for converting BaseConfig objects into keyword arguments for Triton GEMM templates. This ensures consistent updates across all templates and reduces the risk of missing changes.
- Indexing for Large Tensors: Indexing for large tensors is enabled by converting int32 indices to int64. This is part of a stack of changes tracked via ghstack and involves multiple contributors for review and collaboration.
- Compiled Mode Broadcasting Modification: The compiled mode broadcasting in PyTorch is modified to use zero strides for new dimensions of length one. This addresses inconsistencies in stride behavior compared to eager mode and is dependent on another pull request for testing.
- Intermediate Representation Node Reordering: A helper function to reorder intermediate representation nodes for pre-fetching bucketed AG/RS nodes is introduced. This optimizes the forward and backward pass by adjusting the sequence of operations to improve efficiency in the PyTorch project.
- Group Split Function Prototype: A prototype implementation of the
group_split
function as part of the dist2 project is introduced. This follows the proposal outlined in a shared Google document and involves multiple updates and collaborations with contributors.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 194
Key Closed Pull Requests
1. [inductor][triton] Add experimental use_tensor_descriptor config option: This pull request introduces an experimental configuration option to use tensor descriptors in the PyTorch project, refactoring the code to support TMA (Tensor Memory Access) descriptors in general code generation, ensuring compatibility with Triton's requirements for minimum block sizes, and updating tests to accommodate these changes, while also requiring an upgrade to Triton version 3.4.0.
- URL: pull/157906
- Merged: No
- Associated Commits: 0458c, ac074, 81e29, 90684, 594d5, 8f794, 15fb0, a925c, f31d3, 417ea, 7a0b9, ac8f5, a8bdc, 5b51a, e8d67, 86046, cee8b, 36ea5, 615a2, 0eb37, 803ee, 7dcf7, 0e916, 494ef, 55de3
2. #IS157973/numpy version issue: This pull request addresses issue #157973 by modifying the THPUtils_unpackNumberAsBool
function to explicitly recognize numpy.bool_ scalars
using torch::utils::is_numpy_bool
, ensuring that if the object is a NumPy boolean, its truth value is retrieved via PyObject_IsTrue
to avoid the previous error-prone path that treated it as an integer.
- URL: pull/158036
- Merged: No
- Associated Commits: ae85a, 51556, a359e, dabab, 5bd5c, a0792, b7cb0, c2c95, b8f9b, 292c9, 58cd6, a66b7, 428e3, ba14a, 4de53, 82aa0, 093f6, e7569, cae9e, f2d48, 709bb, 0687a, dbc98
3. [WIP] [Inductor][Intel GPU] Always use channel last for only freezing mode.Channel last: This pull request aims to ensure that the PyTorch project consistently uses the channel-last memory format specifically in freezing mode for Intel GPUs, as indicated by the title and the involvement of multiple contributors and reviewers from various organizations.
- URL: pull/157827
- Merged: No
- Associated Commits: 3e6f0, 8e00a, 2b0f8, 3db12, d91af, 0cb3b, b1040, 8b879, 7228e, 9fd94, 34725, 30a9a, f6166, 90d16, a83d7, 65508
Other Closed Pull Requests
- Setuptools and Wheel Management: This topic involves managing the build requirements in the PyTorch project, specifically pinning the
setuptools
version to ensure compatibility and removing thewheel
package from the build requirements. These changes are part of a stack of related updates tracked through theghstack
tool, although the removal of thewheel
package was not merged.
- Dynamo and Compilation Enhancements: Enhancements to the PyTorch dynamo component include introducing a recompilation hook for logging custom compile-related information and implementing the
expand_hints
function to execute and expandgraph_break_hints
. These updates aim to improve the flexibility and functionality of the dynamo component, although theexpand_hints
function was not merged.
- Cost Coverage and Strategy Improvements: The PyTorch project has improved cost coverage by filling in missing
redistribute_cost
for operations likecat
andslice_scatter
and expanding thecat
strategy. This enhancement increases the number of strategies based on input tensors and operation specifications, thereby optimizing the placement of each input tensor.
- Quantization Documentation Transition: The PyTorch project is transitioning its quantization documentation to use TorchAO, involving multiple updates to ensure correct generation from the continuous integration process. This transition marks a significant shift in how quantization documentation is managed within the project.
- CUDA and Architecture Support: Enhancements to the PyTorch project include adding support for the sm_80 architecture to the CUDA 12.9 SBSA build on the aarch64 platform and reintroducing the PTX microarchitecture into CUDA 12.8. These updates aim to expand the project's compatibility with various architectures, although the sm_80 support was not merged.
- Library and Dependency Management: The PyTorch project has addressed issues with library paths by replacing
find_path
withfind_library
for NVSHMEM libraries and removed the unusedastunparse
dependency. These changes streamline the build process and ensure compatibility with system installation locations.
- Device Module and Indexing Improvements: Tracing through the
torch.get_device_module
function and addressing advanced indexing discrepancies with NumPy are key improvements in the PyTorch project. These updates enhance the project's functionality and ensure compatibility with NumPy's dimension ordering.
- Index Kernel Performance and Bug Fixes: The PyTorch project has improved the
index_kernel
function's performance for large tensors and fixed issues withindex_put_accumulate
for boolean types. These changes result in nearly double the performance gain for large tensors and address specific bugs in the function.
- ONNX Exporter Enhancements: Enhancements to the ONNX exporter in PyTorch include adding support for symbolic arguments, allowing for more flexible model input handling. This update addresses previous limitations related to constant arguments in the export process.
- Symbolic Integer and Collective Enhancements: The PyTorch project has added support for unbacked symbolic integers in the Static Data Parallel Function Application and improved the "reorder_collectives_preserve_peak" feature. These updates enhance the project's functionality, although the symbolic integer support was not merged.
- Inductor Collectives and Validation Checks: Improvements in the PyTorch project include handling sink waits in inductor collectives and adding validation checks for the
jagged_dim
parameter. These updates ensure better handling of collective operations and parameter validation, although the validation check was not merged.
- CPython Tests and Serialization Protocol: The PyTorch project has added several CPython tests for various operators and dunder methods, and enhanced the serialization protocol of the Cutlass backend. These updates improve testing coverage and significantly reduce loading time for instantiation.
- Configuration and Communication Cleanup: The PyTorch project has introduced a mechanism to raise errors for unrecognized keys in the accelerator allocator configuration and removed unnecessary global rank parameters for NCCL communication. These changes streamline configuration management and communication processes.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- add ir node bucket helper function
- Toxicity Score: 0.55 (Defensive responses, perceived dismissiveness, unresolved frustration.)
- This GitHub conversation involves a series of interactions where username1 initially proposes a change, and username2 provides feedback that is perceived as dismissive by username1, leading to a defensive response. Username3 attempts to mediate by offering a compromise, but username1 remains frustrated, feeling their efforts are undervalued. The tone shifts from collaborative to tense, with username1 expressing dissatisfaction with the process.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
XuehaiPan | 568 | 57 | 3 | 28 |
bobrenjc93 | 194 | 53 | 0 | 1 |
malfet | 112 | 19 | 7 | 103 |
coconutruben | 104 | 18 | 0 | 49 |
Skylion007 | 16 | 11 | 3 | 123 |
guilhermeleobas | 109 | 29 | 0 | 1 |
atalman | 94 | 12 | 12 | 17 |
williamwen42 | 36 | 10 | 5 | 82 |
jansel | 26 | 8 | 1 | 67 |
guangyey | 60 | 8 | 4 | 24 |