Weekly GitHub Report for Pytorch: September 08, 2025 - September 15, 2025 (12:03:46)

            Weekly GitHub Report for Pytorch: September 08, 2025 - September 15, 2025 (12:03:46)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, a security-focused backward-incompatible change flipping the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward improved performance, security, and streamlined deployment.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

torch.compile generated wrong graph when tracing torch.use_deterministic_algorithms: This issue reports a bug where using torch.compile with the "inductor" backend generates an incorrect computation graph when tracing code that calls torch.use_deterministic_algorithms(True). The problem manifests as an assertion failure during compilation because the deterministic algorithms reset call is misplaced in the graph, causing an illegal graph structure; switching to the aot_eager backend avoids the crash but produces incorrect results, indicating incomplete support for deterministic algorithm toggling in compiled graphs.

The discussion reveals that wrapping torch.use_deterministic_algorithms in a context manager also triggers the same error with the inductor backend, while aot_eager compiles but yields wrong outputs. Commenters note that Dynamo currently does not properly handle these deterministic mode toggles in the graph, leading to silent incorrectness or crashes, and emphasize the importance of supporting mixed deterministic and non-deterministic operations in full graph compilation, especially for use cases like LLM training. Additionally, a separate bug is identified regarding the misuse of torch.use_deterministic_algorithms as a context manager, which is not officially supported.
Number of comments this week: 9

bmm and baddbmm perf regression on CPU: This issue addresses a significant performance regression observed on CPU for the bmm and baddbmm operations with bfloat16 data type, as identified by benchmarking workflows. The discussion focuses on verifying the regression, investigating its cause, considering the timing of a fix relative to the 2.9 release, and the potential for expanding benchmarking coverage and publishing results for broader visibility and contribution.

The comments include commands and timing results confirming the regression from 55 microseconds in version 2.8 to over 4500 microseconds in 2.9 nightly builds, speculation about related code changes, suggestions for publishing benchmark results on official PyTorch sites, and confirmation that a recent pull request likely resolves the issue.
Number of comments this week: 9

Compilation time debugging: construction sympy.Min/Max can be very slow: This issue addresses the significant slowdown in compilation time caused by constructing sympy.Min/Max objects during expression substitution, particularly affecting timm models with pooling operations. The author proposes a monkey patch to bypass certain sympy optimizations that are costly but often ineffective when free symbols are involved, demonstrating a 10% to 20% reduction in compilation time for the inception_v3 model.

The discussion includes requests for example code and clarification, with the author providing a command to reproduce the issue and enable the patch. Commenters acknowledge the patch’s effectiveness in reducing compilation time and suggest it could benefit other timm models as well.
Number of comments this week: 7

Make torch.random.get_rng_state take a device argument: This issue proposes adding a device argument to the torch.random.get_rng_state function to allow it to accept any device type and dispatch to the appropriate backend for retrieving the random number generator state. The user notes that this API was suggested by ChatGPT and seems reasonable, inviting feedback on the idea.

The comments discuss prior implementations in torch.utils.checkpoint for managing RNG states across devices, highlight the complexity and cost of gathering RNG states from multiple devices, and suggest that CPU RNG state should be handled separately due to internal computations possibly occurring on the CPU even when inputs are CUDA tensors. There is also a suggestion to extend similar functionality to torch.accelerator with support for device indices and symmetric setters.
Number of comments this week: 7

[CD] Windows all CUDA nightlies failed since Sept 12: This issue reports that all Windows CUDA nightly builds have been failing since September 12 due to out-of-memory errors during the compilation of flash attention CUDA kernels. The problem appears linked to a recent pull request that introduced changes causing excessive memory usage in the CUDA compiler, and although a fix was pushed, some unrelated CI failures remain.

The comments discuss the out-of-memory compilation errors, initially questioning why a prior commit did not cause this issue. It was clarified that the root cause was a specific pull request, which was subsequently fixed, but some CI failures unrelated to this fix were also noted, including a separate problem related to fbgemm_genai.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0.dev, CUDA 12.1, and Ubuntu 22.04, and shares a code snippet demonstrating the error triggered while compiling parts of a pipeline with torch.compile.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, potentially yielding a significant speedup in performance.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and directory permissions being correct.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling UFMT (a formatting tool) on all files within the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT formatting. It outlines the process for removing files from the exclusion list, running the formatter, handling known formatting-related problems, and organizing the work by directory to facilitate manageable and reviewable pull requests.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files significantly increase the archive size without affecting model correctness, which is particularly important for deploying smaller, quantized models on mobile devices.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 109
Summarized Issues:

Automatic Differentiation and Compilation Errors: Several issues report runtime errors and failures related to PyTorch's automatic differentiation and compilation features. These include inplace modification errors during backward automatic differentiation with native softmax (issue 162350), failures in torch.compile involving RNN pipelines and fake tensor allocation errors (issues 162374, 162375), bugs with custom operators causing backward pass failures in torch.compile (issue 162687), and numerical inconsistencies or crashes when using torch.compile on various models such as Transformers, Boltzmann Machines, and Local Attention (issues 162722, 162723, 162728).  
issues/162350, issues/162374, issues/162375, issues/162687, issues/162722, issues/162723, issues/162728

Inductor Backend and Unit Test Failures on Windows: Multiple issues describe failures and test errors related to the Inductor backend on Windows platforms. These include failing unit tests due to data-dependent symbolic shape guards, dtype consistency, and stride order mismatches (issues 162365, 162366), as well as performance regressions and numerical mismatches in Inductor compiled models (issue 162725).  
issues/162365, issues/162366, issues/162725

Linker and Build System Issues: Several issues highlight problems with the build and linking process. These include the need to modify the version_script.lds linker script to hide LLVM symbols and add macOS support (issue 162352), missing shared libraries causing CUDA kernel compilation failures (issue 162367), MSVC Ninja build failures with missing object files and linker errors (issue 162786), and undefined references to dynamic loading functions on s390x architecture (issue 162824).  
issues/162352, issues/162367, issues/162786, issues/162824

Numerical and Memory Errors in Core Operations: Various core PyTorch operations exhibit numerical instability, memory errors, or incorrect outputs. Examples include heap-buffer-overflow in torch.linalg.eig and quantized max pooling (issues 162358, 162476), NaN outputs from BatchNorm1d on GPU with large inputs (issue 162489), floating point exceptions on empty CUDA tensors (issue 162473), and incorrect results from DTensor.mean on uneven sharding (issue 162692).  
issues/162358, issues/162476, issues/162489, issues/162473, issues/162692

Distributed and Parallelism Test Failures: Several issues report failures in distributed training and parallelism tests. These include tensor mismatches in distributed CUDA tests on B200 GPUs (issue 162429), failures in distributed data parallel RNN tests (issue 162745), and problems with pipeline parallelism backward stage initialization (issue 162822). Additionally, tests requiring exact world sizes fail on larger GPU setups (issues 162746, 162748, 162755, 162871).  
issues/162429, issues/162745, issues/162822, issues/162746, issues/162748, issues/162755, issues/162871

Profiling and Performance Regressions: Issues report profiling inconsistencies and performance regressions. The PyTorch profiler shows inconsistent CUDA kernel activity counts without manual synchronization (issue 162481), and significant slowdowns are observed in bmm and baddbmm operations with bfloat16 on CPU (issue 162553) and in torch.matmul on CPU between versions 2.9.0 and 2.10.0 (issue 162683). Additionally, adding batch dimensions unexpectedly slows down scaled dot product attention (issue 162592).  
issues/162481, issues/162553, issues/162683, issues/162592

TorchDynamo and Graph Break Handling: Multiple issues discuss problems and proposals related to TorchDynamo's graph break behavior and error handling. Bugs include incorrect handling of next(..., default_obj) causing graph breaks (issue 162835), proposals to allow toggling error or resume on graph breaks (issue 162832), and suggestions to replace assert statements with graph breaks to avoid hard crashes (issue 162852). Enhancements for better debugging information and source attribution in Dynamo are also proposed (issues 162857, 162858, 162860).  
issues/162835, issues/162832, issues/162852, issues/162857, issues/162858, issues/162860

CUDA and GPU Compatibility Issues: Several issues report CUDA and GPU compatibility problems, including failure to detect CUDA GPUs on Windows WSL2 with NVIDIA Blackwell architecture (issue 162403), runtime errors due to unsupported Nvidia Jetson Thor GPU architecture in torch.compile (issue 162402), and dropped support for GPUs with compute capability less than 7.5 causing unclear errors (issue 162574). Also, Windows CUDA nightly builds fail due to out-of-memory errors compiling flash-attention kernels (issue 162881).  
issues/162403, issues/162402, issues/162574, issues/162881

ONNX Export and Model Export Failures: Issues report failures and regressions in exporting models to ONNX or TorchScript. These include LSTM models with weight normalization aborting during ONNX export in PyTorch 2.8.0 (issue 162376), GPT-2 model export failures due to missing batching rules in AOT Inductor (issue 162838), and runtime errors when exporting models with unsupported output types like transformers.cache_utils.DynamicCache (issue 162741).  
issues/162376, issues/162838, issues/162741

Documentation and Usability Concerns: Some issues highlight documentation inaccuracies and usability problems. These include a typo in the documentation example comparing tensor sums (issue 162882), lack of clear documentation on supported index data types for scatter and gather (issue 162711), and poor search results due to the default Google Search setting in the docs UI (issue 162512).  
issues/162882, issues/162711, issues/162512

Miscellaneous Bugs and Feature Requests: Other issues cover a variety of bugs and feature requests such as the need for built-in gradient clipping in pipeline parallelism (issue 162638), adding device argument support to torch.random.get_rng_state (issue 162812), and requests for prebuilt PyTorch wheels for AWS g5g instances with T4G GPUs (issue 162780). Bugs include segmentation faults on CUDA 13.0 with Numba integration (issue 162878), deadlocks on macOS MPS backend (issue 162872), and incorrect flattening behavior when converting lists of 1D tensors (issue 162790).  
issues/162638, issues/162812, issues/162780, issues/162878, issues/162872, issues/162790

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 31
Summarized Issues:

CUDA and Device-Specific Numerical Issues: Several issues report numerical discrepancies and incorrect results on CUDA or other specific hardware backends. These include incorrect cumulative sums with bfloat16 dtype on CPU and CUDA, incorrect sorting of large floats on CUDA, differing outputs of torch.as_strided_scatter on CPU vs CUDA, and incorrect results from torch.nn.functional.linear on Apple MPS with non-contiguous weights. These problems highlight inconsistencies in numerical computations across devices.  
issues/162408, issues/162602, issues/162604, issues/162730

Test Disabling Due to Platform Failures: Multiple tests have been disabled on specific platforms due to consistent failures on the main branch. This includes tests in the TestFlexAttentionXPU suite and WhileLoopTests on XPU platforms, as well as a CudaReproTests suite test on ROCm due to timeouts after a Triton release. These disabled tests indicate ongoing stability issues on these hardware platforms.  
issues/162390, issues/162435, issues/162436, issues/162452

Torchgen YAML Parsing Bugs: There are bugs in the torchgen YAML parsing related to missing validation and incorrect parsing of tag entries. These cause runtime errors such as ValueError and TypeError during YAML file processing, indicating robustness issues in the code generation pipeline.  
issues/162395, issues/162397

Documentation Gaps and Issues: Several issues highlight missing examples or outdated documentation. Missing examples were noted for torch.autograd.grad and torch.is_storage, while the main branch documentation was not updated for a long time due to a failing GitHub Actions workflow. Some feedback on documentation was closed due to insufficient information.  
issues/162379, issues/162613, issues/162598, issues/162710, issues/162714

ONNX Export and Runtime Discrepancies: Issues were reported regarding ONNX export behavior and runtime inconsistencies. Exporting torch.atan2 to ONNX produces NaN for zero inputs unlike PyTorch, and the default fallback behavior in torch.onnx.export causes confusing errors, prompting a proposal to change the default. Additionally, exposing ONNX testing utilities was requested.  
issues/162570, issues/162697, issues/162456

Runtime Errors and Crashes in Tensor Operations: Several runtime errors and crashes occur in tensor operations under specific conditions. These include assertion failures in torch.Tensor.bernoulli with invalid inputs, runtime errors in torch.utils.data.random_split due to device mismatches, heap-buffer-overflow in torch.equal with quantized tensors, and crashes when calling torch.Tensor.tolist on CUDA quantized tensors.  
issues/162378, issues/162486, issues/162801, issues/162800

Build and Compilation Failures: There are build failures caused by missing linker flags on AArch64, unexpected compiler termination during CUDA Blackwell builds, and runtime errors when compiling modules with variable arguments in torch.export.export. These issues block successful builds or module execution.  
issues/162640, issues/162769, issues/162599

Functionality Bugs in Core Operations: Bugs affecting core PyTorch functions include torch.linalg.eigh failing on large CPU tensors, torch.nanmedian returning 0 instead of NaN on MPS backend for empty inputs, and torch.compile failing with NameError when combo_kernels is enabled due to missing helper functions. These bugs impact correctness and compilation reliability.  
issues/162805, issues/162798, issues/162756

Performance Optimization and Caching: A proposed intrusive caching mechanism for DLPack conversions in the C++ tensor implementation aimed to reduce overhead but was found to have limited performance benefits, questioning the complexity tradeoff.  
issues/162630

Benchmark and Metric Measurement Failures: The arange_test in operator benchmarks failed due to an AttributeError caused by an integer lacking a 'device' attribute during metric measurement, indicating issues in benchmarking infrastructure.  
issues/162708

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 200
Key Open Pull Requests
1. Add operator benchmarking run to CI nightly: This pull request adds a new "operator microbenchmark" continuous integration workflow and GitHub Actions to run comprehensive operator benchmarks nightly, enhancing test scripts and job matrices to support multiple data types, larger tensor shapes, gradient tests, and focusing benchmark configurations on CUDA and mixed precision to improve coverage and extensibility.

URL: pull/162530

Merged: No

Associated Commits: d257e, 81915, 461c7, ada9c, 162e7, 6357d, 9b226, 19931, eab7b, c193e, 9c701, cf31d, 54c95, c6f1a, c021d, 2fe66, 71ae2, 629de, 8d0ca, 05bb4, 4c257, 7f5b0, 9eca4, d683f, 6b8cc, 49e5e, f47e5, 7b16f, cc2b1, 056bc, 86e38, 98a71, 36357

2. [DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh: This pull request introduces a wrapper class named "_MeshLayout" to adapt the CuTe layout as a mesh layout specifically for DeviceMesh, enabling the addition of DeviceMesh-specific methods while preserving the core CuTe manipulation logic within the pycute module, and sets up the foundational code for subsequent implementation and unit testing.

URL: pull/162414

Merged: No

Associated Commits: 4a046, 0ec96, df82f, 08541, 79f8d, cdc3d, df37b, d74be, 08dad, 7d792, 1e248, 2b39e, 31bae, 5102a, 5423b

3. add support for hint_override in mark_unbacked: This pull request adds support for the hint_override feature in the mark_unbacked.cc component, consolidating the approach to use the hint override solely for size hints, similar to a previous change made for a related component.

URL: pull/162652

Merged: No

Associated Commits: 11a96, 406c8, 6ed56, a9243, 3bca5, 05473, 150b0, a6e67, c9a02, a5b64, e6df7

Other Open Pull Requests

Distributed Environment Initialization and DeviceMesh Enhancements: Multiple pull requests improve distributed environment initialization by introducing a new dist.init() API that removes the need for init_process_group and enables retrieval of world size and rank. They also enhance DeviceMesh by allowing creation of communication sub-groups without a world group, removing implicit global process group creation, and clarifying participation requirements for collective calls versus direct construction.

  [pull/162529, pull/162545, pull/162549, pull/162571]

DTensor and Debugging Improvements: Several pull requests address DTensor functionality by fixing sharding propagation bugs related to cached spec mutation and introducing a lightweight DTensorDebugMode to track and display redistribution and tensor operations hierarchically. These changes facilitate debugging and improve internal DTensor behavior understanding.

  [pull/162702, pull/162665]

Performance and Compiler Optimizations: A set of pull requests focus on performance improvements including relanding vectorized load/store operations for concatenation, adding a configuration to reduce warps in inner reductions, moving LOAF loop reordering to a final fusion round, increasing persistent reduction block parameters, and optimizing iteration calls to reduce tracing time. These collectively aim to enhance runtime efficiency and compilation.

  [pull/162440, pull/162447, pull/162354, pull/162355, pull/162754]

Testing and Stability Fixes: Pull requests improve test stability by replacing sys.exit calls with unittest.skip decorators in NCCL tests to avoid unnecessary subprocesses and fixing inconsistent tests by introducing a global tracer configuration. They also address strict mode issues with gradient state propagation in the new tracer.

  [pull/162706, pull/162542, pull/162558, pull/162559]

Memory Format and Tensor Behavior Adjustments: One pull request updates the behavior of are_strides_like_channels_last to return False when memory format is undecided, which affects suggest_memory_format for unbacked tensors. Another changes pycute manipulation operations from co-lexicographic to lexicographic order to align with PyTorch and NumPy conventions, fixing related bugs and adding tests.

  [pull/162863, pull/162690]

New APIs and Feature Additions: New APIs and features include torch.xpu.can_device_access_peer for Intel GPUs to check peer device access, support for non-node outputs in the hop framework, and a compact visualization tool for profiling model traces with detailed kernel metrics and comparisons.

  [pull/162705, pull/162610, pull/162580]

FlexAttention and Parallelism Enhancements: A pull request introduces a flex_attention_wrapper and a ContextParallel plan to enable composable dispatching of FlexAttention and SDPA calls with context parallelism, simplifying user integration and addressing previous composability issues.

  [pull/162521]

Rotary Embedding Fixes: One pull request corrects the implementation of the rotary_embedding_23 function for 3D inputs and adds tests to ensure proper functionality.

  [pull/162865]

Build Configuration Updates: A pull request modifies build configuration files to enable compile-time detection of the SVE128 instruction set and adds appropriate C++ compile flags, with testing planned via the nightly pipeline.

  [pull/162523]

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 161
Key Closed Pull Requests
1. [release/2.4] add triton name: This pull request is a release/2.4 branch cherry-pick that adds support for using a pinned Triton version with a specific Triton name change based on the ROCm version, including various fixes and updates related to Triton integration and ROCm compatibility for PyTorch.

URL: pull/162603

Merged: No

Associated Commits: c85e2, 1cd41, 62417, 24a38, 6be02, 459e2, 082c4, ed624, 9ad8a, e7dde, d71de, 86271, b1d53, 562cd, 4af50, 0e0a9, edcc7, 50e57, 434bf, 0233f, d2e4c, 67a81, 93c51, 165e0, 49d2e, 699c0, 04e98, ec190, 491e9, 2bf37, 56086, 22a4d, 04339, 4d83b, 1f845, 072d9, 5f7de, 3d7d7, ca8d4, d0831, 80277, fadd3, 95336, 1164d, 12ad7, b26cd, 705e3, e5bda, 49962, 9afe4, e4ee3, d990d, 14ab5, f6fb8, 26735, d6199, 6ba64, 58ab9, 7e0ef, 24e04, d3d93, 10028, 18736, 30faa, dab23, 362a6, 346e0, 4ebe5, 2213c, eba4d, bd92f, c469d, 847a0, 59203, 920c0, 3675f, 9c1f7, 314f0, e0ddb, 6a79d, b84e8, 38b96, 79c88, ee1b6, 28471, 1814b, 0497f, 2e99f, 29ce1, d2d9c, 633cd, d6538, e780f, a3e26, ed765, b0099, 6f165, cd53a, 8c46f, 3b2a3, b1319, 6cbb9, f87c4, 80e24, af93a, 9204c, 000cf, a51e4, dca9b, 658d1, 47e9e, e4fa7, 2ee89, d8be1, 82bdb, a9ddf, d4828, 7f749, 822b3, a78ba, 30e79, ebec4, 98d72, 4c0be, f67dd, 2b777, 985a5, 0e3d8, b3d0a, f207c, b3dea, de3a5, ba26a, eccf8, ff91c, ab1ab, 74162, 37580, edf98, 27a9f, f9999, 84389, 15f43, 4dafe, 4ca9d, ef215, 74818, 519dd, 52e6d, b92a1, 87792, 8b903, 6eda8, 86dd9, df500, 1c91d, e348b, 0d291, fd84d, ad05a, 01efe, 1b312, ecd8b, d8592, def7d, d18dc, 92087, 8c37c, 7ab42, d2983, 0092d, f69f5, 58b4b, e8e52, 75f5c, 72ed3, 0b61a, 3fce1, 9b223, cf1ee, fef4c, c32b2, 462e0, a45af, 565ee, 968f1, bdd2f, a3a75, df8f5, 3e46d, 1a31a, 4f49f, 8cfa3, 8042b, 891cd, c0c52, b2399, aeb2a, 2f488, 5cd12, 1de1c, b3cbe, 39a0d, dce78, 912d4, 631e6, 5b2ff, cedd5, 4d5ad, 88661, 50aed, 8ae84, 02c55, 7d57f, a0157, 2b568, b72d3, 2f00d, 3f1f7, 85e3f, b3030, c1569, 15f86, 4e7ae, b632f, fdea1, c2bcc, 33231, d660f, f1615, a0e57, b7d9d, d670b, c6699, 24163, e7bc5, ea5c9, 8ecb6, 783fa, 6a5e1, f3715, 3795c, 61531

2. Export d80005963: This pull request titled "Export d80005963" is about pulling in the latest changes from multiple prior commits related to improvements and fixes in PyTorch’s inductor backend, distributed optimizer metadata handling, memory leak fixes, export loading fixes, and various other enhancements, but it was ultimately not merged.

URL: pull/162431

Merged: No

Associated Commits: 9bdce, 89d41, 0d71a, b04e9, 1ec2c, 0d84f, 09be1, 3a207, 9499c, c7e41, d2d4c, 5c674, 29280, 73eb4, be5b0, b67c4, 3bbc2, 49487, 5da57, 5c473, bffc7, a7144, 2dd52, 06da7, f3ceb, b2c7b, c2a30, 98374, 261a8, d711f, b18bb, 2ef66, 96025, 4902c, af590, a301d, 031d7, d63ad, e02e9, 9a8d4, c3211, 88d94, 70f86, adae7, 37713, a3c7f, 6087e, de893, 01edc, 2fa05, 92a43, 771f3, c1019, a00cd, 70d36, 79fcd, 01ab3, e0a62, 8d503, 9c03d, 4d4ab, 486b2, 081ca, 14637, 4f72d, 0f45a, 7f4ff, 291cd, 145a3, c3cec, 20629, b2b4a, c0983, a3e54, da4db, 20b47, aac1a, bc505, c98dd, 28f4a, 0ff8e, 9aedb, 5985e, b6d0a, ae0ed, 04760, 541aa, 1a588, 5927a, 48e3b, 2b8a8, 5211f, e3068, b9195, 2a458, fea20, eac3d, 104f2, 93fb2, ada43, 7a83c, 9ad5e, 4348d, 093ab, df59c, e246a, 8235c, ff2de, ec2e3, eb907, 5babb, 103f7, c9ac8, 29e09, 1e065, fb0af, 31d5c, 5b90e, 32911, e1014, 3f599, 25c17, 53297, a9277, f044f, 8e076, 49c44, 5793d, ebd29, 72e67, de5dc, bc417, 314d4, fbcab, d8029, a0d02, 4e506, 9c991, 8ec01, ec2c1, 8f114, 26a1b, 01542, 5d819, dd44f, fecd9, 85fe9, 2c538, 711c8, ac9cc, 5fd6b, 189a0, 07f07

3. [WIP] Port test_nn.py to Intel GPU: This pull request aims to port the unit tests in test/test_nn.py to support Intel GPU by enabling Intel GPU execution through modifications such as replacing CUDA-only test decorators with ones supporting both CUDA and XPU, adding allow_xpu=True in test parameterizations, using torch.accelerator to extend CUDA-specific tests to XPU, adding a skipIfXPU decorator for tests with known Intel GPU issues, enabling 'xpu' device types in test paths, and generalizing device type retrieval for accelerators, all while striving to maintain the original code style.

URL: pull/162369

Merged: No

Associated Commits: 16cf9, b2287, e0922, bfb42, 56dcc, e4dfe, 8ce57, 72734, b38a8, c953c, 2e63d, d30b5, 6250f, 5309f, 610c0, 53544, 2ab81, 60d16, da822, 24e66, 7a816, bd35b, 0e5af, ba5cb, 189ca, 35954, 99700, 8db36, f7db0, 17ac3, d3088, 6b5a8, 1ac12, dab35, d7d93, a1a9b, 91b6b, cd5d7, 8b39c, 2fcc5, 7e312, 82a58, ff758, 5c5f7, 9f7e0, f5ab8, 0ca45, bb2c5, 4b04f, 1e532, cdb2e, 2443e, 4fa24, 3de6c, 7b83b, 9f213, d1df6, 06a73, fa979, 6b90b, d2e87, 695e5, eb62c, 3aff9, e7604, 1412f, 0e9e9, d514f, 2e6fa, 8ab7e, 83c9f, e6f13, 443f0, 41e32, bece0, 2e33b, 06985, 2e267, feeb7, f5214, 01b9b, 3748f, 4c4e1, 2f394, 3e308, 17253, 17010, 12dac, 364ea, b5183, 42a44, 5680a, 32147, c6e45, f64ef, ccb8a, 3b3a6

Other Closed Pull Requests

SVE128 Support and Optimization: Multiple pull requests introduce and enable SVE128-specific vectorized template layers and performance kernels for various data types in PyTorch and Caffe2. These changes optimize performance on SVE128 CPUs by differentiating SVE128 implementations from general SVE and mixing NEON and SVE instructions, including compile-time detection and specialized operations.

  [pull/162433, pull/162524]

vLLM Wheel Build and Packaging Fixes: Several pull requests address building, publishing, and repackaging vLLM nightly wheels for the aarch64 platform, fixing dependency conflicts, version mismatches, and import errors caused by missing libraries or renaming issues. These efforts aim to ensure proper installation and compatibility of vLLM wheels with related packages like xformers.

  [pull/162664, pull/162371, pull/162747]

Dynamo Guard Mechanism Optimization: Pull requests focus on optimizing the Dynamo guard mechanism by avoiding unnecessary construction or conversion of the framelocals dictionary, especially for the LAMBDA_GUARD case. These changes improve efficiency by preventing redundant operations in guard handling.

  [pull/162525, pull/162509]

CUDA Kernel Compilation Enhancements: Multiple pull requests improve the compile_kernel function by adding support for C++ outer templates, fixing large shared memory usage issues, adding CUDA include directories, and handling Python floats as C doubles. These updates enhance kernel compilation flexibility and correctness.

  [pull/162818, pull/162647, pull/162632, pull/162626]

Device and Backend Agnosticism Improvements: A pull request proposes replacing CUDA-specific APIs with torch.accelerator and device modules in DataParallel to support custom backends beyond CUDA, enhancing device agnosticism in PyTorch.

  [pull/162573]

Distributed Backend and Mixed Backend Fixes: A pull request addresses an issue in the distributed backend where process group options were incorrectly propagated across multiple backends, implementing a fallback and warning mechanism to prevent assertion errors in mixed-backend groups.

  [pull/162424]

PyCute Device Mesh Module Integration: One pull request copies the device mesh bookkeeping module and its unit tests from NVIDIA CuTe into PyTorch, adjusting formatting and linting to integrate with PyTorch's CI system.

  [pull/162413]

DTensor aten.unbind.int Support and Refactoring: A pull request adds support for the aten.unbind.int operation in DTensor, handling view returns and error cases, while refactoring tensor operations with new utility functions for shard dimension management.

  [pull/162560]

MPS Sparse Multiplication and Additional Ops: A pull request implements MPS sparse multiplication and enables several other operations such as copy_, division, sum, floor, power, subtraction, and floor division, although it was not merged.

  [pull/162349]

Smoke Test Platform Restriction: A pull request restricts smoke tests that run the nvshmem library to Linux x86_64 and aarch64 platforms, preventing unsupported execution on Windows.

  [pull/162646]

FlexAttention TMA Feature Disable Fix: A pull request disables the TMA feature in the FlexAttention module when it cannot be used, cleaning up noop configurations and related code.

  [pull/162569]

Type Annotations and Linter Improvements: A cosmetic pull request uses Claude to add type annotations, remove linter suppressions, and introduce type checkers, type aliases, and selective mypy ignores to copied PyCute code for better linting compliance.

  [pull/162534]

torch.is_storage Example Addition: A pull request adds an example for the function torch.is_storage, addressing a specific issue, but it was not merged.

  [pull/162614]

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
99
12
9
123

coconutruben
204
17
0
5

guangyey
108
13
0
46

swolchok
116
22
1
19

huydhn
126
9
2
20

kwen2501
101
19
0
24

ezyang
35
17
8
83

yangw-dev
127
5
0
7

Skylion007
7
3
0
120

fduwjj
66
9
0
51

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	99	12	9	123
coconutruben	204	17	0	5
guangyey	108	13	0	46
swolchok	116	22	1	19
huydhn	126	9	2	20
kwen2501	101	19	0	24
ezyang	35	17	8	83
yangw-dev	127	5	0	7
Skylion007	7	3	0	120
fduwjj	66	9	0	51