Weekly GitHub Report for Pytorch: November 17, 2025 - November 24, 2025 (12:06:51)

        November 24, 2025

Weekly GitHub Report for Pytorch: November 17, 2025 - November 24, 2025 (12:06:51)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, FlexAttention for X86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[CI][CUDA][Flex Attention] Regression (Illegal Memory Access) with Flex Attention: This issue reports a regression causing illegal memory access errors when running a Flex Attention test on CUDA platforms, specifically on H100/H200 GPUs. The problem appears linked to a recent commit that changed block pointer usage to manual offset calculations, which may not correctly handle the complex, non-contiguous memory layouts in paged attention with stride ordering and permutation.

The comments discuss debugging details including cuda-gdb outputs and kernel indexing types, confirming the use of int32 indexing rather than int64. Analysis suggests the manual offset calculations introduced in the commit are error-prone for paged attention scenarios, likely causing the illegal memory access, and recommend reverting the commit or carefully reviewing the offset logic to fix the issue.
Number of comments this week: 7

Insufficient documentation about the batching logic of torch.linalg.solve: This issue addresses the insufficient and potentially confusing documentation regarding the batching logic of the torch.linalg.solve function, specifically how batch dimensions are handled differently when B is a batch of vectors versus a batch of matrices. The user suggests clarifying the documentation to explicitly distinguish these cases and to explain the precedence of interpreting B as a batch of vectors over a zero-dimensional batch of matrices, aiming to reduce user confusion and prevent incorrect assumptions about broadcasting behavior.

The comments discuss the complexity of the batching logic and the challenges of implementing consistent broadcasting support without causing backward incompatibility. Contributors debate whether to fully support broadcasting in both cases or to clearly document the current limitations, ultimately favoring improved documentation clarity over changing the existing behavior. A suggestion is made to submit a minimal documentation update reflecting the actual implemented logic.
Number of comments this week: 5

BF16 activation precision mismatch between eager ATen and compiled Triton: This issue reports a precision mismatch for activation functions like sigmoid and tanh when using the bf16 data type between eager ATen execution and compiled Triton kernels. Specifically, eager mode computes activations directly in bf16, whereas the compiled Triton kernel upcasts inputs to fp32, performs the activation, and then downcasts back to bf16, leading to accuracy differences for the same model. The user questions why this upcast-then-downcast strategy is currently employed given the observed discrepancy.

The comments discuss a related issue and suggest an environment variable to emulate precision casts, but the original poster clarifies that their case differs because it involves a single activation without fused ops and that eager mode actually computes in bf16 directly. The conversation requests further explanation on the rationale behind the upcast-then-downcast approach, and the issue is marked for discussion in an upcoming internal meeting.
Number of comments this week: 5

Dynamo overguards on checkpoint policy functions that are closed over by compiled region: This issue reports a bug where Dynamo over-guards on checkpoint policy functions that close over data structures like OrderedSet when used within compiled regions, causing unnecessary recompilations. The problem involves Dynamo tracking the identity and contents of sets in a way that leads to spurious recompiles, while using dictionaries instead avoids this issue.

The comments discuss examples demonstrating over-guarding on irrelevant set contents and the set’s identity, note that dictionaries do not trigger recompiles, and include observations about Dynamo’s current lack of support for OrderedSet in VariableBuilder. Contributors also highlight issues with nondeterministic set ordering affecting guard behavior and propose creating support for OrderedSet to address the problem.
Number of comments this week: 5

Dynamo: Incorrect ObservedTypeError: ConstantVariable(str: "unhashable type: "): This issue reports a bug where a program that runs correctly in eager mode fails under Dynamo compilation due to an incorrect ObservedTypeError related to an unhashable user-defined object variable. The user points out that user-defined classes should be hashable by default via object identity, indicating a potential flaw in Dynamo's handling of such objects.

The comments include a developer stating they are working on a fix and plan to submit a PR, followed by a caution against working on issues not marked as actionable. Another comment expresses skepticism about using automated code generation for the fix, and a suggestion is made to add a check in Dynamo for user-defined objects to verify that their __hash__ method is not None, reflecting the expected default hashability by identity.
Number of comments this week: 4

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0 development build, CUDA 12.1, and Ubuntu 22.04, and includes a code snippet demonstrating the error triggered by compiling parts of a pipeline with torch.compile.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing the MaxPool2D operation when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, potentially yielding a significant speedup in performance.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase by removing approximately 1,500 files currently excluded from UFMT and applying consistent formatting to them. It outlines the process for updating the .lintrunner.toml configuration, running the formatter, handling known edge cases that require preparatory fixes, and organizing the work by directory to facilitate manageable and reviewable pull requests.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files significantly increase the archive size without affecting model correctness, which is particularly important for deploying smaller models on mobile devices.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 575
Summarized Issues:

Inductor Compilation and Backend Issues: Multiple issues report failures and bugs in PyTorch's Inductor compiler related to assertion errors, device mixing in kernel fusion, thread safety, combo kernel crashes, and precision mismatches. These problems cause compilation failures, runtime errors, and numeric divergences especially when using features like combo kernels, multi-threaded compilation, and bfloat16 precision.  
issues/167937, issues/168042, issues/168067, issues/168105, issues/168124, issues/168126, issues/168148, issues/168186

Documentation and Usability Improvements: There are requests for clearer documentation on batching and broadcasting rules in torch.linalg.solve, better warning stack traces in Dynamo, and a leaner documentation webpage layout. These aim to reduce user confusion and improve debugging and interface simplicity.  
issues/167950, issues/167991, issues/168128

TorchDynamo and TorchCompile Errors: Several issues highlight errors and limitations in TorchDynamo and torch.compile, including incorrect hashability errors for user-defined classes, excessive guard generation, inability to capture certain operations, and problems with dynamic attribute access and nested views. These cause unexpected exceptions and recompilation issues.  
issues/167956, issues/168163, issues/168236, issues/168245, issues/168284, issues/168103, issues/168160

Distributed Training and Communication Failures: Issues report hangs in multi-node distributed training, failures in distributed CUDA convolution tests, incorrect reduce_scatter_tensor results on single rank, and failing elastic distributed tests undetected by CI. These problems affect scalability and correctness in distributed environments.  
issues/167975, issues/168085, issues/168092, issues/167994

Test Failures and ROCm Skips: A large number of tests across various modules and platforms are failing or skipped on ROCm, including inductor, distributed, dynamo, nn, cuda, linalg, sparse, and many others. These skips and failures indicate significant ROCm compatibility and stability issues affecting test coverage and reliability.  
issues/168399, issues/168400, issues/168401, issues/168402, issues/168403, issues/168404, issues/168405, issues/168406, issues/168407, [issues/168408](https://github.com/issues/168408], issues/168409, issues/168410, issues/168411, issues/168412, issues/168413, issues/168414, issues/168415, issues/168416, issues/168417, issues/168418, issues/168419, issues/168420, issues/168421, issues/168422, issues/168423, issues/168424, issues/168425, issues/168426, issues/168427, issues/168428, issues/168429, issues/168430, issues/168431, issues/168432, issues/168433, issues/168434, issues/168435, issues/168436, issues/168437, issues/168438, issues/168439, issues/168440, issues/168441, issues/168442, issues/168443, issues/168444, issues/168445, issues/168446, issues/168447, issues/168448, issues/168449, issues/168450, issues/168451, issues/168452, issues/168453, issues/168454, issues/168455, issues/168456, issues/168457, issues/168458, issues/168459, issues/168460, issues/168461, issues/168462, issues/168463, issues/168464, issues/168465, issues/168466, issues/168467, issues/168468, issues/168469, issues/168470, issues/168471, issues/168472, issues/168473, issues/168474, issues/168475, issues/168476, issues/168477, issues/168478, issues/168479, issues/168480, issues/168481, issues/168482, issues/168483, issues/168484, issues/168485, issues/168486, issues/168487, issues/168488, issues/168489, issues/168490, issues/168491, issues/168492, issues/168493, issues/168494, issues/168495, issues/168496, issues/168497, issues/168498, issues/168499, issues/168500, issues/168501, issues/168502, issues/168503, issues/168504, issues/168505, issues/168506, issues/168507, issues/168508, issues/168509, issues/168510, issues/168511, issues/168512, issues/168513, issues/168514, issues/168515, issues/168516, issues/168517, issues/168518, issues/168519, issues/168520, issues/168521, issues/168522, issues/168523, issues/168524, issues/168525, issues/168526, issues/168527, issues/168528, issues/168529, issues/168530, issues/168531, issues/168532, issues/168533, issues/168534, issues/168535, issues/168536, issues/168537, issues/168538, issues/168539, issues/168540, issues/168541, issues/168542, issues/168543, issues/168544, issues/168545, issues/168546, issues/168547, issues/168548, issues/168549, issues/168550, issues/168551, issues/168552, issues/168553, issues/168554, issues/168555, issues/168556, issues/168557, issues/168558, issues/168559, issues/168560, issues/168561, issues/168562, issues/168563, issues/168564, issues/168565, issues/168566, issues/168567, issues/168568, issues/168569, issues/168570, issues/168571, issues/168572, issues/168573, issues/168574, issues/168575, issues/168576, issues/168577, issues/168578, issues/168579, issues/168580, issues/168581, issues/168582, issues/168583, issues/168584, issues/168585, issues/168586, issues/168587, issues/168588, issues/168589, issues/168590, issues/168591, issues/168592, issues/168593, issues/168594, issues/168595, issues/168596, issues/168597, issues/168598, issues/168599, issues/168600, issues/168601, issues/168602, issues/168603, issues/168604, issues/168605, issues/168606, issues/168607, issues/168608, issues/168609, issues/168610, issues/168611, issues/168612, issues/168613, issues/168614, issues/168615, issues/168616, issues/168617, issues/168618, issues/168619, issues/168620, issues/168621, issues/168622, issues/168623, issues/168624, issues/168625, issues/168626, issues/168627, issues/168628, issues/168629, issues/168630, issues/168631, issues/168632, issues/168633, issues/168634, issues/168635, issues/168636, issues/168637, issues/168638, issues/168639, issues/168640, issues/168641, issues/168642, issues/168643, issues/168644, issues/168645, issues/168646, issues/168647, issues/168648, issues/168649, issues/168650, issues/168651, issues/168652, issues/168653, issues/168654, issues/168655, issues/168656, issues/168657, issues/168658, issues/168659, issues/168660, issues/168661, issues/168662, issues/168663, issues/168664, issues/168665, issues/168666, issues/168667, issues/168668, issues/168669, issues/168670, issues/168671, issues/168672, issues/168673, issues/168674, issues/168675, issues/168676, issues/168677, issues/168678, issues/168679, issues/168680, issues/168681, issues/168682, issues/168683, issues/168684, issues/168685, issues/168686, issues/168687, issues/168688, issues/168689, issues/168690, issues/168691, issues/168692, issues/168693, issues/168694, issues/168695, issues/168696, issues/168697, issues/168698, issues/168699, issues/168700, issues/168701, issues/168702, issues/168703, issues/168704, issues/168705, issues/168706, issues/168707, issues/168708, issues/168709, issues/168710, issues/168711, issues/168712, issues/168713, issues/168714, issues/168715, issues/168716, issues/168717, issues/168718, issues/168719, issues/168720, issues/168721, issues/168722, issues/168723, issues/168724, issues/168725, issues/168726, issues/168727, issues/168728, issues/168729, issues/168730, issues/168731, issues/168732, issues/168733, issues/168734, issues/168735, issues/168736, issues/168737, issues/168738, issues/168739, issues/168740, issues/168741, issues/168742, issues/168743, issues/168744, issues/168745, issues/168746, issues/168747, issues/168748, issues/168749, issues/168750, issues/168751, issues/168752, issues/168753, issues/168754, issues/168755, issues/168756, issues/168757, issues/168758, issues/168759, issues/168760, issues/168761, issues/168762, issues/168763, issues/168764, issues/168765, issues/168766, issues/168767, issues/168768, issues/168769, issues/168770, issues/168771, issues/168772, issues/168773, issues/168774, issues/168775, issues/168776, issues/168777, issues/168778, issues/168779, issues/168780, issues/168781, issues/168782, issues/168783, issues/168784, issues/168785, issues/168786, issues/168787, issues/168788, issues/168789, issues/168790, issues/168791, issues/168792, issues/168793, issues/168794, issues/168795, issues/168796, issues/168797, issues/168798, issues/168799, issues/168800, issues/168801, issues/168802, issues/168803, issues/168804, issues/168805, issues/168806, issues/168807, issues/168808, issues/168809, issues/168810, issues/168811, issues/168812, issues/168813, issues/168814, issues/168815, issues/168816, issues/168817, issues/168818, issues/168819, issues/168820, issues/168821, issues/168822, issues/168823, issues/168824, issues/168825, issues/168826, issues/168827, issues/168828, issues/168829, issues/168830, issues/168831, issues/168832, issues/168833, issues/168834, issues/168835, issues/168836, issues/168837, issues/168838, issues/168839, issues/168840, issues/168841, issues/168842, issues/168843, issues/168844, issues/168845, issues/168846, issues/168847, issues/168848, issues/168849, issues/168850, issues/168851, issues/168852, issues/168853, issues/168854, issues/168855, issues/168856, issues/168857, issues/168858, issues/168859, issues/168860, issues/168861, issues/168862, issues/168863, issues/168864, issues/168865, issues/168866, issues/168867, issues/168868, issues/168869, issues/168870, issues/168871, issues/168872, issues/168873, issues/168874, issues/168875, issues/168876, issues/168877, issues/168878, issues/168879, issues/168880, issues/168881, issues/168882, issues/168883, issues/168884, issues/168885

Error Messages and Debugging Enhancements: Issues report misleading or overly long error messages, such as device mismatch errors in conv2d, graph break messages in dynamo, and missing stack traces in warnings. These hinder effective debugging and user understanding.  
issues/168010, issues/168206

Build and Packaging Problems: Problems include silent dependency failures in build scripts, missing shared libraries in manywheel packages, and ROCm manywheel build linker errors. These cause build failures and packaging issues that affect installation and deployment.  
issues/168003, issues/168020, issues/168081

Memory and Performance Issues: Reports include performance regressions on NVIDIA H200 GPUs with bfloat16, slow FSDP root pre-forward on large MoE models, and memory inefficiencies due to delayed CUDA event recording. These impact training speed and memory usage.  
issues/168167, issues/168329, issues/168158

ONNX Export and Model Conversion Bugs: Issues describe failures and incorrect outputs when exporting models to ONNX using Dynamo-based exporters, including name conflicts and unsupported input shapes. These cause broken or inaccurate exported models.  
issues/168048, issues/168211, issues/168224, issues/168306

CUDA and ROCm Compatibility Issues: Problems include default CPU-only installs on aarch64, ROCm build failures due to CUDA version mismatches, and missing multi-architecture support in Docker images. These limit hardware and platform compatibility.  
issues/168065, issues/168168, issues/168353

Test Infrastructure and CI Problems: Some tests fail silently without CI detection, and environment variables cause tests to pass incorrectly. These issues reduce test reliability and visibility of regressions.  
issues/167994, issues/168081, issues/168231

Miscellaneous Bugs and Feature Requests: Includes issues like hash caching conflicts with Python 3.14, infinite recursion in tracing torch._ops.py, requests for new APIs like Tensor.record_event, and support for dataclasses in tracing and compilation. These affect correctness, usability, and feature completeness.  
issues/168254, issues/168136, issues/168158, issues/168392, issues/168890

Installation and Platform-Specific Issues: Problems installing PyTorch on Python 3.13 macOS and lint failures on MacBook Air M4 due to unknown NEON types are reported, affecting developer experience and platform support.  
issues/168897, issues/168907

CUDA Graphs and Control Flow Support: A feature request proposes adding support for capturing control flow with torch.cond() in CUDA Graphs to enable efficient conditional execution on CUDA devices, addressing challenges with eager mode and autograd compatibility.  
issues/168911

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 27
Summarized Issues:

Segmentation faults and crashes on specific hardware or backends: Multiple issues report crashes or segmentation faults occurring under specific conditions such as using Apple’s MPS backend with sliced tensors, moving tensors created with from_blob to GPU on Jetson Nano, and mismatched input shapes in Inductor compilation. These faults highlight stability problems tied to hardware-specific or backend-specific operations in PyTorch.  
issues/167924, issues/168151, issues/168210

DTensor sharding and propagation bugs: Several issues describe problems in DTensor’s sharding propagation and caching mechanisms, including assertion errors when handling tensor parameters in kwargs, incorrect local tensor slices due to uneven strided sharding, and failure to cache shard propagation results for list-based operations. These bugs cause incorrect data distribution and inefficient repeated computations in distributed tensor operations.  
issues/167977, issues/168134, issues/168255

torch.compile and Inductor compilation inconsistencies: Issues report that torch.compile sometimes produces different output shapes or types compared to eager execution, causes unnecessary recompilations due to backend function recreation, and lacks features like ignoring cache loading on first compilation. Additionally, Inductor backend bugs include missing CPPScheduling support causing assertion failures and incorrect results when combining Triton kernel outputs with .cpu(). These problems affect compilation correctness, performance, and usability.  
issues/167927, issues/168239, issues/168251, issues/168278, issues/168303, issues/168373, issues/168181

Incorrect behavior or bugs in PyTorch functions and modules: Several issues highlight incorrect behavior such as EmbeddingBag not including the last offset when expected, torch.autograd.grad setting requires_grad=True incorrectly for unused inputs, torch.nn.functional.pad raising errors on zero-sized dimensions, and torch.export.export producing mismatched exported models. These bugs cause unexpected outputs or errors in common PyTorch operations.  
issues/167974, issues/168059, issues/168071, issues/168240

Build, packaging, and environment issues: Some issues describe problems related to build and packaging such as missing openBLAS library in AArch64 wheels due to test environment masking, incompatibility of a CI script with certain Python environment variable formats, and a Jupyter kernel restart caused by OpenMP runtime conflicts. These issues affect the reliability of PyTorch builds and runtime environments.  
issues/168000, issues/168076, issues/167980

Documentation and test suite problems: There are reports of missing tutorial steps in official documentation and disabling of consistently failing tests in the test suite, indicating gaps in documentation completeness and test stability.  
issues/168238, issues/168346

Type annotation and code correctness improvements: One issue addresses narrowing a return type annotation in the VariableTracker code to better reflect actual return values, improving code clarity and correctness.  
issues/167982

External support and compatibility clarifications: One issue clarifies the official ROCm version supported by PyTorch 2.9.1, noting that ROCm 7 support is only in nightly builds while stable releases align with ROCm 6.4.  
issues/168268

Unexplained or minimal description issues: One issue references a problem in the PyTorch C++ extension utility without further details, indicating a lack of information for diagnosis.  
issues/168388

GitHub infrastructure related failures: One issue reports CI test failures caused by GitHub server-side 500 errors, which were temporary and resolved by GitHub.  
issues/168106

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 202
Key Open Pull Requests
1. Triton backward convolution kernel: This pull request adds Triton-based backward convolution kernels with layout optimization to the PyTorch inductor backend, including support for conv2d backward input and weight templates with autotuning and consistent NHWC layout handling.

URL: pull/168338

Merged: No

Associated Commits: d94ea, cdd7a, abebb, b1940, 7bab7, 7173a, 48398, 6926f, c236b, 1b8f4, b23bf, c632e, 4cc43, 89b09, 3b87b, fb027, 644fd, 8c7db, bf727, 3a8e6, f63de, 1f612, 46443, 12a6d, 4268b, 84210, d80af, b04d8, 8d218, d29e4, 9b4f0, 5bed3, ecd43, 18a92, 6b27e, 1b84f, 64ca7, 697cd, 2b73f, e691e, 1a6c1, 71fa7, f2b3b, 60ddc, 5745d, 53a13, d10ff, c4b98, 85229, a3cd7, b766c, dfd39, f2ee3, 7ad8b, 79126, 5416d, 65695, c2cca, 8b6bc, 3b61d, 06c6a, 28ca4, 1cc51, a6321, 35f1e, 3f236, ef2b1, 89490, c7ff7, 0c236, 07391, 13417, cd603, 2dc4b, 24b0c, 99847, cd885, 20d62, dab81, 27e9c, 1a316, f7721, fa982, 3bfe0, c1423, 800aa, 378a5, 9ebc6, 5beaf, 8af99, b8d92, 0d98f, ab54c, 70518, 1d1c7, 92d32, 0073e, 6f2f4, a1599, bdec1, e8f8a, ff4dd, 4c731, 4a815, 1ae99, 306ba, 769d5, 94173, 62ea9, 790cc, 12141, e2d14, b0a22

2. Support AC materializing in forward+loss+bwd graph: This pull request introduces an optimization pass called rematerialize_ac_nodes() that reduces memory usage during inference-mode compilation by reordering activation checkpoint (AC) nodes in a single forward+loss+backward graph, deferring AC nodes used only in backward to the backward region, duplicating those used in both forward and backward to be recomputed just-in-time, and reusing checkpoint inputs, thereby significantly lowering peak memory consumption compared to the naive approach while maintaining execution order, controlled by a config flag and demonstrated on synthetic large models.

URL: pull/168082

Merged: No

Associated Commits: b424c, b052f, 609b3, 81e27, 4b1d5, 7d1cd, 741bf, b025b, 423ed, bd095, 2aaba, b1c78, 26814, e30fb, 7eb36, d9978, 4db67, bc15b, 6aaf8, 16716, 423ac, b967d, 910e3, 81ee1

3. [user-streams] Handle the record_stream problem: This pull request implements a mechanism to properly synchronize and delay the deallocation of tensors allocated on one CUDA stream but last used on a side stream during the backward pass by estimating stream runtimes, recording events on the side stream, and inserting a synchronized deallocation operation on the allocating stream to ensure memory is not prematurely reused.

URL: pull/168061

Merged: No

Associated Commits: e76e4, ee00a, e5207, 7f954, 00296, ff26e, e93d7, 90bd9, e6832, 34a8a, 25632, d50a5, 1457f, b04b8, 4ad26

Other Open Pull Requests

Inductor combo_kernels enablement and stability: Multiple pull requests focus on enabling the combo_kernels feature by default in the Inductor backend, improving its reliability and fixing crashes related to GroupedSchedulerNode. These changes include strengthening tests and addressing Triton compilation failures to ensure stable and optimized kernel fusion.  
pull/168050, pull/168109, pull/168127, pull/168193

Inductor fusion optimizations: A pull request introduces a fusion optimization that replaces ReLU or GELU activations following addmm operations with a combined _addmm_activation call, simplifying pattern matching and leveraging extended cuBLASLt support to improve performance across tensor sizes and data types.  
pull/168157

DTensor compilation and sharding improvements: Several pull requests enhance DTensor compilation by supporting uneven sharding for unbacked matrix multiplications, introducing zero-cost strategies for no-redistribute cases, and improving unbacked symbol allocation. These changes also include proxy tracking during shard propagation and replacing torch.chunk() with more unbacked-friendly collectives to support strategy selection and testing on 2D meshes.  
pull/168051, pull/168052, pull/168101

Memory and event management enhancements: Updates include adding a configurable skip_actions flag to filter memory snapshot events and reduce trace file size, refactoring cuda::EventPool to CUDAEventPool with pre-allocation support, and assigning streams to epilogue copies to avoid unnecessary synchronization.  
pull/168183, pull/168336, pull/168368

Build system and platform support improvements: Pull requests update the PyTorch build system by enabling sccache in the XPU official build to reduce build times, upgrading the C++ standard to C++20 with necessary code adjustments, integrating ROCm platform support in nightly builds, and fixing Intel compiler issues on Windows related to SYCL kernel assertions.  
pull/167929, pull/167960, pull/167981, pull/168377

Distributed and tensor operation optimizations: New functions and optimizations are introduced for distributed operations, including reduce_scatter_tensor_out for efficient reduce_scatter decomposition and using reduce_scatter_out in Inductor's decompose_mm to avoid concatenation and resolve dependency issues.  
pull/168260, pull/168266

Benchmarking and testing enhancements: Comprehensive benchmarks for various PyTorch optimizers are added to measure performance across parameter sizes, and tests are improved to verify FP8 rowwise-scaling correctness and prevent overflow issues, especially avoiding FP16 output types due to infinite values in tests.  
pull/168004, pull/168073, pull/168195

Code quality and performance improvements: A cached property decorator is applied to reduce guard build time significantly, and a benchmark compares original and expanded strategy generation approaches, showing trade-offs in pruning irrelevant strategies versus generating many more strategies with slowdowns for add operations.  
pull/168116, pull/168131

Scatter add operation optimization: A partitioned buffer approach replaces the index_put operation with a scatter algorithm that partitions operations and buffers to reduce atomic contention and improve performance, managing increased memory usage with heuristics and caps on tensor sizes.  
pull/168073

OpenReg CI refactoring: The OpenReg continuous integration setup is refactored to isolate tests to CPU environments across Linux, macOS, and Windows, removing them from global test sets and adding dedicated CI workflows to improve test independence and functionality.  
pull/167958

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 165
Key Closed Pull Requests
1. Gate PyObject preservation for torch.compile cudagraphs: This pull request proposes adding a mechanism to conditionally preserve Python objects associated with tensors in torch.compile cudagraphs by introducing TensorImpl::requires_pyobject_preservation(), improving thread safety and correctness for Inductor cudagraph workloads while keeping nogil safety opt-in, and ensuring consistent tensor wrapper behavior with detailed documentation of related regressions.

URL: pull/167930

Merged: No

Associated Commits: 9f47d, b7cc8, 37451, a6baa, df413, 34e1f, a2d00, cc3fb, ed39f, e7029, 0545d, 57e2f, 7f2c7, 46c96, 3c27d, b9eef, 2b0e1, 7a5f2, c8fb8, 76f73, 7f62b, b7239, 282a4, da5df, ce513, f9703

2. [DTensor] compute shape and offset for arbitrary _StridedShard: This pull request aims to extend the computation of local shape and global offset in PyTorch's DTensor system to support arbitrary _StridedShard configurations, ensuring correct local shapes for DTensor views that use complex _StridedShard stacking beyond the previously supported cases.

URL: pull/168146

Merged: No

Associated Commits: a23f7, d4700, 3fc0b, 6c0ff, 9058b, d195d, 6219a, 8dd07, c39a5, b8f1d, a3f7e, be746, 3fc07, 3fe04, 5887f, 1a1ee

3. [FlexFlash] Specify lowering w/ new BACKEND kernel option: This pull request proposes specifying the lowering process in the FlexFlash component by introducing a new BACKEND kernel option to align with naming conventions, although it was not merged.

URL: pull/168017

Merged: No

Associated Commits: 7c765, e5da9, bce0b, 35612, d1e30, a3bf4, 1180c, abd6b, 92fd5, a6fa1, 2ef70, f17a0

Other Closed Pull Requests

Profiler event warning: This pull request introduces a warning in the Profiler to inform users that events are cleared at the end of each cycle and that only current cycle events are reported by default, advising users to set acc_events=True to retain events across cycles, thereby improving transparency and preventing confusion in profiling analysis. This change helps users better understand the Profiler's behavior during performance monitoring.  
pull/168066

ROCm CI and test coverage improvements: These pull requests propose adding a nightly continuous integration workflow for PyTorch ROCm builds on the therock platform and expanding ROCm test coverage in the trunk.yml CI workflow by including the full list of unit tests, swapping labels between rocm-mi300.yml and periodic-rocm-mi300.yml workflows, and disabling the now-unnecessary trunk-rocm-mi300.yml shadow workflow. Additionally, a pull request fixes libtorch agnostic tests related to ROCm in the PyTorch CI to ensure compatibility with CUDA and ROCm environments.  
pull/168366, pull/168162, pull/168087

GELU fusion and decomposition optimizations: These pull requests propose moving the unconditional decomposition of the GELU activation function to the fx_passes/post_grad stage to enable better pattern capture and fusion opportunities, specifically for the GELU(Addmm) pattern, and mapping the GELU(Addmm) pattern to the _addmm_activation function to enable fusion of bias addition and activation operations in the Inductor backend. These changes aim to improve graph optimization and backend fusion efficiency.  
pull/167997, pull/167998

Distributed RNG correction in DTensor: This draft pull request aims to correct the distributed random number generation in DTensor by introducing parameters such as local_shape, global_offset, global_shape, and global_strides to ensure that random numbers generated across multiple devices are both different and consistent with those generated on a single device. This fix addresses consistency and uniqueness in distributed RNG scenarios.  
pull/168178

Dynamic shape mix order reduction: This pull request enables mix order reduction to function correctly with dynamic shapes in internal models such as RMS and layer normalization, which would otherwise bypass this optimization without the changes introduced. This enhancement allows more effective optimization in models with dynamic shapes.  
pull/168117

Compile time improvements via dependency precomputation: This pull request strictly improves compile time by precomputing dependencies between start, hiding nodes, and wait nodes to minimize repeated graph searches in the bucketing process. This optimization reduces redundant computations during compilation.  
pull/168122

Dynamo tree_map and pytree function specializations: These pull requests propose special case handling for the tree_map function in the dynamo module to achieve approximately a 20x speedup in related microbenchmarks, compile-time special case handling for torch.utils._pytree._get_node_type, and specializing the tree_is_leaf function within the Dynamo and PyTree components at compile time to improve performance. These changes collectively enhance the efficiency of tree and pytree operations in Dynamo.  
pull/168342, pull/168054, pull/168070

FlexFlash backwards wiring support: This pull request proposes adding wiring support for backwards functionality in the FlexFlash component but was not merged.  
pull/168319

linalg.norm update for NumPy compatibility: This pull request proposes updating the linalg.norm function in PyTorch to align its behavior with NumPy's handling of degenerate inputs, as referenced in the linked NumPy pull request. This update improves consistency between PyTorch and NumPy.  
pull/168086

Improved memory tracking and prefetch blocking: This pull request proposes an improved memory tracking mechanism that simulates memory usage across the entire model, records peak memory per compute node, accumulates prefetched memory, and blocks prefetching when it would exceed user-defined absolute or relative memory increase thresholds, thereby enabling more aggressive and precise memory management. This enhancement allows better control over memory consumption during model execution.  
pull/168121

MSVC C++20 compilation fixes: This pull request addresses compilation errors caused by dependent name disambiguation issues in C++20 when using MSVC by adding the template keyword where necessary and replacing pointer arithmetic with static_cast for safer type casting in the PyTorch backend code. These fixes improve compatibility with modern C++ standards.  
pull/168132

torch.Size hash fix for Python 3.14+: This pull request addresses an issue in PyTorch where the hash function for torch.Size objects behaves incorrectly on Python 3.14 and later due to changes in tuple hash caching, by resetting the hash cache before delegating to the CPython tuple hash implementation. This fix ensures correct hashing behavior on newer Python versions.  
pull/168256

aotriton update and gfx1103 GPU support: This pull request proposes updating the aotriton version to 0.11.1b and enabling support for the gfx1103 GPU architecture to allow the use of flash attention in certain situations. This update expands hardware compatibility for advanced attention mechanisms.  
pull/168351

ABI stable custom ops string support: This pull request proposes adding support for string data types in ABI stable custom operations within the PyTorch project. This addition broadens the data types supported in custom operations.  
pull/168370

MPS backend test precision adjustment: This pull request proposes removing the expected failure designation for a test related to the MPS backend while adjusting the test's precision tolerance to 1e-4. This change reflects improved test stability or accuracy.  
pull/167922

Stable ABI deprecation warnings fix and audio test addition: This pull request aims to fix stable ABI deprecation warnings related to to/from conversions and adds a my_shape test to reproduce a specific issue in the PyTorch audio project. These changes improve ABI stability and test coverage.  
pull/167923

MemoryFormat and Layout header-only attempt and error handling retention: This pull request attempts to move the MemoryFormat and Layout components to be header-only but ultimately abandons changing the error handling semantics of the stream extraction operator (>>) by retaining the use of TORCH_CHECK instead of STD_TORCH_CHECK, and removes the related operator code to preserve desired traceback and error information.  
pull/168034

Inductor benchmark deterministic test fix (unmerged): This pull request addresses the issue of a skipped deterministic unit test in the inductor benchmark by locating the necessary benchmark scripts for the r2r determinism test, although it was ultimately not merged.  
pull/168041

Estimator utility functions modularization: This pull request proposes moving certain estimator utility functions out of the distributed module in the user-streams component to improve code organization and modularity.  
pull/168343

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

jeffdaily
35
1
494
3

malfet
130
22
3
87

cyyever
174
36
0
15

ezyang
74
15
22
83

williamwen42
127
17
11
34

Skylion007
7
2
2
167

guangyey
119
16
2
38

mlazos
146
13
2
10

mikaylagawarecki
115
23
0
26

pianpwk
120
20
0
19

                            Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
jeffdaily	35	1	494	3
malfet	130	22	3	87
cyyever	174	36	0	15
ezyang	74	15	22	83
williamwen42	127	17	11	34
Skylion007	7	2	2	167
guangyey	119	16	2	38
mlazos	146	13	2	10
mikaylagawarecki	115	23	0	26
pianpwk	120	20	0	19