Weekly GitHub Report for Pytorch: March 16, 2026 - March 23, 2026 (19:48:52)

Weekly GitHub Report for Pytorch: March 16, 2026 - March 23, 2026 (19:48:52)

        Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, a backward-incompatible security improvement changing the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward improved performance, security, and streamlined deployment.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[MODULE: PERFORMANCE] [TRIAGED] [ENHANCEMENT] [ONCALL: PT2] [VLLM-COMPILE] [MODULE: VLLM] [BOT-TRIAGED] [BOT-MISLABELED] Significant runtime overhead for standalone_compile: This issue addresses significant runtime overhead observed in the standalone_compile process within the vLLM serving framework, particularly focusing on the CPU-side work before the first kernel call in the Inductor output code. The discussion explores profiling results, identifies costly operations such as assert_size_stride and copy_misaligned_inputs, and considers potential optimizations including refactoring hot-path functions, reducing assertion overhead, and possibly disabling certain asserts during runtime serving to improve performance.  

The comments primarily focus on diagnosing the overhead sources, proposing and testing optimizations like a new assert_size_stride_grouped function that reduces launch overhead, debating when to enable assertions, and suggesting longer-term improvements such as codegen for Python bindings and tensor arena allocators; overall, the interaction is collaborative with shared profiling insights and incremental solution development.
Number of comments this week: 8

[MODULE: CRASH] [MODULE: MEMORY USAGE] [TRIAGED] [MODULE: XPU] [BOT-TRIAGED] XPU: Fatal Level Zero RuntimeErrors instead of recoverable OutOfMemoryError: This issue addresses a problem in PyTorch's XPU backend where exceeding physical memory triggers fatal Level Zero runtime errors instead of recoverable out-of-memory exceptions, causing the entire process to crash. The reporter suggests a workaround by setting the per-process memory fraction to 1.0 to prevent crashes and proposes integrating this into the default runtime, while discussion in the comments reveals that the issue is recognized as a driver bug and a proper fix is underway from the driver team.  

The comments confirm the problem is a driver-level bug affecting certain Intel GPUs, with the workaround being effective but temporary; testing on other GPUs did not reproduce the issue, and the driver team has since fixed the problem in a recent build, with ongoing confirmation about the specific driver release containing the fix.
Number of comments this week: 8

[TRIAGED] [MODULE: FLOP COUNTER] [MODULE: SDPA] [BOT-TRIAGED] [2.10] [repro] torch.ops.aten._flash_attention_forward FlopCounterMode uses wrong permute, gets count wrong: This issue reports that the flop counting for torch.ops.aten._flash_attention_forward is incorrect because it assumes the input tensor dimensions are ordered as (batch, heads, seq, dim), whereas the actual required order is (batch, seq, heads, dim). This discrepancy causes the flop counts to differ significantly from those of the SDPA implementation despite producing allclose outputs and gradients, and the issue includes a repro and discussion of a custom fix involving permuting tensor dimensions in the flop counting functions to achieve parity.  

The comments provide a detailed fix by overriding the flop counting functions with custom versions that permute tensor dimensions appropriately, discuss edge cases like variable length sequences and grouped query attention (GQA), and consider integration and API design implications, ultimately achieving matching flop counts between the two attention implementations.
Number of comments this week: 8

[TRIAGE REVIEW] [MODULE: MACOS] [MODULE: ADVANCED INDEXING] [MODULE: MPS] [BOT-TRIAGED] [BOT-MISLABELED] [MPS] Intermittent indexing out of bounds AcceleratorError: This issue reports an intermittent indexing out of bounds error occurring on the MPS backend when performing advanced indexing operations, which does not reproduce on CPU or CUDA. The problem appears to be related to a bug in the at::count_nonzero function and has been difficult to root cause due to its sporadic nature, with reproductions varying in frequency across different Apple Silicon devices.  

The comments confirm the issue is reproducible but rare and varies by system load, with attempts to bisect the problem revealing it existed silently before a recent PR made it visible; further investigation points to a bug in at::count_nonzero, and the reporter is working towards a fix while others offer support and discuss potential causes.
Number of comments this week: 8

[ONCALL: DISTRIBUTED] [BOT-TRIAGED] DDP init_sync=True does not sync buffers when broadcast_buffers=False: This issue reports that when using Distributed Data Parallel (DDP) with init_sync=True and broadcast_buffers=False, buffers are not synchronized during initialization as the documentation suggests they should be. This discrepancy causes problems for models with frozen buffers that need to be synced once at initialization but not during training, and the current implementation does not allow this behavior.  

The comments confirm the behavior is inconsistent with the documentation and discuss whether to fix the code or update the docs to clarify the behavior. A proposal to add a new argument to handle buffer syncing more explicitly was well received, and a contributor volunteered to work on the fix.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 99
Summarized Issues:

torch.compile numerical stability and correctness issues: Multiple issues report that using torch.compile with the Inductor backend causes severe numerical stability problems, including NaN and Inf outputs on valid float32 inputs, discrepancies in floating-point accumulation order, and inconsistent results compared to eager execution. These problems affect layers such as Conv2d, LayerNorm, BatchNorm2d, and GELU fusion, leading to catastrophic error propagation and incorrect outputs.  
[issues/177657, issues/178047, issues/178055, issues/178084, issues/178096, issues/178134]

torch.compile dtype and input validation inconsistencies: Several issues highlight that torch.compile silently allows operations with mismatched or invalid dtypes, such as implicit promotion in torch.bmm and torch.matmul with mixed float16 and float32 inputs, or accepting float tensor indices in nn.Embedding. Additionally, torch.compile permits invalid argument types like nn.Parameter as fill_value in torch.full(), causing behavioral inconsistencies with eager mode.  
[issues/177629, issues/177630, issues/178042, issues/178046]

torch.compile crashes and compilation failures: There are multiple reports of crashes or compilation failures when using torch.compile with the Inductor backend, including errors triggered by index_put and sort operations, stride inference mismatches in attention outputs, failure to trace builtin Python operators, and missing member functions for certain data types. These issues cause runtime errors or prevent successful compilation of valid models.  
[issues/177631, issues/178039, issues/178041, issues/178095]

Dynamo tracing and mutation tracking bugs: Issues in PyTorch Dynamo include incorrect handling of f-string evaluation with mutable objects, double mutation tracking causing crashes when autotuning Triton kernels, incomplete tracing of exception-object interfaces leading to broken graphs, and import failures due to unrecognized module subclasses. These bugs affect program correctness and graph stability.  
[issues/177582, issues/177600, issues/177633, issues/177682]

Distributed and multi-stream execution problems: Problems arise in distributed training and multi-stream CUDA execution, such as failure to trace stream and event information causing runtime errors, fatal Level Zero errors on XPU when out of memory, and hangs in distributed kernel autotuning with nvshmem backend. These issues impact stability and correctness in distributed and multi-stream contexts.  
[issues/177691, issues/177714, issues/177910]

DTensor and sharding validation bugs: DTensor-related issues include an outdated cost penalty favoring all_gather over all_to_all communication despite updated support, a dead code path caused by incorrect type comparison in propagate_shape_and_sharding, and a failing test due to dynamo graph breaks involving backward passes with generators. These bugs affect distributed tensor sharding and compilation correctness.  
[issues/177663, issues/177751, issues/177972]

ONNX and quantization graph optimization errors: The ONNX program optimizer incorrectly folds DequantizeLinear nodes during quantization-aware training model optimization, resulting in loss of quantization nodes and abnormal exported graphs. Additionally, the TorchScript ONNX exporter inserts unnecessary NaN-sanitizing operations after Softmax, degrading backend fusion and inference performance.  
[issues/177611, issues/177892]

Memory safety and allocator crashes: Several bugs report crashes due to invalid inputs or allocator errors, including segmentation faults in torch.lu_unpack with empty pivot tensors, assertion failures in CUDA memory allocator with negative allocation sizes, and memory corruption in sparse diagonal matrix operations due to missing bounds checks.  
[issues/177827, issues/177829, issues/178089]

Random number generation and reproducibility discrepancies: There is an inconsistency in random number generator states between torch.compile with Inductor and eager execution on non-contiguous tensors, leading to different outputs despite fixed seeds. This affects reproducibility of stochastic computations.  
[issues/177652]

Performance overhead and runtime assertion costs: Significant runtime overhead is observed in standalone compilation during model serving, with suggestions to reduce costly assertions and refactor hot-path functions. Additionally, disabling Inductor's runtime correctness assertions during inference is proposed to improve performance by avoiding unnecessary checks.  
[issues/177655, issues/177719]

Test suite failures and disabled tests: Multiple tests are failing or disabled on the main branch, including fusion tests, serialization tests, and DTensor requires_grad tests, indicating ongoing stability and regression issues in the test infrastructure.  
[issues/177653, issues/177751, issues/178077]

CUDA and device-specific runtime errors: Issues include CUDA driver errors during backward passes after FFT and subtraction, illegal memory access on MPS devices with padded matrices, floating point exceptions on CUDA torch.dot calls, and CUDA caching allocator conflicts causing incorrect all-reduce results. These device-specific errors cause crashes or incorrect computations.  
[issues/177561, issues/178056, issues/178038, issues/178138]

API and internal function bugs: Bugs in internal PyTorch APIs include incorrect learning rate scheduler validation allowing zero or negative step sizes, incorrect output padding validation in convolution transpose meta implementations, and bugs in convolution backward meta functions using wrong inputs for memory format determination. These cause runtime errors or inconsistent behavior.  
[issues/177833, issues/178125, issues/178092]

Compilation and build issues with specific compilers and architectures: Compilation failures occur with Apple Clang 17 due to ambiguous return types in pybind11 lambdas, and ARM Neon vectorizer errors arise from missing conversions during code generation, hindering builds on certain platforms.  
[issues/178044, issues/178136]

Miscellaneous bugs in tensor operations and utilities: Other issues include silent ignoring of size= in torch.Tensor.expand when importing functorch.dim, misleading error messages in sparse matrix multiplication, unexpected behavior differences between NumPy and PyTorch flatten functions, and repeated print outputs on GPU calculations.  
[issues/177654, issues/177951, issues/177957, issues/178049]

Proposals for new features and API improvements: Suggestions include adding a new residual connection module with block-level attention, support for Predictive Coding as an alternative to Backpropagation, a cross-backend API for current device name, kernel-level profiling support for PrivateUse1 backends, and a deterministic CTC Loss implementation to improve reproducibility.  
[issues/177537, issues/177623, issues/177935, issues/177978, issues/178052]

Git and tooling enhancements: Proposals to add PreToolUse hooks to prevent bypassing git hooks by agents and to create a bot notifying maintainers when CI pin updater bot fails to merge updates aim to improve development workflow and code quality enforcement.  
[issues/177880, issues/177889, issues/177896]

Profiling and debugging improvements: Requests to expose C++ KinetoEvent methods to Python for better profiler data access and to replace Python object-based stream/event tracking with opaque objects for Ahead-Of-Time Interpreter support seek to enhance debugging and profiling capabilities.  
[issues/178087, issues/178132]

Distributed group and process group API limitations: The lack of an API to return all replica groups for a process group at compile time breaks SPMD compilation and causes inefficient multiple compilations in distributed training backends, motivating a proposal for a new function to address this limitation.  
[issues/177815]

Error handling and message clarity improvements: Several issues call for clearer error messages, such as improving the error when dense tensors are passed to sparse matrix functions, and better handling of invalid inputs to prevent silent failures or misleading errors.  
[issues/177951, issues/177829]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 45
Summarized Issues:

torch.compile numerical and correctness issues: Multiple issues report numerical discrepancies and incorrect outputs when using torch.compile with various backends and configurations. These include large max absolute errors in fused operations, incorrect output shapes with dynamic keys, failure to recompile on mutated class attributes, and discrepancies in float16 floor_divide on CUDA, all leading to silent correctness errors or unstable model behavior.  
[issues/175952, issues/176851, issues/176862, issues/177740, issues/178094]

torch.compile crashes and backend errors: Several crashes occur during torch.compile usage due to bugs in compiler assumptions or backend codegen. These include heap corruption from buffer aliasing in CPU inductor, AttributeError from Triton kernel dtype handling, and stride inference errors causing view failures in transformer models with index_put_.  
[issues/176711, issues/176712, issues/178043]

Inductor backend and dynamic shape bugs: The Inductor backend exhibits bugs with dynamic shapes, such as incorrect backward code generation causing AssertionErrors, and errors when using unbacked 0-D SymInt in repeat_interleave. These issues break expected dynamic shape support and cause runtime failures.  
[issues/176910, issues/177252]

Test suite failures and disabled tests on XPU platform: Multiple tests are disabled due to consistent failures on the XPU platform, affecting tests like test_div7_dynamic_shapes_xpu, test_adam_ema_update_scalar_precision_beta2_0_9999, test_additive_rnumel, test_comprehensive_masked_cumprod_xpu_float16, and test_max_pool2d_with_indices_backward5.  
[issues/176134, issues/176391, issues/177218, issues/177483, issues/178124]

FSDP and nn.utils.parametrize incompatibility: Using nn.utils.parametrize dynamically during forward pass with FullyShardedDataParallel causes assertion and runtime errors related to parameter presence checks and in-place view modifications, indicating a fundamental incompatibility between these features.  
[issues/177017]

ONNX export and opset issues: Exporting models with dynamo ONNX exporter has bugs such as incorrect cubic_coeff_a attribute in Resize nodes for bicubic interpolation with antialiasing, and requests to make the default ONNX opset constant publicly accessible for easier downstream use.  
[issues/177138, issues/178053]

Platform-specific binary and compatibility concerns: macOS wheel files claim compatibility with macOS 11.0+ but contain binaries targeting macOS 14.0, raising questions about compatibility accuracy.  
[issues/177140]

Autograd and differentiation bugs: Bugs in autograd include silent incorrect zero results from torch.func transforms under torch.inference_mode(), and incorrect backward gradients for complex64/128 tensors on MPS backend causing silent divergence in training.  
[issues/177318, issues/177734]

Memory and resource management issues: Out-of-memory errors occur during benchmarks traced to specific commits, and CUDA OOM errors happen on RTX 3090 due to fragmented memory despite free memory availability.  
[issues/177426, issues/178057]

Build and CI failures: Build failures arise from conditional compilation issues with ROCm and flash attention flags, and CUTLASS CI jobs fail due to uninitialized third_party directories blocking PRs.  
[issues/177485, issues/177945]

Race conditions and flaky tests: Flaky hanging tests in inductor FP8 test suite on sm89 GPUs are caused by race conditions introduced by test instantiation changes, mitigated by environment variable settings.  
[issues/177651]

DTensor dispatch regression: Moving DTensor dispatch logic from Python to C++ breaks subclass overrides of __torch_dispatch__, causing all subclass instances to be dispatched through the base DTensor dispatcher and ignoring subclass logic.  
[issues/177716]

Attention and mask handling bugs: cuDNN backend for scaled dot product attention ignores custom attention masks causing unexpected outputs, and meta kernel guards break strict export for dynamic sequence lengths on MPS.  
[issues/177842, issues/177603]

Parameter creation and type bugs in FSDP: Creating nn.Parameter with non-floating-point tensors like uint8 crashes due to default requires_grad=True, requiring explicit requires_grad setting to fix.  
[issues/177844]

Loop and functorch transform bugs: torch.while_loop fails with UncapturedHigherOrderOpError when combining scalar and tensor predicates with AND, and functorch's popDynamicLayerStackToDepth incorrectly pops two layers per iteration causing stack corruption.  
[issues/177517, issues/177581]

AOT compilation serialization bug: Ahead-Of-Time compilation fails due to missing __builtins_dict__ in serialized global scope, causing runtime errors when deserializing functions using built-in objects.  
[issues/177556]

Docker and SSL issues: Docker image builds fail installing torchvision from PyTorch wheels due to SSL certificate verification errors despite no Dockerfile changes.  
[issues/177638]

Inference warm-up and performance optimization: Users seek advice on optimizing inference warm-up for compiled models with dynamic input lengths, and a proposal suggests replacing a tracing state call to improve module call performance by ~15%.  
[issues/177634, issues/178070]

NCCL communication errors: NCCL 2.29.7 causes ncclDevCommCreate failures due to GIN barrier changes, with fixes involving shifting devComm creation responsibility to Op developers for compatibility.  
[issues/177379]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 289
Key Open Pull Requests
1. [Native DSL] Add De-registration logic: This pull request introduces backend changes to add de-registration logic for operator overrides in the native DSL, enabling safe and non-breaking disabling of overrides (such as globally disabling Triton at runtime) by storing graphs of (operator, dispatch_key) overrides to selectively remove and rebuild only necessary graphs, implementing filtered re-enabling to avoid reactivating just-disabled operators, and includes tests to verify the expanded registry functionality.

URL: pull/177550

Associated Commits: a1096, ce547, fd270, 9be2c, 6bbf5, b614d, 74638, e6260, 3cc54, d24b2, 90470, dff01, 56fdf, 0b038, 87a8c, f315f, e7e18, 211bf, c39bc, 95678

2. [wip][dynamo] Implement generic_richcompare unified comparison protocol: This pull request implements a generic_richcompare function that unifies the comparison protocol by fully adopting CPython's PyObject_RichCompare algorithm—including subclass priority, forward and reflected operations, and identity fallback for equality checks—and integrates this into VariableTracker to correctly handle comparison operators, while also fixing several related bugs in variable comparison behavior.

URL: pull/177943

Associated Commits: ec6e2, e771f, 8b088, 6bf98, b8d7c, a6ce4, c791f, 845f5, 419a0, b0a99, 524c4, 47e4c, 0998a, 56137, af271, a221a, 348d7, 2aa05, 73dca, c8404

3. Fix Inductor bmm mixed-dtype error handling: This pull request fixes the Inductor aten.bmm lowering to enforce strict same-dtype validation consistent with eager bmm and batched matmul, preventing silent normalization of mixed low-precision inputs by adding explicit dtype checks and regression tests to ensure correct error handling and maintain consistent behavior across related matrix multiplication operations.

URL: pull/177858

Associated Commits: e8365, 6ae49, b16bb, 53dda, 4139e, c7996, c7e41, 6c563, 9bb8c, e562b, 7a616, a4791, 94f47, 94232, 0608c, 8914c

Other Open Pull Requests

Backend support and integration improvements: Multiple pull requests enhance backend capabilities by adding hipDNN support for batch normalization on ROCm, sparse CSR tensor format and unary operators on MPS, and improving sparse semi-structured matrix multiplication caching for cuSPARSELt. These changes introduce new ops, dispatch registrations, and performance optimizations across different hardware and backend systems.  
pull/177534, pull/177757, pull/177866

Inductor and XPU backend enhancements: Pull requests add an Inductor pattern matcher for fused Llama MLP operations on Intel XPU devices and consolidate multiple commits improving the XPU sycl-tla backend, including GEMM kernel generation, scheduling, and build improvements. These updates enable fused kernel execution and better compilation and caching support for XPU hardware.  
pull/177612, pull/177539

Activation offloading improvements: Several pull requests introduce custom asynchronous CPU offloading operations, make the partitioner offload-aware to prevent recomputation failures, and remove obsolete stream operation code in favor of new offload/reload/wait operations. These changes streamline activation offloading, improve performance, and enhance correctness in offload scenarios.  
pull/177621, pull/177627, pull/177628

Dynamo component refactoring and feature updates: Multiple pull requests simplify and refactor variable tracker classes, improve support for __slots__, remove redundant exception argument copies, and unify setattr paths. They also update representations of __dict__ attributes in various variable classes to reduce direct __dict__ handling, enhancing maintainability and correctness.  
pull/177658, pull/177659, pull/177668, pull/177807, pull/177808, pull/177948, pull/177593, pull/177659

Profiling and testing infrastructure enhancements: A pull request adds a profiler Chrome trace validator enforcing correctness rules derived from production issues, while another introduces a script to run openreg backend tests with retries, timeouts, and JSON reporting for CI integration. These improvements increase reliability and observability of profiling and backend testing.  
pull/177947, pull/177565

Build and CI workflow updates: Pull requests migrate Linux build and test jobs to OSDC ARC runners with new job variants and update the _linux-build GitHub Actions workflow to support OSDC ARC runner variants, removing EC2-specific steps and improving environment setup. These changes modernize and diversify CI infrastructure for better resource utilization.  
pull/177992, pull/177953

Performance and algorithmic improvements: A CuteDSL implementation of RMSnorm is introduced with benchmarks showing throughput improvements, and the SymmMem module gains a reduce_scatter_offset collective operation optimized for NVLink multicast hardware. These contributions enhance computational efficiency and collective communication capabilities.  
pull/177553, pull/177791

Memory and parameter handling improvements: A boxed calling convention for autograd backward is added to reduce peak memory usage by freeing tangent tensors earlier, and support for non-float parameters is added to the Fully Sharded Data Parallel v2 implementation, broadening parameter type compatibility.  
pull/177837, pull/177948

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 315
Key Closed Pull Requests
1. [Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion: This pull request adds a unit test for the fusion of ExternalTritonTemplateKernel in the Helion backend integrated with torch.compile, specifically testing prologue and epilogue fusion scenarios using a mock external template buffer to verify correct setup and handling of fusion hooks and extra inputs.

URL: pull/177065

Associated Commits: 4dede, 90a70, 297ef, fc8dd, bee95, 8d16a, 067cb, ad866, 75368, aec2a, 9e91b, 0a268, e3da9, 54db4, ac045, 92077, 44dcf, 10743, 62f03, 1f7ba, fdd61, cc6e4, 1dd01, 47382, 4ec4a, df941, a3e53, fc15a, 28ce0, 7f90b, 2e3bc, d2624, 2b4ff

Associated Commits: 4dede, 90a70, 297ef, fc8dd, bee95, 8d16a, 067cb, ad866, 75368, aec2a, 9e91b, 0a268, e3da9, 54db4, ac045, 92077, 44dcf, 10743, 62f03, 1f7ba, fdd61, cc6e4, 1dd01, 47382, 4ec4a, df941, a3e53, fc15a, 28ce0, 7f90b, 2e3bc, d2624, 2b4ff

2. Handle SymPy boolean graph inputs in Inductor: This pull request addresses the issue of Inductor not supporting SymPy boolean graph inputs by extending its input handling to accept SymPy boolean relations like SymBool backed by sympy.Eq, materializing them as torch.bool when needed, and adding regression tests to ensure consistent symbolic scalar support across graph inputs and wrapper codegen.

URL: pull/177326

Associated Commits: f245c, 2781c, 9dfdc, 76515, 08e9f, cb301, 27d9f, def5e, 96b84, 6bfd6, 909f3, 0abd4, 255e0, b4897, a914d, 9b0eb, 9bd9c

Associated Commits: f245c, 2781c, 9dfdc, 76515, 08e9f, cb301, 27d9f, def5e, 96b84, 6bfd6, 909f3, 0abd4, 255e0, b4897, a914d, 9b0eb, 9bd9c

3. [macOS wheel] Add wheel platform tag vs dylib minos validation: This pull request adds a validation check to ensure that the macOS wheel platform tag correctly matches the minimum OS version specified in the Mach-O binaries by inspecting the LC_BUILD_VERSION or LC_VERSION_MIN_MACOSX fields using otool.

URL: pull/177609

Associated Commits: bd5a7, 83120, 3997b, bcf2f, bce91, 266b4, 219f6, baee9, 1cef1, a7b74, 015e0, 8f480, 024a2, 488a9, 2b334, 42c5c, b9da7

Associated Commits: bd5a7, 83120, 3997b, bcf2f, bce91, 266b4, 219f6, baee9, 1cef1, a7b74, 015e0, 8f480, 024a2, 488a9, 2b334, 42c5c, b9da7

Other Closed Pull Requests

DTensor dispatch bug fixes: This set of pull requests restores the subclass-specific __torch_dispatch__ fallback mechanism in DTensor, ensuring correct dispatch behavior for subclasses without losing the C++ fast path optimization for the base DTensor type. It also includes fixes to Python fallback handling, custom handlers, subclass dispatch inheritance, and regression tests to verify the changes.  
pull/177741

AArch64 test suite stability improvements: These pull requests add expected failure (xfail) or skip markers for all known unit test failures on AArch64 architecture across inductor, nn, jit, and linalg test suites. They also include these test files in the linux-aarch64 unit test suite to enable running all unit tests on AArch64 CPUs without reported failures in a future follow-up.  
pull/177584

FunctionalTensorMode and fake mode bug fixes: This pull request fixes a bug in the out_dtype fake mode for functional tensor wrappers by ensuring proper reuse or temporary creation of FunctionalTensorMode during fake-mode execution. It prevents redispatch errors and preserves tracing state, with updated tests and fixes to proxy tracking and tracing paths.  
pull/177679

AOTAutograd in-place true division and gradient hook fixes: These pull requests address bugs in AOTAutograd related to in-place true division on integral tensors causing silent incorrect mutations by introducing graph breaks, and preserve eager execution priority ordering for post-accumulate gradient hooks by enabling backward edge traversal order specification. They also fix cache saving issues for lazy backward computations by eagerly materializing and saving the backward during cache preparation.  
pull/177681, pull/177683, pull/177684

Dynamo nested graph break and decorator interaction fixes: These pull requests fix bugs in PyTorch's Dynamo component related to nested graph breaks and their improper interaction with decorators applied within compiled regions. They ensure correct behavior in nested compilation scenarios and fix global scope bugs in nested closures.  
pull/176906, pull/177090

Dynamo support for requires_grad_() tracing: This pull request enables the PyTorch Dynamo compiler to correctly trace and support the use of requires_grad_() on intermediate tensors followed by backward calls within compiled regions. This allows gradient computations that previously caused graph breaks to be executed seamlessly.  
pull/176984

Inductor synchronize_event operation support and testing: These pull requests add lowering and control dependency support for the synchronize_event operation in Inductor by implementing a fallback mechanism and updating synchronization handling. They also add an end-to-end test for the Event.synchronize() method involving non-blocking device-to-host memory copies to verify compiled output matches eager execution.  
pull/177614, pull/177615, pull/177613

Linear algebra sharding strategies and ROCm Kineto updates: These pull requests implement sharding strategies for linear algebra operations restricted to batch dimensions and validate them on CUDA with distributed tensors. Additionally, they update the Kineto submodule for PyTorch 2.7 by integrating rocprofiler-sdk for ROCm support, fixing test failures, optimizing performance, and applying cherry-picked fixes for stability.  
pull/176955, pull/177787

Performance improvements in pad_mm and MobileBert inference: These pull requests fix bugs in the pad_mm function of the Inductor backend that caused padded strides to leak into outputs and improve handling of "unabcked" mode. The changes result in a 1.73x to 2x performance improvement for MobileBertForMaskedLM inference using the Inductor backend with unbacked-batch-only mode.  
pull/175824, pull/177546

Dynamo subgraph reuse and user-defined class fixes: These pull requests propose adding subgraph reuse functionality for the invoke_subgraph operation to optimize execution by reusing computed subgraphs. They also fix issues in the polyfill implementation of instantiate_user_defined_class_object within Dynamo.  
pull/176644, pull/177155

WrapperUserFunctionVariable inheritance and functools.wraps fix: This pull request fixes an issue in the Dynamo module by having WrapperUserFunctionVariable inherit from BaseUserFunctionVariable instead of VariableTracker. This enables proper handling of special attributes and fixes breakage caused by functools.wraps when applied to lru_cache-wrapped functions during tracing.  
pull/176934

Meta/FakeTensor as_strided argument validation: This pull request fixes argument validation for the Meta/FakeTensor aten::as_strided function by factoring eager mode validation into a shared helper. It rejects invalid negative strides and storage offsets early in the unchecked Meta/FakeTensor path, preventing crashes in compiled code and ensuring consistent runtime error behavior between eager and traced executions.  
pull/177678

torch.compile ConvTranspose mixed-dtype tracing fixes: This pull request addresses incorrect tracing of mixed-dtype ConvTranspose operations by adding missing dtype and device validation checks to fake/meta convolution implementations. It ensures invalid convolution signatures are rejected during tracing as in eager mode, preventing incorrect compiled kernels and aligning the meta path with eager execution.  
pull/177685

CUDA 13.2 support and related build improvements: This pull request adds support for CUDA 13.2 by including x86 and sbsa binaries unified into the same workflow. It also introduces a libtorch Docker image with direct binary testing and addresses related installation and build issues.  
pull/177316

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

anijain2305
325
29
1
11

bobrenjc93
249
49
0
0

anshul-si
188
7
0
0

mlazos
163
17
2
11

malfet
152
11
1
28

Skylion007
48
14
0
70

aorenste
95
17
0
15

IvanKobzarev
91
23
0
7

guilhermeleobas
92
13
0
15

NikhilAPatel
106
6
1
1

Access Last Week's Newsletter:  

Link

                            Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Issues	Comments
anijain2305	325	29	1	11
bobrenjc93	249	49	0	0
anshul-si	188	7	0	0
mlazos	163	17	2	11
malfet	152	11	1	28
Skylion007	48	14	0	70
aorenste	95	17	0	15
IvanKobzarev	91	23	0	7
guilhermeleobas	92	13	0	15
NikhilAPatel	106	6	1	1