Weekly GitHub Report for Pytorch: March 30, 2026 - April 06, 2026 (18:22:38)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, FlexAttention for X86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda package publishing.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[ONCALL: DISTRIBUTED] [TRIAGED] [MODULE: C10D] [BOT-TRIAGED]
device_idnot honored when creatingprocess_groupwith$TORCH_DISTRIBUTED_DEBUG=DETAIL: This issue describes a bug where thedevice_idparameter is not properly honored when creating aprocess_groupwith the environment variable$TORCH_DISTRIBUTED_DEBUGset toDETAIL, due to theProcessGroupWrappersetting the device only on the wrapper and not on the underlying real process group. This leads to potential hangs in distributed training with NCCL and CUDA backends because the device ID is guessed incorrectly, causing warnings and synchronization issues.- The comments discuss the root cause being that
ProcessGroupWrapperdoes not forward thesetBoundDeviceIdcall to the wrapped backend, and propose makingsetBoundDeviceIda virtual method overridden by the wrapper to forward the call properly. They also consider making theBackendclass pure virtual to enforce implementation of all methods in wrappers, but note this could cause backward compatibility issues, leading to a preference for a virtual method with a default implementation and override in the wrapper to fix the problem without breaking existing backends. - Number of comments this week: 9
- The comments discuss the root cause being that
-
[MODULE: TESTS] [TRIAGED] [MODULE: MPS] [BOT-TRIAGED] test_torch.py does not run any tests on MPS despite number of
skipIfMPSdecorators: This issue reports that the test suitetest_torch.pydoes not run any tests on the MPS device despite the presence ofskipIfMPSdecorators, due to the lack ofallow_mps=Truein theinstantiate_device_type_testscall forTestTorchDeviceType. The discussion reveals that adding this flag enables test discovery and execution on MPS, but further work is needed to properly mark tests as skipped or expected to fail based on their MPS compatibility to ensure accurate test results.- The comments confirm the issue was fixed by adding
allow_mps=Trueto the test instantiation, enabling MPS tests to run; users verified the fix locally and submitted a PR. However, the conversation clarifies that the goal is not just to run tests with failures but to refine decorators so only supported tests run while unsupported ones are skipped or marked as expected failures, improving test suite accuracy for MPS. - Number of comments this week: 7
- The comments confirm the issue was fixed by adding
-
[ONCALL: QUANTIZATION] [BOT-TRIAGED] The smallest values of E4M3 for two-level quantization is incorrect: This issue addresses an incorrect definition of the smallest positive value for the E4M3 floating-point format used in two-level quantization within the PyTorch codebase, where the current implementation uses the smallest positive normal number instead of the true smallest positive subnormal number. This mistake causes scale factors in quantization to be clamped too high, leading to an 8x coarser quantization step size and potential loss of precision, especially affecting networks with small activation values.
- The comments confirm the bug and explain the difference between smallest normal and subnormal values in E4M3, propose a fix by correctly calculating the smallest subnormal value, discuss the broader impact on quantization accuracy, and suggest a robust, generalizable approach to derive subnormal minimums programmatically rather than hardcoding constants.
- Number of comments this week: 5
-
[TRIAGE REVIEW] [NEEDS REPRODUCTION] [MODULE: CRASH] [MODULE: WINDOWS] [MODULE: CUDA] [MODULE: PICKLE] [MODULE: SERIALIZATION] [BOT-TRIAGED] [Windows] torch.save triggers 0xC0000005 Access Violation on RTX 4090 Laptop (WDDM Driver Conflict): This issue describes a fatal access violation error occurring on Windows 11 with an NVIDIA RTX 4090 laptop GPU when calling torch.save() during the training of a YOLO model, specifically during the serialization step that moves tensors from GPU to CPU. The problem appears to be related to a conflict between PyTorch's memory handling during serialization and the Windows WDDM driver model, potentially involving pointer invalidation or race conditions in the GPU-to-CPU memory copy process.
- The comments confirm the crash occurs inside the persistent_id function during serialization, provide additional stack traces, note similar system instability possibly linked to Windows updates, and request a reproducible example while suggesting the issue may be related to Windows WDDM Timeout Detection and Recovery (TDR) mechanisms.
- Number of comments this week: 4
-
[TRIAGE REVIEW] [MODULE: CRASH] [MODULE: REGRESSION] [MODULE: XPU] [BOT-TRIAGED] XPU: Segmentation Fault when using newer drivers: This issue reports a segmentation fault occurring when calling
torch.xpu.get_device_properties()after upgrading to newer versions of the Intel compute-runtime driver, which previously worked without error. The user provides detailed environment information and confirms that the fault appears starting from a specific driver release, seeking guidance on compatible driver versions and support for Windows 11.- The comments clarify that the segmentation fault begins with a particular compute-runtime version, emphasize that PyTorch XPU support is guaranteed only for certain Intel drivers as per official guidelines, and include a request for updated driver information for Windows 11 due to availability issues on Intel’s website.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 76
Summarized Issues:
- torch.compile and Inductor Backend Numerical and Functional Issues: Multiple issues report that
torch.compilewith the Inductor backend causes numerical inaccuracies, silent errors, or crashes. These include incorrect output for fused operations, silent bypass of runtime bounds checks, numerical divergences in reduction and quantization models, and crashes due to layout or kernel fusion problems, indicating instability and precision problems in compiled models. - [issues/178820, issues/178845, issues/178869, issues/178871, issues/178875, issues/178879, issues/178880, issues/178881, issues/178882, issues/178883, issues/179135, issues/179232, issues/179233, issues/179368, issues/179418]
- Distributed and Backend Process Group Bugs: Several issues highlight bugs in distributed backends such as GLOO and process group wrappers, including lack of support for tensor stacking with different dimensions, race conditions during shutdown, and device_id not properly applied causing hangs. These problems affect distributed training stability and correctness.
- [issues/178798, issues/178977, issues/179238]
- ROCm and MIOpen Convolution Solver Failures: Issues report that MIOpen convolution solvers on ROCm GPUs return zero workspace sizes causing solvers to be skipped silently, leading to failed backward passes or fallback to slower algorithms on specific GPUs like Radeon 890M and gfx1150.
- [issues/178839, issues/178934]
- PyTorch on macOS Apple Silicon Qt6 Plugin Crash: Importing
torchbefore creating aQApplicationon macOS Apple Silicon causes Qt6 to fail to find the cocoa platform plugin due to hidden file flags and interference from torch's C++ runtime, resulting in application crashes that require manual clearing of hidden flags. - [issues/178841]
- torch.multinomial CUDA Input Validation Inconsistency: The CUDA implementation of
torch.multinomialtriggers a device-side assert on zero-sum probability inputs instead of raising a user-facing error like the CPU version, showing inconsistent input validation between CPU and CUDA. - [issues/178864]
- ONNX Export and torch.export Serialization Bugs: Exporting models with
torch.onnx.exportortorch.export.savefails due to issues with buffer registration combined with in-place assignments or unsupported tensor dtypes liketorch.uint32, causing errors such asValueErrororKeyError. - [issues/178868, issues/179433]
- torch.compile Graph Breaks and Guard Failures: Using certain operations like
dist.record_comminsidetorch.compilecauses graph breaks due to internal pybind11 functions being skipped, and tensor subclass metadata guards crash with confusing errors when metadata contains non-scalar tensors, indicating incomplete support for some dynamic or complex operations in compilation. - [issues/178820, issues/178903]
- Windows and CUDA GPU Crashes and Resource Errors: Fatal access violations occur on Windows 11 with RTX 4090 GPUs during
torch.save(), and CUDA IPC tensor sharing fails with bad file descriptor errors due to fabric handle mismatches, while CTC loss backward pass errors occur on RTX 5090 GPUs with CUDA 13.0 due to resource exhaustion, highlighting GPU and driver interaction issues. - [issues/178892, issues/179214, issues/179220]
- Inductor Scheduler and Autotuner Bugs: The Inductor scheduler incorrectly attempts kernel epilogue fusion without considering buffer reuse, causing assertion errors, and the autotuner needs to optionally ignore host launch latency to improve kernel selection accuracy, especially for memory-bound kernels and CUDA Graphs workloads.
- [issues/179232, issues/179233, issues/179236]
- Triton and FP8 Data Type Support Requests and Updates: Requests include adding the TLX Triton Extension as an out-of-tree package, removing outdated fp8 dtype handling in Triton utils, and adding native support for packed FP8 x4 data types to align with hardware like Trainium3.
- [issues/178917, issues/178938, issues/179109]
- Documentation and Build Failures: Building PyTorch documentation with Python 3.14 fails due to missing
torchmodule and incompatible matplotlib versions, and Windows builds fail for TorchVision and CUDA projects with CUDA Toolkit versions above 13.0 due to invalid type specifiers. - [issues/178988, issues/179005]
- Privateuse1 Backend Tutorial and Utility Updates: Requests to update the privateuse1 tutorial to include instructions for registering the HooksInterface and documenting the new Python backend utility function
_setup_privateuseone_for_python_backendto prevent runtime errors and simplify backend addition. - [issues/179008, issues/179010]
- Intel XPU Device Properties Segmentation Fault: Calling
torch.xpu.get_device_properties()after upgrading Intel compute-runtime drivers causes segmentation faults, indicating driver compatibility issues. - [issues/179030]
- Dynamo Dispatch Latency and Guard Drop Issues: Even after dropping all guards, dynamo dispatch latency remains about 4% higher on vLLM, showing persistent performance overhead in dispatch mechanisms.
- [issues/179056]
- MPS Backend Functional and Testing Issues: The MPS backend has multiple problems including lack of test execution by default, incorrect sum operation behavior due to saturated casting, scaled dot product attention producing incorrect results due to 32-bit overflow, and requests for optimized safetensors loader API to improve model loading performance on MacOS.
- [issues/179259, issues/179415, issues/179294, issues/179352, issues/179190]
- torch.compile Output Layout and Contiguity Differences: Compiled functions produce outputs with different memory layouts and contiguity compared to eager mode for operations like
upsample_nearest3dandF.pad, causing discrepancies in stride and contiguity despite identical numerical results. - [issues/179272, issues/179442]
- Transformer Model Performance and Backward Pass Slowdowns: Backward pass of transformer models compiled with PyTorch 2.11 is slower than previous versions due to Inductor's over-fusion of kernels, negatively impacting training performance despite unchanged forward pass times.
- [issues/179423]
- Stable C Shim API Error Reporting Enhancement Proposal: A proposal to add API support for programmatically retrieving detailed error messages and backtraces from Stable C Shim API failures to improve debugging beyond current stderr printing.
- [issues/179427]
- AdamTR Optimizer Proposal for Token-Routed MoE Architectures: Introduction of AdamTR, a variant of Adam optimizer with per-expert learning rate scaling, expert-aware weight decay, and gradient normalization by token count, designed to improve training dynamics for Token-Routed Mixture-of-Experts models.
- [issues/179143]
- Security Vulnerabilities in ONNX Submodule: The
onnxsubmodule in PyTorch requires updating from version 1.18.0 to 1.21+ to fix multiple high and critical security vulnerabilities including path traversal and race conditions. - [issues/179340]
- Quantile Functions Failing Under torch.compile with Dynamic Shapes:
torch.quantileandtorch.nanquantilework in eager mode but fail with runtime errors related to symbolic tensor sizes or strides when used undertorch.compilewithaot_eager_decomp_partitionbackend and dynamic shape support. - [issues/179383]
- Bug in Scatter Operations on Zero-Stride Tensors: Scatter operations on tensors with zero strides cause runtime errors during backward passes due to aliased memory layouts, requiring forcing contiguous clones to fix the issue.
- [issues/178995]
- Bug in DTensor Indexing with Large Sharded Tensors: Indexing a large sharded DTensor by a replicated index produces inconsistent output placements, returning sharded outputs instead of replicated ones depending on tensor size, indicating a bug in DTensor indexing semantics.
- [issues/179448]
- Missing Upload Date Metadata in PyTorch Package Index: The PyTorch CPU wheel index lacks upload date metadata, preventing dependency management tools from enforcing supply-chain safety policies that exclude recently uploaded packages, compromising security.
- [issues/178980, issues/179374]
- Bug in max(dim=1) with Inductor Backend on Transposed Tensors: Compiling a function with
(a + b).max(dim=1)using Inductor backend produces incorrect all-zero indices whenais transposed, while other backends or non-compiled code behave correctly. - [issues/178964]
- Bug in torch.compile with Deterministic Algorithms and Scatter Add: Enabling deterministic algorithms inside compiled functions causes graph breaks, particularly with scatter_add operations in fullgraph compilation mode.
- [issues/179194]
- Bug in BlockMask Pytree Registration Causing Crashes: Registering the BlockMask class as a pytree node causes crashes during eager activation checkpointing due to AttributeErrors when unflatten receives NoneType for expected tensor indices.
- [issues/179189]
- Bug in torch.backends FP32 Precision and cudnn API Usage: Setting top-level
torch.backends.fp32_precisiondoes not propagate correctly to cudnn sub-backends, and accessingtorch.backends.cudnn.allow_tf32raises RuntimeErrors due to inconsistent internal state from mixing new and legacy APIs. - [issues/179445]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 50
Summarized Issues:
- Torch.compile and Inductor Backend Crashes and Failures: Multiple issues report crashes and failures when using
torch.compilewith the Inductor backend, including tracing errors with builtin Python operators, improper exception initialization during code generation, and assertion errors caused by calling convention mismatches in guard evaluation. These problems often occur despite models running correctly in eager mode and involve dynamic symbolic shapes, reshape operations, and dynamic shape handling regressions. - issues/178041, issues/178676, issues/179026
- Dynamic Shapes and Performance Regressions in Inductor: There is a reported performance regression in multi-thread inference speed for the visformer_small model using AMP/fp32 dynamic shapes with the Inductor CPU backend, alongside a regression causing a ZeroDivisionError when processing empty tensors with dynamic shapes. These issues highlight challenges in handling dynamic symbolic integer dimensions and maintaining performance in recent nightly releases.
- issues/177374, issues/178530
- Torch.compile Numerical and Functional Accuracy Issues: Several issues describe numerical mismatches and incorrect results when using
torch.compile, including output discrepancies between eager and compiled modes, bugs in linear operations on MPS backend for AMD GPUs, and incorrect comparison results due to dtype casting errors. These inaccuracies affect model correctness and inference reliability across different backends. - issues/178247, issues/178697, issues/178716
- Dynamo Tracing and Guard Installation Bugs: Dynamo tracing has bugs such as failure to install guards on function defaults causing stale compiled graphs, and loss of source tracking when using Triton heuristics and autotune decorators, leading to assertion errors. These issues cause silent correctness errors and compilation failures during graph capture and tracing.
- issues/178179, issues/178365
- Memory Leaks and Garbage Collection Performance Degradation: A memory leak during training with FSDP2 causes Python garbage collector-tracked tuple objects containing DTensorSpec data to grow linearly, increasing gc.collect() times and degrading performance over long runs. This issue impacts training stability and resource management.
- issues/178276
- DTensor Indexing and Backward Pass Failures: DTensor indexing has bugs where repeated indexing modifies the original DTensor spec due to aliasing, and backward pass failures occur when mixing slice and dimension indices, causing assertion errors in sharding propagation logic. These bugs affect correctness and were fixed on main but not backported.
- issues/179136, issues/179290
- XPU Platform Test Failures and Disabled Tests: Numerous tests including
test_add_complex4_xpu_gpu_wrapper,test_addmm_gpu_wrapper, and others are consistently failing on the XPU platform and have been disabled on the main branch. This widespread disabling indicates ongoing instability and compatibility issues with the XPU backend. - issues/178575, issues/178753, issues/178761, issues/178762, issues/178854, issues/178855, issues/178857, issues/178971, issues/178974, issues/178984, issues/178992, issues/179117, issues/179213
- PyTorch Build, Packaging, and Import Errors: Issues include a SyntaxError in
setup.pydue to misplacedfrom __future__imports preventing editable installs, missing Inductor kernel template files causing import errors, and a build process error removing critical header files from CPU-only wheels, leading to runtime failures with torch.compile. These problems affect developer workflows and package usability. - issues/179197, issues/179245, issues/179414
- Distributed and Communication Operation Bugs: Bugs in distributed communication include intermittent segmentation faults when using
torch.compilewith distributed ops, incorrect all_reduce results for bfloat16 scalars causing silent corruption, and misleading warnings aboutdestroy_process_group()calls despite correct usage. These issues impact distributed training reliability and debugging clarity. - issues/178859, issues/178865, issues/178758
- ROCm Backend and Autotune Mode Crashes: A crash occurs in the ROCm backend when using
torch.compilewithmax-autotunemode on models with biased linear layers, caused by conflicting PRs leading to invalid stride assumptions during Inductor lowering. This regression affects ROCm users relying on autotuning for performance. - issues/179023
- Documentation and Warning Message Issues: There are documentation errors such as incorrect examples for
torch.argsortand misleading warning messages about process group destruction, which can confuse users and developers. - issues/178967, issues/178758
- Numerical and Functional Bugs in Core Operations: Bugs include incorrect Cholesky decomposition results on CPU for large matrices,
torch.linalg.condreturninginfinstead ofnanfor NaN inputs, and a typo in CUDA AveragePool3d implementation affecting shape checks. These affect mathematical correctness and kernel reliability. - issues/178769, issues/178773, issues/178719
- Activation Checkpointing and Tracing Inconsistencies: Selective activation checkpointing fails to recompute certain matrix multiplications during backward passes as expected, causing inconsistent behavior compared to vanilla checkpointing and suggesting the need for warnings or early returns in non-strict tracing modes.
- issues/178935
- Graph Capture and Functionalization Assertion Errors: A minimal model using
permutefollowed byindex_fillfails undertorch.compiledue to an assertion error related to AOTAutograd functionalization involving acopy_operation during graph capture, despite running correctly in eager mode. - issues/178952
- Cache Directory Ignored in Inductor Module: The
fresh_cache()function in Inductor ignores the user-definedTORCHINDUCTOR_CACHE_DIRenvironment variable and always creates a temporary cache directory, causing unexpected recompilations during tests. - issues/178858
- Non-Deterministic Behavior in CUDA Grid Sampler Backward: The
grid_sampler_2d_backward_cudaoperation exhibits non-deterministic behavior when deterministic algorithms are enabled, leading to inconsistent training results and warnings during model training. - issues/179123
- CI Workflow Improvement Proposal: A proposal suggests having the PyTorch bot notify the AI reviewer Claude to assess whether CI check failures are related to code changes, aiming to improve merge decision efficiency based on AI confidence.
- issues/178837
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 316
Key Open Pull Requests
1. [Inductor][RFC] Symbolic Analysis of User-Defined Triton Kernels: This pull request implements symbolic index expression extraction from user-defined Triton kernels via TTIR traversal to enable formal reasoning about read/write access patterns, improve fusion scoring and cost modeling, and support prologue fusion, while introducing a new dependency type UserTritonDep and addressing related scheduler fusion bugs.
- URL: pull/179149
- Associated Commits: 3e2b5, 5967b, 7c232, 83e69, 8f256, 0adc8, 0fd72, 8a5d0, 74c7a, 86408, 567f1, 1162e, 561be, eb8c1, 8c3d4, 77c0b, 04b8f, 7dcfd, 1f67f, d2efa, 6cfef
2. Triton backward convolution kernel: This pull request implements Triton kernels for 2D convolution backward operations (input and weight gradients) in TorchInductor, replacing the previous ATen-only fallback approach, and includes layout computation, backend selection logic, comprehensive parameterized tests, and achieves up to 20% performance improvement on MI325 GPUs for certain small-batch workloads.
- URL: pull/178945
- Associated Commits: 354ac, c3a73, 3f6ae, 05bc0, 2d8b9, 6b874, 75402, 43a2e, b35d1, f7c59, 34a30, ebd04, 8417d, a46a6, 84c06, 4e6bf, b011a
3. Fix FxGraphCachePickler crash on unpicklable pybind11: This pull request fixes a crash in PyTorch's FxGraphCachePickler when serializing FX graphs containing unpicklable pybind11 extension types by implementing a generic reducer override that detects such types and serializes them via a stable deterministic key, and modifies the AOTAutogradCachePickler to properly fall through to this handler, thereby enabling robust caching without requiring exhaustive type registration.
- URL: pull/178853
Other Open Pull Requests
- Dynamo improvements and fixes: Multiple pull requests enhance PyTorch's Dynamo by fixing deferred default f-string formatting to restore correct Python ordering semantics, reducing special casing for namedtuple and struct sequence objects, adding bytecode source location attribution for better graph-break diagnostics, and refactoring set tracking to decouple it from dictionary tracking. These changes improve tracing accuracy, error reporting, and internal organization without affecting existing functionality.
- Autograd and backward pass enhancements: A pull request introduces a
_patch_autograd_gradcontext manager that patchesautograd.gradto preserve stacktraces and automatically tag backward nodes, while updating the rematerialization pass to detect these nodes and enforce user annotations when multiple backward regions exist. Another pull request improves the standalone autograd_cache_key by rejecting unsupported graph shapes to prevent incorrect keys.
- Profiler metadata improvements: Enhancements to the PyTorch profiler add CPU operation and Python function event metadata to the
events()output, including support for strides, dtypes, and improved shape handling forTensorListarguments, as well as Python module and function launch event metadata. These additions provide more detailed profiling information for performance analysis.
- Batch matrix multiplication optimization: A pull request adds support for stamping out batch matrix multiplication (bmm) outer product operations using a native API, improving performance and correctness across various data types, tensor layouts, and shapes, with detailed benchmarking against cuBLAS results.
- Muon optimizer enhancements: The Muon optimizer is improved by adding dynamic coefficient presets such as "jordan" and "polar_express," multiple normalization methods for Newton–Schulz orthogonalization, updated API and documentation, and comprehensive tests to increase adaptability and performance based on recent research.
- StableIValue memory management: New C shim functions are added to allocate and deallocate
StableIValueobjects on the heap, replacing direct C++newanddeleteusage to ensure proper memory management and Rust interoperability, accompanied by unit tests validating correct behavior.
- Cache key testing improvements: A cache key equivalence mixin is added to the AOTAutogradCacheTests to improve testing of cache key equivalence behavior in PyTorch.
- Module output handling: The
split_modulefunction gains atuple_returnoption to ensure submodule outputs are consistently wrapped in a tuple, aligning with the compile_fx convention and simplifying output node generation by removing redundant handling of empty tuples.
- Refactoring of higher-order ops: The monolithic
higher_order_ops.pymodule is refactored into a well-organized package structured by higher-order operator families, moving shared helpers and base machinery into a common module to improve maintainability and scalability without breaking existing functionality.
- Docker image consolidation: Two pull requests consolidate multiple nearly identical Docker images by combining Python versions into a single multi-environment container with separate conda environments and by merging gcc11 and clang18 images with compiler selection via environment variables, simplifying maintenance and build processes.
- Oink CuTeDSL CUDA kernel registrations: Several pull requests register oink's CuTeDSL kernels as CUDA dispatch overrides for RMSNorm, softmax, and LayerNorm on SM100 (Blackwell) devices, enabling optimized handling of 2D bf16/fp16 inputs with fallback to original CUDA kernels, including gating logic, kernel implementations, and tests.
- Compile options improvements: A DynamoCompileOptions dataclass is introduced to bundle compile options into a single object passed through the torch.compile call chain, simplifying the addition of new options without modifying multiple functions. Another pull request adds a
nested_compile_regionsoption to DynamoCompileOptions with minimal code changes.
- Test suite updates and fixes: Multiple pull requests address failing test cases by marking them as expected failures or asserting raises due to unsupported features, define expected test failures for the XPU device aligned with CUDA, and ensure capture error mode is respected when using accelerator.Graph as a context manager.
- Redirect to Claude implementation: A pull request updates the PyTorch project to implement a redirection to Claude, indicated by the title and a series of incremental "[ghstack-poisoned]" commits.
- OIDC integration debugging: A pull request focuses on debugging the OpenID Connect (OIDC) integration within the PyTorch project through multiple commits aimed at testing and troubleshooting the CI environment without requiring formal review.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 370
Key Closed Pull Requests
1. Fix nn.Dropout accuracy discrepancies between triton and torch implementations [new]: This pull request proposes a PyTorch update to fix accuracy discrepancies in nn.Dropout between Triton-based Torch Compile and eager mode by introducing a compiler switch (torch._inductor.config.align_random_eager) that aligns dropout mask random number generation with eager execution while preserving performance through fusion with adjacent kernels and maintaining compatibility across various GPU architectures.
- URL: pull/178843
- Associated Commits: bcb9f, 5a711, 0de41, 0e443, 6dd7c, a897a, 2a4ec, f274c, d8ebb, 7549a, 14aea, 0c68a, a6732, 3a485, 09c78, 32c25, e498d, a8611, ef9bb, fb078, a1fea, 40a9d, 8868f, 3a955, beb8b, 39665, 66dad, d123b, 9d7d9, 91c75, 01289, f25ad, 43399, 98a0a, aa57d, 53c80, b92cf, dfaed, 03028, 93440, 230b2, 35dbd, 396ef, b2f68, 8348e, 088bc, ed974, 3f550, 4fa0b, 0ab6b, 4109f, ac48b, 7caa7, 4a94f, 860d3, 362c8, ee20c, f0eb6, ad195, be3d0, 8bc3d, 5d4bd, 39f68, bb406, 2aaaf, 7a44e, 7c447, e2f8f, 05273, ddb83, 51420, 81478, dc5e0, e85b7, 11579, c3c8f, 23b1f, 4d0f6, 82cb6, f8f6a, d882e, c5e2a, fab6a, 05d90, 9cb75, 46354, eeab5, 95826, e03f6, 7bc0c, 456f9, bf0e1, fad66, c7c5a, 4cab2, 80e15, 1426d, f9b91, cbf8f, 3c854, 072c2, fb5db, a6b46, c39ac, a39df, a79d1, bb4cc, 7f90f, f32ea, bcac2, 234d3, 67f4b, 0a84c, 127b5, 91a0e, af4c4, 9d21d, e390f, d11e4, 60b31, 8b8c1, 3b736, 632b5, df94b, 273d5, 4b293
- Associated Commits: bcb9f, 5a711, 0de41, 0e443, 6dd7c, a897a, 2a4ec, f274c, d8ebb, 7549a, 14aea, 0c68a, a6732, 3a485, 09c78, 32c25, e498d, a8611, ef9bb, fb078, a1fea, 40a9d, 8868f, 3a955, beb8b, 39665, 66dad, d123b, 9d7d9, 91c75, 01289, f25ad, 43399, 98a0a, aa57d, 53c80, b92cf, dfaed, 03028, 93440, 230b2, 35dbd, 396ef, b2f68, 8348e, 088bc, ed974, 3f550, 4fa0b, 0ab6b, 4109f, ac48b, 7caa7, 4a94f, 860d3, 362c8, ee20c, f0eb6, ad195, be3d0, 8bc3d, 5d4bd, 39f68, bb406, 2aaaf, 7a44e, 7c447, e2f8f, 05273, ddb83, 51420, 81478, dc5e0, e85b7, 11579, c3c8f, 23b1f, 4d0f6, 82cb6, f8f6a, d882e, c5e2a, fab6a, 05d90, 9cb75, 46354, eeab5, 95826, e03f6, 7bc0c, 456f9, bf0e1, fad66, c7c5a, 4cab2, 80e15, 1426d, f9b91, cbf8f, 3c854, 072c2, fb5db, a6b46, c39ac, a39df, a79d1, bb4cc, 7f90f, f32ea, bcac2, 234d3, 67f4b, 0a84c, 127b5, 91a0e, af4c4, 9d21d, e390f, d11e4, 60b31, 8b8c1, 3b736, 632b5, df94b, 273d5, 4b293
2. [DTensor] fix _StridedShard handling in reduction/normalization ops: This pull request addresses multiple silent correctness bugs in the handling of the _StridedShard sharding property within DTensor's reduction and normalization operations—such as softmax and layer_norm—by ensuring proper replication of shard dimensions overlapping with reduction/normalization dims, fixing lost split_factor values during dimension remapping and swaps, correcting shard dimension detection in slicing and concatenation operations, preventing unnecessary all-gather operations in binary ops, normalizing backward redistribution from _StridedShard to Replicate, and improving TP recomputation redistribution, while also introducing a _is_shard_like() TypeGuard helper to centralize checks and prevent future regressions.
- URL: pull/178785
- Associated Commits: bf88a, 04b90, 6ad2b, 1adc2, b5f10, f4732, c6015, fd46e, 0c8dd, cfecc, ef7a0, a7058, 251ce, ea1a2, 039af, 0e030, a0f6f, dc261, 472ee, d866e, e1e5b, 593ae, 3c184, 01084, 1d97a, e5f6c, 8eb27, 1bc7e, 9807e, fe96a, 0942c, ad345, b1691, fdd46, 39e56, 1a7a8, b8d95
- Associated Commits: bf88a, 04b90, 6ad2b, 1adc2, b5f10, f4732, c6015, fd46e, 0c8dd, cfecc, ef7a0, a7058, 251ce, ea1a2, 039af, 0e030, a0f6f, dc261, 472ee, d866e, e1e5b, 593ae, 3c184, 01084, 1d97a, e5f6c, 8eb27, 1bc7e, 9807e, fe96a, 0942c, ad345, b1691, fdd46, 39e56, 1a7a8, b8d95
3. [dynamo] Trace locals()/vars() as ConstDictVariable snapshot: This pull request addresses the issue where locals() calls in Dynamo caused graph breaks and incorrect dead-local pruning by enhancing the liveness analysis to conservatively treat locals() as reading all in-scope locals, thereby preserving frame local variables across graph breaks and ensuring correct shape restoration, with added regression tests to validate this behavior under torch.compile(backend="eager").
- URL: pull/178819
- Associated Commits: c0085, a3502, 85239, 98117, 49230, d647f, 78147, 701dd, d15ca, 0672e, 19837, 4d3bc, 52b2c, 06b81, 54a8e, 68e49, 2c7b2, 55d2c, d4b9b, 45151, 04673, 94d93, 57deb, 3690c, 306e6, 057a6, a8ebf, 0d68d, e4e1e, 6951d, e74b2, 6605e, 189fd
- Associated Commits: c0085, a3502, 85239, 98117, 49230, d647f, 78147, 701dd, d15ca, 0672e, 19837, 4d3bc, 52b2c, 06b81, 54a8e, 68e49, 2c7b2, 55d2c, d4b9b, 45151, 04673, 94d93, 57deb, 3690c, 306e6, 057a6, a8ebf, 0d68d, e4e1e, 6951d, e74b2, 6605e, 189fd
Other Closed Pull Requests
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| bobrenjc93 | 525 | 75 | 0 | 4 |
| anijain2305 | 226 | 23 | 0 | 29 |
| malfet | 154 | 30 | 3 | 41 |
| mlazos | 205 | 9 | 0 | 4 |
| anshul-si | 143 | 17 | 0 | 0 |
| frgossen | 143 | 14 | 0 | 2 |
| yangw-dev | 141 | 10 | 2 | 6 |
| weifengpy | 119 | 11 | 2 | 10 |
| guilhermeleobas | 90 | 12 | 0 | 36 |
| huydhn | 111 | 21 | 0 | 3 |
Access Last Week's Newsletter: