Weekly GitHub Report for Pytorch: January 25, 2026 - February 01, 2026 (21:36:39)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Anaconda channel packages.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[MODULE: ONNX] [TRIAGED] [ONCALL: PT2] [ONCALL: EXPORT] Exporting ONNX model with captum heatmaps generation: This issue describes a problem with exporting a PyTorch model to ONNX format when the model generates Captum heatmaps as outputs; the exported heatmaps remain constant regardless of the input, likely due to some parameters being converted to constants during export. The user also encounters errors when attempting to use dynamic shape export with torch.export.export, particularly related to the use of torchvision.transforms.v2 and issues with tracing these transforms, suggesting that certain dynamic export features or transform modules may not yet be fully supported.
- The comments discuss attempts to resolve the issue by enabling dynamo and dynamic shapes during export, revealing errors related to argument naming and module tracing failures, especially with torchvision.transforms.v2; a runnable repro code snippet was shared, and it was confirmed that the problem likely stems from current limitations or bugs in torch dynamo and dynamic export support for certain transform modules, with suggestions to avoid certain calls or await further support.
- Number of comments this week: 7
-
[HIGH PRIORITY] [MODULE: DOCS] [TRIAGED] [MODULE: REGRESSION] [docs]
torch.cuda.Streamlink broken on 2.10+: This issue reports that the documentation link fortorch.cuda.Streamis broken starting from version 2.10, causing users to encounter non-working URLs when trying to access this part of the PyTorch docs. The reporter requests either fixing the broken link or preventing users from being routed to the invalid URL to improve documentation accessibility.- The comments discuss identifying the new correct documentation page and consider setting up redirects from the old broken links to the new ones. They also explore the potential impact of broken links on users and check for any recent spikes in 404 errors, concluding that the issue might be isolated but monitoring will continue.
- Number of comments this week: 6
-
[MODULE: CUDA] [TRIAGED] [MODULE: NANS AND INFS] [MODULE: LINEAR ALGEBRA] [MODULE: CORRECTNESS (SILENT)]
torch.linalg.slogdetdoes not propagateNaNin CUDA: This issue reports a discrepancy in the behavior oftorch.linalg.slogdetbetween CPU and CUDA implementations when the input tensor containsNaNvalues: the CPU version correctly propagatesNaNin the output, while the CUDA version treatsNaNs as zeros and produces a finite result, which can mask data corruption. The problem appears to stem from the underlying cuSOLVER library used in CUDA, affecting other linear algebra operations like LU factorization and matrix inversion, and has been confirmed by the cuSOLVER team internally.- The comments discuss evidence showing that CUDA's LU factorization treats
NaNs as zeros, leading to incorrect results, and confirm that this behavior is consistent across other related operations; an internal bug report has been filed with the cuSOLVER team who can reproduce the issue, and users share additional examples and documentation highlighting the unexpected behavior. - Number of comments this week: 6
- The comments discuss evidence showing that CUDA's LU factorization treats
-
[ONCALL: DISTRIBUTED] [MODULE: DEVICEMESH] [RFC] Abort a DeviceMesh: This issue proposes adding a method to the DeviceMesh class that allows aborting all associated process groups concurrently when a rank exits prematurely, preventing the system from hanging. The discussion centers around whether the abort should affect only the submesh's process groups or all process groups in the mesh universe, with suggestions to default to aborting all to avoid inconsistent states and hanging.
- The comments debate the scope of the abort operation, weighing the ambiguity of submesh-specific aborts versus global aborts, and consider design options including an abort_all parameter or restricting aborts to disjoint DeviceMeshes, ultimately emphasizing the need for a clear, user-friendly API that prevents hanging and aligns with existing abort semantics.
- Number of comments this week: 6
-
[MODULE: CUDA] [TRIAGED] [CUDA] MaxPool2d CUDA kernel lacks int64 support → one-side error: This issue reports that the MaxPool2d CUDA kernel in PyTorch does not support int64 (torch.int64) input types, causing a runtime error when attempting to run pooling operations on CUDA devices with integer tensors, while the same operation works on CPU. The problem arises because the CUDA implementation only supports floating-point types for MaxPool operations, and integer support is missing, leading to a one-sided error that affects MaxPool1d, 2d, and 3d on CUDA for int64 inputs.
- The comments discuss the historical lack of 64-bit atomic operations on GPUs as a likely cause, confirm that MaxPool is not supported for integer types on GPU at all, and debate whether to add support despite the drawbacks of increased binary size, non-differentiability, and performance issues; some suggest removing integer support from CPU implementations if GPU support is not feasible.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 104
Summarized Issues:
- Test Failures and Disabling on XPU and ROCm Platforms: Several tests including
test_skip_non_tf32,test_mm_plus_mm3,test_mm_plus_mm3_gpu_wrapper,test_codegen_with_custom_heuristics_module,test_weight_norm_conv2d_xpu, and multiple SDPA and FSDP related tests have been disabled due to consistent failures on the main branch for XPU and ROCm platforms. These issues affect test stability and CI status, prompting temporary skips until fixes or pull requests are merged.
- Numerical Inconsistencies and Divergences in LSTM and LayerNorm: Multiple issues report severe numerical discrepancies in
nn.LSTMandnn.LayerNormoperations across CPU, CUDA, and MPS backends, including NaN outputs, 100% to 200% relative error divergences, and collapsed output logits when moving models between devices. These inconsistencies undermine inference reliability and highlight backend-specific bugs or compiler optimization issues.
- Segmentation Faults and Crashes in CUDA and CPU Operators: Several segmentation faults occur in CUDA and CPU operators such as
torch.ops.aten.lstm,torch.ops.aten.gru, andtorch.fbgemm_linear_quantize_weight, often triggered by invalid input shapes, device placement errors, or unsupported backend usage. These crashes cause process termination without clear error messages, complicating debugging and usage.- issues/173476, issues/173623, issues/173944, issues/173946, [issues/173495](https://github.com/issues/173495]
- Compilation and Build Failures Related to Triton, ROCm, and CPU Inductor: Multiple issues report compilation errors including nvcc exit status 255 during Triton kernel builds, stricter Clang visibility defaults causing undefined symbols with ROCm 7.2, and C++ compilation errors in CPU Inductor backend with PyTorch 2.10. These failures block builds and tests, requiring workarounds or fixes in build scripts and source code.
- Memory Leaks and Inefficient Memory Management in Distributed and CUDA Contexts: Issues include tensors wrapped by
torch.distributed._coalescing_managernot being released properly causing memory leaks, excessive CUDA memory usage during matrix multiplication with high-dimensional tensors, and requests to improvetorch.cuda.empty_cache()to forcibly clear GPU cache without manual tensor deletion. These problems lead to out-of-memory errors and inefficient resource utilization.- issues/173772, issues/173904, [issues/173382](https://github.com/issues/173382]
- Inconsistent Behavior and Bugs in torch.compile and Inductor Backend: Several issues describe bugs in
torch.compileand the Inductor backend including silent failures intorch.jit.scriptwith inplace bitwise AND, numerical instability producing NaNs for LayerNorm with large inputs, incorrect pin_memory flag preservation, and redundant cloning due to aliasing detection failures. These affect correctness, performance, and user experience during model compilation.- issues/173492, issues/173793, issues/173939, [issues/173781](https://github.com/issues/173781]
- Distributed and Parallelism Issues Including Deadlocks and Synchronization Problems: Problems include
dist.reduce_scatterfailing on non-contiguous outputs, deadlocks in CUDA LSTM calls, asynchronous all_reduce on XPU backend not synchronizing properly, and stuck calls intorch.distributed.new_groupwith local synchronization enabled. These issues cause hangs, errors, or undefined behavior in distributed training setups.- issues/173362, issues/173476, issues/173897, [issues/173608](https://github.com/issues/173608]
- Test and Runtime Failures Related to Triton Backend and Kernel Limits: Several tests in the Triton backend fail due to kernel shared memory limits being exceeded, illegal memory access errors after Flash Attention submodule upgrade, and compilation errors in Triton kernel tests after updating to the latest Triton trunk. These failures impact GPU kernel execution and test stability.
- issues/173765, issues/173953, [issues/173795](https://github.com/issues/173795]
- Documentation and Usability Improvements Requested: Requests include clarifying symmetric memory documentation, adding support for forward hooks with
torch.compile(fullgraph=True), improving torch.compile autotuning to avoid redundant compilations, and enabling dynamic registration of components in TORCH_LOGS for third-party backends. These aim to enhance developer experience and framework extensibility.- issues/173514, issues/173452, issues/173642, [issues/173759](https://github.com/issues/173759]
- Numerical and API Bugs in Special Functions and Quantized Models: Issues include incorrect outputs from
torch.specialAPIs with uint16 on CUDA, mode calculation errors intorch.distributions.Kumaraswamycausing NaNs, and quantized TorchScript VisionTransformer models failing on ARM due to missing operator implementations. These bugs affect model correctness and deployment on specific hardware.- issues/173636, issues/173912, [issues/173907](https://github.com/issues/173907]
- DeviceMesh and DTensor Functionality Bugs: Bugs include slicing multiple DeviceMesh instances producing identical hashes causing mapping overwrites, missing batching rules for
aten::_weight_normcausing performance drops, and missing propagation ofgrad_dtypein DTensor layout transformations, all impacting distributed tensor operations and performance.- issues/173789, issues/173802, [issues/173990](https://github.com/issues/173990]
- Miscellaneous Bugs Including Typos, Missing Modules, and API Errors: Various issues report a typo in function naming, missing
setuptoolsmodule on Python <3.12 causing test failures, NameError due to undefined variables, and silent failures intorch.exportdue to data-dependent control flow, highlighting maintenance and compatibility challenges.- issues/173643, issues/173823, issues/173924, [issues/173915](https://github.com/issues/173915]
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 63
Summarized Issues:
- Torch.compile and Vmap Runtime Failures: Using
torch.compilewithtorch.vmapon CUDA tensors causes runtime failures due to unsupported.item()calls within the vmap context, leading to data-dependent errors. This issue highlights limitations in handling indexing operations under these combined features.
- Cudagraph Safety and Graph Partitioning Bugs: An unbacked symbolic integer from a cudagraph-unsafe custom op incorrectly propagates into cudagraph-safe partitions during graph partitioning, causing data-dependent shapes that violate cudagraph safety guarantees. This bug undermines the correctness of graph partitioning when
torch._inductor.config.graph_partitionis enabled.
- Test Instantiation Failures for PrivateUse1 Backend: Using
instantiate_device_type_tests()withonly_fororexcept_forparameters for PrivateUse1 backends results in only the first call working correctly, while subsequent calls fail to instantiate test classes. This causes incomplete test coverage for PrivateUse1 devices.
- ONNX Export Metadata Loss Regression: Exporting models with
torch.onnx.exportusingdynamo=Trueand opset version 20+ causes GridSample nodes to lose all metadata properties, unlike earlier opset versions where metadata was preserved. This regression results in loss of important source information during export.
- Inconsistent Default Buffer Sizes: There is an inconsistency in default buffer size values between the Flight Recorder and ProcessGroupNCCL components, leading to potential mismatches in buffer management. This discrepancy may affect performance or correctness in distributed communication.
- Inductor Backend FakeTensor Reshape Failure:
torch.compilewith Inductor backend fails during FakeTensor meta evaluation when reshaping withview(b, -1), raising a stride-related ValueError. This contrasts with eager mode where the operation succeeds, indicating a backend-specific limitation.
- Compiler Error Due to Missing Semicolon: A missing semicolon at the end of a
TORCH_CHECKstatement inactivation.cppcauses potential compiler errors, preventing successful C++ backend builds. This is a straightforward syntax issue affecting build stability.
- DTensor Gradient Inconsistency with Partial Placement: Using
DTensor.to_local()followed byDTensor.from_local()on a DTensor withPartialplacement produces correct forward results but inconsistent gradients in backward passes compared to using the DTensor directly. This raises concerns about gradient correctness and suggests the need for additional implementation checks.
- Level Zero Backend Out of Memory Error on Intel GPU: A RuntimeError occurs when calling
.item()on an Intel Arc Pro B50 GPU due to aUR_RESULT_ERROR_OUT_OF_DEVICE_MEMORYduring device-to-host memory copy with the Level Zero backend. This indicates memory management issues on this hardware.
- PrivateUse1 Tensor Assignment Limitations: Assigning a privateuse1 tensor to a CPU tensor's data attribute is problematic due to limitations of the deprecated
set_datamethod. Discussions include makingVariableHooksnon-final to allow overridingset_datafor better support.
- Inductor Regional Backend Fails on Effectful Higher-Order Operators: The regional inductor backend does not support higher-order operators with effects, causing errors due to multiple passes through
aot_autogradand lack of handling for effectful operations likewith_effect. This limits compilation of effectful code.
- ONNX Exporter Produces Incorrect Input Names with Dynamo: Exporting a trivial model to ONNX using the Dynamo exporter results in incorrect input names ("output_samples"), whereas the legacy exporter produces correct names. This causes confusion and potential downstream issues with exported models.
- Documentation Typo in torch_compiler Export: The phrase "different outputs" in the export.md documentation should be "different inputs" to match the code in
torch/export/__init__.py. This typo may mislead readers about the exporter's behavior.
- Inconsistent Random Tensor Generation in Distributed Setup: Using
torch.distributed.tensor.randnwith fixed seeds produces inconsistent random tensors between single-GPU and multi-GPU setups, breaking reproducibility and complicating debugging.
- Inductor Unit Test Fudge Factors Need Adjustment: Recent improvements in math backend precision for float16 attention cause existing Inductor unit tests to fail, necessitating readjustment of fudge factors originally calibrated for lower precision.
- Foreach Norm and Max Functions Produce Incorrect Results: The
foreach_normfunction withord=infreturns unexpected tensor values on empty inputs instead of errors, andforeach_maxproduces incorrect maximum values due to improper initialization on CUDA tensors. Both issues affect correctness of batched tensor operations.
- AOTAutograd Graph Compilation Crashes with Effectful Ops: Combining custom effectful operations with
flex_attentioncauses crashes during AOTAutograd graph compilation due to failure in unlifting effect tokens for subgraphs without tokens. This blocks compilation of certain effectful models.
- Dynamo Error Messages Misleading for Nested Contexts: Dynamo error messages incorrectly point to the context manager line rather than the actual error line inside nested functions, making debugging difficult when errors occur deeply nested within contexts.
- PyTorch Memory Visualizer Zoom Broken by D3 Update: The Memory Visualizer's zoom functionality is broken due to a JavaScript TypeError caused by removal of
d3.eventin newer D3 versions, rendering the tool partially unusable after a CDN source change.
- Missing SHA256 Hashes for Some Artifacts: Several artifacts, including xpu builds, lack sha256 hashes on the PyTorch download index, extending a previously fixed problem that affected only cpu builds for versions >= 2.9.1. This impacts artifact verification.
- ONNX Export Fails with Custom Autograd Functions Using Saved Variables: Exporting models with custom
torch.autograd.Functionclasses that accessctx.saved_variablesin backward pass fails with unsupported operation errors when usingtorch.onnx.export(..., dynamo=True). This blocks exporting certain video restoration models.
- Multiple XPU Tests Disabled Due to Consistent Failures: Numerous tests on the XPU platform, including various
nn.functionalconv, rms_norm, max_unpool2d, and NLLLoss tests, are disabled due to consistent failures on the main branch, indicating stability issues on XPU hardware.
- Incorrect Results from torch.foreach_copy with Mixed Dtypes: The
torch._foreach_copy_function produces incorrect results when copying tensors to a list of destination tensors with mixed data types on CUDA, due to only checking the first destination tensor's dtype. This causes data corruption in batched copy operations.
- MacOS Torch Wheel Tag Mismatch in Release 2.10: MacOS torch wheels in release 2.10 have incorrect wheel tags causing installation problems, with a proposed fix to correct tags and republish binaries with revised naming.
- Torch 2.9.0 CUDA 12.9 Installation Fails Due to CDN Cache: Installing torch 2.9.0 for CUDA 12.9 using uv fails with a hash mismatch error caused by CDN caching issues after binaries were rebuilt and republished; the problem was resolved by invalidating the CDN cache.
- torch.addr CPU and GPU Implementations Diverge on Overflow: The CPU implementation of
torch.addrreturns infinity due to intermediate overflow with int64 and float16 inputs, while the GPU returns correct finite float32 results, causing inconsistent behavior across backends.
- CUDA Error on Unsupported NVIDIA GeForce RTX 5070 Ti: PyTorch crashes with a CUDA error indicating no kernel image is available for the NVIDIA GeForce RTX 5070 Ti (sm_120), requesting added support for this GPU in CUDA builds.
- torch._stack Crashes on Empty Tensor Input: Calling
torch._stackwith an empty tensor causes a segmentation fault (SIGSEGV) crash instead of raising a catchable exception, leading to instability.
- torch.floor_divide Crashes on int64 Min Divided by -1: Using
torch.floor_divideon CPU with int64 inputs crashes the Python interpreter with a Floating Point Exception (SIGFPE) when dividing the minimum 64-bit integer by -1 due to unhandled integer overflow.
- Numerical Overflow Differences in nn.Conv2d CPU vs CUDA: CUDA and CPU implementations of
nn.Conv2ddiffer in overflow behavior near float32 limits; CUDA outputs partially finite values while CPU outputs overflow to infinity, causing numerical inconsistency.
- Typo in Test Name for Distributed Tensor Debug Mode: The test name
test_hash_empty_tenorshould be corrected totest_hash_empty_tensorintest/distributed/tensor/debug/test_debug_mode.py.
- nn.Conv2d CUDA Backend Produces NaNs for Near-Limit Inputs: The CUDA backend of
nn.Conv2dproduces NaN and infinite values for inputs near float32 limits, while the CPU backend produces valid finite results with identical inputs and weights, indicating numerical stability issues.
- Torch Package Missing Numpy Dependency Without torchvision: Installing the torch package without torchvision does not automatically install numpy, causing torch to fail loading due to missing numpy module.
- Segmentation Fault in matrix_exp_backward with Scalar Input:
torch.ops.aten.matrix_exp_backwardcrashes with a segmentation fault when given a scalar tensor instead of a matrix tensor, lacking proper error handling.
- Regression in AOTInductor Model Loading in PyTorch 2.10.0: Loading model packages with AOTInductor fails due to an AttributeError from missing or inaccessible 'codecache' attribute in
torch._inductor, a regression from version 2.9.1.
- FSDP2 Backward Pass Runtime Error with Scaled Dot Product Attention: Using Fully Sharded Data Parallel v2 with
torch.nn.functional.scaled_dot_product_attentioncauses a storage size mismatch error during backward pass when loss depends only on inputs, with a workaround to disable resharding after forward.
- torch.reshape Crashes on Very Large Negative Dimension:
torch.reshapecrashes with a runtime error when given a very large negative input dimension due to unexpected shape argument type.
- HF Cache on B200 Causes vLLM Job Issues: Enabling the HF cache on B200 causes vLLM jobs to automatically detect and use the cache, requiring a forward fix to tests to avoid rollback complications.
- Lack of Documentation for _lazy_clone C++ API Method: The non-public C++ API method
_lazy_clonelacks documentation despite its presumed role in deferring tensor cloning operations, raising questions about its usage.
- InstanceNorm ONNX Export Warning with track_running_stats=False: Exporting models with InstanceNorm in eval mode and
track_running_stats=Falsetriggers warnings about training mode during ONNX export, unlike whentrack_running_stats=True, indicating export flag handling issues.
- torch.linalg.cholesky_ex CPU and CUDA Backend Discrepancy: Given infinite inputs, CPU backend returns infinity with success, while CUDA silently produces NaNs but also indicates success, causing inconsistent and unsafe numerical results.
- torch.nn.functional.pdist CPU and CUDA Output Inconsistency: For
p=0and inputs with infinite values, CPU backend propagates NaNs frominf - infoperations, but CUDA returns finite integer values, leading to inconsistent outputs.
- c10d Distributed Operators Fail on MPS CPU Fallback: Using CPU fallback on MPS devices with
PYTORCH_ENABLE_MPS_FALLBACK=1causes silent failures in distributed operators due to asynchronous operation handling and improper CPU-MPS tensor copying, resulting in incorrect broadcast results.
- MI300 CI Node Migration Causes Job Queue Delays: Migration of MI300 continuous integration nodes to a new cloud provider temporarily increased queue times and prevented job runs until migration completed.
- TCPStore Binds to All IPv6 Addresses Instead of Specified IPv4: TCPStore listens on all IPv6 addresses (::) rather than the specified IPv4 localhost (127.0.0.1), potentially causing unintended security risks by exposing services on all interfaces.
- Duplicate Jinja2 Package Entry Removed from CI Requirements: A redundant duplicate Jinja2 package entry with different casing was removed from
.ci/docker/requirements-ci.txtto eliminate unnecessary CI dependency redundancy.
- torchInductor and Eager Backend Output Discrepancy on fractional_max_pool2d: Significant output mismatches occur between torchInductor and eager backends when using
aten.fractional_max_pool2din PyTorch 2.6.0, causing assertion failures.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 301
Key Open Pull Requests
1. Handle List/Dict Comprehension Graph Breaks for Python3.12+: This pull request addresses the changes in Python 3.12 where list and dict comprehensions are inlined into their surrounding functions by enhancing PyTorch Dynamo to correctly handle graph breaks within these comprehensions through bytecode analysis, checkpointing, and selective tracing, thereby ensuring accurate tracing and resumption of execution across various complex comprehension scenarios and edge cases.
- URL: pull/173558
- Associated Commits: 56724, c55d7, 5757c, a0eb3, 084e5, 13b84, 9a755, 7ac14, 48d2c, 98374, 9a9db, b31da, dd72d, b4f59, 74fe6, 6c567, 63fd0, a4d89, 8547f, 2129b, c4b09, 4ce41, 3a27d, 8895e, ba25a, d3811, 6cb43, 00a98, 2ee74, 960e9, 1342d, ee671, cf5bd, 411d5, 89182, 3909a, 2b02f, d5fdd, d3015, cb56d, a4524, 8cc33, edc3f, 29785
2. Add interactive glossary with hover tooltips: This pull request adds an interactive glossary to the PyTorch documentation featuring hover tooltips for over 15 PyTorch-specific terms, integrates these tooltips across multiple documentation files, updates dependencies and configuration files to support the new functionality, and incorporates the glossary into the main documentation navigation.
- URL: pull/173390
- Associated Commits: ad6b2, 3ddcb, 385c7, 19981, b489d, b0659, ed27a, 60f68, 4d439, 404f8, 0c5d0, dad4b, eeb35, c0d5e, b16da, 6276e, 864a4, 2bc3d, e7fd0, 2282a, 112c0, f9ee7, 361b3, 09845, eca66, b8207, 93919, 260cc, 21e6e, 2f8bf, f06c5, f07c3, 03fd9, e08d2, 36790
3. [dynamo][claude] Dynamo profiler: This pull request introduces a Dynamo-native profiler that operates at the tracing layer to measure the time spent by Dynamo while tracing individual Python functions, providing improved visibility into expensive user functions and polyfill invocations to better diagnose and optimize compile-time performance issues, while maintaining compatibility with existing Python profiling tools like pstats and snakeviz.
- URL: pull/173942
- Associated Commits: e9010, fa138, f1a90, 53e34, aa637, 4dc4c, f1038, e59ad, 07c75, 7a249, 202f4, d568e, e0e22, 7bf23, baa5d, 085db, 2c419, 2e2a3, 7aedb, a02e3, 57b9e, ac310
Other Open Pull Requests
- Consolidation of VariableTracker Construction: Multiple pull requests consolidate the construction of VariableTracker objects across various PyTorch modules by routing direct variable creation through centralized builders like SourcelessBuilder.create() or VariableBuilder when transaction context is available. These changes use local imports to avoid circular dependencies and address step 1 of a related issue, while leaving some static handler methods unchanged where transaction context is unavailable.
- Inductor Backend and Test Updates: Pull requests re-enable Inductor X86 backend test cases removed during PT2E migration by updating them to avoid the PT2E API and propose allocating bucket memory from the process group to improve overlap handling. These changes restore test functionality and aim to optimize backend memory management.
- Platform-Specific and Hardware Support Enhancements: Several pull requests improve support for specific hardware and platforms, including enabling dlpack tests for Intel GPU with XPU support, adding rocSHMEM support on ROCm, introducing lazy Intel Level Zero dependency for XPU builds, and addressing ROCm MI350 graph break debugging. These changes enhance compatibility and stability across diverse hardware environments.
- Performance and Compilation Improvements: Pull requests propose allowing eager evaluation of certain Dynamo functions to reduce compile time by about 1.8 seconds and add profiler utilization annotations with FLOPS and bandwidth metrics to the Inductor backend. These enhancements improve compilation efficiency and provide detailed performance analysis capabilities.
- API and Type Handling Enhancements: A pull request enables device-specific Event classes to accept both generic and device-specific Stream inputs, resolving stricter API type requirements and conversion issues between generic and backend-specific events. This change improves API flexibility and usability across devices.
- CI and Build Infrastructure Updates: Pull requests add blockwise FP8 support for scaled_mm_v2 on XPU, introduce ppc64le wheel building support in CI/CD pipelines, and add the torchfuzz test to the CI pipeline with ongoing updates for test management. These changes expand hardware support and improve testing infrastructure.
- Code Quality and Compiler Warning Fixes: A pull request fixes MSVC compiler warning C4267 by adding explicit static casts to ensure type consistency when all warnings are enabled. This improves code robustness and compiler compliance.
- Dynamic Shape Support in Linear Algebra Operations: A pull request fixes dimension-dependent errors in 18 linear algebra operations by replacing direct dimension comparisons with runtime validation and handling unbacked symbolic dimensions properly. This enables these operations to support dynamic shapes effectively.
- Environment Configuration Refactor: One pull request converts environment variable configuration logic from a shell script to a Python-based EnvironmentConfig within the Lumen CLI, improving management, display, export, and verification of environment variables for PyTorch test builds.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 336
Key Closed Pull Requests
1. Skip the distributed tests which were previously disabled for 2.8: This pull request proposes skipping the distributed tests that were previously disabled for the PyTorch 2.8 release branch, addressing related issues and coordinating with multiple contributors.
- URL: pull/173365
- Associated Commits: 71a30, 8fe04, 18a50, 19367, 4b463, 1befb, fad6b, 2f824, 1963d, 30252, 3d102, cb987, 85ac5, 62c67, 86e58, 2074e, 2b25d, ca125, 96009, d568c, b26dd, 53829, 7b590, 61c07, 730c7, fb814, eb343, 9f118, ecc20, 1b442, cdfe1, 2d72f, a0ffd, 22d02, ed0d0, d010d, 9c429, ccdb1, ad6b8, 77a67, ade02, e96dc, 2975e, b4af4, 1d7b9, 2067a, eb471, d2d97, c3d28, 4febb, 419fb, ab27a, 0def0, 75c80, c03be, 64359, b2fb6, 8d179, fd4b1, c1404, 1a9ca, b2d45, 7b2a4, 6aaab, 0e570, 9a46f, 9596b, 9ea02, 675f8, db3ba, aeb64, a20c7, 66514, 0b82d, bd740, dfd38, 245bf, 2cd73, cbd27, fe1f5, b2b16, 336f2, 7a520, 71347, d631b, 2ce89, 330f5, b067d, a3546, 36586, 93dd5, 57296, 63e52, cba8b, 05f24, 1a24a, fa544, cc1d0, 9c53f, cbaa7, 393ae, bf943, 05fef, fd9c5, 07086
- Associated Commits: 71a30, 8fe04, 18a50, 19367, 4b463, 1befb, fad6b, 2f824, 1963d, 30252, 3d102, cb987, 85ac5, 62c67, 86e58, 2074e, 2b25d, ca125, 96009, d568c, b26dd, 53829, 7b590, 61c07, 730c7, fb814, eb343, 9f118, ecc20, 1b442, cdfe1, 2d72f, a0ffd, 22d02, ed0d0, d010d, 9c429, ccdb1, ad6b8, 77a67, ade02, e96dc, 2975e, b4af4, 1d7b9, 2067a, eb471, d2d97, c3d28, 4febb, 419fb, ab27a, 0def0, 75c80, c03be, 64359, b2fb6, 8d179, fd4b1, c1404, 1a9ca, b2d45, 7b2a4, 6aaab, 0e570, 9a46f, 9596b, 9ea02, 675f8, db3ba, aeb64, a20c7, 66514, 0b82d, bd740, dfd38, 245bf, 2cd73, cbd27, fe1f5, b2b16, 336f2, 7a520, 71347, d631b, 2ce89, 330f5, b067d, a3546, 36586, 93dd5, 57296, 63e52, cba8b, 05f24, 1a24a, fa544, cc1d0, 9c53f, cbaa7, 393ae, bf943, 05fef, fd9c5, 07086
2. Implements InputObserver to guess the dynamic shapes for torch.export.export and torch.onnx.export: This pull request proposes and implements the InputObserver feature to automatically infer dynamic shapes for torch.export.export and torch.onnx.export by analyzing multiple input sets with varying dimensions, addressing the complexity of handling nested input structures like DynamicCache.
- URL: pull/172838
- Associated Commits: 39f8f, 066b8, 1d998, c0fbd, 21b9f, 9b5cf, 52209, 50b08, f66ac, e3c59, 3bf4f, 70e3f, a7524, fa4bb, 5cccd, a74c0, a1e3a, 9324a, 7ab9d, d6760, 00fd4, 04685, ecf27, 46db0, 9f3a9, edd6c
- Associated Commits: 39f8f, 066b8, 1d998, c0fbd, 21b9f, 9b5cf, 52209, 50b08, f66ac, e3c59, 3bf4f, 70e3f, a7524, fa4bb, 5cccd, a74c0, a1e3a, 9324a, 7ab9d, d6760, 00fd4, 04685, ecf27, 46db0, 9f3a9, edd6c
3. [DTensor] Optimize redistribute comms using flattened meshes: This pull request optimizes DTensor's redistribution communications by detecting and using flattened device meshes when available to reduce costly sequential collective operations, particularly improving reduce communications to avoid divergent results from different reduction orders, while also adding support for comms beyond all_reduce, banning mixed partial placements, simplifying implementation through grouping and merging transform infos, issuing warnings for missing flattened meshes, and addressing various limitations and edge cases to enhance performance and correctness.
- URL: pull/172610
- Associated Commits: 0133e, f46a1, 30e46, cf97d, 9c878, fa535, f7336, 36655, 4d734, 6e501, 27ec0, 5e38d, c5f81, 5bf75, 25d94
- Associated Commits: 0133e, f46a1, 30e46, cf97d, 9c878, fa535, f7336, 36655, 4d734, 6e501, 27ec0, 5e38d, c5f81, 5bf75, 25d94
Other Closed Pull Requests
- Norm computation strategy updates: This set of pull requests introduces a new S->P(sum) strategy for the
linalg_vector_normfunction whenskip_root=Trueand updates norm strategies for inf/-inf/0/1 norms to use Partial(max/min/sum) placements instead of NormPartial with a reduce_op. These changes remove the reduce_op field and simplify norm computations to avoid problematic sqrt→pow→sqrt cycles.
- MPS backend test adjustments: These pull requests propose skipping or marking as expected failures certain tests in the MPS backend, including
test_non_standard_bool_valuesand some OpInfo tests, due to inconsistent results across different platforms. The goal is to improve test reliability and remove associated expected failure markers (xfails).
- NCCL and communication improvements: This pull request implements NCCL 2.29 one-sided APIs for symmetric memory, including updates to
nccl_extension.cu, signal methods, and test additions, while addressing feedback and fixing compilation errors. Another related pull request forces saving of torchcomms outputs in the functorch partitioner to ensure backward operations have access to forward tensor outputs, preventing invalid partition dependencies.
- Size hint and optimization hint migration: This pull request migrates remaining calls to
size_hintthat already pass fallback to useoptimization_hintby applying the size hint atomically at those call sites without changing the handling of unbacked cases. This prepares for further handling of unbacked call sites in subsequent updates.
- ProcessGroup and FakeScriptObject improvements: This pull request modifies the
ProcessGroupclass to use an abstract base class (ABC) metaclass, enabling the registration ofFakeScriptObjectas a virtual subclass. This allows correct behavior ofisinstancechecks for tracing when dealing withFakeScriptObjectStackinstances.
- Serialization enhancements for GraphModule: This pull request replaces the
__reduce__-based serialization inSerializedGraphModulewithGraphPicklerto directly serialize and reconstruct graph structures. This enables support for HigherOrderOperators that the FX tracer cannot handle properly by adding specialized pickling support and updating serialization methods accordingly.
- CUDA memory snapshot speedup: This pull request adds an option to the CUDA memory snapshot functionality to skip collecting the full trace entry history while still capturing the current memory state. This results in significant speedups—up to thousands of times faster—when taking snapshots with large numbers of trace entries.
- DTensor debug and redistribution fixes: These pull requests enhance the DTensor debug mode by enabling it to print optimized transform information and fix a crash caused by an assertion failure in the DTensor redistribution planner. The fix allows non-participating ranks to safely exit early during redistribution cost computation, ensuring consistent DTensor property queries across all ranks.
- Circular dependency bug fix in constant folding: This pull request fixes a circular dependency bug in the
constant_fold_uniform_valuefunction that causedstable_topological_sortto fail with an assertion error. It adds a check to skip replacements that would create cycles involvingsym_size_intnodes andfull()nodes.
- Automated triage workflow for GitHub issues: This pull request introduces an automated triage workflow using a skill-based system that applies predefined labels and canned responses to GitHub issues via GitHub Actions. It leverages a static label list and the sonnet-4.5 model to improve issue classification and management.
- ONNX exporter dynamic shape inference update: This pull request adds a parameter to the
InputObserver.infer_dynamic_shapesmethod in the ONNX exporter to allow forcing the first dimension of input tensors to be treated as dynamic. This improves flexibility in dynamic shape inference even when the dimension is not present in a given set of inputs.
- Inductor backend NVGEMM support: These pull requests add support for scaled matrix multiplication (mm) and Groupgemm operations using NVGEMM within the Inductor backend. These enhancements improve the backend's capability to handle various matrix multiplication scenarios.
- MAGMA backend deprecation for SVD: This pull request deprecates the MAGMA backend for singular value decomposition (svd) and unconditionally dispatches the operation to the cuSOLVER backend instead.
- Hugging Face cache enablement in CI: This pull request proposes enabling the Hugging Face (HF) cache across all continuous integration (CI) jobs to locally store HF content for faster access. It includes a mechanism to refresh the cache via a special PR label and daily updates tied to the vLLM pin update process.
- Shallow copy support for privateuse1 backend: This pull request introduces a new function to enable shallow copying between CPU and the privateuse1 backend. It enhances documentation and provides an example to support this previously unsupported operation.
- Intel Triton commit update and fixes: This pull request updates the Intel Triton commit pin within the [xpu][inductor] components, including related fixes such as unskipping a specific test and addressing lint errors.
- Unbacked tensor dimension testing: This pull request adds a new test suite file,
test_ops_unbacked.py, which marks tensor dimensions of size two or greater as unbacked for all OpInfo entries and attempts full graph compilation. It detects framework data-dependent errors while maintaining a list of known failing operations due to such errors.
- DeviceContext mode stack invariant enforcement: This pull request ensures that the
DeviceContextmaintains the invariant of having only one mode on its stack at any given time.
- ROCm gfx950 GPU test fixes: This pull request addresses and implements test skips and fixes for unit test failures specific to the gfx950 GPU architecture in the ROCm CI environment. These changes ensure stable and accurate continuous integration results.
- Static Triton kernel launcher for XPU: This pull request proposes enabling the static Triton kernel launcher for the XPU backend and reusing the corresponding unit tests to support this feature.
- Global kernel cache for cutlass_api: This pull request implements a global kernel cache built at first use to avoid repeated expensive calls to
cutlass_api.get_kernels(). This results in significant runtime improvements, including up to a 43% speedup in end-to-end latency benchmarks for workflows involving multiple GEMM operations.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| wconstab | 142 | 20 | 0 | 36 |
| malfet | 84 | 21 | 2 | 79 |
| NikhilAPatel | 141 | 35 | 0 | 0 |
| pianpwk | 144 | 18 | 0 | 9 |
| ydwu4 | 151 | 15 | 1 | 2 |
| laithsakka | 133 | 20 | 0 | 11 |
| bobrenjc93 | 121 | 31 | 0 | 7 |
| BenjaminDEMAILLE | 128 | 1 | 0 | 9 |
| kurtamohler | 112 | 3 | 1 | 3 |
| eellison | 54 | 17 | 1 | 30 |