Weekly GitHub Report for Pytorch: January 25, 2026 - February 01, 2026 (21:36:39)

foreach_copy

Weekly GitHub Report for Pytorch: January 25, 2026 - February 01, 2026 (21:36:39)

        Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Anaconda channel packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[MODULE: ONNX] [TRIAGED] [ONCALL: PT2] [ONCALL: EXPORT] Exporting ONNX model with captum heatmaps generation: This issue describes a problem with exporting a PyTorch model to ONNX format when the model generates Captum heatmaps as outputs; the exported heatmaps remain constant regardless of the input, likely due to some parameters being converted to constants during export. The user also encounters errors when attempting to use dynamic shape export with torch.export.export, particularly related to the use of torchvision.transforms.v2 and issues with tracing these transforms, suggesting that certain dynamic export features or transform modules may not yet be fully supported.  

The comments discuss attempts to resolve the issue by enabling dynamo and dynamic shapes during export, revealing errors related to argument naming and module tracing failures, especially with torchvision.transforms.v2; a runnable repro code snippet was shared, and it was confirmed that the problem likely stems from current limitations or bugs in torch dynamo and dynamic export support for certain transform modules, with suggestions to avoid certain calls or await further support.
Number of comments this week: 7

[HIGH PRIORITY] [MODULE: DOCS] [TRIAGED] [MODULE: REGRESSION] [docs] torch.cuda.Stream link broken on 2.10+: This issue reports that the documentation link for torch.cuda.Stream is broken starting from version 2.10, causing users to encounter non-working URLs when trying to access this part of the PyTorch docs. The reporter requests either fixing the broken link or preventing users from being routed to the invalid URL to improve documentation accessibility.  

The comments discuss identifying the new correct documentation page and consider setting up redirects from the old broken links to the new ones. They also explore the potential impact of broken links on users and check for any recent spikes in 404 errors, concluding that the issue might be isolated but monitoring will continue.
Number of comments this week: 6

[MODULE: CUDA] [TRIAGED] [MODULE: NANS AND INFS] [MODULE: LINEAR ALGEBRA] [MODULE: CORRECTNESS (SILENT)] torch.linalg.slogdet does not propagate NaN in CUDA: This issue reports a discrepancy in the behavior of torch.linalg.slogdet between CPU and CUDA implementations when the input tensor contains NaN values: the CPU version correctly propagates NaN in the output, while the CUDA version treats NaNs as zeros and produces a finite result, which can mask data corruption. The problem appears to stem from the underlying cuSOLVER library used in CUDA, affecting other linear algebra operations like LU factorization and matrix inversion, and has been confirmed by the cuSOLVER team internally.  

The comments discuss evidence showing that CUDA's LU factorization treats NaNs as zeros, leading to incorrect results, and confirm that this behavior is consistent across other related operations; an internal bug report has been filed with the cuSOLVER team who can reproduce the issue, and users share additional examples and documentation highlighting the unexpected behavior.
Number of comments this week: 6

[ONCALL: DISTRIBUTED] [MODULE: DEVICEMESH] [RFC] Abort a DeviceMesh: This issue proposes adding a method to the DeviceMesh class that allows aborting all associated process groups concurrently when a rank exits prematurely, preventing the system from hanging. The discussion centers around whether the abort should affect only the submesh's process groups or all process groups in the mesh universe, with suggestions to default to aborting all to avoid inconsistent states and hanging.  

The comments debate the scope of the abort operation, weighing the ambiguity of submesh-specific aborts versus global aborts, and consider design options including an abort_all parameter or restricting aborts to disjoint DeviceMeshes, ultimately emphasizing the need for a clear, user-friendly API that prevents hanging and aligns with existing abort semantics.
Number of comments this week: 6

[MODULE: CUDA] [TRIAGED] [CUDA] MaxPool2d CUDA kernel lacks int64 support → one-side error: This issue reports that the MaxPool2d CUDA kernel in PyTorch does not support int64 (torch.int64) input types, causing a runtime error when attempting to run pooling operations on CUDA devices with integer tensors, while the same operation works on CPU. The problem arises because the CUDA implementation only supports floating-point types for MaxPool operations, and integer support is missing, leading to a one-sided error that affects MaxPool1d, 2d, and 3d on CUDA for int64 inputs.  

The comments discuss the historical lack of 64-bit atomic operations on GPUs as a likely cause, confirm that MaxPool is not supported for integer types on GPU at all, and debate whether to add support despite the drawbacks of increased binary size, non-differentiability, and performance issues; some suggest removing integer support from CPU implementations if GPU support is not feasible.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 104
Summarized Issues:

Test Failures and Disabling on XPU and ROCm Platforms: Several tests including test_skip_non_tf32, test_mm_plus_mm3, test_mm_plus_mm3_gpu_wrapper, test_codegen_with_custom_heuristics_module, test_weight_norm_conv2d_xpu, and multiple SDPA and FSDP related tests have been disabled due to consistent failures on the main branch for XPU and ROCm platforms. These issues affect test stability and CI status, prompting temporary skips until fixes or pull requests are merged.  
issues/173336, issues/173344, issues/173352, issues/173473, issues/173916, issues/173994, issues/173712, issues/173713, issues/173714, issues/173715, issues/173717, issues/173761

Numerical Inconsistencies and Divergences in LSTM and LayerNorm: Multiple issues report severe numerical discrepancies in nn.LSTM and nn.LayerNorm operations across CPU, CUDA, and MPS backends, including NaN outputs, 100% to 200% relative error divergences, and collapsed output logits when moving models between devices. These inconsistencies undermine inference reliability and highlight backend-specific bugs or compiler optimization issues.  
issues/173334, issues/173922, issues/173927, issues/173640, issues/173525, issues/174011, issues/173885

Segmentation Faults and Crashes in CUDA and CPU Operators: Several segmentation faults occur in CUDA and CPU operators such as torch.ops.aten.lstm, torch.ops.aten.gru, and torch.fbgemm_linear_quantize_weight, often triggered by invalid input shapes, device placement errors, or unsupported backend usage. These crashes cause process termination without clear error messages, complicating debugging and usage.  
issues/173476, issues/173623, issues/173944, issues/173946, [issues/173495](https://github.com/issues/173495]

Compilation and Build Failures Related to Triton, ROCm, and CPU Inductor: Multiple issues report compilation errors including nvcc exit status 255 during Triton kernel builds, stricter Clang visibility defaults causing undefined symbols with ROCm 7.2, and C++ compilation errors in CPU Inductor backend with PyTorch 2.10. These failures block builds and tests, requiring workarounds or fixes in build scripts and source code.  
issues/173800, issues/173707, issues/173626, issues/173871

Memory Leaks and Inefficient Memory Management in Distributed and CUDA Contexts: Issues include tensors wrapped by torch.distributed._coalescing_manager not being released properly causing memory leaks, excessive CUDA memory usage during matrix multiplication with high-dimensional tensors, and requests to improve torch.cuda.empty_cache() to forcibly clear GPU cache without manual tensor deletion. These problems lead to out-of-memory errors and inefficient resource utilization.  
issues/173772, issues/173904, [issues/173382](https://github.com/issues/173382]

Inconsistent Behavior and Bugs in torch.compile and Inductor Backend: Several issues describe bugs in torch.compile and the Inductor backend including silent failures in torch.jit.script with inplace bitwise AND, numerical instability producing NaNs for LayerNorm with large inputs, incorrect pin_memory flag preservation, and redundant cloning due to aliasing detection failures. These affect correctness, performance, and user experience during model compilation.  
issues/173492, issues/173793, issues/173939, [issues/173781](https://github.com/issues/173781]

Distributed and Parallelism Issues Including Deadlocks and Synchronization Problems: Problems include dist.reduce_scatter failing on non-contiguous outputs, deadlocks in CUDA LSTM calls, asynchronous all_reduce on XPU backend not synchronizing properly, and stuck calls in torch.distributed.new_group with local synchronization enabled. These issues cause hangs, errors, or undefined behavior in distributed training setups.  
issues/173362, issues/173476, issues/173897, [issues/173608](https://github.com/issues/173608]

Test and Runtime Failures Related to Triton Backend and Kernel Limits: Several tests in the Triton backend fail due to kernel shared memory limits being exceeded, illegal memory access errors after Flash Attention submodule upgrade, and compilation errors in Triton kernel tests after updating to the latest Triton trunk. These failures impact GPU kernel execution and test stability.  
issues/173765, issues/173953, [issues/173795](https://github.com/issues/173795]

Documentation and Usability Improvements Requested: Requests include clarifying symmetric memory documentation, adding support for forward hooks with torch.compile(fullgraph=True), improving torch.compile autotuning to avoid redundant compilations, and enabling dynamic registration of components in TORCH_LOGS for third-party backends. These aim to enhance developer experience and framework extensibility.  
issues/173514, issues/173452, issues/173642, [issues/173759](https://github.com/issues/173759]

Numerical and API Bugs in Special Functions and Quantized Models: Issues include incorrect outputs from torch.special APIs with uint16 on CUDA, mode calculation errors in torch.distributions.Kumaraswamy causing NaNs, and quantized TorchScript VisionTransformer models failing on ARM due to missing operator implementations. These bugs affect model correctness and deployment on specific hardware.  
issues/173636, issues/173912, [issues/173907](https://github.com/issues/173907]

DeviceMesh and DTensor Functionality Bugs: Bugs include slicing multiple DeviceMesh instances producing identical hashes causing mapping overwrites, missing batching rules for aten::_weight_norm causing performance drops, and missing propagation of grad_dtype in DTensor layout transformations, all impacting distributed tensor operations and performance.  
issues/173789, issues/173802, [issues/173990](https://github.com/issues/173990]

Miscellaneous Bugs Including Typos, Missing Modules, and API Errors: Various issues report a typo in function naming, missing setuptools module on Python <3.12 causing test failures, NameError due to undefined variables, and silent failures in torch.export due to data-dependent control flow, highlighting maintenance and compatibility challenges.  
issues/173643, issues/173823, issues/173924, [issues/173915](https://github.com/issues/173915]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 63
Summarized Issues:

Torch.compile and Vmap Runtime Failures: Using torch.compile with torch.vmap on CUDA tensors causes runtime failures due to unsupported .item() calls within the vmap context, leading to data-dependent errors. This issue highlights limitations in handling indexing operations under these combined features.  
issues/172364

Cudagraph Safety and Graph Partitioning Bugs: An unbacked symbolic integer from a cudagraph-unsafe custom op incorrectly propagates into cudagraph-safe partitions during graph partitioning, causing data-dependent shapes that violate cudagraph safety guarantees. This bug undermines the correctness of graph partitioning when torch._inductor.config.graph_partition is enabled.  
issues/172728

Test Instantiation Failures for PrivateUse1 Backend: Using instantiate_device_type_tests() with only_for or except_for parameters for PrivateUse1 backends results in only the first call working correctly, while subsequent calls fail to instantiate test classes. This causes incomplete test coverage for PrivateUse1 devices.  
issues/172764

ONNX Export Metadata Loss Regression: Exporting models with torch.onnx.export using dynamo=True and opset version 20+ causes GridSample nodes to lose all metadata properties, unlike earlier opset versions where metadata was preserved. This regression results in loss of important source information during export.  
issues/172784

Inconsistent Default Buffer Sizes: There is an inconsistency in default buffer size values between the Flight Recorder and ProcessGroupNCCL components, leading to potential mismatches in buffer management. This discrepancy may affect performance or correctness in distributed communication.  
issues/172811

Inductor Backend FakeTensor Reshape Failure: torch.compile with Inductor backend fails during FakeTensor meta evaluation when reshaping with view(b, -1), raising a stride-related ValueError. This contrasts with eager mode where the operation succeeds, indicating a backend-specific limitation.  
issues/172830

Compiler Error Due to Missing Semicolon: A missing semicolon at the end of a TORCH_CHECK statement in activation.cpp causes potential compiler errors, preventing successful C++ backend builds. This is a straightforward syntax issue affecting build stability.  
issues/172901

DTensor Gradient Inconsistency with Partial Placement: Using DTensor.to_local() followed by DTensor.from_local() on a DTensor with Partial placement produces correct forward results but inconsistent gradients in backward passes compared to using the DTensor directly. This raises concerns about gradient correctness and suggests the need for additional implementation checks.  
issues/172932

Level Zero Backend Out of Memory Error on Intel GPU: A RuntimeError occurs when calling .item() on an Intel Arc Pro B50 GPU due to a UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY during device-to-host memory copy with the Level Zero backend. This indicates memory management issues on this hardware.  
issues/172934

PrivateUse1 Tensor Assignment Limitations: Assigning a privateuse1 tensor to a CPU tensor's data attribute is problematic due to limitations of the deprecated set_data method. Discussions include making VariableHooks non-final to allow overriding set_data for better support.  
issues/173021

Inductor Regional Backend Fails on Effectful Higher-Order Operators: The regional inductor backend does not support higher-order operators with effects, causing errors due to multiple passes through aot_autograd and lack of handling for effectful operations like with_effect. This limits compilation of effectful code.  
issues/173024

ONNX Exporter Produces Incorrect Input Names with Dynamo: Exporting a trivial model to ONNX using the Dynamo exporter results in incorrect input names ("output_samples"), whereas the legacy exporter produces correct names. This causes confusion and potential downstream issues with exported models.  
issues/173076

Documentation Typo in torch_compiler Export: The phrase "different outputs" in the export.md documentation should be "different inputs" to match the code in torch/export/__init__.py. This typo may mislead readers about the exporter's behavior.  
issues/173097

Inconsistent Random Tensor Generation in Distributed Setup: Using torch.distributed.tensor.randn with fixed seeds produces inconsistent random tensors between single-GPU and multi-GPU setups, breaking reproducibility and complicating debugging.  
issues/173157

Inductor Unit Test Fudge Factors Need Adjustment: Recent improvements in math backend precision for float16 attention cause existing Inductor unit tests to fail, necessitating readjustment of fudge factors originally calibrated for lower precision.  
issues/173176

Foreach Norm and Max Functions Produce Incorrect Results: The foreach_norm function with ord=inf returns unexpected tensor values on empty inputs instead of errors, and foreach_max produces incorrect maximum values due to improper initialization on CUDA tensors. Both issues affect correctness of batched tensor operations.  
issues/173203, issues/173210

AOTAutograd Graph Compilation Crashes with Effectful Ops: Combining custom effectful operations with flex_attention causes crashes during AOTAutograd graph compilation due to failure in unlifting effect tokens for subgraphs without tokens. This blocks compilation of certain effectful models.  
issues/173222

Dynamo Error Messages Misleading for Nested Contexts: Dynamo error messages incorrectly point to the context manager line rather than the actual error line inside nested functions, making debugging difficult when errors occur deeply nested within contexts.  
issues/173231

PyTorch Memory Visualizer Zoom Broken by D3 Update: The Memory Visualizer's zoom functionality is broken due to a JavaScript TypeError caused by removal of d3.event in newer D3 versions, rendering the tool partially unusable after a CDN source change.  
issues/173310

Missing SHA256 Hashes for Some Artifacts: Several artifacts, including xpu builds, lack sha256 hashes on the PyTorch download index, extending a previously fixed problem that affected only cpu builds for versions >= 2.9.1. This impacts artifact verification.  
issues/173312

ONNX Export Fails with Custom Autograd Functions Using Saved Variables: Exporting models with custom torch.autograd.Function classes that access ctx.saved_variables in backward pass fails with unsupported operation errors when using torch.onnx.export(..., dynamo=True). This blocks exporting certain video restoration models.  
issues/173316

Multiple XPU Tests Disabled Due to Consistent Failures: Numerous tests on the XPU platform, including various nn.functional conv, rms_norm, max_unpool2d, and NLLLoss tests, are disabled due to consistent failures on the main branch, indicating stability issues on XPU hardware.  
issues/173335, issues/173339, issues/173340, issues/173341, issues/173345, issues/173350, issues/173351, issues/173353, issues/173354, issues/173363, issues/173364, issues/173464, issues/173465, issues/173466, issues/173471, issues/173472

Incorrect Results from torch.foreach_copy with Mixed Dtypes: The torch._foreach_copy_ function produces incorrect results when copying tensors to a list of destination tensors with mixed data types on CUDA, due to only checking the first destination tensor's dtype. This causes data corruption in batched copy operations.  
issues/173383

MacOS Torch Wheel Tag Mismatch in Release 2.10: MacOS torch wheels in release 2.10 have incorrect wheel tags causing installation problems, with a proposed fix to correct tags and republish binaries with revised naming.  
issues/173475

Torch 2.9.0 CUDA 12.9 Installation Fails Due to CDN Cache: Installing torch 2.9.0 for CUDA 12.9 using uv fails with a hash mismatch error caused by CDN caching issues after binaries were rebuilt and republished; the problem was resolved by invalidating the CDN cache.  
issues/173486

torch.addr CPU and GPU Implementations Diverge on Overflow: The CPU implementation of torch.addr returns infinity due to intermediate overflow with int64 and float16 inputs, while the GPU returns correct finite float32 results, causing inconsistent behavior across backends.  
issues/173491

CUDA Error on Unsupported NVIDIA GeForce RTX 5070 Ti: PyTorch crashes with a CUDA error indicating no kernel image is available for the NVIDIA GeForce RTX 5070 Ti (sm_120), requesting added support for this GPU in CUDA builds.  
issues/173494

torch._stack Crashes on Empty Tensor Input: Calling torch._stack with an empty tensor causes a segmentation fault (SIGSEGV) crash instead of raising a catchable exception, leading to instability.  
issues/173498

torch.floor_divide Crashes on int64 Min Divided by -1: Using torch.floor_divide on CPU with int64 inputs crashes the Python interpreter with a Floating Point Exception (SIGFPE) when dividing the minimum 64-bit integer by -1 due to unhandled integer overflow.  
issues/173506

Numerical Overflow Differences in nn.Conv2d CPU vs CUDA: CUDA and CPU implementations of nn.Conv2d differ in overflow behavior near float32 limits; CUDA outputs partially finite values while CPU outputs overflow to infinity, causing numerical inconsistency.  
issues/173520

Typo in Test Name for Distributed Tensor Debug Mode: The test name test_hash_empty_tenor should be corrected to test_hash_empty_tensor in test/distributed/tensor/debug/test_debug_mode.py.  
issues/173523

nn.Conv2d CUDA Backend Produces NaNs for Near-Limit Inputs: The CUDA backend of nn.Conv2d produces NaN and infinite values for inputs near float32 limits, while the CPU backend produces valid finite results with identical inputs and weights, indicating numerical stability issues.  
issues/173529

Torch Package Missing Numpy Dependency Without torchvision: Installing the torch package without torchvision does not automatically install numpy, causing torch to fail loading due to missing numpy module.  
issues/173532

Segmentation Fault in matrix_exp_backward with Scalar Input: torch.ops.aten.matrix_exp_backward crashes with a segmentation fault when given a scalar tensor instead of a matrix tensor, lacking proper error handling.  
issues/173624

Regression in AOTInductor Model Loading in PyTorch 2.10.0: Loading model packages with AOTInductor fails due to an AttributeError from missing or inaccessible 'codecache' attribute in torch._inductor, a regression from version 2.9.1.  
issues/173706

FSDP2 Backward Pass Runtime Error with Scaled Dot Product Attention: Using Fully Sharded Data Parallel v2 with torch.nn.functional.scaled_dot_product_attention causes a storage size mismatch error during backward pass when loss depends only on inputs, with a workaround to disable resharding after forward.  
issues/173709

torch.reshape Crashes on Very Large Negative Dimension: torch.reshape crashes with a runtime error when given a very large negative input dimension due to unexpected shape argument type.  
issues/173724

HF Cache on B200 Causes vLLM Job Issues: Enabling the HF cache on B200 causes vLLM jobs to automatically detect and use the cache, requiring a forward fix to tests to avoid rollback complications.  
issues/173777

Lack of Documentation for _lazy_clone C++ API Method: The non-public C++ API method _lazy_clone lacks documentation despite its presumed role in deferring tensor cloning operations, raising questions about its usage.  
issues/173780

InstanceNorm ONNX Export Warning with track_running_stats=False: Exporting models with InstanceNorm in eval mode and track_running_stats=False triggers warnings about training mode during ONNX export, unlike when track_running_stats=True, indicating export flag handling issues.  
issues/173782

torch.linalg.cholesky_ex CPU and CUDA Backend Discrepancy: Given infinite inputs, CPU backend returns infinity with success, while CUDA silently produces NaNs but also indicates success, causing inconsistent and unsafe numerical results.  
issues/173786

torch.nn.functional.pdist CPU and CUDA Output Inconsistency: For p=0 and inputs with infinite values, CPU backend propagates NaNs from inf - inf operations, but CUDA returns finite integer values, leading to inconsistent outputs.  
issues/173799

c10d Distributed Operators Fail on MPS CPU Fallback: Using CPU fallback on MPS devices with PYTORCH_ENABLE_MPS_FALLBACK=1 causes silent failures in distributed operators due to asynchronous operation handling and improper CPU-MPS tensor copying, resulting in incorrect broadcast results.  
issues/173808

MI300 CI Node Migration Causes Job Queue Delays: Migration of MI300 continuous integration nodes to a new cloud provider temporarily increased queue times and prevented job runs until migration completed.  
issues/173851

TCPStore Binds to All IPv6 Addresses Instead of Specified IPv4: TCPStore listens on all IPv6 addresses (::) rather than the specified IPv4 localhost (127.0.0.1), potentially causing unintended security risks by exposing services on all interfaces.  
issues/173909

Duplicate Jinja2 Package Entry Removed from CI Requirements: A redundant duplicate Jinja2 package entry with different casing was removed from .ci/docker/requirements-ci.txt to eliminate unnecessary CI dependency redundancy.  
issues/173918

torchInductor and Eager Backend Output Discrepancy on fractional_max_pool2d: Significant output mismatches occur between torchInductor and eager backends when using aten.fractional_max_pool2d in PyTorch 2.6.0, causing assertion failures.  
issues/173985

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 301
Key Open Pull Requests
1. Handle List/Dict Comprehension Graph Breaks for Python3.12+: This pull request addresses the changes in Python 3.12 where list and dict comprehensions are inlined into their surrounding functions by enhancing PyTorch Dynamo to correctly handle graph breaks within these comprehensions through bytecode analysis, checkpointing, and selective tracing, thereby ensuring accurate tracing and resumption of execution across various complex comprehension scenarios and edge cases.

URL: pull/173558

Associated Commits: 56724, c55d7, 5757c, a0eb3, 084e5, 13b84, 9a755, 7ac14, 48d2c, 98374, 9a9db, b31da, dd72d, b4f59, 74fe6, 6c567, 63fd0, a4d89, 8547f, 2129b, c4b09, 4ce41, 3a27d, 8895e, ba25a, d3811, 6cb43, 00a98, 2ee74, 960e9, 1342d, ee671, cf5bd, 411d5, 89182, 3909a, 2b02f, d5fdd, d3015, cb56d, a4524, 8cc33, edc3f, 29785

2. Add interactive glossary with hover tooltips: This pull request adds an interactive glossary to the PyTorch documentation featuring hover tooltips for over 15 PyTorch-specific terms, integrates these tooltips across multiple documentation files, updates dependencies and configuration files to support the new functionality, and incorporates the glossary into the main documentation navigation.

URL: pull/173390

Associated Commits: ad6b2, 3ddcb, 385c7, 19981, b489d, b0659, ed27a, 60f68, 4d439, 404f8, 0c5d0, dad4b, eeb35, c0d5e, b16da, 6276e, 864a4, 2bc3d, e7fd0, 2282a, 112c0, f9ee7, 361b3, 09845, eca66, b8207, 93919, 260cc, 21e6e, 2f8bf, f06c5, f07c3, 03fd9, e08d2, 36790

3. [dynamo][claude] Dynamo profiler: This pull request introduces a Dynamo-native profiler that operates at the tracing layer to measure the time spent by Dynamo while tracing individual Python functions, providing improved visibility into expensive user functions and polyfill invocations to better diagnose and optimize compile-time performance issues, while maintaining compatibility with existing Python profiling tools like pstats and snakeviz.

URL: pull/173942

Associated Commits: e9010, fa138, f1a90, 53e34, aa637, 4dc4c, f1038, e59ad, 07c75, 7a249, 202f4, d568e, e0e22, 7bf23, baa5d, 085db, 2c419, 2e2a3, 7aedb, a02e3, 57b9e, ac310

Other Open Pull Requests

Consolidation of VariableTracker Construction: Multiple pull requests consolidate the construction of VariableTracker objects across various PyTorch modules by routing direct variable creation through centralized builders like SourcelessBuilder.create() or VariableBuilder when transaction context is available. These changes use local imports to avoid circular dependencies and address step 1 of a related issue, while leaving some static handler methods unchanged where transaction context is unavailable.  
pull/173442, pull/173439, pull/173441, pull/173449, pull/173450, pull/173451, pull/173458, pull/173439

Inductor Backend and Test Updates: Pull requests re-enable Inductor X86 backend test cases removed during PT2E migration by updating them to avoid the PT2E API and propose allocating bucket memory from the process group to improve overlap handling. These changes restore test functionality and aim to optimize backend memory management.  
pull/173349, pull/173386

Platform-Specific and Hardware Support Enhancements: Several pull requests improve support for specific hardware and platforms, including enabling dlpack tests for Intel GPU with XPU support, adding rocSHMEM support on ROCm, introducing lazy Intel Level Zero dependency for XPU builds, and addressing ROCm MI350 graph break debugging. These changes enhance compatibility and stability across diverse hardware environments.  
pull/173760, pull/173518, pull/173497, pull/173683, pull/173509

Performance and Compilation Improvements: Pull requests propose allowing eager evaluation of certain Dynamo functions to reduce compile time by about 1.8 seconds and add profiler utilization annotations with FLOPS and bandwidth metrics to the Inductor backend. These enhancements improve compilation efficiency and provide detailed performance analysis capabilities.  
pull/173746, pull/173551

API and Type Handling Enhancements: A pull request enables device-specific Event classes to accept both generic and device-specific Stream inputs, resolving stricter API type requirements and conversion issues between generic and backend-specific events. This change improves API flexibility and usability across devices.  
pull/173908

CI and Build Infrastructure Updates: Pull requests add blockwise FP8 support for scaled_mm_v2 on XPU, introduce ppc64le wheel building support in CI/CD pipelines, and add the torchfuzz test to the CI pipeline with ongoing updates for test management. These changes expand hardware support and improve testing infrastructure.  
pull/173630, pull/173519, pull/173857

Code Quality and Compiler Warning Fixes: A pull request fixes MSVC compiler warning C4267 by adding explicit static casts to ensure type consistency when all warnings are enabled. This improves code robustness and compiler compliance.  
pull/173325

Dynamic Shape Support in Linear Algebra Operations: A pull request fixes dimension-dependent errors in 18 linear algebra operations by replacing direct dimension comparisons with runtime validation and handling unbacked symbolic dimensions properly. This enables these operations to support dynamic shapes effectively.  
pull/173399

Environment Configuration Refactor: One pull request converts environment variable configuration logic from a shell script to a Python-based EnvironmentConfig within the Lumen CLI, improving management, display, export, and verification of environment variables for PyTorch test builds.  
pull/173424

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 336
Key Closed Pull Requests
1. Skip the distributed tests which were previously disabled for 2.8: This pull request proposes skipping the distributed tests that were previously disabled for the PyTorch 2.8 release branch, addressing related issues and coordinating with multiple contributors.

URL: pull/173365

Associated Commits: 71a30, 8fe04, 18a50, 19367, 4b463, 1befb, fad6b, 2f824, 1963d, 30252, 3d102, cb987, 85ac5, 62c67, 86e58, 2074e, 2b25d, ca125, 96009, d568c, b26dd, 53829, 7b590, 61c07, 730c7, fb814, eb343, 9f118, ecc20, 1b442, cdfe1, 2d72f, a0ffd, 22d02, ed0d0, d010d, 9c429, ccdb1, ad6b8, 77a67, ade02, e96dc, 2975e, b4af4, 1d7b9, 2067a, eb471, d2d97, c3d28, 4febb, 419fb, ab27a, 0def0, 75c80, c03be, 64359, b2fb6, 8d179, fd4b1, c1404, 1a9ca, b2d45, 7b2a4, 6aaab, 0e570, 9a46f, 9596b, 9ea02, 675f8, db3ba, aeb64, a20c7, 66514, 0b82d, bd740, dfd38, 245bf, 2cd73, cbd27, fe1f5, b2b16, 336f2, 7a520, 71347, d631b, 2ce89, 330f5, b067d, a3546, 36586, 93dd5, 57296, 63e52, cba8b, 05f24, 1a24a, fa544, cc1d0, 9c53f, cbaa7, 393ae, bf943, 05fef, fd9c5, 07086

Associated Commits: 71a30, 8fe04, 18a50, 19367, 4b463, 1befb, fad6b, 2f824, 1963d, 30252, 3d102, cb987, 85ac5, 62c67, 86e58, 2074e, 2b25d, ca125, 96009, d568c, b26dd, 53829, 7b590, 61c07, 730c7, fb814, eb343, 9f118, ecc20, 1b442, cdfe1, 2d72f, a0ffd, 22d02, ed0d0, d010d, 9c429, ccdb1, ad6b8, 77a67, ade02, e96dc, 2975e, b4af4, 1d7b9, 2067a, eb471, d2d97, c3d28, 4febb, 419fb, ab27a, 0def0, 75c80, c03be, 64359, b2fb6, 8d179, fd4b1, c1404, 1a9ca, b2d45, 7b2a4, 6aaab, 0e570, 9a46f, 9596b, 9ea02, 675f8, db3ba, aeb64, a20c7, 66514, 0b82d, bd740, dfd38, 245bf, 2cd73, cbd27, fe1f5, b2b16, 336f2, 7a520, 71347, d631b, 2ce89, 330f5, b067d, a3546, 36586, 93dd5, 57296, 63e52, cba8b, 05f24, 1a24a, fa544, cc1d0, 9c53f, cbaa7, 393ae, bf943, 05fef, fd9c5, 07086

2. Implements InputObserver to guess the dynamic shapes for torch.export.export and torch.onnx.export: This pull request proposes and implements the InputObserver feature to automatically infer dynamic shapes for torch.export.export and torch.onnx.export by analyzing multiple input sets with varying dimensions, addressing the complexity of handling nested input structures like DynamicCache.

URL: pull/172838

Associated Commits: 39f8f, 066b8, 1d998, c0fbd, 21b9f, 9b5cf, 52209, 50b08, f66ac, e3c59, 3bf4f, 70e3f, a7524, fa4bb, 5cccd, a74c0, a1e3a, 9324a, 7ab9d, d6760, 00fd4, 04685, ecf27, 46db0, 9f3a9, edd6c

Associated Commits: 39f8f, 066b8, 1d998, c0fbd, 21b9f, 9b5cf, 52209, 50b08, f66ac, e3c59, 3bf4f, 70e3f, a7524, fa4bb, 5cccd, a74c0, a1e3a, 9324a, 7ab9d, d6760, 00fd4, 04685, ecf27, 46db0, 9f3a9, edd6c

3. [DTensor] Optimize redistribute comms using flattened meshes: This pull request optimizes DTensor's redistribution communications by detecting and using flattened device meshes when available to reduce costly sequential collective operations, particularly improving reduce communications to avoid divergent results from different reduction orders, while also adding support for comms beyond all_reduce, banning mixed partial placements, simplifying implementation through grouping and merging transform infos, issuing warnings for missing flattened meshes, and addressing various limitations and edge cases to enhance performance and correctness.

URL: pull/172610

Associated Commits: 0133e, f46a1, 30e46, cf97d, 9c878, fa535, f7336, 36655, 4d734, 6e501, 27ec0, 5e38d, c5f81, 5bf75, 25d94

Associated Commits: 0133e, f46a1, 30e46, cf97d, 9c878, fa535, f7336, 36655, 4d734, 6e501, 27ec0, 5e38d, c5f81, 5bf75, 25d94

Other Closed Pull Requests

Norm computation strategy updates: This set of pull requests introduces a new S->P(sum) strategy for the linalg_vector_norm function when skip_root=True and updates norm strategies for inf/-inf/0/1 norms to use Partial(max/min/sum) placements instead of NormPartial with a reduce_op. These changes remove the reduce_op field and simplify norm computations to avoid problematic sqrt→pow→sqrt cycles.  
pull/172604

MPS backend test adjustments: These pull requests propose skipping or marking as expected failures certain tests in the MPS backend, including test_non_standard_bool_values and some OpInfo tests, due to inconsistent results across different platforms. The goal is to improve test reliability and remove associated expected failure markers (xfails).  
pull/173560, pull/173455

NCCL and communication improvements: This pull request implements NCCL 2.29 one-sided APIs for symmetric memory, including updates to nccl_extension.cu, signal methods, and test additions, while addressing feedback and fixing compilation errors. Another related pull request forces saving of torchcomms outputs in the functorch partitioner to ensure backward operations have access to forward tensor outputs, preventing invalid partition dependencies.  
pull/172425, pull/172889

Size hint and optimization hint migration: This pull request migrates remaining calls to size_hint that already pass fallback to use optimization_hint by applying the size hint atomically at those call sites without changing the handling of unbacked cases. This prepares for further handling of unbacked call sites in subsequent updates.  
pull/172533

ProcessGroup and FakeScriptObject improvements: This pull request modifies the ProcessGroup class to use an abstract base class (ABC) metaclass, enabling the registration of FakeScriptObject as a virtual subclass. This allows correct behavior of isinstance checks for tracing when dealing with FakeScriptObjectStack instances.  
pull/172566

Serialization enhancements for GraphModule: This pull request replaces the __reduce__-based serialization in SerializedGraphModule with GraphPickler to directly serialize and reconstruct graph structures. This enables support for HigherOrderOperators that the FX tracer cannot handle properly by adding specialized pickling support and updating serialization methods accordingly.  
pull/173767

CUDA memory snapshot speedup: This pull request adds an option to the CUDA memory snapshot functionality to skip collecting the full trace entry history while still capturing the current memory state. This results in significant speedups—up to thousands of times faster—when taking snapshots with large numbers of trace entries.  
pull/172672

DTensor debug and redistribution fixes: These pull requests enhance the DTensor debug mode by enabling it to print optimized transform information and fix a crash caused by an assertion failure in the DTensor redistribution planner. The fix allows non-participating ranks to safely exit early during redistribution cost computation, ensuring consistent DTensor property queries across all ranks.  
pull/173436, pull/172478

Circular dependency bug fix in constant folding: This pull request fixes a circular dependency bug in the constant_fold_uniform_value function that caused stable_topological_sort to fail with an assertion error. It adds a check to skip replacements that would create cycles involving sym_size_int nodes and full() nodes.  
pull/173444

Automated triage workflow for GitHub issues: This pull request introduces an automated triage workflow using a skill-based system that applies predefined labels and canned responses to GitHub issues via GitHub Actions. It leverages a static label list and the sonnet-4.5 model to improve issue classification and management.  
pull/173530

ONNX exporter dynamic shape inference update: This pull request adds a parameter to the InputObserver.infer_dynamic_shapes method in the ONNX exporter to allow forcing the first dimension of input tensors to be treated as dynamic. This improves flexibility in dynamic shape inference even when the dimension is not present in a given set of inputs.  
pull/173533

Inductor backend NVGEMM support: These pull requests add support for scaled matrix multiplication (mm) and Groupgemm operations using NVGEMM within the Inductor backend. These enhancements improve the backend's capability to handle various matrix multiplication scenarios.  
pull/172525, pull/172417

MAGMA backend deprecation for SVD: This pull request deprecates the MAGMA backend for singular value decomposition (svd) and unconditionally dispatches the operation to the cuSOLVER backend instead.  
pull/172824

Hugging Face cache enablement in CI: This pull request proposes enabling the Hugging Face (HF) cache across all continuous integration (CI) jobs to locally store HF content for faster access. It includes a mechanism to refresh the cache via a special PR label and daily updates tied to the vLLM pin update process.  
pull/173477

Shallow copy support for privateuse1 backend: This pull request introduces a new function to enable shallow copying between CPU and the privateuse1 backend. It enhances documentation and provides an example to support this previously unsupported operation.  
pull/172564

Intel Triton commit update and fixes: This pull request updates the Intel Triton commit pin within the [xpu][inductor] components, including related fixes such as unskipping a specific test and addressing lint errors.  
pull/172943

Unbacked tensor dimension testing: This pull request adds a new test suite file, test_ops_unbacked.py, which marks tensor dimensions of size two or greater as unbacked for all OpInfo entries and attempts full graph compilation. It detects framework data-dependent errors while maintaining a list of known failing operations due to such errors.  
pull/173131

DeviceContext mode stack invariant enforcement: This pull request ensures that the DeviceContext maintains the invariant of having only one mode on its stack at any given time.  
pull/173537

ROCm gfx950 GPU test fixes: This pull request addresses and implements test skips and fixes for unit test failures specific to the gfx950 GPU architecture in the ROCm CI environment. These changes ensure stable and accurate continuous integration results.  
pull/173590

Static Triton kernel launcher for XPU: This pull request proposes enabling the static Triton kernel launcher for the XPU backend and reusing the corresponding unit tests to support this feature.  
pull/169938

Global kernel cache for cutlass_api: This pull request implements a global kernel cache built at first use to avoid repeated expensive calls to cutlass_api.get_kernels(). This results in significant runtime improvements, including up to a 43% speedup in end-to-end latency benchmarks for workflows involving multiple GEMM operations.  
pull/172402

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

wconstab
142
20
0
36

malfet
84
21
2
79

NikhilAPatel
141
35
0
0

pianpwk
144
18
0
9

ydwu4
151
15
1
2

laithsakka
133
20
0
11

bobrenjc93
121
31
0
7

BenjaminDEMAILLE
128
1
0
9

kurtamohler
112
3
1
3

eellison
54
17
1
30

                            Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Issues	Comments
wconstab	142	20	0	36
malfet	84	21	2	79
NikhilAPatel	141	35	0	0
pianpwk	144	18	0	9
ydwu4	151	15	1	2
laithsakka	133	20	0	11
bobrenjc93	121	31	0	7
BenjaminDEMAILLE	128	1	0	9
kurtamohler	112	3	1	3
eellison	54	17	1	30