Weekly GitHub Report for Pytorch: January 16, 2026 - January 23, 2026 (21:06:18)

Weekly GitHub Report for Pytorch: January 16, 2026 - January 23, 2026 (21:06:18)

        Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, FlexAttention for X86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda package publishing.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[TRIAGED] [MODULE: CUDA GRAPHS] [ONCALL: PT2] [MODULE: DYNAMIC SHAPES] [MODULE: INDUCTOR] [VLLM-COMPILE] [MODULE: VLLM] [torch.compile][Bug] Unbacked SymInt leaks into a cudagraph-safe partitions: This issue describes a bug where an unbacked symbolic integer (SymInt) returned from a cudagraph-unsafe custom operation is incorrectly propagated into a cudagraph-safe partition when using graph partitioning in PyTorch's Inductor backend. This causes the partition to use data-dependent shapes in CUDA allocation and pointer offsets, violating cudagraph safety guarantees and leading to incorrect graph captures.  

The comments discuss potential fixes including adding an API to mark specific SymInts as cudagraph-safe or unsafe, the challenges of distinguishing safe versus unsafe dynamic shapes, and the need for Inductor to automatically partition graphs to exclude operations dependent on unsafe SymInts; a pull request addressing the issue has been raised and further design considerations about API naming and behavior were debated.
Number of comments this week: 15

[HIGH PRIORITY] [TRIAGE REVIEW] [MODULE: AUTOGRAD] [TRIAGED] backward does not respect with torch.device(device): This issue reports that the backward pass in PyTorch does not respect the with torch.device(device) context manager because the device context is thread-local and the backward computation runs on a separate thread, causing tensors created during backward to default to the CPU instead of the intended device. The discussion clarifies that this behavior is not a regression from a recent change, highlights limitations of TorchFunctionMode in intercepting backward operations, and notes that TorchDispatchMode may be required for reliable device control during backward.  

The comments discuss whether this is a regression and conclude it is not tied to a specific PR, explain the thread-local nature of device contexts and limitations of TorchFunctionMode in backward, mention related issues and use cases, and suggest that TorchDispatchMode is necessary for proper interception and device management during backward passes.
Number of comments this week: 8

[TRIAGED] [MODULE: XPU] XPU: x.item() fails on Intel Arc Pro B50 with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY; UR shows urEnqueueUSMMemcpy(size=4): This issue reports a runtime error occurring on Intel Arc Pro B50 and B580 GPUs when calling the .item() method on an XPU tensor in PyTorch, resulting in a UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY from the Level Zero backend during a USM memory copy operation. The problem appears related to device-to-host memory copying, with attempts to reproduce and diagnose it revealing that even copying the tensor to CPU memory fails, and the issue may be linked to recent driver or runtime changes rather than PyTorch versions.

Commenters confirmed the error on multiple Intel Arc GPUs and suggested it might be caused by invalid USM pointers or driver issues; attempts to downgrade PyTorch did not resolve the problem, and testing with the latest Intel drivers on Fedora Rawhide showed the issue was fixed, indicating a driver update as a potential solution.
Number of comments this week: 8

[ONCALL: PT2] [TRIAGED] [MODULE: FAKETENSOR] [MODULE: DECOMPOSITIONS] [MODULE: PT2-DISPATCHER] [MODULE: DYNAMIC SHAPES] torch.compile fails in FakeTensor meta path: Cannot view ... strides ... as (1, 2048) while eager works: This issue describes a bug where a model using various layers and operations runs correctly in eager mode but fails during torch.compile with the "inductor" backend when evaluated with FakeTensor meta tensors, specifically failing on a view operation due to stride incompatibility. The root cause was identified as a mismatch between C++ and Python implementations of a function detecting channels-last memory format from strides, which led to incorrect stride interpretation and a subsequent failure in the FakeTensor path; a fix was proposed and confirmed to resolve the issue.  

The comments include initial identification of the problem as a reshape/view decomposition error, detailed stride and shape comparisons between eager and FakeTensor modes, discovery of a mismatch in memory format detection logic between C++ and Python, a proposed fix adjusting stride checks, and confirmation from the original reporter that the fix resolves the issue with consistent outputs in both eager and compiled modes.
Number of comments this week: 6

[ONCALL: DISTRIBUTED] [PIPELINE PARALLELISM] Pipeline communication blocks the execution of pipeline stages: This issue addresses a problem in the pipeline scheduling of distributed training where communication operations block the execution of pipeline stages, particularly on AMD hardware. The current greedy scheduling algorithm causes a communication receive operation to block computation, leading to pipeline bubbles, and the user requests an option to reorder these operations to avoid overlap and interference.  

The comments discuss testing a proposed patch that reorders receive operations to reduce blocking, showing improved performance in single-node tests but encountering deadlocks in multi-node setups; attempts to resolve deadlocks by swapping send/receive order on odd ranks were unsuccessful, and further investigation into better heuristics was promised.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 74
Summarized Issues:

CUDA and GPU Architecture Support: Several issues address CUDA compatibility and GPU architecture problems, including updating the CUDA support matrix to make CUDA 13.0 stable, stripping of accelerated suffixes from GPU architecture flags causing broken NVIDIA features, and a test failure on GB200 (aarch64) due to missing valid CUDATemplateCaller choices. These problems affect build configurations, feature support, and test reliability across different CUDA versions and GPU platforms.  
issues/172663, issues/172807, issues/172928

torch.compile and Inductor Backend Bugs: Multiple issues describe bugs in torch.compile and the Inductor backend, including incorrect tensor shape inference, NotImplementedErrors with MKLDNN tensors, dynamic compilation errors with strided tensors, dtype-changing view miscompilations, and failures involving complex operations or higher-order gradients. These bugs cause silent correctness errors, crashes, or recompilations, impacting model compilation and execution fidelity.  
issues/172684, issues/172693, issues/172711, issues/172712, issues/172747, issues/172822, issues/172826, issues/172830, issues/173052, issues/173054

Distributed and DTensor Issues: Several problems relate to distributed tensor operations and DTensor behavior, including pipeline scheduling blocking on AMD hardware, incorrect global shapes from unevenly sharded DTensors, leaking unbacked SymInt into cudagraph-safe partitions, gradient type mismatches for unused DTensor outputs, and errors with Shard(1) placement in tensor parallel layers. These issues cause performance bottlenecks, incorrect computations, and failures in distributed or parallel tensor workflows.  
issues/172668, issues/172728, issues/172812, issues/173123, issues/173041

Segmentation Faults and Crashes: Multiple reports describe segmentation faults and crashes occurring in various contexts, such as loading compiled AOT artifacts with constants, importing cudagraph_trees module, subclassing OpaqueBase, nightly build updates causing segfaults in ComfyUI, and unsafe crashes compiling rnn_tanh operator. These faults disrupt normal operation and require debugging to ensure stability.  
issues/172739, issues/172837, issues/172869, issues/173125, issues/173046

Randomness and Reproducibility Issues: There are issues with inconsistent random tensor generation and unexpected stride changes, including inconsistent random tensors between single- and multi-GPU setups despite fixed seeds, and unexpected stride changes when using torch.autograd.Function.apply. These problems affect reproducibility and tensor layout assumptions.  
issues/173157, issues/172797

ONNX and Exporting Problems: Issues include metadata loss when exporting models with torch.onnx.export using dynamo=True, incorrect input names in Dynamo ONNX exporter outputs, and failures exporting models due to symbolic length errors or deprecated torch.jit usage. These affect model portability and export correctness.  
issues/172784, issues/173076, issues/172930, issues/173169

Memory and Allocation Concerns: Problems include requests for FP8 support in symmetric memory to reduce memory inflation, leveraging unified memory for Apple Silicon MPS backend to reduce data transfer costs, and NCCL communicator requiring explicit all_reduce calls to avoid illegal memory access. These issues impact memory efficiency and correctness in distributed and device-specific contexts.  
issues/172976, issues/172987, issues/172873

Documentation and Build Issues: Several issues report documentation typos, broken links, inconsistent buffer size defaults, missing semicolons causing build errors, and CI testing gaps for install script changes. These affect developer experience and build reliability.  
issues/172811, issues/173055, issues/172901, issues/172907, issues/173097

Performance and Tuning Problems: Issues include slow torch.randn on CPU with bfloat16, coordinate descent tuner performing unnecessary expensive operations despite warm cache, and the need to readjust fudge factors in Inductor unit tests due to improved math backend precision. These affect runtime efficiency and test stability.  
issues/172819, issues/173173, issues/173176

API and Functionality Inconsistencies: Problems include torch.bmm not accepting documented default out_dtype=None, backward function ignoring device context due to thread-locality, inconsistent behavior of FP8 casting between CPU and CUDA backends, and torch.bucketize producing inconsistent results for NaNs between eager and Inductor modes. These inconsistencies cause confusion and correctness issues.  
issues/172751, issues/172798, issues/172767, issues/173133

Model Loading and State Dict Enhancements: There is a feature request to add an optional parameter to model.load_state_dict to allow partial tensor filling during non-strict loading, enabling pretrained weights to be partially copied into expanded weight matrices while initializing remaining parts. This would improve flexibility in model weight transfer.  
issues/173160

PrivateUse1 and Backend Compatibility: Issues discuss problems with PrivateUse1 backends including test instantiation failures, assignment of PrivateUse1 tensors to CPU tensor data attributes, and enabling FlexAttention device validation to allow PrivateUse1 devices. These affect experimental or third-party backend integration.  
issues/172764, issues/173021, issues/173071

Quantization and Numerical Accuracy Bugs: A bug in UniformQuantizationObserverBase causes incorrect scale and zero_point calculations, and TorchInductor produces incorrect results when combining aten.add with aten.hardswish_backward, leading to numerical errors and output mismatches.  
issues/173075, issues/173054

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 43
Summarized Issues:

Test Failures and Runtime Errors on Specific Platforms: Multiple issues report test failures and runtime errors related to platform-specific hardware or software configurations. These include test failures on the xpu platform, runtime errors with Inductor GEMM backend on NVIDIA B300 GPUs, segmentation faults due to device tensor pointer mismatches, and failures on CUDA-enabled H100 machines during memory efficient attention backward passes.  
issues/169910, issues/170476, issues/170863, issues/172233

Compilation and Recompilation Issues with torch.compile: Several issues describe bugs and assertion errors occurring during compilation or recompilation with torch.compile. Problems include excessive recompilations caused by forward hooks on submodules, failures with effectful operations during backward passes, and inconsistent behavior between eager and compiled modes.  
issues/170110, issues/172248, issues/172529, issues/173053

CUDA and GPU Compatibility Problems: There are multiple reports of compatibility issues involving CUDA versions and GPU architectures. These include failures of tests on SM90 GPUs with CUDA versions earlier than 12.9, user warnings for unsupported CUDA capabilities (12.1), missing operator implementations on MPS devices, and installation failures due to missing CUDA bindings or unsupported GPU architectures.  
issues/172227, issues/172629, issues/172814, issues/172926, issues/173065

Memory and Segmentation Fault Crashes: Several issues describe crashes caused by memory corruption, segmentation faults, or invalid memory accesses. These include out-of-bounds errors in flex_attention, memory corruption in torch.linalg.ldl_solve, segmentation faults in XPUPluggableAllocator, and crashes during large model exports or memory visualization.  
issues/172225, issues/172744, issues/172931, issues/172896, issues/172903

Documentation and Typographical Errors: Multiple issues address typos and documentation inaccuracies that could cause confusion or errors. These include misspellings in code and documentation files, incorrect environment variable names, misleading function signatures, and formatting problems in hardware prerequisite tables.  
issues/172722, issues/172724, issues/172939, issues/172936, issues/172966, issues/173027

Installation and Environment Configuration Issues: Some issues report problems related to package installation and environment setup. These include errors caused by mismatched Python environments leading to module not found errors, missing compatible wheels for macOS Python 3.14, and ninja version conflicts in CI environments.  
issues/172808, issues/172993, issues/172815, issues/172816

Release Management and Cherry-Pick Tracking: One issue outlines the process and criteria for managing cherry-picks to the PyTorch 2.10.0 release branch to maintain stability and quality during the release cycle.  
issues/170119

ONNX Export and Model Conversion Failures: Several issues describe errors encountered during exporting models to ONNX format, including unregistered custom types causing runtime errors and undefined variables during decomposition steps.  
issues/172880, issues/172903, issues/173074

API Behavior and Output Inconsistencies: Some issues report inconsistencies in API behavior or output shapes, such as discrepancies in ConvTranspose3d output size formulas and differences in torch.isin output between eager and compiled modes.  
issues/172529, issues/172560

Build and Platform Compatibility Problems: Issues include build failures for Triton XPU on specific Linux distributions due to GLIBC mismatches and missing hash fragments in PyTorch wheel URLs affecting supply chain security.  
issues/173064, issues/173099, issues/173165

User Feedback and Positive Comments: A few issues contain user feedback or positive remarks about documentation clarity without reporting bugs or problems.  
issues/172753, issues/173092

Usage and API Questions: One issue questions how to run exported artifacts from torch.export in C++ following recent release recommendations.  
issues/173067

Indexing and Dynamic Shape Handling: An issue addresses the need to disable 32-bit indexing assumptions in vLLM for dynamic shapes and to analyze unbacked models for appropriate indexing based on tensor size limits.  
issues/172883

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 249
Key Open Pull Requests
1. [reland][ROCm] remove caffe2 from hipify: This pull request relands the removal of caffe2 from the hipify tool in the ROCm project by eliminating all "MasqueradingAsCUDA" files and classes and avoiding renaming "CUDA" classes to "HIP," addressing previous infrastructure issues and incorporating multiple fixes, updates, and mapping improvements to ensure compatibility and build stability.

URL: pull/172796

Associated Commits: 4cb19, 5d694, c5a07, 7d7e3, c6f2c, e538b, f0fca, c55ca, 52f55, 7c7be, 6c7cb, 0f0a5, 43be9, fced1, 1bd02, 9cd91, b25bf, 84388, d21a7, fa4ea, bfc83, 3f702, dd3ca, 9b608, 03a40, be812, e7838, 9cf12, 3c4c1, b09bb, c3f73, ee1f7, 71e55, dca58, 5a319, b8641, 6b41f, e23ec, 1970c, 1bfb1, 11e1c, 64210, 7dcbb, 7d2af, e3652, a7c55, 3d3e4, 82a60, 6b226, 47962, f48f0, e259e, 7b449, a15de, 5a66c, 76c18, 64325, 6e55c, 8e093, 7a60c, d34f4, 68086, 61db1

2. Implements InputObserver to guess the dynamic shapes for torch.export.export and torch.onnx.export: This pull request implements the InputObserver to automatically infer dynamic shapes for torch.export.export and torch.onnx.export by analyzing multiple input sets with varying dimensions, addressing challenges in handling nested input structures like DynamicCache.

URL: pull/172838

Associated Commits: 39f8f, 066b8, 1d998, c0fbd, 21b9f, 9b5cf, 52209, 50b08, f66ac, e3c59, 3bf4f, 70e3f, a7524, fa4bb, 5cccd, a74c0, a1e3a, 9324a, 7ab9d, d6760, 00fd4, 04685

3. [wip][dtensor][invoke_subgraph] Support Dtensor inputs to invoke subgraph: This pull request aims to add support for Dtensors as inputs to the invoke_subgraph function in PyTorch, enabling tensor subclass compatibility within subgraph invocation.

URL: pull/172970

Associated Commits: 2c824, d2cdc, b26a9, 9775e, 7dd0c, 462f6, 0b801, 8c228, 9950b, 992c8, 773d5, 5ca05, a5b01, 2ac89, bb88b, d0270, c4f81, 83b04, f8a04, 9e51e

Other Open Pull Requests

Module tracing and hooks decoration: This pull request enables users to decorate the forward method of nn.Module with @leaf_function to prevent PyTorch Dynamo from tracing into it, and experiments with annotating module hooks similarly to avoid tracing. These changes help control tracing behavior in PyTorch's dynamic graph mode.

pull/172692

ONNX export improvements: Support for exporting the torch.ops.higher_order.invoke_subgraph operator to ONNX was added by creating direct function call nodes to preserve nested compiled functions without inlining, including output renaming to avoid runtime conflicts. Additionally, a fix was made to prevent crashes when exporting torch.cdist with dynamic input shapes by using a safer fallback computation.

pull/172715, pull/172758

New PyTorch operations for power sums: Introduced torch.linalg.powsum and torch._foreach_powsum operations that compute the sum of absolute values raised to a power without taking the final root, enabling efficient distributed power sum computations by allowing partial sums to be reduced across shards before applying the root.

pull/172685

DTensor sharding rule discovery and fallback: A comprehensive DTensor sharding rule discovery harness was introduced for PyTorch operators, including infrastructure for enumeration, validation, and documentation, plus new sharding rules for operators like cdist and kron. A fallback mechanism for the dtensor shard_dim_alltoall operation was also implemented using an allgather plus chunk approach to address lack of native alltoall support.

pull/172779, pull/172890

Fixes for dynamic dispatch and indexing errors: Addressed a dynamic dispatch error in the _exec_fft function by adding support for unbacked views in reshape-related functions, and fixed two bugs related to int64 indexing when dimension products exceed 65,000, correcting data type usage and overflow issues.

pull/172717, pull/172925

Build system and CI improvements for ROCm and AMD: Updated the PyTorch build system to use CMake's native HIP support for ROCm, improving compiler separation and consistency with CUDA builds, and introduced testing for MI250 runners in shadow mode within ROCm CI workflows by updating configurations and fixing syntax.

pull/172775, pull/172977

CUDA memory snapshot optimization: Added an option to speed up CUDA memory snapshots by excluding trace entries and annotations, significantly reducing snapshot time while still capturing current memory state, achieving speedups over 3000x with large trace histories.

pull/172672

Lint job fixes and triage automation: Fixed the PyTorch lint job by applying patched linux_job updates, switching test versions, and stabilizing linter tests, and introduced automation to label PRs with reviewers as "triaged" to reduce triage workload without changing codebase behavior.

pull/172818, pull/172676

Functorch partitioner and backward dependency fixes: Forced saving of torchcomms outputs in the functorch partitioner to ensure backward operations have access to forward tensor outputs, preventing invalid partition dependencies, and fixed the partitioner to correctly handle multi-output nodes returned as grad_output with new tests.

pull/172889, pull/172878

Deprecation of MAGMA backend: Deprecated the MAGMA backend for both the singular value decomposition (svd) and ldl_factor functions, unconditionally dispatching these operations to the cuSOLVER backend, and added a deprecation warning triggered when the linear algebra backend is retrieved or set in MAGMA.

pull/172824, pull/172825, pull/172823

Partial reinplacement support for mutating arguments: Added support for specifying multiple mutating arguments to enable partial reinplacement when at least one argument can be reinplaced in-place without requiring a clone, improving in-place operation flexibility.

pull/172782

AMP gradient state enhancements: Introduced an updated state in the optimizer state and a new get_stage method in the GradScaler to enhance AMP gradient state management, along with related refactoring and unit tests to fix issue #67590.

pull/172718

SymPy and Inductor crash prevention: Modified SymPy within PyTorch's Inductor to treat the Identity element as non-numeric in Min/Max operations to prevent comparability crashes during substitution, including a regression test for this case.

pull/172678

Code cleanup in ParamsHash: Removed redundant code from ParamsHash.h by eliminating duplicate move operations, consolidating hashing logic duplicated in ParamsWrapperHash, and correcting memcmp usage to accept void* inputs instead of typed pointers.

pull/172707

Data-dependent error fixes: Addressed and fixed a data-dependent error in the PyTorch codebase, including updates and corrections to related tests to ensure proper functionality.

pull/172786

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 249
Key Closed Pull Requests
1. [DRAFT] Fix gloo sync issue for non default stream: This pull request addresses a synchronization issue in the Gloo backend for PyTorch's distributed data parallel tests by capturing and synchronizing on the original non-default CUDA stream during the allreduce operation, thereby fixing a bug where the default stream was incorrectly used instead of the intended non-default stream.

URL: pull/172952

Associated Commits: cd2c9, ce928, c31a8, 15238, 0ac9f, 10769, 44baf, aebf4, 7f8ba, 9718a, baab5, ffa6f, bc158, 76beb, b1aae, 25d8c, a576d, 35c55, ddd50, f83cf, 7cf37, 57979, 4966d, 47cb4, 715dc, 6c058, 1dadb, 5322d, be29c, 7d024, 96f0c, 300ba, 9952b, c0577, b0dc9, 87c5d, 132d9, 0154c, 49dab, fc8bf, 824d5, 63da9, 57dc6, f9e49, 7cadf, 5340e, 99930, daa3d, d7a70, 37e26, 45e25, 11f77, 709f4, b64fc, a2c77, 20100, d1b63, 22d46, 21fec, a21a4, 72cf4, 005e3, e70d9, 71282, a5fea, f227c, 3abee, 3e8a0, 764f6, 881c2, 1cd83, 31c72, 10b50, d6e84, 017d8, 2f638, fd364, c74f0, 3b573, d4c43, b0154, 42f0c, 6f12b, 26e02, 0fabc, 4dca4, c1908, e9e3d, 9315f, cbe1a, 32e37, e2f6f, 3dead, e0c8f, 59337, d29de, fc561, 76335, 38e8b, 13f1b, 8e83e, f190b, ba863, 6bc3d, 4aca6, 49046, 5b9f0, a0614, 3d27d, 8f658, e6bcb, 9976b, 68772, f36c7, 5811a, b0025, d3816, 05532, 37029, 78c49, aa609, 57c7f, d5cdc, a955f, dd14b, 144ed, 3a2af, f49a1, 2a2ce, e5b4c, 8b238, db04f, 270fc, ec73c, 345bd, fbe15, f42bf, 08dbe, 90b73, 133f2, db4e3, caf34, 55b57, 82da5, d6fa3, 839c2, bf92b, d6727, ff68c, e1820, 8ed5a, 83a04, dd64d, c4f04, 69680, 80e0f, ddb0b, 2bfd9, ff9e1, a7bdc, 61872, 932a9, e7d49, b2f5d, aef08, a2b0f, a6a30, 6ecd7, f363a, 10201, b8d38, 84631, 4e285, 351ff, 9c8d5, e9d71, 2c520, 74f43, 4ff52, f6ba4, 79f4e, 34002, c8423, 0e8d4, 150de, b71cb, 7e194, 07b35

Associated Commits: cd2c9, ce928, c31a8, 15238, 0ac9f, 10769, 44baf, aebf4, 7f8ba, 9718a, baab5, ffa6f, bc158, 76beb, b1aae, 25d8c, a576d, 35c55, ddd50, f83cf, 7cf37, 57979, 4966d, 47cb4, 715dc, 6c058, 1dadb, 5322d, be29c, 7d024, 96f0c, 300ba, 9952b, c0577, b0dc9, 87c5d, 132d9, 0154c, 49dab, fc8bf, 824d5, 63da9, 57dc6, f9e49, 7cadf, 5340e, 99930, daa3d, d7a70, 37e26, 45e25, 11f77, 709f4, b64fc, a2c77, 20100, d1b63, 22d46, 21fec, a21a4, 72cf4, 005e3, e70d9, 71282, a5fea, f227c, 3abee, 3e8a0, 764f6, 881c2, 1cd83, 31c72, 10b50, d6e84, 017d8, 2f638, fd364, c74f0, 3b573, d4c43, b0154, 42f0c, 6f12b, 26e02, 0fabc, 4dca4, c1908, e9e3d, 9315f, cbe1a, 32e37, e2f6f, 3dead, e0c8f, 59337, d29de, fc561, 76335, 38e8b, 13f1b, 8e83e, f190b, ba863, 6bc3d, 4aca6, 49046, 5b9f0, a0614, 3d27d, 8f658, e6bcb, 9976b, 68772, f36c7, 5811a, b0025, d3816, 05532, 37029, 78c49, aa609, 57c7f, d5cdc, a955f, dd14b, 144ed, 3a2af, f49a1, 2a2ce, e5b4c, 8b238, db04f, 270fc, ec73c, 345bd, fbe15, f42bf, 08dbe, 90b73, 133f2, db4e3, caf34, 55b57, 82da5, d6fa3, 839c2, bf92b, d6727, ff68c, e1820, 8ed5a, 83a04, dd64d, c4f04, 69680, 80e0f, ddb0b, 2bfd9, ff9e1, a7bdc, 61872, 932a9, e7d49, b2f5d, aef08, a2b0f, a6a30, 6ecd7, f363a, 10201, b8d38, 84631, 4e285, 351ff, 9c8d5, e9d71, 2c520, 74f43, 4ff52, f6ba4, 79f4e, 34002, c8423, 0e8d4, 150de, b71cb, 7e194, 07b35

2. Implements InputObserver to guess the dynamic shapes for torch.export.export and torch.onnx.export: This pull request proposes the implementation of an InputObserver to automatically infer dynamic shapes for torch.export.export and torch.onnx.export, addressing the complexity of setting dynamic shapes when inputs contain nested structures by computing them based on multiple input sets with varying dimensions.

URL: pull/172835

Associated Commits: 22dd1, 11523, e7ab7, 882d5, 249f4, 0b004, 44999, 0f250, 0caa1, ad6bb, 04f01, f169d, dd484, d90f9, 7b457, 45937, 402a3, 1da86, acc6d, ed383, f3aa4, bc23f, 4decd, 69e32, 773b7, 59f15, 082f4, 0880a, 23c60, c11eb, 7e57f, dd657, 0afd8, eb3f6, e3342, fba31, bf8cb, 1fb07, 9e440, bef07, 069b8, 9204d, fec5a, b7da8, 8944a, 326d0, 12a9c, 6a63b, 966c9, ea575, e0c1b, 75204

Associated Commits: 22dd1, 11523, e7ab7, 882d5, 249f4, 0b004, 44999, 0f250, 0caa1, ad6bb, 04f01, f169d, dd484, d90f9, 7b457, 45937, 402a3, 1da86, acc6d, ed383, f3aa4, bc23f, 4decd, 69e32, 773b7, 59f15, 082f4, 0880a, 23c60, c11eb, 7e57f, dd657, 0afd8, eb3f6, e3342, fba31, bf8cb, 1fb07, 9e440, bef07, 069b8, 9204d, fec5a, b7da8, 8944a, 326d0, 12a9c, 6a63b, 966c9, ea575, e0c1b, 75204

3. flex document masking respects block sizes: This pull request improves the document masking mechanism by adding offsets and limits to ensure each document mask is aligned to block sizes, making the masking batch invariant and enhancing performance by reducing interference between neighboring documents when sequence lengths exceed the block size.

URL: pull/172464

Associated Commits: b259b, dc3d3, 0b9a2, 9ad54, 25d3b, ee0c4, 2ab5d, 70bae, 9b5a3, d3d34, fccb2, 67c51, 92bde, d6bb6, 68aca, 0b53f, a15f6, 9a96a, 512f9, 3e4da

Associated Commits: b259b, dc3d3, 0b9a2, 9ad54, 25d3b, ee0c4, 2ab5d, 70bae, 9b5a3, d3d34, fccb2, 67c51, 92bde, d6bb6, 68aca, 0b53f, a15f6, 9a96a, 512f9, 3e4da

Other Closed Pull Requests

Flash Attention 3 (fa3) wheel building and configuration: This topic covers pull requests focused on building and configuring Flash Attention 3 wheels for Linux, including adjustments to CUDA versions, compiler settings, installation scripts, and wheel naming to ensure compatibility and successful deployment. These changes enable proper packaging and installation of fa3 on supported systems.  
pull/172654

ROCm unit test fixes and GPU compatibility adjustments: These pull requests update and fix ROCm-specific unit tests by extending test skips across all ROCm architectures and correcting grid size expectations to differentiate between MI300 and other AMD GPUs. They also fix test failures by replacing hardcoded Nvidia-specific values with dynamic device properties to accommodate AMD ROCm GPU characteristics.  
pull/172780, pull/172499

Metal Performance Shaders (MPS) OpInfo skips and dtype specifications: Multiple pull requests add OpInfo skips and specify data types for MPS operations to improve testing and compatibility. These include adding expected failure skips for unsupported CPU dtypes, extending test_output matching to support upcasts, and replacing expected failures with appropriate markers for MPS-specific dtypes.  
pull/172569, pull/172570, pull/172571, pull/172572, pull/172573, pull/172574, pull/172575, pull/172591, pull/172750

DTensor enhancements and redistribution semantics: These pull requests improve DTensor functionality by enabling a single-dimension strategy for addmm and baddbmm operations, filtering out incompatible output strategies in expand_to_full_mesh_op_strategy, supporting redistribution of uneven _StridedShard tensors with device order through Replicate, and clarifying left-to-right ordering of partial redistributions with added tests and bug fixes.  
pull/172387, pull/172420, pull/172266, pull/172277

Element-wise tensor list operations with scalar tensors: This pull request implements new functions _foreach_add.ScalarTensorList and _foreach_sub.ScalarTensorList to efficiently support element-wise addition and subtraction of tensor lists with corresponding scalar tensors. These functions achieve bitwise equivalence with the standard in-place add_ operation and improve performance for optimizer patterns using per-parameter step sizes.  
pull/172432

BF16 GEMM backend selection heuristic: This pull request evaluates BF16 GEMM shapes by comparing BGEMM and oneDNN, builds a ground-truth performance dataset, and trains a decision-tree model to create a simple rule-based heuristic. The heuristic achieves approximately 99.9% oracle accuracy, slightly improves average runtime, and significantly reduces worst-case slowdown from about 36× to 2.5×.  
pull/172776

Fused multiply-add (FMA) optimizations for addcmul: These pull requests propose adding FMA lowerings for the addcmul operations within the Inductor backend and improve torch.addcmul by using FMA when possible to ensure bitwise-identical results with torch.add for float32 tensors with alpha=1. One proposal was not merged, while the other addresses precision and testing considerations.  
pull/170531, pull/172750

varlen_attn module updates: These pull requests propose adding sliding window attention to the varlen_attn function with positional optional arguments for future compatibility and deprecate the is_causal flag in favor of using window_size = (-1, 0) to express causal attention.  
pull/172238, pull/172245

Dynamic tensor shape consistency with duck_shape_id: This pull request introduces an optional duck_shape_id parameter to the mark_unbacked function that groups unbacked tensor dimensions sharing the same identifier to use the same unbacked symbol. This enables runtime assertions to ensure consistent sizes for dynamic batch sizes, supporting use cases like vLLM's dynamic input lengths and torchbench benchmarks.  
pull/172716

CPython 3.13 test failure fixes: This pull request addresses and fixes test failures in CPython 3.13 reported in the dynamo-unittest job for the PyTorch project, ensuring compatibility with the new Python version.  
pull/172448

Typechecking enablement via suppression comment updates: This pull request enables typechecking for previously excluded files by updating suppression comment formats to be compatible with a recent usort upgrade, allowing these files to be checked without errors.  
pull/172969

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

BenjaminDEMAILLE
87
53
0
6

pianpwk
107
18
0
2

laithsakka
77
14
1
26

wconstab
97
13
0
2

Copilot
12
2
0
97

dolpm
83
12
0
11

guangyey
91
7
0
3

malfet
55
8
1
37

ezyang
60
19
1
20

kurtamohler
88
11
0
0

                            Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Issues	Comments
BenjaminDEMAILLE	87	53	0	6
pianpwk	107	18	0	2
laithsakka	77	14	1	26
wconstab	97	13	0	2
Copilot	12	2	0	97
dolpm	83	12	0	11
guangyey	91	7	0	3
malfet	55	8	1	37
ezyang	60	19	1	20
kurtamohler	88	11	0	0