Weekly GitHub Report for Pytorch: January 16, 2026 - January 23, 2026 (21:06:18)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, FlexAttention for X86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda package publishing.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[TRIAGED] [MODULE: CUDA GRAPHS] [ONCALL: PT2] [MODULE: DYNAMIC SHAPES] [MODULE: INDUCTOR] [VLLM-COMPILE] [MODULE: VLLM] [torch.compile][Bug] Unbacked SymInt leaks into a cudagraph-safe partitions: This issue describes a bug where an unbacked symbolic integer (SymInt) returned from a cudagraph-unsafe custom operation is incorrectly propagated into a cudagraph-safe partition when using graph partitioning in PyTorch's Inductor backend. This causes the partition to use data-dependent shapes in CUDA allocation and pointer offsets, violating cudagraph safety guarantees and leading to incorrect graph captures.
- The comments discuss potential fixes including adding an API to mark specific SymInts as cudagraph-safe or unsafe, the challenges of distinguishing safe versus unsafe dynamic shapes, and the need for Inductor to automatically partition graphs to exclude operations dependent on unsafe SymInts; a pull request addressing the issue has been raised and further design considerations about API naming and behavior were debated.
- Number of comments this week: 15
-
[HIGH PRIORITY] [TRIAGE REVIEW] [MODULE: AUTOGRAD] [TRIAGED]
backwarddoes not respectwith torch.device(device): This issue reports that thebackwardpass in PyTorch does not respect thewith torch.device(device)context manager because the device context is thread-local and the backward computation runs on a separate thread, causing tensors created during backward to default to the CPU instead of the intended device. The discussion clarifies that this behavior is not a regression from a recent change, highlights limitations ofTorchFunctionModein intercepting backward operations, and notes thatTorchDispatchModemay be required for reliable device control during backward.- The comments discuss whether this is a regression and conclude it is not tied to a specific PR, explain the thread-local nature of device contexts and limitations of
TorchFunctionModein backward, mention related issues and use cases, and suggest thatTorchDispatchModeis necessary for proper interception and device management during backward passes. - Number of comments this week: 8
- The comments discuss whether this is a regression and conclude it is not tied to a specific PR, explain the thread-local nature of device contexts and limitations of
-
[TRIAGED] [MODULE: XPU] XPU: x.item() fails on Intel Arc Pro B50 with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY; UR shows urEnqueueUSMMemcpy(size=4): This issue reports a runtime error occurring on Intel Arc Pro B50 and B580 GPUs when calling the
.item()method on an XPU tensor in PyTorch, resulting in aUR_RESULT_ERROR_OUT_OF_DEVICE_MEMORYfrom the Level Zero backend during a USM memory copy operation. The problem appears related to device-to-host memory copying, with attempts to reproduce and diagnose it revealing that even copying the tensor to CPU memory fails, and the issue may be linked to recent driver or runtime changes rather than PyTorch versions.- Commenters confirmed the error on multiple Intel Arc GPUs and suggested it might be caused by invalid USM pointers or driver issues; attempts to downgrade PyTorch did not resolve the problem, and testing with the latest Intel drivers on Fedora Rawhide showed the issue was fixed, indicating a driver update as a potential solution.
- Number of comments this week: 8
-
[ONCALL: PT2] [TRIAGED] [MODULE: FAKETENSOR] [MODULE: DECOMPOSITIONS] [MODULE: PT2-DISPATCHER] [MODULE: DYNAMIC SHAPES]
torch.compilefails in FakeTensor meta path:Cannot view ... strides ... as (1, 2048)while eager works: This issue describes a bug where a model using various layers and operations runs correctly in eager mode but fails duringtorch.compilewith the "inductor" backend when evaluated with FakeTensor meta tensors, specifically failing on aviewoperation due to stride incompatibility. The root cause was identified as a mismatch between C++ and Python implementations of a function detecting channels-last memory format from strides, which led to incorrect stride interpretation and a subsequent failure in the FakeTensor path; a fix was proposed and confirmed to resolve the issue.- The comments include initial identification of the problem as a reshape/view decomposition error, detailed stride and shape comparisons between eager and FakeTensor modes, discovery of a mismatch in memory format detection logic between C++ and Python, a proposed fix adjusting stride checks, and confirmation from the original reporter that the fix resolves the issue with consistent outputs in both eager and compiled modes.
- Number of comments this week: 6
-
[ONCALL: DISTRIBUTED] [PIPELINE PARALLELISM] Pipeline communication blocks the execution of pipeline stages: This issue addresses a problem in the pipeline scheduling of distributed training where communication operations block the execution of pipeline stages, particularly on AMD hardware. The current greedy scheduling algorithm causes a communication receive operation to block computation, leading to pipeline bubbles, and the user requests an option to reorder these operations to avoid overlap and interference.
- The comments discuss testing a proposed patch that reorders receive operations to reduce blocking, showing improved performance in single-node tests but encountering deadlocks in multi-node setups; attempts to resolve deadlocks by swapping send/receive order on odd ranks were unsuccessful, and further investigation into better heuristics was promised.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 74
Summarized Issues:
- CUDA and GPU Architecture Support: Several issues address CUDA compatibility and GPU architecture problems, including updating the CUDA support matrix to make CUDA 13.0 stable, stripping of accelerated suffixes from GPU architecture flags causing broken NVIDIA features, and a test failure on GB200 (aarch64) due to missing valid CUDATemplateCaller choices. These problems affect build configurations, feature support, and test reliability across different CUDA versions and GPU platforms.
- torch.compile and Inductor Backend Bugs: Multiple issues describe bugs in
torch.compileand the Inductor backend, including incorrect tensor shape inference, NotImplementedErrors with MKLDNN tensors, dynamic compilation errors with strided tensors, dtype-changing view miscompilations, and failures involving complex operations or higher-order gradients. These bugs cause silent correctness errors, crashes, or recompilations, impacting model compilation and execution fidelity.
- Distributed and DTensor Issues: Several problems relate to distributed tensor operations and DTensor behavior, including pipeline scheduling blocking on AMD hardware, incorrect global shapes from unevenly sharded DTensors, leaking unbacked SymInt into cudagraph-safe partitions, gradient type mismatches for unused DTensor outputs, and errors with Shard(1) placement in tensor parallel layers. These issues cause performance bottlenecks, incorrect computations, and failures in distributed or parallel tensor workflows.
- Segmentation Faults and Crashes: Multiple reports describe segmentation faults and crashes occurring in various contexts, such as loading compiled AOT artifacts with constants, importing cudagraph_trees module, subclassing OpaqueBase, nightly build updates causing segfaults in ComfyUI, and unsafe crashes compiling rnn_tanh operator. These faults disrupt normal operation and require debugging to ensure stability.
- Randomness and Reproducibility Issues: There are issues with inconsistent random tensor generation and unexpected stride changes, including inconsistent random tensors between single- and multi-GPU setups despite fixed seeds, and unexpected stride changes when using
torch.autograd.Function.apply. These problems affect reproducibility and tensor layout assumptions.
- ONNX and Exporting Problems: Issues include metadata loss when exporting models with torch.onnx.export using dynamo=True, incorrect input names in Dynamo ONNX exporter outputs, and failures exporting models due to symbolic length errors or deprecated torch.jit usage. These affect model portability and export correctness.
- Memory and Allocation Concerns: Problems include requests for FP8 support in symmetric memory to reduce memory inflation, leveraging unified memory for Apple Silicon MPS backend to reduce data transfer costs, and NCCL communicator requiring explicit all_reduce calls to avoid illegal memory access. These issues impact memory efficiency and correctness in distributed and device-specific contexts.
- Documentation and Build Issues: Several issues report documentation typos, broken links, inconsistent buffer size defaults, missing semicolons causing build errors, and CI testing gaps for install script changes. These affect developer experience and build reliability.
- Performance and Tuning Problems: Issues include slow
torch.randnon CPU with bfloat16, coordinate descent tuner performing unnecessary expensive operations despite warm cache, and the need to readjust fudge factors in Inductor unit tests due to improved math backend precision. These affect runtime efficiency and test stability.
- API and Functionality Inconsistencies: Problems include
torch.bmmnot accepting documented defaultout_dtype=None, backward function ignoring device context due to thread-locality, inconsistent behavior of FP8 casting between CPU and CUDA backends, andtorch.bucketizeproducing inconsistent results for NaNs between eager and Inductor modes. These inconsistencies cause confusion and correctness issues.
- Model Loading and State Dict Enhancements: There is a feature request to add an optional parameter to
model.load_state_dictto allow partial tensor filling during non-strict loading, enabling pretrained weights to be partially copied into expanded weight matrices while initializing remaining parts. This would improve flexibility in model weight transfer.
- PrivateUse1 and Backend Compatibility: Issues discuss problems with PrivateUse1 backends including test instantiation failures, assignment of PrivateUse1 tensors to CPU tensor data attributes, and enabling FlexAttention device validation to allow PrivateUse1 devices. These affect experimental or third-party backend integration.
- Quantization and Numerical Accuracy Bugs: A bug in UniformQuantizationObserverBase causes incorrect scale and zero_point calculations, and TorchInductor produces incorrect results when combining aten.add with aten.hardswish_backward, leading to numerical errors and output mismatches.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 43
Summarized Issues:
- Test Failures and Runtime Errors on Specific Platforms: Multiple issues report test failures and runtime errors related to platform-specific hardware or software configurations. These include test failures on the xpu platform, runtime errors with Inductor GEMM backend on NVIDIA B300 GPUs, segmentation faults due to device tensor pointer mismatches, and failures on CUDA-enabled H100 machines during memory efficient attention backward passes.
- Compilation and Recompilation Issues with torch.compile: Several issues describe bugs and assertion errors occurring during compilation or recompilation with torch.compile. Problems include excessive recompilations caused by forward hooks on submodules, failures with effectful operations during backward passes, and inconsistent behavior between eager and compiled modes.
- CUDA and GPU Compatibility Problems: There are multiple reports of compatibility issues involving CUDA versions and GPU architectures. These include failures of tests on SM90 GPUs with CUDA versions earlier than 12.9, user warnings for unsupported CUDA capabilities (12.1), missing operator implementations on MPS devices, and installation failures due to missing CUDA bindings or unsupported GPU architectures.
- Memory and Segmentation Fault Crashes: Several issues describe crashes caused by memory corruption, segmentation faults, or invalid memory accesses. These include out-of-bounds errors in flex_attention, memory corruption in torch.linalg.ldl_solve, segmentation faults in XPUPluggableAllocator, and crashes during large model exports or memory visualization.
- Documentation and Typographical Errors: Multiple issues address typos and documentation inaccuracies that could cause confusion or errors. These include misspellings in code and documentation files, incorrect environment variable names, misleading function signatures, and formatting problems in hardware prerequisite tables.
- Installation and Environment Configuration Issues: Some issues report problems related to package installation and environment setup. These include errors caused by mismatched Python environments leading to module not found errors, missing compatible wheels for macOS Python 3.14, and ninja version conflicts in CI environments.
- Release Management and Cherry-Pick Tracking: One issue outlines the process and criteria for managing cherry-picks to the PyTorch 2.10.0 release branch to maintain stability and quality during the release cycle.
- ONNX Export and Model Conversion Failures: Several issues describe errors encountered during exporting models to ONNX format, including unregistered custom types causing runtime errors and undefined variables during decomposition steps.
- API Behavior and Output Inconsistencies: Some issues report inconsistencies in API behavior or output shapes, such as discrepancies in ConvTranspose3d output size formulas and differences in torch.isin output between eager and compiled modes.
- Build and Platform Compatibility Problems: Issues include build failures for Triton XPU on specific Linux distributions due to GLIBC mismatches and missing hash fragments in PyTorch wheel URLs affecting supply chain security.
- User Feedback and Positive Comments: A few issues contain user feedback or positive remarks about documentation clarity without reporting bugs or problems.
- Usage and API Questions: One issue questions how to run exported artifacts from torch.export in C++ following recent release recommendations.
- Indexing and Dynamic Shape Handling: An issue addresses the need to disable 32-bit indexing assumptions in vLLM for dynamic shapes and to analyze unbacked models for appropriate indexing based on tensor size limits.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 249
Key Open Pull Requests
1. [reland][ROCm] remove caffe2 from hipify: This pull request relands the removal of caffe2 from the hipify tool in the ROCm project by eliminating all "MasqueradingAsCUDA" files and classes and avoiding renaming "CUDA" classes to "HIP," addressing previous infrastructure issues and incorporating multiple fixes, updates, and mapping improvements to ensure compatibility and build stability.
- URL: pull/172796
- Associated Commits: 4cb19, 5d694, c5a07, 7d7e3, c6f2c, e538b, f0fca, c55ca, 52f55, 7c7be, 6c7cb, 0f0a5, 43be9, fced1, 1bd02, 9cd91, b25bf, 84388, d21a7, fa4ea, bfc83, 3f702, dd3ca, 9b608, 03a40, be812, e7838, 9cf12, 3c4c1, b09bb, c3f73, ee1f7, 71e55, dca58, 5a319, b8641, 6b41f, e23ec, 1970c, 1bfb1, 11e1c, 64210, 7dcbb, 7d2af, e3652, a7c55, 3d3e4, 82a60, 6b226, 47962, f48f0, e259e, 7b449, a15de, 5a66c, 76c18, 64325, 6e55c, 8e093, 7a60c, d34f4, 68086, 61db1
2. Implements InputObserver to guess the dynamic shapes for torch.export.export and torch.onnx.export: This pull request implements the InputObserver to automatically infer dynamic shapes for torch.export.export and torch.onnx.export by analyzing multiple input sets with varying dimensions, addressing challenges in handling nested input structures like DynamicCache.
- URL: pull/172838
- Associated Commits: 39f8f, 066b8, 1d998, c0fbd, 21b9f, 9b5cf, 52209, 50b08, f66ac, e3c59, 3bf4f, 70e3f, a7524, fa4bb, 5cccd, a74c0, a1e3a, 9324a, 7ab9d, d6760, 00fd4, 04685
3. [wip][dtensor][invoke_subgraph] Support Dtensor inputs to invoke subgraph: This pull request aims to add support for Dtensors as inputs to the invoke_subgraph function in PyTorch, enabling tensor subclass compatibility within subgraph invocation.
- URL: pull/172970
- Associated Commits: 2c824, d2cdc, b26a9, 9775e, 7dd0c, 462f6, 0b801, 8c228, 9950b, 992c8, 773d5, 5ca05, a5b01, 2ac89, bb88b, d0270, c4f81, 83b04, f8a04, 9e51e
Other Open Pull Requests
- Module tracing and hooks decoration: This pull request enables users to decorate the forward method of
nn.Modulewith@leaf_functionto prevent PyTorch Dynamo from tracing into it, and experiments with annotating module hooks similarly to avoid tracing. These changes help control tracing behavior in PyTorch's dynamic graph mode.
pull/172692
- ONNX export improvements: Support for exporting the
torch.ops.higher_order.invoke_subgraphoperator to ONNX was added by creating direct function call nodes to preserve nested compiled functions without inlining, including output renaming to avoid runtime conflicts. Additionally, a fix was made to prevent crashes when exportingtorch.cdistwith dynamic input shapes by using a safer fallback computation.
pull/172715, pull/172758
- New PyTorch operations for power sums: Introduced
torch.linalg.powsumandtorch._foreach_powsumoperations that compute the sum of absolute values raised to a power without taking the final root, enabling efficient distributed power sum computations by allowing partial sums to be reduced across shards before applying the root.
pull/172685
- DTensor sharding rule discovery and fallback: A comprehensive DTensor sharding rule discovery harness was introduced for PyTorch operators, including infrastructure for enumeration, validation, and documentation, plus new sharding rules for operators like
cdistandkron. A fallback mechanism for the dtensorshard_dim_alltoalloperation was also implemented using an allgather plus chunk approach to address lack of native alltoall support.
pull/172779, pull/172890
- Fixes for dynamic dispatch and indexing errors: Addressed a dynamic dispatch error in the
_exec_fftfunction by adding support for unbacked views in reshape-related functions, and fixed two bugs related to int64 indexing when dimension products exceed 65,000, correcting data type usage and overflow issues.
pull/172717, pull/172925
- Build system and CI improvements for ROCm and AMD: Updated the PyTorch build system to use CMake's native HIP support for ROCm, improving compiler separation and consistency with CUDA builds, and introduced testing for MI250 runners in shadow mode within ROCm CI workflows by updating configurations and fixing syntax.
pull/172775, pull/172977
- CUDA memory snapshot optimization: Added an option to speed up CUDA memory snapshots by excluding trace entries and annotations, significantly reducing snapshot time while still capturing current memory state, achieving speedups over 3000x with large trace histories.
pull/172672
- Lint job fixes and triage automation: Fixed the PyTorch lint job by applying patched linux_job updates, switching test versions, and stabilizing linter tests, and introduced automation to label PRs with reviewers as "triaged" to reduce triage workload without changing codebase behavior.
pull/172818, pull/172676
- Functorch partitioner and backward dependency fixes: Forced saving of torchcomms outputs in the functorch partitioner to ensure backward operations have access to forward tensor outputs, preventing invalid partition dependencies, and fixed the partitioner to correctly handle multi-output nodes returned as grad_output with new tests.
pull/172889, pull/172878
- Deprecation of MAGMA backend: Deprecated the MAGMA backend for both the singular value decomposition (svd) and
ldl_factorfunctions, unconditionally dispatching these operations to the cuSOLVER backend, and added a deprecation warning triggered when the linear algebra backend is retrieved or set in MAGMA.
pull/172824, pull/172825, pull/172823
- Partial reinplacement support for mutating arguments: Added support for specifying multiple mutating arguments to enable partial reinplacement when at least one argument can be reinplaced in-place without requiring a clone, improving in-place operation flexibility.
pull/172782
- AMP gradient state enhancements: Introduced an updated state in the optimizer state and a new
get_stagemethod in the GradScaler to enhance AMP gradient state management, along with related refactoring and unit tests to fix issue #67590.
pull/172718
- SymPy and Inductor crash prevention: Modified SymPy within PyTorch's Inductor to treat the Identity element as non-numeric in Min/Max operations to prevent comparability crashes during substitution, including a regression test for this case.
pull/172678
- Code cleanup in ParamsHash: Removed redundant code from
ParamsHash.hby eliminating duplicate move operations, consolidating hashing logic duplicated inParamsWrapperHash, and correctingmemcmpusage to acceptvoid*inputs instead of typed pointers.
pull/172707
- Data-dependent error fixes: Addressed and fixed a data-dependent error in the PyTorch codebase, including updates and corrections to related tests to ensure proper functionality.
pull/172786
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 249
Key Closed Pull Requests
1. [DRAFT] Fix gloo sync issue for non default stream: This pull request addresses a synchronization issue in the Gloo backend for PyTorch's distributed data parallel tests by capturing and synchronizing on the original non-default CUDA stream during the allreduce operation, thereby fixing a bug where the default stream was incorrectly used instead of the intended non-default stream.
- URL: pull/172952
- Associated Commits: cd2c9, ce928, c31a8, 15238, 0ac9f, 10769, 44baf, aebf4, 7f8ba, 9718a, baab5, ffa6f, bc158, 76beb, b1aae, 25d8c, a576d, 35c55, ddd50, f83cf, 7cf37, 57979, 4966d, 47cb4, 715dc, 6c058, 1dadb, 5322d, be29c, 7d024, 96f0c, 300ba, 9952b, c0577, b0dc9, 87c5d, 132d9, 0154c, 49dab, fc8bf, 824d5, 63da9, 57dc6, f9e49, 7cadf, 5340e, 99930, daa3d, d7a70, 37e26, 45e25, 11f77, 709f4, b64fc, a2c77, 20100, d1b63, 22d46, 21fec, a21a4, 72cf4, 005e3, e70d9, 71282, a5fea, f227c, 3abee, 3e8a0, 764f6, 881c2, 1cd83, 31c72, 10b50, d6e84, 017d8, 2f638, fd364, c74f0, 3b573, d4c43, b0154, 42f0c, 6f12b, 26e02, 0fabc, 4dca4, c1908, e9e3d, 9315f, cbe1a, 32e37, e2f6f, 3dead, e0c8f, 59337, d29de, fc561, 76335, 38e8b, 13f1b, 8e83e, f190b, ba863, 6bc3d, 4aca6, 49046, 5b9f0, a0614, 3d27d, 8f658, e6bcb, 9976b, 68772, f36c7, 5811a, b0025, d3816, 05532, 37029, 78c49, aa609, 57c7f, d5cdc, a955f, dd14b, 144ed, 3a2af, f49a1, 2a2ce, e5b4c, 8b238, db04f, 270fc, ec73c, 345bd, fbe15, f42bf, 08dbe, 90b73, 133f2, db4e3, caf34, 55b57, 82da5, d6fa3, 839c2, bf92b, d6727, ff68c, e1820, 8ed5a, 83a04, dd64d, c4f04, 69680, 80e0f, ddb0b, 2bfd9, ff9e1, a7bdc, 61872, 932a9, e7d49, b2f5d, aef08, a2b0f, a6a30, 6ecd7, f363a, 10201, b8d38, 84631, 4e285, 351ff, 9c8d5, e9d71, 2c520, 74f43, 4ff52, f6ba4, 79f4e, 34002, c8423, 0e8d4, 150de, b71cb, 7e194, 07b35
- Associated Commits: cd2c9, ce928, c31a8, 15238, 0ac9f, 10769, 44baf, aebf4, 7f8ba, 9718a, baab5, ffa6f, bc158, 76beb, b1aae, 25d8c, a576d, 35c55, ddd50, f83cf, 7cf37, 57979, 4966d, 47cb4, 715dc, 6c058, 1dadb, 5322d, be29c, 7d024, 96f0c, 300ba, 9952b, c0577, b0dc9, 87c5d, 132d9, 0154c, 49dab, fc8bf, 824d5, 63da9, 57dc6, f9e49, 7cadf, 5340e, 99930, daa3d, d7a70, 37e26, 45e25, 11f77, 709f4, b64fc, a2c77, 20100, d1b63, 22d46, 21fec, a21a4, 72cf4, 005e3, e70d9, 71282, a5fea, f227c, 3abee, 3e8a0, 764f6, 881c2, 1cd83, 31c72, 10b50, d6e84, 017d8, 2f638, fd364, c74f0, 3b573, d4c43, b0154, 42f0c, 6f12b, 26e02, 0fabc, 4dca4, c1908, e9e3d, 9315f, cbe1a, 32e37, e2f6f, 3dead, e0c8f, 59337, d29de, fc561, 76335, 38e8b, 13f1b, 8e83e, f190b, ba863, 6bc3d, 4aca6, 49046, 5b9f0, a0614, 3d27d, 8f658, e6bcb, 9976b, 68772, f36c7, 5811a, b0025, d3816, 05532, 37029, 78c49, aa609, 57c7f, d5cdc, a955f, dd14b, 144ed, 3a2af, f49a1, 2a2ce, e5b4c, 8b238, db04f, 270fc, ec73c, 345bd, fbe15, f42bf, 08dbe, 90b73, 133f2, db4e3, caf34, 55b57, 82da5, d6fa3, 839c2, bf92b, d6727, ff68c, e1820, 8ed5a, 83a04, dd64d, c4f04, 69680, 80e0f, ddb0b, 2bfd9, ff9e1, a7bdc, 61872, 932a9, e7d49, b2f5d, aef08, a2b0f, a6a30, 6ecd7, f363a, 10201, b8d38, 84631, 4e285, 351ff, 9c8d5, e9d71, 2c520, 74f43, 4ff52, f6ba4, 79f4e, 34002, c8423, 0e8d4, 150de, b71cb, 7e194, 07b35
2. Implements InputObserver to guess the dynamic shapes for torch.export.export and torch.onnx.export: This pull request proposes the implementation of an InputObserver to automatically infer dynamic shapes for torch.export.export and torch.onnx.export, addressing the complexity of setting dynamic shapes when inputs contain nested structures by computing them based on multiple input sets with varying dimensions.
- URL: pull/172835
- Associated Commits: 22dd1, 11523, e7ab7, 882d5, 249f4, 0b004, 44999, 0f250, 0caa1, ad6bb, 04f01, f169d, dd484, d90f9, 7b457, 45937, 402a3, 1da86, acc6d, ed383, f3aa4, bc23f, 4decd, 69e32, 773b7, 59f15, 082f4, 0880a, 23c60, c11eb, 7e57f, dd657, 0afd8, eb3f6, e3342, fba31, bf8cb, 1fb07, 9e440, bef07, 069b8, 9204d, fec5a, b7da8, 8944a, 326d0, 12a9c, 6a63b, 966c9, ea575, e0c1b, 75204
- Associated Commits: 22dd1, 11523, e7ab7, 882d5, 249f4, 0b004, 44999, 0f250, 0caa1, ad6bb, 04f01, f169d, dd484, d90f9, 7b457, 45937, 402a3, 1da86, acc6d, ed383, f3aa4, bc23f, 4decd, 69e32, 773b7, 59f15, 082f4, 0880a, 23c60, c11eb, 7e57f, dd657, 0afd8, eb3f6, e3342, fba31, bf8cb, 1fb07, 9e440, bef07, 069b8, 9204d, fec5a, b7da8, 8944a, 326d0, 12a9c, 6a63b, 966c9, ea575, e0c1b, 75204
3. flex document masking respects block sizes: This pull request improves the document masking mechanism by adding offsets and limits to ensure each document mask is aligned to block sizes, making the masking batch invariant and enhancing performance by reducing interference between neighboring documents when sequence lengths exceed the block size.
- URL: pull/172464
- Associated Commits: b259b, dc3d3, 0b9a2, 9ad54, 25d3b, ee0c4, 2ab5d, 70bae, 9b5a3, d3d34, fccb2, 67c51, 92bde, d6bb6, 68aca, 0b53f, a15f6, 9a96a, 512f9, 3e4da
- Associated Commits: b259b, dc3d3, 0b9a2, 9ad54, 25d3b, ee0c4, 2ab5d, 70bae, 9b5a3, d3d34, fccb2, 67c51, 92bde, d6bb6, 68aca, 0b53f, a15f6, 9a96a, 512f9, 3e4da
Other Closed Pull Requests
- Flash Attention 3 (fa3) wheel building and configuration: This topic covers pull requests focused on building and configuring Flash Attention 3 wheels for Linux, including adjustments to CUDA versions, compiler settings, installation scripts, and wheel naming to ensure compatibility and successful deployment. These changes enable proper packaging and installation of fa3 on supported systems.
- ROCm unit test fixes and GPU compatibility adjustments: These pull requests update and fix ROCm-specific unit tests by extending test skips across all ROCm architectures and correcting grid size expectations to differentiate between MI300 and other AMD GPUs. They also fix test failures by replacing hardcoded Nvidia-specific values with dynamic device properties to accommodate AMD ROCm GPU characteristics.
- Metal Performance Shaders (MPS) OpInfo skips and dtype specifications: Multiple pull requests add OpInfo skips and specify data types for MPS operations to improve testing and compatibility. These include adding expected failure skips for unsupported CPU dtypes, extending test_output matching to support upcasts, and replacing expected failures with appropriate markers for MPS-specific dtypes.
- DTensor enhancements and redistribution semantics: These pull requests improve DTensor functionality by enabling a single-dimension strategy for addmm and baddbmm operations, filtering out incompatible output strategies in expand_to_full_mesh_op_strategy, supporting redistribution of uneven _StridedShard tensors with device order through Replicate, and clarifying left-to-right ordering of partial redistributions with added tests and bug fixes.
- Element-wise tensor list operations with scalar tensors: This pull request implements new functions
_foreach_add.ScalarTensorListand_foreach_sub.ScalarTensorListto efficiently support element-wise addition and subtraction of tensor lists with corresponding scalar tensors. These functions achieve bitwise equivalence with the standard in-placeadd_operation and improve performance for optimizer patterns using per-parameter step sizes.
- BF16 GEMM backend selection heuristic: This pull request evaluates BF16 GEMM shapes by comparing BGEMM and oneDNN, builds a ground-truth performance dataset, and trains a decision-tree model to create a simple rule-based heuristic. The heuristic achieves approximately 99.9% oracle accuracy, slightly improves average runtime, and significantly reduces worst-case slowdown from about 36× to 2.5×.
- Fused multiply-add (FMA) optimizations for addcmul: These pull requests propose adding FMA lowerings for the addcmul operations within the Inductor backend and improve
torch.addcmulby using FMA when possible to ensure bitwise-identical results withtorch.addfor float32 tensors with alpha=1. One proposal was not merged, while the other addresses precision and testing considerations.
- varlen_attn module updates: These pull requests propose adding sliding window attention to the
varlen_attnfunction with positional optional arguments for future compatibility and deprecate theis_causalflag in favor of usingwindow_size = (-1, 0)to express causal attention.
- Dynamic tensor shape consistency with duck_shape_id: This pull request introduces an optional
duck_shape_idparameter to themark_unbackedfunction that groups unbacked tensor dimensions sharing the same identifier to use the same unbacked symbol. This enables runtime assertions to ensure consistent sizes for dynamic batch sizes, supporting use cases like vLLM's dynamic input lengths and torchbench benchmarks.
- CPython 3.13 test failure fixes: This pull request addresses and fixes test failures in CPython 3.13 reported in the dynamo-unittest job for the PyTorch project, ensuring compatibility with the new Python version.
- Typechecking enablement via suppression comment updates: This pull request enables typechecking for previously excluded files by updating suppression comment formats to be compatible with a recent usort upgrade, allowing these files to be checked without errors.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| BenjaminDEMAILLE | 87 | 53 | 0 | 6 |
| pianpwk | 107 | 18 | 0 | 2 |
| laithsakka | 77 | 14 | 1 | 26 |
| wconstab | 97 | 13 | 0 | 2 |
| Copilot | 12 | 2 | 0 | 97 |
| dolpm | 83 | 12 | 0 | 11 |
| guangyey | 91 | 7 | 0 | 3 |
| malfet | 55 | 8 | 1 | 37 |
| ezyang | 60 | 19 | 1 | 20 |
| kurtamohler | 88 | 11 | 0 | 0 |