Weekly Project News

Archives

Weekly GitHub Report for Pytorch: February 16, 2026 - February 23, 2026 (17:35:31)

Weekly GitHub Report for Pytorch

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is v2.6.0

1.2 Version Information:

Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, the new performance tuning API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous bug fixes, performance optimizations, and deprecations such as the discontinuation of official Conda packages.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. [HIGH PRIORITY] [TRIAGE REVIEW] [MODULE: CRASH] [MODULE: ACTIVATION CHECKPOINTING] [ONCALL: PT2] [MODULE: DTENSOR] [MODULE: PT2-DISPATCHER] [MODULE: FLEX ATTENTION] [BOT-TRIAGED] Block mask caching for flex attention and SAC don't play nicely together (RuntimeError: Only Tensors of floating point and complex dtype can require gradients): This issue describes a runtime error occurring when block mask caching in flex attention interacts poorly with selective activation checkpointing (SAC), specifically causing a RuntimeError related to tensors of integer dtype incorrectly requiring gradients during SAC recomputation. The problem arises because a compiled create_block_mask produces a BlockMask with integer tensors that are cached and reused, leading aot_autograd's wrap_tensor_subclasses to attempt reconstructing a DTensor with requires_grad=True on an integer tensor, which is invalid.

    • The comments discuss the difficulty in pinpointing the error, note that the failure is due to branching on global state causing cache mismatches during SAC, and explore a proposed solution involving clearing the cache on recompute which ultimately fails. A user shares a workaround involving a recompute tape mechanism to record and replay cache hits during forward and recompute passes, ensuring consistent behavior and preventing the error.
    • Number of comments this week: 7
  2. [ONCALL: PT2] [MODULE: DYNAMIC SHAPES] MobileBertForMaskedLM is 90% slower with unbacked vs backed !: This issue reports that the MobileBertForMaskedLM model runs approximately 90% slower when using unbacked batch processing compared to backed batch processing, despite recent optimizations related to size hinting. The author is seeking assistance from the Inductor team to further improve performance for unbacked batches, as this work is preparatory for integrating with vLLM and aims to match Huggingface inference speeds for faster iteration.

    • The comments detail a series of incremental code optimizations that progressively improve performance from 1.19x to over 2x speedup, including changes to size hinting heuristics, handling of unbacked symbols, and padding optimizations, with ongoing efforts to finalize fixes for unbacked batch processing.
    • Number of comments this week: 5
  3. [TRIAGED] [FUNCTION REQUEST] [ONCALL: PT2] [MODULE: DYNAMO] [MODULE: COMPILE UX] [DYNAMO-TRIAGE-DEC2025] [BOT-TRIAGED] torch.compile(..., name="flex_attention"): This issue proposes adding a name keyword argument to the torch.compile function to assign names to compile regions, which can then be used for better identification and handling in various contexts such as activation checkpointing and stack traces. The motivation is to improve the ability to distinguish and manage compiled regions, especially when using features like SAC and inductor compiled code, thereby enhancing debugging, tracing, and model transparency.

    • The comments generally support the proposal, discussing the trade-offs between using string names versus object-based namespacing for uniqueness, and highlighting the usefulness of named compile regions for debugging, tracing, and improving model interpretability.
    • Number of comments this week: 4
  4. [TRIAGED] [RELEASE TRACKER] [v.2.11.0] Release Tracker: This issue is about tracking and managing cherry-picks to the release branch for the PyTorch 2.11.0 release, outlining the criteria and process for including changes during different phases of the release cycle. It provides detailed instructions on what types of fixes are allowed, how to submit cherry-pick requests, and the approval workflow to ensure stability and quality before the final release.

    • The comments show multiple cherry-pick requests submitted with links to both trunk and release branch PRs, each categorized by the type of change, and all were approved and merged by a release team member.
    • Number of comments this week: 3
  5. [TRIAGE REVIEW] [MODULE: BUILD] [MODULE: RISC-V] [BOT-TRIAGED] ZLib Reference outdated in riscv ci dockerfile: This issue addresses a failure in Docker builds caused by an outdated URL for downloading zlib version 1.3.1 in the riscv CI Dockerfile, resulting in 404 errors. The problem arose because zlib was updated to version 1.3.2, and the current source for 1.3.1 is no longer hosted at the original location, prompting a need to either upgrade the zlib version or source the older version from GitHub releases.

    • The comments confirm the build failures started recently following the zlib 1.3.2 release, propose a simple fix by bumping the zlib version in the Dockerfile, and discuss adding a triage review to consider separating mainstream and experimental docker builds for better maintenance.
    • Number of comments this week: 3

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

As of our latest update, there are no stale issues for the project this week.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 60

Summarized Issues:

  • Cherry-pick and release management: This issue tracks the process and criteria for cherry-picking low-risk and critical fixes to the PyTorch 2.11.0 release branch, ensuring proper management during different phases of the release cycle. It focuses on defining clear guidelines to maintain release stability while incorporating necessary fixes.
    • issues/175093
  • NCCL deadlock and distributed communication: A persistent deadlock occurs when launching NCCL point-to-point and collective operations concurrently from separate Python threads, caused by circular dependencies in NCCL's internal progress engine despite synchronization attempts. Additionally, serialization of homogeneous point-to-point communication operations on a single CUDA stream causes head-of-line blocking in pipeline-parallel workloads, proposing direct per-op issuance to reduce bottlenecks.
    • issues/175145, issues/175225
  • Interpolation and random number generation inconsistencies under torch.compile: Using torch.nn.functional.interpolate with mode 'nearest' under torch.compile with certain backends produces incorrect results due to suspected rounding errors, while eager mode is correct. Similarly, random number generation results differ between eager and compiled modes with the inductor backend, where the first randint output is inconsistent despite resetting the seed.
    • issues/175154, issues/175156
  • Build and packaging errors due to missing files and outdated URLs: The hipification process in build_amd.py fails if files or directories like third_party/fbgemm are missing, notably affecting Fedora's python-torch package. Docker builds also fail due to an outdated URL for zlib 1.3.1, requiring an upgrade or alternative sourcing to fix 404 errors.
    • issues/175160, issues/175193
  • Autocast backward pass documentation and behavior mismatch: Documentation and default behavior conflict regarding performing backward passes under autocast with torch.compile, where the recommended practice advises against it but the default assumes backward runs under the same autocast context as forward, potentially impacting numerical correctness.
    • issues/175166
  • Performance regressions and optimization needs in Inductor backend: The MobileBertForMaskedLM model runs about 90% slower with unbacked batch processing compared to backed, highlighting the need for expert help to optimize unbacked scenarios. A minor regression in torch.compile related to torch._scaled_mm causes assertion errors on some GPU models and PyTorch versions, indicating backend inconsistencies.
    • issues/175167, issues/175206
  • MPS backend bugs with channels_last memory format: The backward pass of BatchNorm2d produces incorrect weight gradients on MPS with channels_last inputs, causing training divergence despite correct forward outputs. Additionally, backward passes of AvgPool2d and AdaptiveAvgPool2d crash with SIGABRT due to buffer size assertion failures on channels_last inputs, avoidable by making inputs contiguous.
    • issues/175189, issues/175190
  • Hardware capability and feature detection improvements: Proposes replacing get_device_capability() in CUDA, ROCm, and accelerator tests with more generic, self-documenting feature queries to better verify hardware support across platforms.
    • issues/175211
  • Flex attention and SAC (Selective Activation Checkpointing) challenges: Matching flex attention regions idiomatically in SAC policies is difficult due to generic operation names after torch.compile, complicating tracking. SAC also needs unique identifiers for inductor_compiled_code and similar operators to improve error detection during recomputation.
    • issues/175229, issues/175306
  • OutOfMemory and performance issues in fused kernels and tensor metadata reading: Torch.compile fusing multiple operations into a large kernel causes OutOfMemoryError in Triton during Flexattention backward pass on torch 2.10+, not seen on 2.9. Excessive unwanted data access during tensor metadata reading with large read_ahead_kb settings leads to significant performance degradation.
    • issues/175250, issues/175252
  • Data pipeline serialization and deployment limitations: Proposes a fully serializable native C++ data pipeline engine for PyTorch to enable Python-independent deployment by capturing data ingestion logic as part of the model graph, addressing current limitations in high-performance and edge environments.
    • issues/175255
  • Caching and autograd errors with DTensors and flex attention: Caching a BlockMask containing integer tensors in compiled functions with SAC and DTensors causes runtime errors because cached int tensors incorrectly require gradients during recomputation, breaking autograd.
    • issues/175258
  • Dynamo module correctness and refactoring: Multiple issues in Dynamo include incorrect HAS_ATTR guard insertion, investigation of VariableTracker creation only for sourceful objects, is operator malfunction, refactoring variables/builtin.py for better maintainability, and lack of metaclass support, all indicating ongoing efforts to improve Dynamo's correctness and code quality.
    • issues/175263, issues/175264, issues/175267, issues/175269, issues/175292
  • Export and parsing failures: Running tl-parse on export produces empty reports, and fx.GraphModule.to_folder() fails to save models containing TorchScript submodules due to serialization errors, indicating issues in export and saving workflows.
    • issues/175293, issues/175493
  • Inductor backend architecture and test failures: The Inductor CUTLASS backend incorrectly uses CollectiveEpilogue for SM90 on SM100 architectures, causing ~60 test failures. The test_comprehensive_nn_functional_linear_cuda_float32 is disabled due to consistent Linux failures.
    • issues/175304, issues/175354
  • Segmentation faults in embedding_bag and property setter bugs: Segfaults occur in torch.nn.functional.embedding_bag due to insufficient validation of offsets tensor values and missing bounds checks for float64 weights with empty offsets. Python property setters are ignored when assigning nn.Parameter() to nn.Module properties because setattr takes precedence, causing unexpected behavior.
    • issues/175368, issues/175370, issues/175372
  • Naming and debugging improvements for torch.compile: Proposes adding a name keyword argument to torch.compile to assign names to compile regions, improving identification for activation checkpointing, debugging, and tracing.
    • issues/175390
  • TensorSubclass and rounding mode bugs: Using a TensorSubclass with a custom_op in torch.compile causes runtime errors during Dynamo tracing due to invocation of the custom_op implementation instead of the fake one. Also, adding support for different rounding modes when casting tensors is proposed to enable precise control for specialized formats like MXFP8 on Blackwell GPUs.
    • issues/175408, issues/175409
  • vLLM project CI failures and accuracy regressions: An umbrella issue tracks CI failures for the vLLM project for the 2.11 release, including a failing distributed test where accuracy on GSM8K evaluation with 2 GPUs and deepep_low_latency backend is below expected thresholds.
    • issues/175426, issues/175429
  • GPU out-of-memory despite free memory: An out-of-memory error occurs on GPUs with ample free memory, suggesting issues like memory fragmentation or allocation inefficiencies during tensor operations.
    • issues/175431
  • Quantization test failures due to config incompatibilities: Tests for quantization of pre-quantized models using torchao fail due to incompatibilities with model configuration versions, specifically errors from unexpected or unsupported arguments in Int4WeightOnlyConfig during engine core initialization.
    • issues/175435
  • Illegal instruction crashes on older CPUs: Intermittent illegal instruction crashes occur running torch.sin on large CPU tensors with PyTorch 2.10+ on older Intel Xeon E5-2670 processors, traced to mkl_vml_kernel_dSin_* functions, mitigated by limiting PyTorch to single-threaded execution.
    • issues/175436
  • Beam search concurrency limit test failures: The test_beam_search_with_concurrency_limit function fails due to output mismatches when concurrency limits are applied during beam search sampling for TinyLlama-1.1B-Chat-v1.0.
    • issues/175437
  • DTensor export and decomposition errors: Exporting tensor-parallel models using DTensor fails because DTensorSpec is not registered as a pytree constant, causing RuntimeErrors unless explicitly registered. Running run_decompositions() on ExportedPrograms using DTensor fails with an AssertionError during decomposition.
    • issues/175467, issues/175469
  • Test disables and gradient correctness issues: The test_custom_op_with_layout_arg_xpu is disabled due to failures on the xpu platform. Using make_fx with symbolic tracing produces incorrect second-order gradients involving torch.sqrt, diverging from eager execution.
    • issues/175475, issues/175477
  • Distribution and LSTM documentation and test issues: Requests to update torch.distributions.Gamma().sample() to accept torch.Generator for reproducibility. Documentation error in LSTM output h_n description. The test_index_put_error_cuda is disabled due to ROCm 7.2 failures.
    • issues/175478, issues/175479, issues/175482
  • TensorFloat32 warning suppression and interpolation NotImplementedError: Users cannot disable repeated TensorFloat32 precision warnings during fp32 matrix multiplications, suggesting a flag to distinguish default vs user-set precision. On CPU, torch.nn.functional.interpolate with antialias=True raises NotImplementedError for bfloat16 and float16 inputs in bilinear or bicubic modes, breaking preprocessing workflows.
    • issues/175484, issues/175489
  • In-place division and softmax compilation bugs: Compiling functions performing in-place division, softmax, and returning detached tensors results in inconsistent outputs compared to eager execution, specifically affecting the first returned element when detached tensors are included.
    • issues/175496
  • Unsupported operations and batching rule gaps: torch._dynamo.export fails on models containing nn.GRU, raising questions about known limitations. torch.vmap raises ValueError when returning pytrees with non-tensor leaves, proposing fixes to allow non-tensor passthrough. Lack of vmap batching rule for torch.while_loop causes KeyErrors, proposing a TransformType.Vmap rule for proper batched execution.
    • issues/175520, issues/175521, issues/175522
  • Graph breaks due to custom getattribute in torch.compile: torch.compile causes graph breaks on every attribute access when nn.Module subclasses define custom getattribute, proposing fixes to trace through getattribute like plain user-defined objects to avoid unnecessary breaks.
    • issues/175523
  • Race conditions in CUDA RPC tests: A race condition in CUDA RPC test_tensor_view_as_return_value intermittently causes SIGABRT crashes due to multiple RPC worker threads executing CUDA ops on the default stream without proper synchronization, affecting distributed RPC test reliability.
    • issues/175528
  • torch.compile backward pass failures on CPU and CUDA: torch.compile fails during backward pass on minimal models with torchvision convnext_tiny on CPU, related to LayerNorm2d patterns involving tensor permutation and normalization, also affecting more complex variants on CUDA, while eager mode works correctly.
    • issues/175530
  • FSDP test device context and NCCL failures: FSDPTestMultiThread fails because device context is not set before creating DeviceMesh, causing warnings, incorrect device inference, and NCCL failures in distributed FSDP tests.
    • issues/175531

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 50

Summarized Issues:

  • Inductor Backend Boolean Tensor Issues: Multiple issues report incorrect behavior when using the Inductor backend with torch.compile on boolean tensors. These include wrong indices returned by argmax or max(dim)[1] operations and incorrect results from boolean .data assignments, both differing from eager execution results.
  • [issues/174069, issues/174187]
  • Distributed and Async Operation Errors in Compilation: There is a problem where torch.compile misinterprets the output of dist.all_reduce with async_op=False, causing an AttributeError due to expecting a torch.Tensor instead of None. This error occurs only in compiled mode, while eager execution works correctly.
  • [issues/174280]
  • MPS Backend Numerical and Memory Issues: Several issues highlight incorrect results and memory errors on the MPS backend, including failures in BatchNorm2d and avg_pool2d with channels_last=True and storage_offset > 0, out-of-bounds memory access in scaled dot product attention for long sequences, and incorrect gradients or outputs in various tensor operations on non-contiguous or permuted tensors.
  • [issues/174345, issues/174861, issues/174943, issues/175187, issues/175188, issues/175191, issues/175192]
  • Export and Environment Variable Interaction: The torch.export.export(..., strict=True) function fails when the PYTHONOPTIMIZE environment variable is set above zero, likely due to stripped assert statements, causing errors not seen with strict=False or PYTHONOPTIMIZE=0.
  • [issues/174784]
  • CUDA Memory Access and Kernel Errors: A CUDA illegal memory access error occurs when using torch.nn.attention.flex_attention with create_block_mask under certain large key-value and small query length conditions, especially when compiled with torch.compile, indicating possible out-of-bounds or indexing issues in the kernel.
  • [issues/174923]
  • Test Failures Due to Missing Dependencies and Disabled Tests: Some tests fail or are disabled due to missing Python modules like 'dominate' or segmentation faults caused by attribute errors during teardown, affecting multiple platforms and CUDA versions.
  • [issues/174919, issues/175019, issues/175065]
  • Loss Function and Gradient Computation Bugs: The NLLLoss function fails to backpropagate gradients for non-contiguous 4D inputs, and CrossEntropyLoss on MPS incorrectly accepts invalid 1D float labels without error, leading to inconsistent validation and runtime errors.
  • [issues/174943, issues/175084]
  • CPU Test Tolerance and Documentation Issues: Some CPU tests fail due to overly strict tolerance thresholds that do not accommodate expected numerical differences in low-precision operations, and minor documentation problems exist with the varlen_attn feature regarding import errors and argument formatting.
  • [issues/174952, issues/174961]
  • Inductor and Periodic-Dynamo Benchmark Instabilities: Numerous issues report instability and flakiness in inductor-periodic and periodic-dynamo benchmark tests across CPU and CUDA platforms, affecting various TorchBench test suites and configurations.
  • [issues/175121, issues/175122, issues/175123, issues/175124, issues/175125, issues/175126, issues/175127, issues/175128, issues/175129, issues/175130, issues/175131, issues/175132, issues/175133, issues/175134, issues/175135, issues/175136, issues/175137, issues/175138, issues/175139, issues/175140, issues/175141, issues/175142, issues/175143]
  • Name Mangling and Build Linking Errors: There is a problem with inconsistent name mangling between C++ and CUDA for templated functions, causing linking errors during the build process due to mismatched mangled symbols.
  • [issues/174898]
  • Feature Request for Selective Gradient Blocking: A new autograd feature is requested to enable selective upstream gradient blocking per loss without extra memory or compute overhead, allowing efficient multi-loss model training without multiple backward passes or graph retention.
  • [issues/175165]
  • Float8 Layout Test Bug on Specific GPUs: The test_float8_basics_cuda test incorrectly expects a RuntimeError for Spark and Thor GPUs regarding float8 layouts, but these GPUs actually support some layouts, causing test failures.
  • [issues/175182]
  • Negative Step Slicing Support Proposal: A feature request proposes adding support for negative step sizes in slicing syntax by internally rewriting such slices to a combination of flip and positive-step slicing, improving ergonomics and aligning with NumPy behavior.
  • [issues/175240]
  • Numerical Accuracy and Undefined Behavior in Linear Algebra: A CUDA test fails numerical accuracy checks due to discrepancies in linear algebra computations with high condition number matrices, and undefined behavior is reported in GemmHelper caused by improper use of vector::reserve() leading to out-of-bounds access.
  • [issues/175282, issues/175302]
  • Warning Suppression Request for Readonly NumPy Arrays: There is a request to suppress or optionally disable warnings generated when creating a Tensor from a readonly NumPy array, as these warnings clutter logs without affecting training or inference.
  • [issues/175395]
  • Python 3.15 Support Initiation: Discussions and requests are made to begin support for Python 3.15 in PyTorch to enable timely compatibility shortly after the official Python 3.15 release.
  • [issues/175402, issues/175407]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 205

Key Open Pull Requests

1. [test] Add basic pyrefly infer command: This pull request introduces a basic pyrefly infer command along with initial setup, tests for autotyping, lint fixes, directory renaming to avoid clashes, reference typing completion, integration of a custom directory switcher, and test dependencies to enhance the pyrefly functionality within the PyTorch project.

  • URL: pull/175153
  • Associated Commits: be3d0, 11f5d, 0d4a8, b88e6, d26be, 286e7, cf386, c5eef, c8a85, e4b96, 5270f, 7cfce, a34c5, cdfab

2. [inductor] Add inline PTX pow for bitwise CUDA parity: This pull request adds an inline PTX implementation of the powf_cuda helper function using non-flush-to-zero instructions to precisely match CUDA's powf behavior, improving numerical accuracy when eager_numerics.pow_precision is enabled.

  • URL: pull/175227
  • Associated Commits: 7bc03, d96ba, 9487b, 9b6a3, 62c88, 36e56, 988a9, 840b1, b773a, de3d9, aa3c3

3. [ROCM] Refactor BFloat16 implementation for native usage of BF16 to float conversion.: This pull request refactors the BFloat16 implementation in the PyTorch ROCm environment to utilize HIP's native __bf16 hardware-accelerated float-conversion capabilities, replacing software-based conversions with more efficient constructors, conversion operators, and operator overloads that improve performance, compatibility, and maintainability of BFloat16 operations.

  • URL: pull/175303
  • Associated Commits: 8d386, 05c88, 92b0a, 3c59c, faf0e, fc78a, ca8b2, d6401, c9549, cd0da, 463da

Other Open Pull Requests

  • Precision and FMA Lowering in Inductor Backend: Multiple pull requests improve precision in the PyTorch inductor backend by modifying powf_cuda inline PTX helpers to fallback to libdevice.pow for fp64 inputs and by skipping decomposition of addcdiv and addcmul operations. These changes enable fused multiply-add (FMA) lowering, achieving precision parity with eager CUDA and benefiting optimizers like Adam and AdamW.
  • [pull/175268, pull/175310, pull/175309]
  • Inductor Backend Tiling and Memory Planning Enhancements: Updates to the Inductor backend include extending ND tiling heuristics to support output (pointwise) dimensions in reduction kernels and improving memory planning for symm_mem collective operations by handling inputs not controlled by Inductor. These improvements involve automatic identity copies, communication layout propagation, and delegation of code generation to fix uninitialized buffer bugs and support complex memory scenarios.
  • [pull/175308, pull/175449]
  • Autotuning Integration and Benchmarking: The custom operator autotuning infrastructure is integrated into the aten.mm matrix multiplication function with added autotuning options and tests. Additionally, a new Inductor benchmarker is introduced for ROCm platforms using the Torch profiler to improve kernel timing accuracy during autotuning, with potential benefits for NVIDIA platforms.
  • [pull/175278, pull/175097]
  • Debugging and Dynamo Improvements: Several pull requests enhance debugging and tracing in Dynamo by fixing exception handling in InteractiveDebugSession, modifying the 'q' command to exit immediately, and removing unnecessary unimplemented prefix variables related to comprehension graph breaks. These changes address test failures and improve developer experience during debugging.
  • [pull/175173, pull/175200, pull/175420]
  • Continuous Integration and Workflow Updates: Updates include skipping sccache PATH wrappers in ROCm CI to fix AOTriton build breakage, increasing ROCm nightly binary build timeouts to 360 minutes, updating vLLM tests and benchmarks to CUDA 13.0, and bumping the transformers library version to 5.2.0 with fixes for compatibility and test failures.
  • [pull/175443, pull/175152, pull/175274, pull/175393]
  • Higher-Order Ops and API Additions: A print_backward flag is added to the torch._higher_order_ops.print operation to print gradients during backward passes without breaking graphs under torch.compile. A per-Tensor API for selective activation checkpointing is introduced, allowing explicit pinning of intermediate tensors to avoid recomputation, with refactoring of SAC storage to support retroactive insertion.
  • [pull/175224, pull/175348]
  • ONNX Export and Graph Capture Fixes: A validation failure in torch.onnx.export when using renamed input names and dynamic shapes is fixed by remapping dynamic shape keys to original parameter names. The error message for CPU-to-CUDA tensor copies during graph capture is improved to clarify requirements for pinning and capture error mode settings.
  • [pull/175279, pull/175281]
  • Expression Cache and Configuration Defaults: A per-SymNode expression cache keyed on the _replacements_version_counter is introduced to optimize expression handling. The default value of the wrap_inductor_compiled_regions configuration option is set to True to update PyTorch's default behavior.
  • [pull/175169, pull/175353]
  • Tensor Resource Management Enhancements: The torch::stable::from_blob function is extended to accept complex callable deleters such as lambdas with captures, enabling more flexible resource management for use cases like TorchCodec without global maps or thread contention workarounds.
  • [pull/175089]

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 243

Key Closed Pull Requests

1. Enable hipSOLVER for supported linalg operations on ROCm: This pull request aims to enable hipSOLVER support for specific linear algebra operations on the ROCm platform by adding appropriate guards and fallbacks to utilize hipSOLVER's 64-bit APIs and batched operations where supported, while falling back to MAGMA for unsupported functionalities, thereby improving ROCm's linalg backend capabilities.

  • URL: pull/175367
  • Associated Commits: 8b80a, 34f78, 88129, dd44b, 1c96f, 3d409, 7672d, a7ad4, 36b75, 32803, f5519, 94de7, 7fb50, 61a38, 4b515, d9556, 48007, b9f71, 94c6e, 25087, ac013, b6ee5, 08906, 77acf, 03aab, b2db1
  • Associated Commits: 8b80a, 34f78, 88129, dd44b, 1c96f, 3d409, 7672d, a7ad4, 36b75, 32803, f5519, 94de7, 7fb50, 61a38, 4b515, d9556, 48007, b9f71, 94c6e, 25087, ac013, b6ee5, 08906, 77acf, 03aab, b2db1

2. More size-hinting cleanups: This pull request focuses on cleaning up size-hinting in the codebase by replacing all size_hint calls with fallback to use optimization_hint, removing fallback parameters from size_hint calls in preparation for its eventual deletion, and updating calls from symbolic_hint() to replace_backed_with_hints().

  • URL: pull/174580
  • Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c
  • Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c

3. [DTensor] Strategy Validation (3/3): strategy querying, orchestrator, and CLI: This pull request adds a comprehensive DTensor sharding rule validation framework including an orchestrator that queries DTensor’s claimed sharding strategies via multiple paths, computes ground truth validity for each placement combination, detects and reports discrepancies such as incorrect or missing rules with false positive mitigations, provides a CLI for running validations on individual or all registered operators, and includes end-to-end tests to ensure reliable detection of DTensor bugs.

  • URL: pull/174800
  • Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546
  • Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546

Other Closed Pull Requests

  • DTensor validation and decomposition improvements: Multiple pull requests enhance DTensor functionality by adding a validation engine for sharding rules that simulates distributed execution and by making redistribution semantics more lenient to allow CIA operations through the decomposition flow again. These changes improve rule validation accuracy and decomposition flexibility within DTensor.
    • pull/174799, pull/175194
  • DTensor printing and backward function updates: Support for DTensor arguments in the torch._higher_order_ops.print operator was added to enable rank-wise printing without collectives, and backward functions for from_local and to_local were updated to handle None gradients explicitly. These updates improve debugging capabilities and prevent autograd type mismatch errors.
    • pull/175222, [pull/173865](https://github.com/pytorch/pytorch/pull/173865]
  • Pallas TPU backend enhancements: Several pull requests introduce element-wise operation support, initial inductor IR lowering, broadcasting, and tiling improvements for the Pallas TPU backend, while fixing CPU and TPU-related issues. These contributions advance TPU support and improve code generation and functionality.
    • pull/174743, pull/175027
  • vLLM submodule and size_hint migration: Updates include pinning the vLLM submodule commit, reorganizing test paths, fixing CUDA 12.8 build issues, and rewriting size_hint usages to support unbacked tensors with more precise APIs. These changes enhance compatibility and correctness for vLLM and related tensor operations.
    • pull/175238, pull/174937, pull/175216
  • nonstrict_trace feature improvements: Support for nn modules as inputs was added, along with improved documentation and additional tests for the nonstrict_trace feature. These enhancements increase the feature's usability and reliability.
    • pull/172372, pull/172395
  • ProcessGroup and tracing fixes: The ProcessGroup class was modified to use an ABC metaclass allowing FakeScriptObject to register as a virtual subclass, enabling correct isinstance behavior during tracing. This fix improves tracing correctness when handling FakeScriptObjectStack instances.
    • pull/172566
  • Inductor backend autotuning and configuration patches: Scoped configuration patches propagation was enabled for specific operations during autotuning by adding a new Operation._config_patches field and tagging SubgraphBuffer operations, improving performance without globally enabling coordinate descent tuning.
    • pull/175277
  • Custom operator autotuning with CUDA graph benchmarking: CUDA graph benchmarking capabilities were added to ExternKernelCaller and SubgraphChoiceCaller with parameters to ensure significant speedup selection and fair performance comparisons using CUDA graph capture and replay. This improves autotuning accuracy and efficiency.
    • pull/175275
  • Tracing performance optimization: A _LazyStackTrace class was introduced to defer expensive stack trace symbolization and formatting until accessed, eliminating about 10% tracing time overhead caused by eager summarization during node creation. This optimization reduces tracing latency.
    • pull/175334
  • DTensor stack operation and ShardingPropagator fixes: Fixes were applied to dimension normalization in the DTensor stack operation and to prevent hangs in ShardingPropagator during multi-threading tests by enabling a lock only in testing. These changes improve stability and correctness in DTensor components.
    • pull/174640, pull/174820
  • InputObserver custom empty tensor support: Support was added for specifying a custom empty tensor in InputObserver to handle missing inputs like pixel_values during sequential forward calls, ensuring consistent input observation in multi-modal models.
    • pull/174964
  • Error message improvements: The error message for MultiMarginLoss was enhanced to provide clearer, more detailed explanations about target tensor size inconsistencies, aiding user debugging.
    • pull/174072
  • Pyrefly GitHub action error handling fix: A temporary fix was implemented to decode JSON error output from Pyrefly when running in GitHub actions, ensuring compatibility in CI and local environments while a comprehensive fix is developed.
    • pull/175289
  • Dynamo einops version check revert: A partial revert was made to disable the einops 0.8.2 version check in Dynamo by falling back to prior behavior using allow_in_graph, preventing excessive warning logspam caused by tracing einops operations with @lru_cache.
    • pull/175351

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
laithsakka 184 19 2 9
albanD 178 2 1 0
malfet 113 20 0 31
anijain2305 147 9 5 3
wconstab 119 17 0 23
pianpwk 125 19 0 5
ydwu4 121 11 0 0
eellison 99 12 0 16
weifengpy 99 2 0 1
guilhermeleobas 86 11 0 3

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.