Weekly GitHub Report for Pytorch: February 16, 2026 - February 23, 2026 (17:35:31)

Weekly GitHub Report for Pytorch: February 16, 2026 - February 23, 2026 (17:35:31)

        Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, the new performance tuning API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous bug fixes, performance optimizations, and deprecations such as the discontinuation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[HIGH PRIORITY] [TRIAGE REVIEW] [MODULE: CRASH] [MODULE: ACTIVATION CHECKPOINTING] [ONCALL: PT2] [MODULE: DTENSOR] [MODULE: PT2-DISPATCHER] [MODULE: FLEX ATTENTION] [BOT-TRIAGED] Block mask caching for flex attention and SAC don't play nicely together (RuntimeError: Only Tensors of floating point and complex dtype can require gradients): This issue describes a runtime error occurring when block mask caching in flex attention interacts poorly with selective activation checkpointing (SAC), specifically causing a RuntimeError related to tensors of integer dtype incorrectly requiring gradients during SAC recomputation. The problem arises because a compiled create_block_mask produces a BlockMask with integer tensors that are cached and reused, leading aot_autograd's wrap_tensor_subclasses to attempt reconstructing a DTensor with requires_grad=True on an integer tensor, which is invalid.  

The comments discuss the difficulty in pinpointing the error, note that the failure is due to branching on global state causing cache mismatches during SAC, and explore a proposed solution involving clearing the cache on recompute which ultimately fails. A user shares a workaround involving a recompute tape mechanism to record and replay cache hits during forward and recompute passes, ensuring consistent behavior and preventing the error.
Number of comments this week: 7

[ONCALL: PT2] [MODULE: DYNAMIC SHAPES] MobileBertForMaskedLM is 90% slower with unbacked vs backed !: This issue reports that the MobileBertForMaskedLM model runs approximately 90% slower when using unbacked batch processing compared to backed batch processing, despite recent optimizations related to size hinting. The author is seeking assistance from the Inductor team to further improve performance for unbacked batches, as this work is preparatory for integrating with vLLM and aims to match Huggingface inference speeds for faster iteration.  

The comments detail a series of incremental code optimizations that progressively improve performance from 1.19x to over 2x speedup, including changes to size hinting heuristics, handling of unbacked symbols, and padding optimizations, with ongoing efforts to finalize fixes for unbacked batch processing.
Number of comments this week: 5

[TRIAGED] [FUNCTION REQUEST] [ONCALL: PT2] [MODULE: DYNAMO] [MODULE: COMPILE UX] [DYNAMO-TRIAGE-DEC2025] [BOT-TRIAGED] torch.compile(..., name="flex_attention"): This issue proposes adding a name keyword argument to the torch.compile function to assign names to compile regions, which can then be used for better identification and handling in various contexts such as activation checkpointing and stack traces. The motivation is to improve the ability to distinguish and manage compiled regions, especially when using features like SAC and inductor compiled code, thereby enhancing debugging, tracing, and model transparency.  

The comments generally support the proposal, discussing the trade-offs between using string names versus object-based namespacing for uniqueness, and highlighting the usefulness of named compile regions for debugging, tracing, and improving model interpretability.
Number of comments this week: 4

[TRIAGED] [RELEASE TRACKER] [v.2.11.0] Release Tracker: This issue is about tracking and managing cherry-picks to the release branch for the PyTorch 2.11.0 release, outlining the criteria and process for including changes during different phases of the release cycle. It provides detailed instructions on what types of fixes are allowed, how to submit cherry-pick requests, and the approval workflow to ensure stability and quality before the final release.  

The comments show multiple cherry-pick requests submitted with links to both trunk and release branch PRs, each categorized by the type of change, and all were approved and merged by a release team member.
Number of comments this week: 3

[TRIAGE REVIEW] [MODULE: BUILD] [MODULE: RISC-V] [BOT-TRIAGED] ZLib Reference outdated in riscv ci dockerfile: This issue addresses a failure in Docker builds caused by an outdated URL for downloading zlib version 1.3.1 in the riscv CI Dockerfile, resulting in 404 errors. The problem arose because zlib was updated to version 1.3.2, and the current source for 1.3.1 is no longer hosted at the original location, prompting a need to either upgrade the zlib version or source the older version from GitHub releases.  

The comments confirm the build failures started recently following the zlib 1.3.2 release, propose a simple fix by bumping the zlib version in the Dockerfile, and discuss adding a triage review to consider separating mainstream and experimental docker builds for better maintenance.
Number of comments this week: 3

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 60
Summarized Issues:

Cherry-pick and release management: This issue tracks the process and criteria for cherry-picking low-risk and critical fixes to the PyTorch 2.11.0 release branch, ensuring proper management during different phases of the release cycle. It focuses on defining clear guidelines to maintain release stability while incorporating necessary fixes.  
issues/175093

NCCL deadlock and distributed communication: A persistent deadlock occurs when launching NCCL point-to-point and collective operations concurrently from separate Python threads, caused by circular dependencies in NCCL's internal progress engine despite synchronization attempts. Additionally, serialization of homogeneous point-to-point communication operations on a single CUDA stream causes head-of-line blocking in pipeline-parallel workloads, proposing direct per-op issuance to reduce bottlenecks.  
issues/175145, issues/175225

Interpolation and random number generation inconsistencies under torch.compile: Using torch.nn.functional.interpolate with mode 'nearest' under torch.compile with certain backends produces incorrect results due to suspected rounding errors, while eager mode is correct. Similarly, random number generation results differ between eager and compiled modes with the inductor backend, where the first randint output is inconsistent despite resetting the seed.  
issues/175154, issues/175156

Build and packaging errors due to missing files and outdated URLs: The hipification process in build_amd.py fails if files or directories like third_party/fbgemm are missing, notably affecting Fedora's python-torch package. Docker builds also fail due to an outdated URL for zlib 1.3.1, requiring an upgrade or alternative sourcing to fix 404 errors.  
issues/175160, issues/175193

Autocast backward pass documentation and behavior mismatch: Documentation and default behavior conflict regarding performing backward passes under autocast with torch.compile, where the recommended practice advises against it but the default assumes backward runs under the same autocast context as forward, potentially impacting numerical correctness.  
issues/175166

Performance regressions and optimization needs in Inductor backend: The MobileBertForMaskedLM model runs about 90% slower with unbacked batch processing compared to backed, highlighting the need for expert help to optimize unbacked scenarios. A minor regression in torch.compile related to torch._scaled_mm causes assertion errors on some GPU models and PyTorch versions, indicating backend inconsistencies.  
issues/175167, issues/175206

MPS backend bugs with channels_last memory format: The backward pass of BatchNorm2d produces incorrect weight gradients on MPS with channels_last inputs, causing training divergence despite correct forward outputs. Additionally, backward passes of AvgPool2d and AdaptiveAvgPool2d crash with SIGABRT due to buffer size assertion failures on channels_last inputs, avoidable by making inputs contiguous.  
issues/175189, issues/175190

Hardware capability and feature detection improvements: Proposes replacing get_device_capability() in CUDA, ROCm, and accelerator tests with more generic, self-documenting feature queries to better verify hardware support across platforms.  
issues/175211

Flex attention and SAC (Selective Activation Checkpointing) challenges: Matching flex attention regions idiomatically in SAC policies is difficult due to generic operation names after torch.compile, complicating tracking. SAC also needs unique identifiers for inductor_compiled_code and similar operators to improve error detection during recomputation.  
issues/175229, issues/175306

OutOfMemory and performance issues in fused kernels and tensor metadata reading: Torch.compile fusing multiple operations into a large kernel causes OutOfMemoryError in Triton during Flexattention backward pass on torch 2.10+, not seen on 2.9. Excessive unwanted data access during tensor metadata reading with large read_ahead_kb settings leads to significant performance degradation.  
issues/175250, issues/175252

Data pipeline serialization and deployment limitations: Proposes a fully serializable native C++ data pipeline engine for PyTorch to enable Python-independent deployment by capturing data ingestion logic as part of the model graph, addressing current limitations in high-performance and edge environments.  
issues/175255

Caching and autograd errors with DTensors and flex attention: Caching a BlockMask containing integer tensors in compiled functions with SAC and DTensors causes runtime errors because cached int tensors incorrectly require gradients during recomputation, breaking autograd.  
issues/175258

Dynamo module correctness and refactoring: Multiple issues in Dynamo include incorrect HAS_ATTR guard insertion, investigation of VariableTracker creation only for sourceful objects, is operator malfunction, refactoring variables/builtin.py for better maintainability, and lack of metaclass support, all indicating ongoing efforts to improve Dynamo's correctness and code quality.  
issues/175263, issues/175264, issues/175267, issues/175269, issues/175292

Export and parsing failures: Running tl-parse on export produces empty reports, and fx.GraphModule.to_folder() fails to save models containing TorchScript submodules due to serialization errors, indicating issues in export and saving workflows.  
issues/175293, issues/175493

Inductor backend architecture and test failures: The Inductor CUTLASS backend incorrectly uses CollectiveEpilogue for SM90 on SM100 architectures, causing ~60 test failures. The test_comprehensive_nn_functional_linear_cuda_float32 is disabled due to consistent Linux failures.  
issues/175304, issues/175354

Segmentation faults in embedding_bag and property setter bugs: Segfaults occur in torch.nn.functional.embedding_bag due to insufficient validation of offsets tensor values and missing bounds checks for float64 weights with empty offsets. Python property setters are ignored when assigning nn.Parameter() to nn.Module properties because setattr takes precedence, causing unexpected behavior.  
issues/175368, issues/175370, issues/175372

Naming and debugging improvements for torch.compile: Proposes adding a name keyword argument to torch.compile to assign names to compile regions, improving identification for activation checkpointing, debugging, and tracing.  
issues/175390

TensorSubclass and rounding mode bugs: Using a TensorSubclass with a custom_op in torch.compile causes runtime errors during Dynamo tracing due to invocation of the custom_op implementation instead of the fake one. Also, adding support for different rounding modes when casting tensors is proposed to enable precise control for specialized formats like MXFP8 on Blackwell GPUs.  
issues/175408, issues/175409

vLLM project CI failures and accuracy regressions: An umbrella issue tracks CI failures for the vLLM project for the 2.11 release, including a failing distributed test where accuracy on GSM8K evaluation with 2 GPUs and deepep_low_latency backend is below expected thresholds.  
issues/175426, issues/175429

GPU out-of-memory despite free memory: An out-of-memory error occurs on GPUs with ample free memory, suggesting issues like memory fragmentation or allocation inefficiencies during tensor operations.  
issues/175431

Quantization test failures due to config incompatibilities: Tests for quantization of pre-quantized models using torchao fail due to incompatibilities with model configuration versions, specifically errors from unexpected or unsupported arguments in Int4WeightOnlyConfig during engine core initialization.  
issues/175435

Illegal instruction crashes on older CPUs: Intermittent illegal instruction crashes occur running torch.sin on large CPU tensors with PyTorch 2.10+ on older Intel Xeon E5-2670 processors, traced to mkl_vml_kernel_dSin_* functions, mitigated by limiting PyTorch to single-threaded execution.  
issues/175436

Beam search concurrency limit test failures: The test_beam_search_with_concurrency_limit function fails due to output mismatches when concurrency limits are applied during beam search sampling for TinyLlama-1.1B-Chat-v1.0.  
issues/175437

DTensor export and decomposition errors: Exporting tensor-parallel models using DTensor fails because DTensorSpec is not registered as a pytree constant, causing RuntimeErrors unless explicitly registered. Running run_decompositions() on ExportedPrograms using DTensor fails with an AssertionError during decomposition.  
issues/175467, issues/175469

Test disables and gradient correctness issues: The test_custom_op_with_layout_arg_xpu is disabled due to failures on the xpu platform. Using make_fx with symbolic tracing produces incorrect second-order gradients involving torch.sqrt, diverging from eager execution.  
issues/175475, issues/175477

Distribution and LSTM documentation and test issues: Requests to update torch.distributions.Gamma().sample() to accept torch.Generator for reproducibility. Documentation error in LSTM output h_n description. The test_index_put_error_cuda is disabled due to ROCm 7.2 failures.  
issues/175478, issues/175479, issues/175482

TensorFloat32 warning suppression and interpolation NotImplementedError: Users cannot disable repeated TensorFloat32 precision warnings during fp32 matrix multiplications, suggesting a flag to distinguish default vs user-set precision. On CPU, torch.nn.functional.interpolate with antialias=True raises NotImplementedError for bfloat16 and float16 inputs in bilinear or bicubic modes, breaking preprocessing workflows.  
issues/175484, issues/175489

In-place division and softmax compilation bugs: Compiling functions performing in-place division, softmax, and returning detached tensors results in inconsistent outputs compared to eager execution, specifically affecting the first returned element when detached tensors are included.  
issues/175496

Unsupported operations and batching rule gaps: torch._dynamo.export fails on models containing nn.GRU, raising questions about known limitations. torch.vmap raises ValueError when returning pytrees with non-tensor leaves, proposing fixes to allow non-tensor passthrough. Lack of vmap batching rule for torch.while_loop causes KeyErrors, proposing a TransformType.Vmap rule for proper batched execution.  
issues/175520, issues/175521, issues/175522

Graph breaks due to custom getattribute in torch.compile: torch.compile causes graph breaks on every attribute access when nn.Module subclasses define custom getattribute, proposing fixes to trace through getattribute like plain user-defined objects to avoid unnecessary breaks.  
issues/175523

Race conditions in CUDA RPC tests: A race condition in CUDA RPC test_tensor_view_as_return_value intermittently causes SIGABRT crashes due to multiple RPC worker threads executing CUDA ops on the default stream without proper synchronization, affecting distributed RPC test reliability.  
issues/175528

torch.compile backward pass failures on CPU and CUDA: torch.compile fails during backward pass on minimal models with torchvision convnext_tiny on CPU, related to LayerNorm2d patterns involving tensor permutation and normalization, also affecting more complex variants on CUDA, while eager mode works correctly.  
issues/175530

FSDP test device context and NCCL failures: FSDPTestMultiThread fails because device context is not set before creating DeviceMesh, causing warnings, incorrect device inference, and NCCL failures in distributed FSDP tests.  
issues/175531

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 50
Summarized Issues:

Inductor Backend Boolean Tensor Issues: Multiple issues report incorrect behavior when using the Inductor backend with torch.compile on boolean tensors. These include wrong indices returned by argmax or max(dim)[1] operations and incorrect results from boolean .data assignments, both differing from eager execution results.  
[issues/174069, issues/174187]

Distributed and Async Operation Errors in Compilation: There is a problem where torch.compile misinterprets the output of dist.all_reduce with async_op=False, causing an AttributeError due to expecting a torch.Tensor instead of None. This error occurs only in compiled mode, while eager execution works correctly.  
[issues/174280]

MPS Backend Numerical and Memory Issues: Several issues highlight incorrect results and memory errors on the MPS backend, including failures in BatchNorm2d and avg_pool2d with channels_last=True and storage_offset > 0, out-of-bounds memory access in scaled dot product attention for long sequences, and incorrect gradients or outputs in various tensor operations on non-contiguous or permuted tensors.  
[issues/174345, issues/174861, issues/174943, issues/175187, issues/175188, issues/175191, issues/175192]

Export and Environment Variable Interaction: The torch.export.export(..., strict=True) function fails when the PYTHONOPTIMIZE environment variable is set above zero, likely due to stripped assert statements, causing errors not seen with strict=False or PYTHONOPTIMIZE=0.  
[issues/174784]

CUDA Memory Access and Kernel Errors: A CUDA illegal memory access error occurs when using torch.nn.attention.flex_attention with create_block_mask under certain large key-value and small query length conditions, especially when compiled with torch.compile, indicating possible out-of-bounds or indexing issues in the kernel.  
[issues/174923]

Test Failures Due to Missing Dependencies and Disabled Tests: Some tests fail or are disabled due to missing Python modules like 'dominate' or segmentation faults caused by attribute errors during teardown, affecting multiple platforms and CUDA versions.  
[issues/174919, issues/175019, issues/175065]

Loss Function and Gradient Computation Bugs: The NLLLoss function fails to backpropagate gradients for non-contiguous 4D inputs, and CrossEntropyLoss on MPS incorrectly accepts invalid 1D float labels without error, leading to inconsistent validation and runtime errors.  
[issues/174943, issues/175084]

CPU Test Tolerance and Documentation Issues: Some CPU tests fail due to overly strict tolerance thresholds that do not accommodate expected numerical differences in low-precision operations, and minor documentation problems exist with the varlen_attn feature regarding import errors and argument formatting.  
[issues/174952, issues/174961]

Inductor and Periodic-Dynamo Benchmark Instabilities: Numerous issues report instability and flakiness in inductor-periodic and periodic-dynamo benchmark tests across CPU and CUDA platforms, affecting various TorchBench test suites and configurations.  
[issues/175121, issues/175122, issues/175123, issues/175124, issues/175125, issues/175126, issues/175127, issues/175128, issues/175129, issues/175130, issues/175131, issues/175132, issues/175133, issues/175134, issues/175135, issues/175136, issues/175137, issues/175138, issues/175139, issues/175140, issues/175141, issues/175142, issues/175143]

Name Mangling and Build Linking Errors: There is a problem with inconsistent name mangling between C++ and CUDA for templated functions, causing linking errors during the build process due to mismatched mangled symbols.  
[issues/174898]

Feature Request for Selective Gradient Blocking: A new autograd feature is requested to enable selective upstream gradient blocking per loss without extra memory or compute overhead, allowing efficient multi-loss model training without multiple backward passes or graph retention.  
[issues/175165]

Float8 Layout Test Bug on Specific GPUs: The test_float8_basics_cuda test incorrectly expects a RuntimeError for Spark and Thor GPUs regarding float8 layouts, but these GPUs actually support some layouts, causing test failures.  
[issues/175182]

Negative Step Slicing Support Proposal: A feature request proposes adding support for negative step sizes in slicing syntax by internally rewriting such slices to a combination of flip and positive-step slicing, improving ergonomics and aligning with NumPy behavior.  
[issues/175240]

Numerical Accuracy and Undefined Behavior in Linear Algebra: A CUDA test fails numerical accuracy checks due to discrepancies in linear algebra computations with high condition number matrices, and undefined behavior is reported in GemmHelper caused by improper use of vector::reserve() leading to out-of-bounds access.  
[issues/175282, issues/175302]

Warning Suppression Request for Readonly NumPy Arrays: There is a request to suppress or optionally disable warnings generated when creating a Tensor from a readonly NumPy array, as these warnings clutter logs without affecting training or inference.  
[issues/175395]

Python 3.15 Support Initiation: Discussions and requests are made to begin support for Python 3.15 in PyTorch to enable timely compatibility shortly after the official Python 3.15 release.  
[issues/175402, issues/175407]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 205
Key Open Pull Requests
1. [test] Add basic pyrefly infer command: This pull request introduces a basic pyrefly infer command along with initial setup, tests for autotyping, lint fixes, directory renaming to avoid clashes, reference typing completion, integration of a custom directory switcher, and test dependencies to enhance the pyrefly functionality within the PyTorch project.

URL: pull/175153

Associated Commits: be3d0, 11f5d, 0d4a8, b88e6, d26be, 286e7, cf386, c5eef, c8a85, e4b96, 5270f, 7cfce, a34c5, cdfab

2. [inductor] Add inline PTX pow for bitwise CUDA parity: This pull request adds an inline PTX implementation of the powf_cuda helper function using non-flush-to-zero instructions to precisely match CUDA's powf behavior, improving numerical accuracy when eager_numerics.pow_precision is enabled.

URL: pull/175227

Associated Commits: 7bc03, d96ba, 9487b, 9b6a3, 62c88, 36e56, 988a9, 840b1, b773a, de3d9, aa3c3

3. [ROCM] Refactor BFloat16 implementation for native usage of BF16 to float conversion.: This pull request refactors the BFloat16 implementation in the PyTorch ROCm environment to utilize HIP's native __bf16 hardware-accelerated float-conversion capabilities, replacing software-based conversions with more efficient constructors, conversion operators, and operator overloads that improve performance, compatibility, and maintainability of BFloat16 operations.

URL: pull/175303

Associated Commits: 8d386, 05c88, 92b0a, 3c59c, faf0e, fc78a, ca8b2, d6401, c9549, cd0da, 463da

Other Open Pull Requests

Precision and FMA Lowering in Inductor Backend: Multiple pull requests improve precision in the PyTorch inductor backend by modifying powf_cuda inline PTX helpers to fallback to libdevice.pow for fp64 inputs and by skipping decomposition of addcdiv and addcmul operations. These changes enable fused multiply-add (FMA) lowering, achieving precision parity with eager CUDA and benefiting optimizers like Adam and AdamW.  
[pull/175268, pull/175310, pull/175309]

Inductor Backend Tiling and Memory Planning Enhancements: Updates to the Inductor backend include extending ND tiling heuristics to support output (pointwise) dimensions in reduction kernels and improving memory planning for symm_mem collective operations by handling inputs not controlled by Inductor. These improvements involve automatic identity copies, communication layout propagation, and delegation of code generation to fix uninitialized buffer bugs and support complex memory scenarios.  
[pull/175308, pull/175449]

Autotuning Integration and Benchmarking: The custom operator autotuning infrastructure is integrated into the aten.mm matrix multiplication function with added autotuning options and tests. Additionally, a new Inductor benchmarker is introduced for ROCm platforms using the Torch profiler to improve kernel timing accuracy during autotuning, with potential benefits for NVIDIA platforms.  
[pull/175278, pull/175097]

Debugging and Dynamo Improvements: Several pull requests enhance debugging and tracing in Dynamo by fixing exception handling in InteractiveDebugSession, modifying the 'q' command to exit immediately, and removing unnecessary unimplemented prefix variables related to comprehension graph breaks. These changes address test failures and improve developer experience during debugging.  
[pull/175173, pull/175200, pull/175420]

Continuous Integration and Workflow Updates: Updates include skipping sccache PATH wrappers in ROCm CI to fix AOTriton build breakage, increasing ROCm nightly binary build timeouts to 360 minutes, updating vLLM tests and benchmarks to CUDA 13.0, and bumping the transformers library version to 5.2.0 with fixes for compatibility and test failures.  
[pull/175443, pull/175152, pull/175274, pull/175393]

Higher-Order Ops and API Additions: A print_backward flag is added to the torch._higher_order_ops.print operation to print gradients during backward passes without breaking graphs under torch.compile. A per-Tensor API for selective activation checkpointing is introduced, allowing explicit pinning of intermediate tensors to avoid recomputation, with refactoring of SAC storage to support retroactive insertion.  
[pull/175224, pull/175348]

ONNX Export and Graph Capture Fixes: A validation failure in torch.onnx.export when using renamed input names and dynamic shapes is fixed by remapping dynamic shape keys to original parameter names. The error message for CPU-to-CUDA tensor copies during graph capture is improved to clarify requirements for pinning and capture error mode settings.  
[pull/175279, pull/175281]

Expression Cache and Configuration Defaults: A per-SymNode expression cache keyed on the _replacements_version_counter is introduced to optimize expression handling. The default value of the wrap_inductor_compiled_regions configuration option is set to True to update PyTorch's default behavior.  
[pull/175169, pull/175353]

Tensor Resource Management Enhancements: The torch::stable::from_blob function is extended to accept complex callable deleters such as lambdas with captures, enabling more flexible resource management for use cases like TorchCodec without global maps or thread contention workarounds.  
[pull/175089]

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 243
Key Closed Pull Requests
1. Enable hipSOLVER for supported linalg operations on ROCm: This pull request aims to enable hipSOLVER support for specific linear algebra operations on the ROCm platform by adding appropriate guards and fallbacks to utilize hipSOLVER's 64-bit APIs and batched operations where supported, while falling back to MAGMA for unsupported functionalities, thereby improving ROCm's linalg backend capabilities.

URL: pull/175367

Associated Commits: 8b80a, 34f78, 88129, dd44b, 1c96f, 3d409, 7672d, a7ad4, 36b75, 32803, f5519, 94de7, 7fb50, 61a38, 4b515, d9556, 48007, b9f71, 94c6e, 25087, ac013, b6ee5, 08906, 77acf, 03aab, b2db1

Associated Commits: 8b80a, 34f78, 88129, dd44b, 1c96f, 3d409, 7672d, a7ad4, 36b75, 32803, f5519, 94de7, 7fb50, 61a38, 4b515, d9556, 48007, b9f71, 94c6e, 25087, ac013, b6ee5, 08906, 77acf, 03aab, b2db1

2. More size-hinting cleanups: This pull request focuses on cleaning up size-hinting in the codebase by replacing all size_hint calls with fallback to use optimization_hint, removing fallback parameters from size_hint calls in preparation for its eventual deletion, and updating calls from symbolic_hint() to replace_backed_with_hints().

URL: pull/174580

Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c

Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c

3. [DTensor] Strategy Validation (3/3): strategy querying, orchestrator, and CLI: This pull request adds a comprehensive DTensor sharding rule validation framework including an orchestrator that queries DTensor’s claimed sharding strategies via multiple paths, computes ground truth validity for each placement combination, detects and reports discrepancies such as incorrect or missing rules with false positive mitigations, provides a CLI for running validations on individual or all registered operators, and includes end-to-end tests to ensure reliable detection of DTensor bugs.

URL: pull/174800

Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546

Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546

Other Closed Pull Requests

DTensor validation and decomposition improvements: Multiple pull requests enhance DTensor functionality by adding a validation engine for sharding rules that simulates distributed execution and by making redistribution semantics more lenient to allow CIA operations through the decomposition flow again. These changes improve rule validation accuracy and decomposition flexibility within DTensor.  
pull/174799, pull/175194

DTensor printing and backward function updates: Support for DTensor arguments in the torch._higher_order_ops.print operator was added to enable rank-wise printing without collectives, and backward functions for from_local and to_local were updated to handle None gradients explicitly. These updates improve debugging capabilities and prevent autograd type mismatch errors.  
pull/175222, [pull/173865](https://github.com/pytorch/pytorch/pull/173865]

Pallas TPU backend enhancements: Several pull requests introduce element-wise operation support, initial inductor IR lowering, broadcasting, and tiling improvements for the Pallas TPU backend, while fixing CPU and TPU-related issues. These contributions advance TPU support and improve code generation and functionality.  
pull/174743, pull/175027

vLLM submodule and size_hint migration: Updates include pinning the vLLM submodule commit, reorganizing test paths, fixing CUDA 12.8 build issues, and rewriting size_hint usages to support unbacked tensors with more precise APIs. These changes enhance compatibility and correctness for vLLM and related tensor operations.  
pull/175238, pull/174937, pull/175216

nonstrict_trace feature improvements: Support for nn modules as inputs was added, along with improved documentation and additional tests for the nonstrict_trace feature. These enhancements increase the feature's usability and reliability.  
pull/172372, pull/172395

ProcessGroup and tracing fixes: The ProcessGroup class was modified to use an ABC metaclass allowing FakeScriptObject to register as a virtual subclass, enabling correct isinstance behavior during tracing. This fix improves tracing correctness when handling FakeScriptObjectStack instances.  
pull/172566

Inductor backend autotuning and configuration patches: Scoped configuration patches propagation was enabled for specific operations during autotuning by adding a new Operation._config_patches field and tagging SubgraphBuffer operations, improving performance without globally enabling coordinate descent tuning.  
pull/175277

Custom operator autotuning with CUDA graph benchmarking: CUDA graph benchmarking capabilities were added to ExternKernelCaller and SubgraphChoiceCaller with parameters to ensure significant speedup selection and fair performance comparisons using CUDA graph capture and replay. This improves autotuning accuracy and efficiency.  
pull/175275

Tracing performance optimization: A _LazyStackTrace class was introduced to defer expensive stack trace symbolization and formatting until accessed, eliminating about 10% tracing time overhead caused by eager summarization during node creation. This optimization reduces tracing latency.  
pull/175334

DTensor stack operation and ShardingPropagator fixes: Fixes were applied to dimension normalization in the DTensor stack operation and to prevent hangs in ShardingPropagator during multi-threading tests by enabling a lock only in testing. These changes improve stability and correctness in DTensor components.  
pull/174640, pull/174820

InputObserver custom empty tensor support: Support was added for specifying a custom empty tensor in InputObserver to handle missing inputs like pixel_values during sequential forward calls, ensuring consistent input observation in multi-modal models.  
pull/174964

Error message improvements: The error message for MultiMarginLoss was enhanced to provide clearer, more detailed explanations about target tensor size inconsistencies, aiding user debugging.  
pull/174072

Pyrefly GitHub action error handling fix: A temporary fix was implemented to decode JSON error output from Pyrefly when running in GitHub actions, ensuring compatibility in CI and local environments while a comprehensive fix is developed.  
pull/175289

Dynamo einops version check revert: A partial revert was made to disable the einops 0.8.2 version check in Dynamo by falling back to prior behavior using allow_in_graph, preventing excessive warning logspam caused by tracing einops operations with @lru_cache.  
pull/175351

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

laithsakka
184
19
2
9

albanD
178
2
1
0

malfet
113
20
0
31

anijain2305
147
9
5
3

wconstab
119
17
0
23

pianpwk
125
19
0
5

ydwu4
121
11
0
0

eellison
99
12
0
16

weifengpy
99
2
0
1

guilhermeleobas
86
11
0
3

                            Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Issues	Comments
laithsakka	184	19	2	9
albanD	178	2	1	0
malfet	113	20	0	31
anijain2305	147	9	5	3
wconstab	119	17	0	23
pianpwk	125	19	0	5
ydwu4	121	11	0	0
eellison	99	12	0	16
weifengpy	99	2	0	1
guilhermeleobas	86	11	0	3