Weekly GitHub Report for Pytorch: February 08, 2026 - February 15, 2026 (15:14:17)

Weekly GitHub Report for Pytorch: February 08, 2026 - February 15, 2026 (15:14:17)

        Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation and Windows binaries, and a backward-incompatible security improvement flipping the default weights_only parameter in torch.load; additionally, PyTorch has deprecated its official Anaconda channel and updated Linux binaries to use Manylinux 2.28 with CXX11_ABI=1.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[MODULE: DOCS] [MODULE: CUDA] [TRIAGED] [ENHANCEMENT] [BOT-TRIAGED] document torch.cuda.memory.mem_get_info in: This issue requests the addition of documentation for the function torch.cuda.memory.mem_get_info within the existing CUDA memory management notes, highlighting that the function may incur a significant delay due to CUDA initialization or synchronization when called. The discussion also emphasizes the need to clarify the use cases of this API, its behavior in multi-process GPU sharing scenarios, and the inclusion of related memory APIs for better discoverability in the documentation.  

The comments include an offer to submit a pull request with a brief note about the function’s latency, clarifications on when delays occur, and suggestions to add information about multi-process GPU memory usage. Participants discuss the current state of the documentation, note missing related APIs, and agree on adding a dedicated section for these functions to improve user understanding, culminating in a submitted PR to address the issue.
Number of comments this week: 16

[TRIAGE REVIEW] [MODULE: MEMORY USAGE] [MODULE: REGRESSION] [MODULE: FSDP] [ONCALL: PT2] [MODULE: DYNAMIC SHAPES] [MODULE: COMPILE-TIME] [BOT-TRIAGED] Recompile time with torch.compile 2.10: This issue reports a significant increase in recompilation time and GPU memory overhead after upgrading to torch 2.10, particularly when using Fully Sharded Data Parallel (FSDP) wrapping and dynamic shapes in the model. The user observes that warmup time for multiple shapes has increased from 30 minutes to several hours and suspects that custom operators like a manual RMSNorm implementation may be causing inefficiencies with torch.compile in this version.  

The comments include detailed logs showing numerous guard failures causing recompilations, requests for more environment and reproducer details, discussion about NaN losses with the repro, and observations that switching from a custom RMSNorm to the built-in torch.RMSNorm alleviates the issue, suggesting problems with custom operators and dynamic shapes in torch.compile 2.10.
Number of comments this week: 7

[TRIAGE REVIEW] [MODULE: CUDA] [MODULE: CUBLAS] [ONCALL: PT2] [MODULE: INDUCTOR] [MODULE: VLLM] [BOT-TRIAGED] [vllm] CUBLAS_STATUS_INVALID_VALUE in cublasGemmEx after upgrading to PyTorch 2.10: This issue describes a runtime error caused by a CUDA version incompatibility when using vLLM with PyTorch 2.10, specifically a CUBLAS_STATUS_INVALID_VALUE error in cublasGemmEx that occurs with CUDA 12.9 but not 12.8. The problem arises because PyTorch 2.10 pins the nvidia-cublas-cu12 version to 12.8.4.1, preventing users from upgrading to the required 12.9.1.4 version to fix the error, complicating installation and compatibility with vLLM.  

The comments clarify that the error is due to a mismatch between the CUDA version used by PyTorch and the one required by vLLM, recommending installing PyTorch with the CUDA 12.9 wheel to resolve the issue, but noting that this conflicts with PyPI's single CUDA wheel version policy and the pinned dependencies in PyTorch 2.10, making it a packaging and compatibility challenge rather than a bug in PyTorch itself.
Number of comments this week: 6

[ONCALL: DISTRIBUTED] [BOT-TRIAGED] Feature Request: Activation offloading with async prefetch in FSDP: This issue proposes adding activation offloading as a first-class feature within FSDP's CPUOffloadPolicy to asynchronously offload activation tensors to CPU during forward passes and prefetch them back before backward passes, reusing existing prefetch infrastructure for unified PCIe scheduling and consistent API design. The motivation is to optimize memory usage and PCIe bandwidth by treating activation offloading similarly to parameter offloading, enabling better overlap of data transfers and computation, especially beneficial for long-sequence training scenarios.  

The comments express strong support for the proposal, discuss the complementary nature of activation offloading and activation checkpointing, clarify implementation details such as offloading intermediate tensors via autograd hooks, and highlight practical considerations including platform-specific PCIe/NVLink bandwidth and memory usage trade-offs; there is consensus on the high priority of native support in FSDP and interest in both short-term and long-term implementation strategies.
Number of comments this week: 6

[MODULE: BUILD] [MODULE: LINT] [TRIAGED] [BOT-TRIAGED] [MODULE: SPIN] spin lint doesn't regenerate cuda headers: This issue reports that the spin linting process does not regenerate two CUDA header files, "cuda_cmake_macros.h" and "CUDAConfig.h," which leads to lint errors because these files are missing. The problem arises because these headers are only generated when CUDA is enabled during the build, and the current lint regeneration command does not fully trigger the necessary build steps to produce both files, especially when USE_CUDA=0.  

The comments discuss possible solutions including adding a "module: spin" label for tracking, clarifying the generation process of the headers, and debating whether to enable CUDA during lint regeneration or create shim headers for missing files when CUDA is disabled; the consensus leans towards either enabling CUDA during linting or providing shims to avoid lint failures.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 59
Summarized Issues:

Profiling and Tracing Bugs: Using torch.profiler.profile with with_stack=True causes incorrect rendering and ignored spans for record_function and ProfilerStep in Chrome tracing and Perfetto due to incoherent start and end times for short-lived spans. This results in inaccurate profiling outputs that hinder performance analysis.  
issues/174555

Performance Regressions with CUDA Graphs: PyTorch 2.10 shows a significant throughput regression compared to 2.9 when using torch.compile(mode="reduce-overhead") on H100 GPUs, mainly due to a 7x slowdown in cudaGraphLaunch calls and new Python overhead in cudagraph_trees.py. This leads to degraded performance and round-over-round throughput decline during CUDA graph replay in GPT inference workloads.  
issues/174575

Non-Determinism and Incorrect Results in Compilation: The function torch.nn.functional.ctc_loss compiled with torch.compile sometimes produces inconsistent and incorrect results across multiple executions even with fixed seeds, especially when cloning input tensors. This non-determinism affects several backends including eager, aot_eager, and inductor, undermining reliability.  
issues/174602

Environment Variable Handling in Compilation: The aot_compile() function does not properly respect the TORCH_COMPILE_DISABLE environment variable, causing unexpected behavior when Dynamo is disabled. This leads to confusion for users trying to disable torch.compile.  
issues/174607

CUDA Memory and Synchronization Issues: Inductor backend has a bug where non-blocking device-to-host copies lack proper host-side synchronization before CPU kernel consumption, causing potential race conditions as CPU kernels may read buffers before DMA transfers complete. Additionally, a CUDA illegal memory access occurs in AOTInductor during compilation of a SAM3 model with dynamic batch size and text prompt dimensions when math SDPA is enabled, linked to split reduction codegen and softmax fallback.  
issues/174608, issues/174695

Documentation and API Discoverability Requests: There is a request to add documentation for torch.cuda.memory.mem_get_info to CUDA memory management notes, including discussion of potential delays due to CUDA initialization or synchronization. Consolidating memory-related API references is suggested to improve discoverability.  
issues/174625

Numerical Accuracy and Data Type Issues: The torch.histc function produces inconsistent results with float16 due to accumulation and rounding errors violating numerical invariants, suggesting the need for float32 internal accumulation. Also, torch.nn.CrossEntropyLoss on MPS incorrectly accepts 1D float labels including invalid values without error, unlike CPU.  
issues/174668, issues/174943

Python Version Compatibility and Serialization Bugs: Python 3.13's pickle module no longer supports pickling code objects like tracebacks, causing failures during distributed checkpoint saving and object gathering. Separately, the DefaultsSource class fails serialization/deserialization with the dynamo precompile cache due to properties with init=False, causing a TypeError.  
issues/174669, issues/174955

Device Initialization and Driver Overhead: Inductor's initialization registers an XPU-specific fallback by querying torch.xpu.is_available(), which triggers device driver initialization unnecessarily even when only CPU is used, causing overhead.  
issues/174672

Linear Algebra Performance and Test Skips: torch.linalg.eigh is significantly slower than CuPy's equivalent for batched inputs, indicating heuristics for cuSOLVER driver selection need reconsideration. Additionally, the test_tensorinv test is skipped on ROCm platforms due to related issues.  
issues/174674, issues/174913

Build and Linting Failures: The spin linting process fails to regenerate CUDA header files causing file-not-found errors when USE_CUDA=0, with potential fixes including header shims or configuration changes. Also, a missing Python module 'dominate' causes inductor test failures due to dependency installation issues.  
issues/174682, issues/174919

Compilation and Runtime Crashes: Multiple crashes occur including a segmentation fault in torch.nn.functional.scaled_dot_product_attention with zero-sized tensors, a crash in Inductor backend during backward pass of BertForMaskedLM with bfloat16 precision, and a crash in Pallas backend due to incorrect reshape in argmax/max with keepdim=True.  
issues/174719, issues/174884, issues/174939

Kernel Launch and Compatibility Issues: Pre-compiled CUDA kernels for NVIDIA GeForce RTX 5060 (sm_120) fail to launch on Windows despite correct cubin files, while runtime-compiled kernels work. Also, index_reduce is not implemented on MPS device causing NotImplementedError, with CPU fallback suggested.  
issues/174731, issues/174858

Distributed and Sharded Model Compilation Errors: Compiling a sharded embedding layer with dynamic sequence length using torch.compile and DTensor's RowwiseParallel backend causes an IndexError in distributed settings.  
issues/174732

Mixed Precision and Autocast Incompatibility: torch._grouped_mm is incompatible with automatic mixed precision autocasting, causing runtime dtype mismatch errors, suggesting registration with autocast or documentation of incompatibility with workarounds.  
issues/174763

Export and Optimization Failures: torch.export.export(..., strict=True) fails with PYTHONOPTIMIZE>0 likely due to stripped assert statements. Also, a bug causes torch.export with torch.while_loop to fail with GuardOnDataDependentSymNode error if input tensors have data-dependent shape dimensions.  
issues/174784, issues/174876

Inductor Backend Incorrect Results: Using torch.compile with Inductor produces incorrect results for broadcast addition on complex tensors (imaginary part set incorrectly), for torch.as_strided on channels_last tensors, and for in-place addition add_ on transposed tensors followed by reshape, causing output mismatches with eager mode.  
issues/174891, issues/174963, issues/174969

Name Mangling and Linking Issues: Inconsistent name mangling between C++ and CUDA compilers causes undefined symbols due to differences in mangled function names for templated functions across libtorch_cpu.so and libtorch_cuda.so.  
issues/174898

Build Dependency and Test Failures: Inductor-pallas TPU test job fails due to missing dependency declaration for complex.h header in torch_tpu codebase, with AWS S3 permission errors as secondary cosmetic failures. Inductor CPU and CUDA13 tests show instability referencing related issues.  
issues/174921, issues/174929, issues/174930

Attention Kernel Memory Corruption: Scaled dot product attention (SDPA) on MPS backend has out-of-bounds memory access causing inconsistent and incorrect outputs due to suspected memory corruption in the two-pass attention kernel.  
issues/174861

Parameter Validation and API Consistency: torch.nn.functional.scaled_mm improperly handles swizzle_{a,b} parameters by allowing lists without length validation, causing potential errors and inconsistent swizzle application across GEMM implementations.  
issues/174872

CUDA Graph and RNG Handling Improvements: Multiple issues with torch.cond() include compatibility with graph capture stream reuse, RNG handling within CUDA graphs, adding inductor support, optimizing conditional nodes for CUDA >= 12.8, and enabling combined forward/backward/NaN checks/optimizer execution in a single CUDA graph.  
issues/175001

DTensor Enhancements: Proposal to improve gen_single_dim_einsum_strategies in DTensor by adding linearity rules for batch dimensions to enhance implementation and testing.  
issues/175005

CUDA Memory Allocator Bug: CUDACachingAllocator user memory pool incorrectly inherits CUDA graph capture behavior, deferring block freeing improperly and disabling out-of-memory recovery, leading to unreclaimable memory blocks and failed allocations despite cached memory availability.  
issues/175015

Test Disabling on XPU: The test_comprehensive_linalg_lu_factor_ex_xpu_float32 test is disabled on the xpu platform due to consistent failures on the main branch.  
issues/175019

Cross-Repository CI Coordination: Proposal for a GitHub App and Relay Server to enable PyTorch main repo to trigger and relay CI results from downstream out-of-tree backend repos, ensuring transparent upstream-downstream linkage and compatibility.  
issues/175022

DistributedDataParallel Buffer Versioning Bug: Inplace operations on BatchNorm buffers during multiple forward passes in DDP-wrapped modules cause RuntimeError due to version counter increments on BatchNorm running stats, leading to autograd version mismatches during backward.  
issues/175024

Mega-Cache Guard State Mismatch: Mega-Cache generates mismatched guard states when loading compilation output because guard state serialization occurs inside amp context but reload constructs GLOBAL_STATE guard before amp context, causing dispatch key set mismatch runtime errors.  
issues/175025

Code Cleanup: Removal of a condition that never evaluates to true is proposed to reduce technical debt.  
issues/175032

Docker Image Consolidation: Plan to consolidate or retire pytorch/almalinux-builder Docker images used for MAGMA compilation and CI runners following MAGMA deprecation in favor of cuSOLVER/cuBLAS, requiring updates to CI workflows and release processes.  
issues/175045

CUDA Runtime and Library Version Mismatch: CUDA error CUBLAS_STATUS_INVALID_VALUE occurs in cublasGemmEx when using vLLM with PyTorch 2.10 and CUDA 12.9 due to mismatch between CUDA runtime and nvidia-cublas-cu12 library pinned by PyTorch, preventing dependency resolution without manual incompatible installs.  
issues/174949

Test Failures Due to Strict Tolerances: Three CPU tests fail because tolerance thresholds for low-precision operations (bfloat16/int4 quantized) against higher-precision references are set too strictly, causing expected numerical differences to trigger failures.  
issues/174952

Activation Offloading Feature Request: Proposal to add activation offloading as a first-class feature in FSDP's CPUOffloadPolicy to asynchronously offload activations to CPU during forward and prefetch before backward, leveraging existing prefetch infrastructure for unified PCIe scheduling, memory reduction, and consistent API for large-scale training.  
issues/174960

Varlen Attention Minor Issues: Two minor problems with varlen_attn: the function is not imported by default causing import errors, and the documentation format for its arguments is incorrect.  
issues/174961

CUDA Context Error Noise Reduction: Excessive and spurious "No CUDA context is current to the calling thread" error messages occur when using CUDA_LOG_FILE=stdout with CUDA 12.9, and efforts aim to reduce noise in debug output on PyTorch side for better user experience.  
issues/174983

FFT and Signal Processing Errors: torch.fft.rfft raises MKL FFT errors when called with empty tensors due to inconsistent configuration parameters, suggesting clearer API-level errors. Also, torch.istft raises unhelpful internal runtime errors with certain valid parameters, indicating need for improved input validation or special case handling.  
issues/174984, issues/174986

Broadcasting Bug in isclose: torch.isclose raises runtime errors due to shape mismatch when broadcasting a scalar against a tensor with equal_nan=True, unlike other APIs that handle broadcasting correctly, indicating an internal bug.  
issues/174985

Autotune Testing and Determinism: Feature request to systematically run and test all autotune configurations generated by torch.compile to identify and debug nondeterministic bugs exposed by autotuning. Also, a proposal to add a deterministic alternative to Inductor's compile-time autotuning to reduce performance cost when using torch.use_deterministic_algorithms().  
issues/174711, issues/174792

Dynamo and Introspection Issues: Dynamo tracing fails on flex attention with callable classes for block masks due to inability to handle inspect module introspection on callable classes, especially in Python 3.12.11.  
issues/174801

Property Shadowing Regression: Regression in PyTorch 2.10 where using a property that shadows a variable name in a compiled class causes torch.compile to crash, a behavior not present in 2.9.  
issues/174794

Cache API Hooks Proposal: Proposal to add API hooks to TorchInductor caching systems to allow downstream projects to define custom recompilation triggers by influencing cache hit/miss decisions across DynamoCache, AOTCache, and FXGraphCache.  
issues/174970

Debug Build Checks and Custom Accelerators: Mandatory debug checks on view-like operations enforcing shared storage cause errors in custom accelerator backends like NPU, with proposals to add build options or change errors to warnings to ease debugging.  
issues/174944

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 42
Summarized Issues:

Numerical and Precision Bugs in CUDA and Mixed Precision: Several issues report numerical inconsistencies and precision-related bugs in CUDA implementations, including NaN outputs in nn.LSTM when processing batches versus single samples, severe numerical divergence in LSTM outputs between CUDA and CPU, and mixed precision training causing NaNs in validation loss after upgrading PyTorch. These problems highlight challenges in maintaining numerical stability and correctness across different backends and precision modes.  
issues/173334, issues/173927, issues/174441

Data Type Support and Incorrect Outputs on CUDA and Inductor Backends: Multiple issues describe incorrect outputs or failures when using certain data types on CUDA or the inductor backend, such as torch.special APIs producing infinite values with uint16 on CUDA, aminmax producing wrong minimum values on uint8 tensors with inductor, and runtime errors in torch.corrcoef and torch.cov with complex types on CUDA. These indicate incomplete or faulty datatype support in various PyTorch components.  
issues/173636, issues/174378, issues/174382

Test Failures and Disabled Tests on XPU Platform: Several tests have been disabled due to consistent failures on the xpu platform, including test_weight_norm_conv2d_xpu, test_comprehensive_linalg_multi_dot_xpu_float32, and test_comprehensive_lu_xpu_float32. These disabled tests point to ongoing stability or compatibility issues with the xpu backend in the main branch.  
issues/173994, issues/174175, issues/174770

Triton and Autotuning Runtime Errors on CUDA: Issues report runtime errors during autotuning and kernel compilation in the Triton backend, including failures in test_unspec_inputs_uint8_cuda and test_unspec_inputs_uint8_cuda_dynamic_shapes_gpu_wrapper due to integer requirements not being met, as well as an nvcc compiler error during fatbin generation. These errors disrupt GPU kernel compilation and autotuning workflows.  
issues/173871, issues/174304, issues/174420

MPS Backend Functional and Numerical Issues: The MPS backend exhibits correctness problems such as torch.abs overflowing or underflowing for large complex inputs and torch.nn.functional.grid_sample producing incorrect results due to kernel and caching bugs. These issues reveal backend-specific implementation flaws affecting output correctness.  
issues/174246, issues/174339

Compilation and Export Inconsistencies: Problems arise when compiling or exporting models, including inconsistent outputs from compiled ResNet50 models from the timm library and torch.nn.FractionalMaxPool2d producing different results in eager versus compiled modes due to randomness. Additionally, exporting models with explicit dtype casts generates device-dependent guards limiting portability. These issues affect model reproducibility and deployment flexibility.  
issues/174467, issues/174549, issues/174666

PyTorch Continuous Integration and Infrastructure Failures: The CI system faces multiple disruptions including a broken torchcomms CI due to missing shared libraries, an ongoing outage caused by a GitHub service incident, persistent lint failures from missing Python modules, and test failures caused by multiple cpuinfo library instances. These infrastructure problems impact development and testing workflows.  
issues/174486, issues/174600, issues/174624, issues/174637, issues/174587

Runtime and Permission Errors in PyTorch Binaries and DataLoader: A macOS wheel for PyTorch lacked executable permissions on the torch_shm_manager binary, causing multiprocessing failures with DataLoaders using the spawn method. Additionally, the DataLoader's pin-memory utility passes a deprecated device argument, triggering deprecation warnings due to internal API changes. These issues affect usability and stability of data loading and multiprocessing.  
issues/174680, issues/174546

Memory and Autograd Graph Partitioning Errors: Errors occur in the autograd partitioner when a forward output is both saved and mutated by a Triton kernel during backward, causing assertion failures, and in the Inductor module where memory format suggestions become inconsistent due to zero strides introduced during lowering of padding operations. These bugs complicate graph partitioning and memory format handling.  
issues/174124, issues/174869

Flex Attention and KV Caching Bugs: The Flex Attention implementation fails during mixed precision execution due to dtype mismatches and also encounters assertion errors when supporting KV caching with reordered KV entries, indicating underlying implementation issues in attention mechanisms.  
issues/174018, issues/174878

PyTorch Compile and Dynamo Module Regressions: The torch.compile system produces inconsistent results when Python magic methods like __class__ are overridden, and the Dynamo module regresses by raising AsPythonConstantNotImplementedError when treating certain user-defined objects as constants during JIT and AOT workflows. These regressions affect compilation reliability and user-defined object handling.  
issues/174050, issues/174128

Distributed Communication Error Detection Challenges: PyTorch's distributed collective operations can silently fail or hang when tensor shapes mismatch across ranks, without raising exceptions, leading to inconsistent communication states. There is a request for diagnostic functions to detect and verify communication health after such errors to improve debugging and robustness.  
issues/174844

Performance Bottlenecks and Algorithmic Improvements: Performance issues include CUDA synchronization bottlenecks during spectral decompositions and proposals for an online softmax algorithm that significantly speeds up computation and reduces memory usage. Additionally, vectorized CPU kernels for sigmoid-like functions are being explored to improve efficiency.  
issues/174601, issues/174767, issues/174900

Version Compatibility and Protocol Checks in Distributed Systems: There is a proposal to add explicit protocol version compatibility checks during torch.distributed process group initialization to detect incompatible protocol revisions across mixed PyTorch versions, aiming to fail fast and improve diagnostics in heterogeneous or rolling upgrade environments.  
issues/174917

Unified Device Abstraction and Dynamic Precision Switching: A feature request proposes a Unified Device Abstraction (UDA) layer to harmonize CUDA and XLA/TPU workflows by introducing Predictive Entropy Quantization (PEQ), which dynamically adjusts tensor bit-precision based on entropy measurements to optimize performance and developer experience across hardware backends.  
issues/174965

Miscellaneous Bugs and Feature Requests: Other issues include core crashes and numpy compatibility errors on Orange Pi 3 devices, requests for nth smallest element retrieval in tensors, and challenges in computing multiple Jacobian-vector products without redundant forward computations. These highlight diverse user needs and platform-specific problems.  
issues/174785, issues/174551, issues/174659

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 247
Key Open Pull Requests
1. Support __dict__ in NestedUserFunctionVariable: This pull request introduces a new variable tracker called DunderDictVariable to provide consistent and reliable support for the __dict__ attribute in Dynamo by addressing the shortcomings of the existing three implementations and enabling proper mutation tracking through the use of a side effects table.

URL: pull/174570

Associated Commits: 9a1f3, cf56e, 16f3e, 56062, 9a69c, 32cbf, 8f375, 0a889, 8e4f0, b8f15, 904e5, e15ee, 1c6ea, cc711, 67b32, 1422a

2. More size-hinting cleanups: This pull request refactors the code by replacing all size_hint calls with fallback to use optimization_hint instead, removes the fallback parameter from size_hint calls in preparation for its eventual deletion, and updates calls from symbolic_hint() to replace_backed_with_hints() to improve size-hinting logic.

URL: pull/174580

Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c

3. [DTensor] Strategy Validation (3/3): strategy querying, orchestrator, and CLI: This pull request adds a comprehensive DTensor sharding rule validator including a strategy querying orchestrator and a CLI tool that compares DTensor's claimed sharding rules against ground truth validity across multiple strategy paths, reports discrepancies such as incorrect or missing rules with false positive mitigations, and provides end-to-end tests and a user-friendly command-line interface for validating individual or all registered operators.

URL: pull/174800

Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546

Other Open Pull Requests

DTensor validation and strategy improvements: Multiple pull requests enhance DTensor functionality by adding a validation engine for sharding rules that simulates distributed execution on a single machine and improving view operation strategies with a view_groups algorithm and guard_or_false mechanisms to fix numerous DDE-related issues. These changes collectively improve correctness and robustness in DTensor operations, including fixes for dimension normalization and stack operation handling.  
pull/174799, pull/174629, pull/174650, pull/174640

Pallas TPU backend enhancements: Several pull requests add and fix support for the Pallas TPU backend by implementing element-wise operations, initial inductor IR lowering, tiling support, and function legalization, while also addressing CPU compatibility and GPU tiling restrictions. These updates ensure better TPU code generation, broadcasting support, and overall backend stability.  
pull/174743, pull/175027, pull/174855

Memory and tracing improvements in FSDP2: Pull requests address memory leakage issues in FSDP2 by supporting dataclass arguments with a hybrid approach and remove dynamo tracing code from FSDP2 hooks, opting instead to unconditionally disable dynamo tracing in these hooks. These changes improve memory efficiency and simplify tracing behavior in fully shard modules.  
pull/174692, pull/174863

Inductor backend fixes and deterministic behavior: Updates fix allocation issues related to deterministic guards in the inductor backend and add a deterministic backward implementation for the flex flash operation, improving reliability and correctness in compiled code execution.  
pull/174718, pull/174813

Function annotation and code cleanup: A pull request moves function annotations to align with Python standards, and another removes unexpected code in the _range_constraints module while fixing related unit tests to enable dynamic shapes with keyword arguments. These changes improve code clarity and dynamic shape support.  
pull/174714, pull/174593

Size hint and shape mapping optimizations: Pull requests rewrite size_hint usage to support unbacked tensors and introduce guard_or_false around shape comparisons to improve pre- and post-broadcasting shape mappings, fixing many DDE-related issues and enhancing performance and correctness.  
pull/174937, pull/174636

CUDA and cuBLASLt backend improvements: Updates set cuBLASLt as the default BLAS backend for CUDA operations when available and enhance CUDA graph reliability by verifying external input tensor validity, handling pinned host memory, and improving test coverage and code organization.  
pull/174594, pull/174649

Triton and ROCm updates: Pull requests test disabling asynchronous copy in Triton 37 for ROCm, update to the latest Triton LLVM version, and include general Triton updates as part of the PyTorch 2.11 release.  
pull/174710, pull/174896

Tensor subclass and flattening fixes: A pull request improves handling of self references during tracing, implements fallback shallow copies for tensor subclass metadata, catches attribute errors during flatten operations, and refines meta converters to fix subclass initialization and flattening issues.  
pull/174699

Inductor reciprocal operation fix: The reciprocal operation in inductor is fixed by using a float32 constant instead of an integer to ensure proper floating-point division, addressing eager division rounding emulation.  
pull/174738

Compile-time print utilities and bug fixes: New utilities compile_print and make_compile_print enable printing tensor values inside compiled and traced code by wrapping print calls and using hooks, while also fixing bugs related to dispatch key stripping and dead code elimination that previously blocked this functionality.  
pull/174636

InputObserver custom empty tensor support: Support is added for specifying a custom empty tensor in InputObserver to handle missing inputs like pixel_values during subsequent forward calls, ensuring consistent input observation in models processing both images and text.  
pull/174964

CUDA kernel restrict keyword migration: CUDA kernels are migrated to use the RestrictPtrTraits struct to properly apply the restrict keyword to kernel parameters, fixing NVCC implementation issues and improving performance as demonstrated by initial benchmarks.  
pull/174552

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 285
Key Closed Pull Requests
1. Add mem_get_info usage notes to CUDA memory management docs: This pull request adds clarification notes to the CUDA memory management documentation in PyTorch to help users correctly interpret allocator statistics in relation to device-wide GPU memory usage by explaining the usage of mem_get_info.

URL: pull/174822

Associated Commits: 7cd90, 863ed, f4e98, a61b4, 3f18b, 3ecf7, 054b1, 3e976, f568c, 50d4b, 2f470, c5799, ca246, 41835, ffe4a, 40fb4, 2cb07, bcc08, dd832, 06986, 8e224, 62d42, 0084d, aa828, 4e881, 6dd20, 0138d, c27ec, b2633, a5a42, a8295, fba39, 41b6a, e2457, c647e, 188a1, dbc1e, fa6fa, c46a9, adad7, dadfc, 0aaaa, a9c2e, 1375a, 1386a, 66de6, a769a, ab159, e8421, 3a5fb, f3b5d, a266b, 0e245, 2c9af, e43b5, 1f74c, 99cb4, f6e6c, 449b1, 70a7c

Associated Commits: 7cd90, 863ed, f4e98, a61b4, 3f18b, 3ecf7, 054b1, 3e976, f568c, 50d4b, 2f470, c5799, ca246, 41835, ffe4a, 40fb4, 2cb07, bcc08, dd832, 06986, 8e224, 62d42, 0084d, aa828, 4e881, 6dd20, 0138d, c27ec, b2633, a5a42, a8295, fba39, 41b6a, e2457, c647e, 188a1, dbc1e, fa6fa, c46a9, adad7, dadfc, 0aaaa, a9c2e, 1375a, 1386a, 66de6, a769a, ab159, e8421, 3a5fb, f3b5d, a266b, 0e245, 2c9af, e43b5, 1f74c, 99cb4, f6e6c, 449b1, 70a7c

2. M11: Distributed Protocol Version Guardrail: This pull request proposes adding an internal distributed protocol version guardrail enforced during the initialization of process groups via a store-based version check, including spawn-safe tests and an environment variable override for simulating version mismatches, while explicitly excluding MPI due to its lack of a rendezvous store path.

URL: pull/174577

Associated Commits: 0912e, 6c2ba, 41daf, d1e3e, d72fd, 001b4, e1e4b, b352a, 0d264, 61d4b, 32cf1, 73677, 5fbb0, df86b, 760d4, 34761, c533b, c5f29, 7c7bc, 0ff9d, f75f5, 1cadb, 7ab7c, 2ea65, e8069, 37e69, 63f09, 50435, 738fd, d3932, 0b1e7, 1ed83, 8aadb, 17f7c, 995b2, 11779, 30d9c, e4c52, 59332, d76b3, fdc54, ea8c9, 0655d, f4e3c, 5e238, a3693, 3c933, a8d99, 1901f, 16737

Associated Commits: 0912e, 6c2ba, 41daf, d1e3e, d72fd, 001b4, e1e4b, b352a, 0d264, 61d4b, 32cf1, 73677, 5fbb0, df86b, 760d4, 34761, c533b, c5f29, 7c7bc, 0ff9d, f75f5, 1cadb, 7ab7c, 2ea65, e8069, 37e69, 63f09, 50435, 738fd, d3932, 0b1e7, 1ed83, 8aadb, 17f7c, 995b2, 11779, 30d9c, e4c52, 59332, d76b3, fdc54, ea8c9, 0655d, f4e3c, 5e238, a3693, 3c933, a8d99, 1901f, 16737

3. Fix FP8 test failures on AMD RDNA4 GPUs: This pull request addresses and fixes FP8 inductor test failures on AMD RDNA4 (gfx120x) GPUs that occur during scaled matrix multiplication tests with small M dimensions (M < 16) by implementing GPU-specific compile mode selection to correct incorrect tensor indexing in autotuned Triton kernels.

URL: pull/174871

Associated Commits: b97cf, db39c, d28ec, 6b3a1, c440b, 25a18, 70a5f, d14e5, b296d, a0166, 45156, c2d4e, 30a23, 11ca2, 629e8, ab471, 25366, 777e7, 56002, 223b9, b4c1e, 3d742, da5ac, a3c49, 5ca07, ecdea, f742d, 979dc, 7e17f, 4d673, fe101, c53b5

Associated Commits: b97cf, db39c, d28ec, 6b3a1, c440b, 25a18, 70a5f, d14e5, b296d, a0166, 45156, c2d4e, 30a23, 11ca2, 629e8, ab471, 25366, 777e7, 56002, 223b9, b4c1e, 3d742, da5ac, a3c49, 5ca07, ecdea, f742d, 979dc, 7e17f, 4d673, fe101, c53b5

Other Closed Pull Requests

InputObserver refactor and test expansion: This pull request refactors the InputObserver class to better handle empty caches during argument inference at the prefill step and significantly expands test coverage to include optional and mixed argument scenarios with dynamic shapes. It also integrates pandas for enhanced discrepancy analysis between model outputs and ONNX exports and updates related tests accordingly.  
pull/174205

DTensor communication optimization and sharding support: Multiple pull requests improve DTensor by optimizing redistribution communications using flattened device meshes to reduce costly sequential operations and adding support for comms beyond all_reduce, banning mixed partial placements, and addressing edge cases. Additional work includes enabling debug mode to print optimized transform info, adding foundational utilities for sharding rule validation, and supporting Partial input placements for matmul with missing sharding propagation rules.  
pull/172610, pull/173436, pull/174798, pull/174901

FSDP2 feature enhancements and testing: These pull requests update unit tests to enable more coverage for Fully Sharded Data Parallel version 2 (FSDP2) on CPU and introduce support for assigning a per-parameter mesh configuration in FSDP2.  
pull/174048, pull/173509

Test re-enablement and assertion removals: One pull request re-enables Inductor X86 backend tests by updating them to no longer use the PT2E API, while several others focus on removing additional test assertions across multiple test files as part of a series of stacked changes.  
pull/173349, pull/174255, pull/174256, pull/174257

Platform-specific debugging and test adjustments: This pull request addresses graph break issues on the ROCm MI350 platform by making code adjustments, updating test expectations, and selectively skipping tests with known accuracy or compatibility problems on gfx950 hardware.  
pull/173683

Device-specific Event class improvements: This pull request enhances PyTorch by enabling device-specific Event classes like torch.xpu.Event and torch.cuda.Event to accept both generic and device-specific Stream inputs, improving API flexibility and fixing issue #173792.  
pull/173908

Pallas TPU CI configuration: This pull request configures continuous integration testing for the Pallas TPU by adding and refining scripts to enable successful pulling and installation of the private torch_tpu repository before its full open-source release.  
pull/174201

Backend migration and performance improvements: One pull request migrates the grid_sampler_2d function to use Metal for backend execution to improve performance, while another proposes skipping the LazyVT optimization for virtual tensors that are realized anyway to reduce overhead and slightly improve compile time.  
pull/174343, pull/174901

PyTorch Dynamo comprehension graph break enhancements: This pull request extends PyTorch Dynamo's handling of comprehension graph breaks to support nested and inlined function contexts by adding new speculation checks, bytecode handling for closure variables, decrement logic for nested comprehensions, and fixes for zero-dimensional FakeTensors, accompanied by extensive new tests.  
pull/174413

Function attribute fix: This pull request fixes an issue related to assigning the __annotations__ attribute of a function within the SET_FUNCTION_ATTRIBUTE operation.  
pull/174568

TypingVariable equality method proposal: This pull request proposes adding an equality method (__eq__) to the TypingVariable class to enhance comparison capabilities, but it was ultimately not merged.  
pull/174569

XPUGraph frontend Python APIs: This pull request introduces frontend Python APIs for XPUGraph, including functions like pool() and debug_dump(), to enhance graph capture and replay capabilities on XPU devices.  
pull/174046

JIT CUDA kernel unsigned integer support: This pull request adds support for uint16, uint32, and uint64 scalar types in JIT-compiled CUDA kernels to fix crashes and incorrect results in torch.special functions like zeta when using unsigned integer inputs on CUDA. It extends relevant macros, improves error handling, and adds corresponding tests.  
pull/174303

CI improvements for TIMM pretrained model caching: This pull request improves the CI process by enabling caching of TIMM pretrained models on a shared Hugging Face cache to prevent benchmark failures caused by offline mode blocking model downloads. It adds a download-only flag, uses a version-pinned cache directory, and implements a stamp file to ensure proper cache preparation.  
pull/174596

Symbolic reasoning optimization: This pull request proposes a simple optimization in symbolic reasoning that returns true early when evaluating sums of symbols/constants all guaranteed to be positive, significantly reducing export time from over 5 minutes to just over 3 minutes.  
pull/174615

Documentation theme update: This pull request updates the pytorch_sphinx_theme2 to version 0.4.3, removing the PyTorch navbar in favor of a main site link, adding dropdowns to the top navigation bar, removing the runllm.js script, adding sphinx-tippy tooltip support, autogenerating an llms.txt file, and incorporating detailed LLM meta tags for enhanced documentation navigation and metadata.  
pull/174806

Profiler and symbolic integer improvements: This pull request enhances profiler visibility by replacing internal profiling calls with torch.autograd.profiler.record_function and updates the torch.combinations implementation to use symbolic integer operations to support dynamic shapes during tracing.  
pull/174008

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

pianpwk
194
21
1
13

laithsakka
182
26
0
9

wconstab
145
15
0
55

albanD
174
20
2
4

anijain2305
158
16
0
5

ydwu4
151
11
0
8

malfet
102
18
2
21

weifengpy
101
14
1
13

BenjaminDEMAILLE
128
0
0
0

guilhermeleobas
94
15
0
3

Access Last Week's Newsletter:  

Link

                            Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Issues	Comments
pianpwk	194	21	1	13
laithsakka	182	26	0	9
wconstab	145	15	0	55
albanD	174	20	2	4
anijain2305	158	16	0	5
ydwu4	151	11	0	8
malfet	102	18	2	21
weifengpy	101	14	1	13
BenjaminDEMAILLE	128	0	0	0
guilhermeleobas	94	15	0	3