Weekly GitHub Report for Pytorch: February 08, 2026 - February 15, 2026 (15:14:17)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation and Windows binaries, and a backward-incompatible security improvement flipping the default weights_only parameter in torch.load; additionally, PyTorch has deprecated its official Anaconda channel and updated Linux binaries to use Manylinux 2.28 with CXX11_ABI=1.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[MODULE: DOCS] [MODULE: CUDA] [TRIAGED] [ENHANCEMENT] [BOT-TRIAGED] document
torch.cuda.memory.mem_get_infoin: This issue requests the addition of documentation for the functiontorch.cuda.memory.mem_get_infowithin the existing CUDA memory management notes, highlighting that the function may incur a significant delay due to CUDA initialization or synchronization when called. The discussion also emphasizes the need to clarify the use cases of this API, its behavior in multi-process GPU sharing scenarios, and the inclusion of related memory APIs for better discoverability in the documentation.- The comments include an offer to submit a pull request with a brief note about the function’s latency, clarifications on when delays occur, and suggestions to add information about multi-process GPU memory usage. Participants discuss the current state of the documentation, note missing related APIs, and agree on adding a dedicated section for these functions to improve user understanding, culminating in a submitted PR to address the issue.
- Number of comments this week: 16
-
[TRIAGE REVIEW] [MODULE: MEMORY USAGE] [MODULE: REGRESSION] [MODULE: FSDP] [ONCALL: PT2] [MODULE: DYNAMIC SHAPES] [MODULE: COMPILE-TIME] [BOT-TRIAGED] Recompile time with torch.compile 2.10: This issue reports a significant increase in recompilation time and GPU memory overhead after upgrading to torch 2.10, particularly when using Fully Sharded Data Parallel (FSDP) wrapping and dynamic shapes in the model. The user observes that warmup time for multiple shapes has increased from 30 minutes to several hours and suspects that custom operators like a manual RMSNorm implementation may be causing inefficiencies with torch.compile in this version.
- The comments include detailed logs showing numerous guard failures causing recompilations, requests for more environment and reproducer details, discussion about NaN losses with the repro, and observations that switching from a custom RMSNorm to the built-in torch.RMSNorm alleviates the issue, suggesting problems with custom operators and dynamic shapes in torch.compile 2.10.
- Number of comments this week: 7
-
[TRIAGE REVIEW] [MODULE: CUDA] [MODULE: CUBLAS] [ONCALL: PT2] [MODULE: INDUCTOR] [MODULE: VLLM] [BOT-TRIAGED] [vllm] CUBLAS_STATUS_INVALID_VALUE in cublasGemmEx after upgrading to PyTorch 2.10: This issue describes a runtime error caused by a CUDA version incompatibility when using vLLM with PyTorch 2.10, specifically a CUBLAS_STATUS_INVALID_VALUE error in cublasGemmEx that occurs with CUDA 12.9 but not 12.8. The problem arises because PyTorch 2.10 pins the nvidia-cublas-cu12 version to 12.8.4.1, preventing users from upgrading to the required 12.9.1.4 version to fix the error, complicating installation and compatibility with vLLM.
- The comments clarify that the error is due to a mismatch between the CUDA version used by PyTorch and the one required by vLLM, recommending installing PyTorch with the CUDA 12.9 wheel to resolve the issue, but noting that this conflicts with PyPI's single CUDA wheel version policy and the pinned dependencies in PyTorch 2.10, making it a packaging and compatibility challenge rather than a bug in PyTorch itself.
- Number of comments this week: 6
-
[ONCALL: DISTRIBUTED] [BOT-TRIAGED] Feature Request: Activation offloading with async prefetch in FSDP: This issue proposes adding activation offloading as a first-class feature within FSDP's
CPUOffloadPolicyto asynchronously offload activation tensors to CPU during forward passes and prefetch them back before backward passes, reusing existing prefetch infrastructure for unified PCIe scheduling and consistent API design. The motivation is to optimize memory usage and PCIe bandwidth by treating activation offloading similarly to parameter offloading, enabling better overlap of data transfers and computation, especially beneficial for long-sequence training scenarios.- The comments express strong support for the proposal, discuss the complementary nature of activation offloading and activation checkpointing, clarify implementation details such as offloading intermediate tensors via autograd hooks, and highlight practical considerations including platform-specific PCIe/NVLink bandwidth and memory usage trade-offs; there is consensus on the high priority of native support in FSDP and interest in both short-term and long-term implementation strategies.
- Number of comments this week: 6
-
[MODULE: BUILD] [MODULE: LINT] [TRIAGED] [BOT-TRIAGED] [MODULE: SPIN] spin lint doesn't regenerate cuda headers: This issue reports that the spin linting process does not regenerate two CUDA header files, "cuda_cmake_macros.h" and "CUDAConfig.h," which leads to lint errors because these files are missing. The problem arises because these headers are only generated when CUDA is enabled during the build, and the current lint regeneration command does not fully trigger the necessary build steps to produce both files, especially when
USE_CUDA=0.- The comments discuss possible solutions including adding a "module: spin" label for tracking, clarifying the generation process of the headers, and debating whether to enable CUDA during lint regeneration or create shim headers for missing files when CUDA is disabled; the consensus leans towards either enabling CUDA during linting or providing shims to avoid lint failures.
- Number of comments this week: 5
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 59
Summarized Issues:
- Profiling and Tracing Bugs: Using
torch.profiler.profilewithwith_stack=Truecauses incorrect rendering and ignored spans forrecord_functionandProfilerStepin Chrome tracing and Perfetto due to incoherent start and end times for short-lived spans. This results in inaccurate profiling outputs that hinder performance analysis.
- Performance Regressions with CUDA Graphs: PyTorch 2.10 shows a significant throughput regression compared to 2.9 when using
torch.compile(mode="reduce-overhead")on H100 GPUs, mainly due to a 7x slowdown incudaGraphLaunchcalls and new Python overhead incudagraph_trees.py. This leads to degraded performance and round-over-round throughput decline during CUDA graph replay in GPT inference workloads.
- Non-Determinism and Incorrect Results in Compilation: The function
torch.nn.functional.ctc_losscompiled withtorch.compilesometimes produces inconsistent and incorrect results across multiple executions even with fixed seeds, especially when cloning input tensors. This non-determinism affects several backends including eager, aot_eager, and inductor, undermining reliability.
- Environment Variable Handling in Compilation: The
aot_compile()function does not properly respect theTORCH_COMPILE_DISABLEenvironment variable, causing unexpected behavior when Dynamo is disabled. This leads to confusion for users trying to disabletorch.compile.
- CUDA Memory and Synchronization Issues: Inductor backend has a bug where non-blocking device-to-host copies lack proper host-side synchronization before CPU kernel consumption, causing potential race conditions as CPU kernels may read buffers before DMA transfers complete. Additionally, a CUDA illegal memory access occurs in AOTInductor during compilation of a SAM3 model with dynamic batch size and text prompt dimensions when math SDPA is enabled, linked to split reduction codegen and softmax fallback.
- Documentation and API Discoverability Requests: There is a request to add documentation for
torch.cuda.memory.mem_get_infoto CUDA memory management notes, including discussion of potential delays due to CUDA initialization or synchronization. Consolidating memory-related API references is suggested to improve discoverability.
- Numerical Accuracy and Data Type Issues: The
torch.histcfunction produces inconsistent results with float16 due to accumulation and rounding errors violating numerical invariants, suggesting the need for float32 internal accumulation. Also,torch.nn.CrossEntropyLosson MPS incorrectly accepts 1D float labels including invalid values without error, unlike CPU.
- Python Version Compatibility and Serialization Bugs: Python 3.13's
picklemodule no longer supports pickling code objects like tracebacks, causing failures during distributed checkpoint saving and object gathering. Separately, the DefaultsSource class fails serialization/deserialization with the dynamo precompile cache due to properties withinit=False, causing a TypeError.
- Device Initialization and Driver Overhead: Inductor's initialization registers an XPU-specific fallback by querying
torch.xpu.is_available(), which triggers device driver initialization unnecessarily even when only CPU is used, causing overhead.
- Linear Algebra Performance and Test Skips:
torch.linalg.eighis significantly slower than CuPy's equivalent for batched inputs, indicating heuristics for cuSOLVER driver selection need reconsideration. Additionally, thetest_tensorinvtest is skipped on ROCm platforms due to related issues.
- Build and Linting Failures: The spin linting process fails to regenerate CUDA header files causing file-not-found errors when
USE_CUDA=0, with potential fixes including header shims or configuration changes. Also, a missing Python module 'dominate' causes inductor test failures due to dependency installation issues.
- Compilation and Runtime Crashes: Multiple crashes occur including a segmentation fault in
torch.nn.functional.scaled_dot_product_attentionwith zero-sized tensors, a crash in Inductor backend during backward pass of BertForMaskedLM with bfloat16 precision, and a crash in Pallas backend due to incorrect reshape inargmax/maxwithkeepdim=True.
- Kernel Launch and Compatibility Issues: Pre-compiled CUDA kernels for NVIDIA GeForce RTX 5060 (sm_120) fail to launch on Windows despite correct cubin files, while runtime-compiled kernels work. Also,
index_reduceis not implemented on MPS device causing NotImplementedError, with CPU fallback suggested.
- Distributed and Sharded Model Compilation Errors: Compiling a sharded embedding layer with dynamic sequence length using
torch.compileand DTensor's RowwiseParallel backend causes an IndexError in distributed settings.
- Mixed Precision and Autocast Incompatibility:
torch._grouped_mmis incompatible with automatic mixed precision autocasting, causing runtime dtype mismatch errors, suggesting registration with autocast or documentation of incompatibility with workarounds.
- Export and Optimization Failures:
torch.export.export(..., strict=True)fails withPYTHONOPTIMIZE>0likely due to strippedassertstatements. Also, a bug causestorch.exportwithtorch.while_loopto fail withGuardOnDataDependentSymNodeerror if input tensors have data-dependent shape dimensions.
- Inductor Backend Incorrect Results: Using torch.compile with Inductor produces incorrect results for broadcast addition on complex tensors (imaginary part set incorrectly), for
torch.as_stridedon channels_last tensors, and for in-place additionadd_on transposed tensors followed by reshape, causing output mismatches with eager mode.
- Name Mangling and Linking Issues: Inconsistent name mangling between C++ and CUDA compilers causes undefined symbols due to differences in mangled function names for templated functions across
libtorch_cpu.soandlibtorch_cuda.so.
- Build Dependency and Test Failures: Inductor-pallas TPU test job fails due to missing dependency declaration for
complex.hheader in torch_tpu codebase, with AWS S3 permission errors as secondary cosmetic failures. Inductor CPU and CUDA13 tests show instability referencing related issues.
- Attention Kernel Memory Corruption: Scaled dot product attention (SDPA) on MPS backend has out-of-bounds memory access causing inconsistent and incorrect outputs due to suspected memory corruption in the two-pass attention kernel.
- Parameter Validation and API Consistency:
torch.nn.functional.scaled_mmimproperly handlesswizzle_{a,b}parameters by allowing lists without length validation, causing potential errors and inconsistent swizzle application across GEMM implementations.
- CUDA Graph and RNG Handling Improvements: Multiple issues with
torch.cond()include compatibility with graph capture stream reuse, RNG handling within CUDA graphs, adding inductor support, optimizing conditional nodes for CUDA >= 12.8, and enabling combined forward/backward/NaN checks/optimizer execution in a single CUDA graph.
- DTensor Enhancements: Proposal to improve
gen_single_dim_einsum_strategiesin DTensor by adding linearity rules for batch dimensions to enhance implementation and testing.
- CUDA Memory Allocator Bug:
CUDACachingAllocatoruser memory pool incorrectly inherits CUDA graph capture behavior, deferring block freeing improperly and disabling out-of-memory recovery, leading to unreclaimable memory blocks and failed allocations despite cached memory availability.
- Test Disabling on XPU: The
test_comprehensive_linalg_lu_factor_ex_xpu_float32test is disabled on the xpu platform due to consistent failures on the main branch.
- Cross-Repository CI Coordination: Proposal for a GitHub App and Relay Server to enable PyTorch main repo to trigger and relay CI results from downstream out-of-tree backend repos, ensuring transparent upstream-downstream linkage and compatibility.
- DistributedDataParallel Buffer Versioning Bug: Inplace operations on BatchNorm buffers during multiple forward passes in DDP-wrapped modules cause RuntimeError due to version counter increments on BatchNorm running stats, leading to autograd version mismatches during backward.
- Mega-Cache Guard State Mismatch: Mega-Cache generates mismatched guard states when loading compilation output because guard state serialization occurs inside amp context but reload constructs GLOBAL_STATE guard before amp context, causing dispatch key set mismatch runtime errors.
- Code Cleanup: Removal of a condition that never evaluates to true is proposed to reduce technical debt.
- Docker Image Consolidation: Plan to consolidate or retire
pytorch/almalinux-builderDocker images used for MAGMA compilation and CI runners following MAGMA deprecation in favor of cuSOLVER/cuBLAS, requiring updates to CI workflows and release processes.
- CUDA Runtime and Library Version Mismatch: CUDA error
CUBLAS_STATUS_INVALID_VALUEoccurs incublasGemmExwhen using vLLM with PyTorch 2.10 and CUDA 12.9 due to mismatch between CUDA runtime and nvidia-cublas-cu12 library pinned by PyTorch, preventing dependency resolution without manual incompatible installs.
- Test Failures Due to Strict Tolerances: Three CPU tests fail because tolerance thresholds for low-precision operations (bfloat16/int4 quantized) against higher-precision references are set too strictly, causing expected numerical differences to trigger failures.
- Activation Offloading Feature Request: Proposal to add activation offloading as a first-class feature in FSDP's
CPUOffloadPolicyto asynchronously offload activations to CPU during forward and prefetch before backward, leveraging existing prefetch infrastructure for unified PCIe scheduling, memory reduction, and consistent API for large-scale training.
- Varlen Attention Minor Issues: Two minor problems with varlen_attn: the function is not imported by default causing import errors, and the documentation format for its arguments is incorrect.
- CUDA Context Error Noise Reduction: Excessive and spurious "No CUDA context is current to the calling thread" error messages occur when using
CUDA_LOG_FILE=stdoutwith CUDA 12.9, and efforts aim to reduce noise in debug output on PyTorch side for better user experience.
- FFT and Signal Processing Errors:
torch.fft.rfftraises MKL FFT errors when called with empty tensors due to inconsistent configuration parameters, suggesting clearer API-level errors. Also,torch.istftraises unhelpful internal runtime errors with certain valid parameters, indicating need for improved input validation or special case handling.
- Broadcasting Bug in isclose:
torch.iscloseraises runtime errors due to shape mismatch when broadcasting a scalar against a tensor withequal_nan=True, unlike other APIs that handle broadcasting correctly, indicating an internal bug.
- Autotune Testing and Determinism: Feature request to systematically run and test all autotune configurations generated by
torch.compileto identify and debug nondeterministic bugs exposed by autotuning. Also, a proposal to add a deterministic alternative to Inductor's compile-time autotuning to reduce performance cost when usingtorch.use_deterministic_algorithms().
- Dynamo and Introspection Issues: Dynamo tracing fails on flex attention with callable classes for block masks due to inability to handle
inspectmodule introspection on callable classes, especially in Python 3.12.11.
- Property Shadowing Regression: Regression in PyTorch 2.10 where using a property that shadows a variable name in a compiled class causes
torch.compileto crash, a behavior not present in 2.9.
- Cache API Hooks Proposal: Proposal to add API hooks to TorchInductor caching systems to allow downstream projects to define custom recompilation triggers by influencing cache hit/miss decisions across DynamoCache, AOTCache, and FXGraphCache.
- Debug Build Checks and Custom Accelerators: Mandatory debug checks on view-like operations enforcing shared storage cause errors in custom accelerator backends like NPU, with proposals to add build options or change errors to warnings to ease debugging.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 42
Summarized Issues:
- Numerical and Precision Bugs in CUDA and Mixed Precision: Several issues report numerical inconsistencies and precision-related bugs in CUDA implementations, including NaN outputs in nn.LSTM when processing batches versus single samples, severe numerical divergence in LSTM outputs between CUDA and CPU, and mixed precision training causing NaNs in validation loss after upgrading PyTorch. These problems highlight challenges in maintaining numerical stability and correctness across different backends and precision modes.
- issues/173334, issues/173927, issues/174441
- Data Type Support and Incorrect Outputs on CUDA and Inductor Backends: Multiple issues describe incorrect outputs or failures when using certain data types on CUDA or the inductor backend, such as
torch.specialAPIs producing infinite values withuint16on CUDA,aminmaxproducing wrong minimum values onuint8tensors with inductor, and runtime errors intorch.corrcoefandtorch.covwith complex types on CUDA. These indicate incomplete or faulty datatype support in various PyTorch components. - issues/173636, issues/174378, issues/174382
- Test Failures and Disabled Tests on XPU Platform: Several tests have been disabled due to consistent failures on the xpu platform, including
test_weight_norm_conv2d_xpu,test_comprehensive_linalg_multi_dot_xpu_float32, andtest_comprehensive_lu_xpu_float32. These disabled tests point to ongoing stability or compatibility issues with the xpu backend in the main branch. - issues/173994, issues/174175, issues/174770
- Triton and Autotuning Runtime Errors on CUDA: Issues report runtime errors during autotuning and kernel compilation in the Triton backend, including failures in
test_unspec_inputs_uint8_cudaandtest_unspec_inputs_uint8_cuda_dynamic_shapes_gpu_wrapperdue to integer requirements not being met, as well as an nvcc compiler error during fatbin generation. These errors disrupt GPU kernel compilation and autotuning workflows. - issues/173871, issues/174304, issues/174420
- MPS Backend Functional and Numerical Issues: The MPS backend exhibits correctness problems such as
torch.absoverflowing or underflowing for large complex inputs andtorch.nn.functional.grid_sampleproducing incorrect results due to kernel and caching bugs. These issues reveal backend-specific implementation flaws affecting output correctness. - issues/174246, issues/174339
- Compilation and Export Inconsistencies: Problems arise when compiling or exporting models, including inconsistent outputs from compiled ResNet50 models from the
timmlibrary andtorch.nn.FractionalMaxPool2dproducing different results in eager versus compiled modes due to randomness. Additionally, exporting models with explicit dtype casts generates device-dependent guards limiting portability. These issues affect model reproducibility and deployment flexibility. - issues/174467, issues/174549, issues/174666
- PyTorch Continuous Integration and Infrastructure Failures: The CI system faces multiple disruptions including a broken torchcomms CI due to missing shared libraries, an ongoing outage caused by a GitHub service incident, persistent lint failures from missing Python modules, and test failures caused by multiple cpuinfo library instances. These infrastructure problems impact development and testing workflows.
- issues/174486, issues/174600, issues/174624, issues/174637, issues/174587
- Runtime and Permission Errors in PyTorch Binaries and DataLoader: A macOS wheel for PyTorch lacked executable permissions on the
torch_shm_managerbinary, causing multiprocessing failures with DataLoaders using the spawn method. Additionally, the DataLoader's pin-memory utility passes a deprecateddeviceargument, triggering deprecation warnings due to internal API changes. These issues affect usability and stability of data loading and multiprocessing. - issues/174680, issues/174546
- Memory and Autograd Graph Partitioning Errors: Errors occur in the autograd partitioner when a forward output is both saved and mutated by a Triton kernel during backward, causing assertion failures, and in the Inductor module where memory format suggestions become inconsistent due to zero strides introduced during lowering of padding operations. These bugs complicate graph partitioning and memory format handling.
- issues/174124, issues/174869
- Flex Attention and KV Caching Bugs: The Flex Attention implementation fails during mixed precision execution due to dtype mismatches and also encounters assertion errors when supporting KV caching with reordered KV entries, indicating underlying implementation issues in attention mechanisms.
- issues/174018, issues/174878
- PyTorch Compile and Dynamo Module Regressions: The torch.compile system produces inconsistent results when Python magic methods like
__class__are overridden, and the Dynamo module regresses by raisingAsPythonConstantNotImplementedErrorwhen treating certain user-defined objects as constants during JIT and AOT workflows. These regressions affect compilation reliability and user-defined object handling. - issues/174050, issues/174128
- Distributed Communication Error Detection Challenges: PyTorch's distributed collective operations can silently fail or hang when tensor shapes mismatch across ranks, without raising exceptions, leading to inconsistent communication states. There is a request for diagnostic functions to detect and verify communication health after such errors to improve debugging and robustness.
- issues/174844
- Performance Bottlenecks and Algorithmic Improvements: Performance issues include CUDA synchronization bottlenecks during spectral decompositions and proposals for an online softmax algorithm that significantly speeds up computation and reduces memory usage. Additionally, vectorized CPU kernels for sigmoid-like functions are being explored to improve efficiency.
- issues/174601, issues/174767, issues/174900
- Version Compatibility and Protocol Checks in Distributed Systems: There is a proposal to add explicit protocol version compatibility checks during torch.distributed process group initialization to detect incompatible protocol revisions across mixed PyTorch versions, aiming to fail fast and improve diagnostics in heterogeneous or rolling upgrade environments.
- issues/174917
- Unified Device Abstraction and Dynamic Precision Switching: A feature request proposes a Unified Device Abstraction (UDA) layer to harmonize CUDA and XLA/TPU workflows by introducing Predictive Entropy Quantization (PEQ), which dynamically adjusts tensor bit-precision based on entropy measurements to optimize performance and developer experience across hardware backends.
- issues/174965
- Miscellaneous Bugs and Feature Requests: Other issues include core crashes and numpy compatibility errors on Orange Pi 3 devices, requests for nth smallest element retrieval in tensors, and challenges in computing multiple Jacobian-vector products without redundant forward computations. These highlight diverse user needs and platform-specific problems.
- issues/174785, issues/174551, issues/174659
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 247
Key Open Pull Requests
1. Support __dict__ in NestedUserFunctionVariable: This pull request introduces a new variable tracker called DunderDictVariable to provide consistent and reliable support for the __dict__ attribute in Dynamo by addressing the shortcomings of the existing three implementations and enabling proper mutation tracking through the use of a side effects table.
- URL: pull/174570
- Associated Commits: 9a1f3, cf56e, 16f3e, 56062, 9a69c, 32cbf, 8f375, 0a889, 8e4f0, b8f15, 904e5, e15ee, 1c6ea, cc711, 67b32, 1422a
2. More size-hinting cleanups: This pull request refactors the code by replacing all size_hint calls with fallback to use optimization_hint instead, removes the fallback parameter from size_hint calls in preparation for its eventual deletion, and updates calls from symbolic_hint() to replace_backed_with_hints() to improve size-hinting logic.
- URL: pull/174580
- Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c
3. [DTensor] Strategy Validation (3/3): strategy querying, orchestrator, and CLI: This pull request adds a comprehensive DTensor sharding rule validator including a strategy querying orchestrator and a CLI tool that compares DTensor's claimed sharding rules against ground truth validity across multiple strategy paths, reports discrepancies such as incorrect or missing rules with false positive mitigations, and provides end-to-end tests and a user-friendly command-line interface for validating individual or all registered operators.
- URL: pull/174800
- Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546
Other Open Pull Requests
- DTensor validation and strategy improvements: Multiple pull requests enhance DTensor functionality by adding a validation engine for sharding rules that simulates distributed execution on a single machine and improving view operation strategies with a view_groups algorithm and guard_or_false mechanisms to fix numerous DDE-related issues. These changes collectively improve correctness and robustness in DTensor operations, including fixes for dimension normalization and stack operation handling.
- Pallas TPU backend enhancements: Several pull requests add and fix support for the Pallas TPU backend by implementing element-wise operations, initial inductor IR lowering, tiling support, and function legalization, while also addressing CPU compatibility and GPU tiling restrictions. These updates ensure better TPU code generation, broadcasting support, and overall backend stability.
- Memory and tracing improvements in FSDP2: Pull requests address memory leakage issues in FSDP2 by supporting dataclass arguments with a hybrid approach and remove dynamo tracing code from FSDP2 hooks, opting instead to unconditionally disable dynamo tracing in these hooks. These changes improve memory efficiency and simplify tracing behavior in fully shard modules.
- Inductor backend fixes and deterministic behavior: Updates fix allocation issues related to deterministic guards in the inductor backend and add a deterministic backward implementation for the flex flash operation, improving reliability and correctness in compiled code execution.
- Function annotation and code cleanup: A pull request moves function annotations to align with Python standards, and another removes unexpected code in the _range_constraints module while fixing related unit tests to enable dynamic shapes with keyword arguments. These changes improve code clarity and dynamic shape support.
- Size hint and shape mapping optimizations: Pull requests rewrite size_hint usage to support unbacked tensors and introduce guard_or_false around shape comparisons to improve pre- and post-broadcasting shape mappings, fixing many DDE-related issues and enhancing performance and correctness.
- CUDA and cuBLASLt backend improvements: Updates set cuBLASLt as the default BLAS backend for CUDA operations when available and enhance CUDA graph reliability by verifying external input tensor validity, handling pinned host memory, and improving test coverage and code organization.
- Triton and ROCm updates: Pull requests test disabling asynchronous copy in Triton 37 for ROCm, update to the latest Triton LLVM version, and include general Triton updates as part of the PyTorch 2.11 release.
- Tensor subclass and flattening fixes: A pull request improves handling of self references during tracing, implements fallback shallow copies for tensor subclass metadata, catches attribute errors during flatten operations, and refines meta converters to fix subclass initialization and flattening issues.
- Inductor reciprocal operation fix: The reciprocal operation in inductor is fixed by using a float32 constant instead of an integer to ensure proper floating-point division, addressing eager division rounding emulation.
- Compile-time print utilities and bug fixes: New utilities
compile_printandmake_compile_printenable printing tensor values inside compiled and traced code by wrapping print calls and using hooks, while also fixing bugs related to dispatch key stripping and dead code elimination that previously blocked this functionality.
- InputObserver custom empty tensor support: Support is added for specifying a custom empty tensor in InputObserver to handle missing inputs like pixel_values during subsequent forward calls, ensuring consistent input observation in models processing both images and text.
- CUDA kernel restrict keyword migration: CUDA kernels are migrated to use the RestrictPtrTraits struct to properly apply the restrict keyword to kernel parameters, fixing NVCC implementation issues and improving performance as demonstrated by initial benchmarks.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 285
Key Closed Pull Requests
1. Add mem_get_info usage notes to CUDA memory management docs: This pull request adds clarification notes to the CUDA memory management documentation in PyTorch to help users correctly interpret allocator statistics in relation to device-wide GPU memory usage by explaining the usage of mem_get_info.
- URL: pull/174822
- Associated Commits: 7cd90, 863ed, f4e98, a61b4, 3f18b, 3ecf7, 054b1, 3e976, f568c, 50d4b, 2f470, c5799, ca246, 41835, ffe4a, 40fb4, 2cb07, bcc08, dd832, 06986, 8e224, 62d42, 0084d, aa828, 4e881, 6dd20, 0138d, c27ec, b2633, a5a42, a8295, fba39, 41b6a, e2457, c647e, 188a1, dbc1e, fa6fa, c46a9, adad7, dadfc, 0aaaa, a9c2e, 1375a, 1386a, 66de6, a769a, ab159, e8421, 3a5fb, f3b5d, a266b, 0e245, 2c9af, e43b5, 1f74c, 99cb4, f6e6c, 449b1, 70a7c
- Associated Commits: 7cd90, 863ed, f4e98, a61b4, 3f18b, 3ecf7, 054b1, 3e976, f568c, 50d4b, 2f470, c5799, ca246, 41835, ffe4a, 40fb4, 2cb07, bcc08, dd832, 06986, 8e224, 62d42, 0084d, aa828, 4e881, 6dd20, 0138d, c27ec, b2633, a5a42, a8295, fba39, 41b6a, e2457, c647e, 188a1, dbc1e, fa6fa, c46a9, adad7, dadfc, 0aaaa, a9c2e, 1375a, 1386a, 66de6, a769a, ab159, e8421, 3a5fb, f3b5d, a266b, 0e245, 2c9af, e43b5, 1f74c, 99cb4, f6e6c, 449b1, 70a7c
2. M11: Distributed Protocol Version Guardrail: This pull request proposes adding an internal distributed protocol version guardrail enforced during the initialization of process groups via a store-based version check, including spawn-safe tests and an environment variable override for simulating version mismatches, while explicitly excluding MPI due to its lack of a rendezvous store path.
- URL: pull/174577
- Associated Commits: 0912e, 6c2ba, 41daf, d1e3e, d72fd, 001b4, e1e4b, b352a, 0d264, 61d4b, 32cf1, 73677, 5fbb0, df86b, 760d4, 34761, c533b, c5f29, 7c7bc, 0ff9d, f75f5, 1cadb, 7ab7c, 2ea65, e8069, 37e69, 63f09, 50435, 738fd, d3932, 0b1e7, 1ed83, 8aadb, 17f7c, 995b2, 11779, 30d9c, e4c52, 59332, d76b3, fdc54, ea8c9, 0655d, f4e3c, 5e238, a3693, 3c933, a8d99, 1901f, 16737
- Associated Commits: 0912e, 6c2ba, 41daf, d1e3e, d72fd, 001b4, e1e4b, b352a, 0d264, 61d4b, 32cf1, 73677, 5fbb0, df86b, 760d4, 34761, c533b, c5f29, 7c7bc, 0ff9d, f75f5, 1cadb, 7ab7c, 2ea65, e8069, 37e69, 63f09, 50435, 738fd, d3932, 0b1e7, 1ed83, 8aadb, 17f7c, 995b2, 11779, 30d9c, e4c52, 59332, d76b3, fdc54, ea8c9, 0655d, f4e3c, 5e238, a3693, 3c933, a8d99, 1901f, 16737
3. Fix FP8 test failures on AMD RDNA4 GPUs: This pull request addresses and fixes FP8 inductor test failures on AMD RDNA4 (gfx120x) GPUs that occur during scaled matrix multiplication tests with small M dimensions (M < 16) by implementing GPU-specific compile mode selection to correct incorrect tensor indexing in autotuned Triton kernels.
- URL: pull/174871
- Associated Commits: b97cf, db39c, d28ec, 6b3a1, c440b, 25a18, 70a5f, d14e5, b296d, a0166, 45156, c2d4e, 30a23, 11ca2, 629e8, ab471, 25366, 777e7, 56002, 223b9, b4c1e, 3d742, da5ac, a3c49, 5ca07, ecdea, f742d, 979dc, 7e17f, 4d673, fe101, c53b5
- Associated Commits: b97cf, db39c, d28ec, 6b3a1, c440b, 25a18, 70a5f, d14e5, b296d, a0166, 45156, c2d4e, 30a23, 11ca2, 629e8, ab471, 25366, 777e7, 56002, 223b9, b4c1e, 3d742, da5ac, a3c49, 5ca07, ecdea, f742d, 979dc, 7e17f, 4d673, fe101, c53b5
Other Closed Pull Requests
- InputObserver refactor and test expansion: This pull request refactors the
InputObserverclass to better handle empty caches during argument inference at the prefill step and significantly expands test coverage to include optional and mixed argument scenarios with dynamic shapes. It also integrates pandas for enhanced discrepancy analysis between model outputs and ONNX exports and updates related tests accordingly.
- DTensor communication optimization and sharding support: Multiple pull requests improve DTensor by optimizing redistribution communications using flattened device meshes to reduce costly sequential operations and adding support for comms beyond all_reduce, banning mixed partial placements, and addressing edge cases. Additional work includes enabling debug mode to print optimized transform info, adding foundational utilities for sharding rule validation, and supporting Partial input placements for matmul with missing sharding propagation rules.
- FSDP2 feature enhancements and testing: These pull requests update unit tests to enable more coverage for Fully Sharded Data Parallel version 2 (FSDP2) on CPU and introduce support for assigning a per-parameter mesh configuration in FSDP2.
- Test re-enablement and assertion removals: One pull request re-enables Inductor X86 backend tests by updating them to no longer use the PT2E API, while several others focus on removing additional test assertions across multiple test files as part of a series of stacked changes.
- Platform-specific debugging and test adjustments: This pull request addresses graph break issues on the ROCm MI350 platform by making code adjustments, updating test expectations, and selectively skipping tests with known accuracy or compatibility problems on gfx950 hardware.
- Device-specific Event class improvements: This pull request enhances PyTorch by enabling device-specific Event classes like
torch.xpu.Eventandtorch.cuda.Eventto accept both generic and device-specific Stream inputs, improving API flexibility and fixing issue #173792.
- Pallas TPU CI configuration: This pull request configures continuous integration testing for the Pallas TPU by adding and refining scripts to enable successful pulling and installation of the private torch_tpu repository before its full open-source release.
- Backend migration and performance improvements: One pull request migrates the grid_sampler_2d function to use Metal for backend execution to improve performance, while another proposes skipping the LazyVT optimization for virtual tensors that are realized anyway to reduce overhead and slightly improve compile time.
- PyTorch Dynamo comprehension graph break enhancements: This pull request extends PyTorch Dynamo's handling of comprehension graph breaks to support nested and inlined function contexts by adding new speculation checks, bytecode handling for closure variables, decrement logic for nested comprehensions, and fixes for zero-dimensional FakeTensors, accompanied by extensive new tests.
- Function attribute fix: This pull request fixes an issue related to assigning the
__annotations__attribute of a function within theSET_FUNCTION_ATTRIBUTEoperation.
- TypingVariable equality method proposal: This pull request proposes adding an equality method (
__eq__) to theTypingVariableclass to enhance comparison capabilities, but it was ultimately not merged.
- XPUGraph frontend Python APIs: This pull request introduces frontend Python APIs for XPUGraph, including functions like pool() and debug_dump(), to enhance graph capture and replay capabilities on XPU devices.
- JIT CUDA kernel unsigned integer support: This pull request adds support for uint16, uint32, and uint64 scalar types in JIT-compiled CUDA kernels to fix crashes and incorrect results in torch.special functions like zeta when using unsigned integer inputs on CUDA. It extends relevant macros, improves error handling, and adds corresponding tests.
- CI improvements for TIMM pretrained model caching: This pull request improves the CI process by enabling caching of TIMM pretrained models on a shared Hugging Face cache to prevent benchmark failures caused by offline mode blocking model downloads. It adds a download-only flag, uses a version-pinned cache directory, and implements a stamp file to ensure proper cache preparation.
- Symbolic reasoning optimization: This pull request proposes a simple optimization in symbolic reasoning that returns true early when evaluating sums of symbols/constants all guaranteed to be positive, significantly reducing export time from over 5 minutes to just over 3 minutes.
- Documentation theme update: This pull request updates the pytorch_sphinx_theme2 to version 0.4.3, removing the PyTorch navbar in favor of a main site link, adding dropdowns to the top navigation bar, removing the runllm.js script, adding sphinx-tippy tooltip support, autogenerating an llms.txt file, and incorporating detailed LLM meta tags for enhanced documentation navigation and metadata.
- Profiler and symbolic integer improvements: This pull request enhances profiler visibility by replacing internal profiling calls with
torch.autograd.profiler.record_functionand updates thetorch.combinationsimplementation to use symbolic integer operations to support dynamic shapes during tracing.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| pianpwk | 194 | 21 | 1 | 13 |
| laithsakka | 182 | 26 | 0 | 9 |
| wconstab | 145 | 15 | 0 | 55 |
| albanD | 174 | 20 | 2 | 4 |
| anijain2305 | 158 | 16 | 0 | 5 |
| ydwu4 | 151 | 11 | 0 | 8 |
| malfet | 102 | 18 | 2 | 21 |
| weifengpy | 101 | 14 | 1 | 13 |
| BenjaminDEMAILLE | 128 | 0 | 0 | 0 |
| guilhermeleobas | 94 | 15 | 0 | 3 |
Access Last Week's Newsletter: