Weekly GitHub Report for Pytorch: October 13, 2025 - October 20, 2025 (12:05:29)

            Weekly GitHub Report for Pytorch: October 13, 2025 - October 20, 2025 (12:05:29)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on x86 CPUs, expanded Intel GPU support, a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward improved performance, security, and streamlined deployment.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

dynamic_axes in torch.onnx.export seems broken in 2.9.0: This issue reports that the dynamic_axes argument in torch.onnx.export is not functioning correctly in PyTorch version 2.9.0, where dynamic axis names are replaced by serial numbers instead of user-defined names as expected. The user provides example code and output comparisons between versions 2.8.0 and 2.9.0 to demonstrate the regression, highlighting that the older version preserves custom axis names while the newer one does not.

The discussion suggests using the newer dynamic_shapes argument instead of dynamic_axes and mentions that the change is related to the internal handling of axis renaming and exporter behavior controlled by a dynamo flag. It is recommended to add dynamo=False to maintain the old behavior temporarily, with plans to eventually deprecate the old conversion method, and the conversation ends with appreciation for the clear and timely explanations.
Number of comments this week: 6

PyTorch CPU / DataLoader: “Too many open files” (EMFILE): This issue describes a problem where PyTorch's DataLoader on CPU encounters a "Too many open files" (EMFILE) error due to the combination of large prefetch_factor, batch_size, and per-sample file usage exceeding the system file descriptor limit, even with a single worker. The user provides a reproducible example demonstrating how file descriptors accumulate and cause the worker process to crash, and several commenters discuss potential workarounds, diagnostics, and related experiences with file descriptor exhaustion in multiprocessing and shared memory contexts.

Commenters suggest increasing the system file descriptor limit as a practical workaround, but some report that even very high limits do not prevent the error, indicating possible file descriptor leaks in PyTorch or related libraries. Diagnostic tips include using lsof to track open files and monitoring file handle usage, while others recommend disabling multiprocessing (num_workers=0) to avoid leaks, upgrading packages, and providing minimal reproducible code to help identify the root cause.
Number of comments this week: 5

[Meta] Type Hints in Pytorch: This issue initiates a discussion about improving type hints in PyTorch, emphasizing that top-level type aliases used in public interfaces should themselves be public to avoid forcing users to duplicate or import private types. It also highlights concerns about backward compatibility when refining type hints, particularly when changing from general types like str to more restrictive Literal types, which can unintentionally introduce type errors for users relying on dynamic string values.

The comments express strong agreement with the need for public type aliases and backward compatibility, debating whether to centralize type definitions in a dedicated module or keep some local to their context. Participants discuss the trade-offs between using Literal types versus StrEnum for string-based parameters, noting current limitations in type checkers and the complexity of supporting both static and dynamic string values. Suggestions include creating a linter to enforce public exposure of types used in public APIs and carefully managing deprecations to avoid breaking user code, while acknowledging that some type-breaking changes may be acceptable in minor releases given their limited impact compared to runtime breaks.
Number of comments this week: 5

PyTorch on aarch64 crashes with ConvTranspose1d: This issue reports a crash occurring when running a large ConvTranspose1d operation in PyTorch on an aarch64 architecture, caused by an invalid memory write in the ARM Compute Library (ACL). The root cause is identified as an integer overflow in the offset calculation within ACL due to the large tensor dimensions, and a proposed fix involves changing the offset calculation to use 64-bit integers to prevent overflow.

The discussion confirms the crash is reproducible with specific versions of PyTorch, oneDNN, and ACL on Neoverse-V2 hardware. The problem is traced to a write to an invalid address caused by a 32-bit integer overflow in ACL’s offset calculation for large tensor sizes, and it is suggested that switching to 64-bit offsets will resolve the issue.
Number of comments this week: 5

Memory leak when converting from numpy array: This issue reports a memory leak occurring when converting numpy arrays to PyTorch tensors on the CPU, specifically observed with PyTorch version 2.8.0 and Python 3.10/3.13. The user provides minimal reproducible examples showing that certain tensor operations and conversions cause unexpected memory growth, likely due to internal caching or allocator behavior rather than a traditional leak, and notes that the problem appears starting from PyTorch 2.6.

The comments further simplify the example to isolate the leak and suggest the issue may stem from PyTorch’s CPU memory allocator caching small tensors, causing memory fragmentation or overuse. It is clarified that the memory is not truly leaked since dropping references frees it, but the allocator’s caching strategy leads to unexpectedly large memory consumption for small tensors stored in lists, and manual calls to system-level memory trimming can partially mitigate the problem.
Number of comments this week: 4

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0.dev, CUDA 12.1, and Ubuntu 22.04, and shares a code snippet where the error occurs while compiling parts of a pipeline with torch.compile using the "reduce-overhead" mode.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a MaxPool2D operation with a larger kernel size as multiple MaxPool2D operations with a smaller kernel size, which reduces the computational cost per cell. The suggested modification targets the MaxPool2D layer directly to avoid additional backpropagation overhead and is expected to yield performance improvements specifically on CPU, as demonstrated by testing that showed a speedup of approximately 1.3 times.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot map a segment due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, and managing known formatting challenges, while also providing a detailed worklist organized by directory to coordinate and track progress on this large-scale formatting effort.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files, which are primarily for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on mobile devices where storage is limited.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 95
Summarized Issues:

Type and Annotation Issues in PyTorch APIs: Several issues highlight problems with type annotations and parameter typings, such as the incorrect typing of autowrap_modules in symbolic tracing, mismatched type annotations in distributed module functions, and the need for better type hinting practices to avoid breaking user code. These problems can cause confusion, runtime errors, or hinder interoperability across PyTorch components.  
issues/165277, issues/165439, issues/165628

Memory Leaks and Memory Safety Bugs: Multiple reports describe memory leaks and memory safety issues, including leaks when converting numpy arrays to tensors, leaks in torch.compile with flash_attn_varlen_func, and a severe heap corruption bug during backward passes of loss functions. These issues lead to increased memory usage, crashes, or undefined behavior during training or inference.  
issues/165319, issues/165340, issues/165407

Torch.compile and Dynamo Compilation Failures: Several issues involve failures or limitations in torch.compile and Dynamo, such as unsupported dynamic shape operators, errors with tensor metadata comparisons, in-place mutations on overlapped tensors, and problems with custom operator constraints or invalid attribute assignments. These problems prevent successful compilation or cause runtime errors in compiled models.  
issues/165349, issues/165385, issues/165408, issues/165409, issues/165417, issues/165421

Inductor Graph Partitioning and Custom Pass Mechanism Issues: There are multiple reports about the Inductor graph partitioning mechanism needing redesign, including problems with relying on FakeTensors, missing custom operations in FX graph cache, and embedding non-serializable stateful objects in configs. These issues complicate partitioning, reduce robustness, and hinder migration efforts.  
issues/165341, issues/165501, [issues/165595](https://github.com/issues/165595]

ONNX Export and FX Graph Metadata Problems: Issues include failures during ONNX export due to index errors in decomposition, loss of stack trace metadata when symbolic tracing FX graphs, and strict export mode failing to handle global constants properly. These problems affect model export workflows and downstream tooling like quantization.  
issues/165395, issues/165491, issues/165641

Distributed and Parallelism Bugs: Problems reported include hangs in DistributedDataParallel training with tied weights on NVIDIA Blackwell GPUs, incorrect outputs from dist.reduce_scatter_tensor with world size one, and ineffective timeout updates in the gloo backend. These issues impact distributed training stability and correctness.  
issues/165399, issues/165422, issues/165727

Operator and Backend Implementation Bugs: Several issues report incorrect or inconsistent operator behavior, such as Conv1d on Intel GPUs producing wrong results, aten.masked_select and torch.unique being rejected in fullgraph compilation, and abs operations returning empty tensors on custom device backends. These bugs cause incorrect computations or compilation failures.  
issues/165392, issues/165417, issues/165421, issues/165549

Numerical and Precision Errors: Reports include NaN gradients in backward pass of atan2 with zero inputs, dtype mismatch errors in mixed precision training with torch.func.jvp, and inconsistent outputs in flex_attention with AMP or block masks. These issues degrade numerical stability and precision correctness.  
issues/165324, issues/165427, issues/165468, issues/165838

Crash and Segmentation Faults in Specific Operations: Crashes occur in matrix multiplication with CUDA 13.0 on Ubuntu, fmod with int64 causing segfaults, and ConvTranspose1d on aarch64 due to integer overflow. These critical failures cause abrupt termination of programs.  
issues/165798, issues/165649, issues/165654

Documentation and Usability Concerns: Issues include broken image links in release notes, inaccurate backend integration documentation, unclear ONNX output_names behavior, and confusing error messages in FakeTensorMode. These reduce user experience and increase confusion.  
issues/165559, issues/165578, issues/165758, issues/165547

Build, CI, and Environment Issues: Problems such as Python version mismatch in Docker images, ROCm GPU maintenance causing queue delays, and missing libcudnn.so.9 on AlmaLinux affect development workflows and environment stability.  
issues/165684, issues/165836, issues/165812

Performance Regressions and Profiling Gaps: Reports include drastic slowdowns on Intel XPU after kernel changes, missing CPU spans in profiling with expandable_segments enabled, and failures in autotuning and caching in CuTeDSL Inductor path. These affect runtime efficiency and profiling accuracy.  
issues/165738, issues/165715, issues/165785

Test Failures and Disabled Tests on Specific Platforms: Several tests are disabled or failing on xpu or ROCm platforms, including TestTritonDotReduction and TestDistributions, indicating platform-specific instability or incompatibility.  
issues/165719, issues/165720, issues/165671

Requests for New Features and Improvements: Suggestions include adding FFT support to FLOP counters, element-wise operation flop formulas, NumPy dtype acceptance in APIs, sparse tensor support for view_as_complex, and Python-only backend registration APIs. These aim to enhance PyTorch's functionality and interoperability.  
issues/165425, issues/165570, issues/165612, issues/165781, issues/165811

DataLoader and Multiprocessing Stability Issues: A file descriptor exhaustion problem in DataLoader on CPU causes worker crashes due to large prefetch and batch sizes combined with multiple file openings, highlighting resource management challenges.  
issues/165532

Symbolic and FX Graph Export Limitations: Issues with symbolic integer support in functions like tril and problems with make_fx producing incorrect graphs when BatchedFallback is triggered indicate limitations in symbolic tracing and FX graph generation.  
issues/165613, issues/165634

Error Handling and Messaging Improvements Needed: Several issues call for clearer error messages, such as for deepcopying CUDA tensors in FakeTensorMode, invalid setattr in Dynamo, and scalar usage with torch.where and out= parameter, to improve developer experience.  
issues/165385, issues/165547, issues/165799

Miscellaneous Bugs in Core Operations: Bugs include floating point exceptions in dynamic quantization with zero output dimension, inconsistent remainder behavior between CPU and CUDA, and bugs in fusion scoring logic due to copy-paste errors. These affect correctness and stability in core functionalities.  
issues/165829, issues/165650, issues/165724

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 32
Summarized Issues:

Installation and Compatibility Issues: Several users report problems installing PyTorch or specific versions due to compatibility issues with Python versions or operating systems. These include errors finding matching distributions for torch==2.7.0 and lack of prebuilt wheels for Python 3.14.0, with plans to support newer Python versions in future releases.  
[issues/165276, issues/165396]

XPU Platform Test Failures and Disabling: Multiple tests in the DynamicShapesGPUTests and TestInductorOpInfoXPU suites are consistently failing on the XPU platform, leading to their disabling on the main branch. These include tests for dynamic shapes, unspec inputs, var correction, and comprehensive atanh operations.  
[issues/165403, issues/165411, issues/165412, issues/165413, issues/165414, issues/165415, issues/165416, issues/165742]

Memory Format and Data Type Bugs on CUDA: Using torch.nn.MaxPool2d with channels_last memory format and bfloat16 dtype on CUDA causes NaN outputs or illegal memory access errors. This issue can be avoided by converting tensors to a contiguous memory format before applying MaxPool2d.  
[issues/165297]

Distributed and Backend Integration Challenges: There are difficulties mixing NCCL and MPI backends within torch.distributed, specifically using NCCL for global process groups and MPI for subgroups, which leads to segmentation faults and hangs during subgroup creation.  
[issues/165428]

Custom Operator and Inductor Graph Partitioning Issues: Enhancements and bug fixes are needed in custom operator registration to prevent outputs being views of inputs, and proposals exist to improve inductor graph partitioning by specifying custom operators via string names to avoid serialization problems.  
[issues/165360, issues/165486]

Flash Attention and Functorch Compatibility Problems: The torch.func.jvp function does not support models using flash attention due to missing autograd methods, causing runtime errors and NotImplementedErrors when computing gradients or forward automatic differentiation.  
[issues/165517, issues/165530]

Tensor View and Reshape Errors: Using the .view() method on non-contiguous tensor slices results in size and stride incompatibility errors, with recommendations to use .reshape() instead to avoid runtime failures.  
[issues/165525]

Documentation and Contribution Process Clarifications: Users inquire about submitting documentation pull requests, with guidance that minor fixes can be submitted directly while larger changes should be discussed first. Additionally, documentation inaccuracies exist regarding tensor.index_put_ behavior with duplicate indices.  
[issues/165503, issues/165536]

PyTorch Compiler and Serialization Bugs: AOT precompile serialization fails with AttributeErrors when running compiled functions multiple times due to caching conflicts, and the Inductor compiler encounters AttributeErrors related to loop reordering during model compilation with specific settings.  
[issues/165447, issues/165579]

Resource Management and Reference Counting Bugs: The c10::intrusive_ptr::reset_() method mishandles weak reference counts, potentially causing resource leaks, and PyTorch Dynamo's RelationalGuard classes improperly store raw PyObject* pointers without incrementing reference counts, risking dangling pointers.  
[issues/165262, issues/165722]

Test Suite Failures and Disabling on XPU: Several tests related to dynamic shapes, unspec inputs, and specific operations are disabled due to consistent failures on the XPU platform, affecting test reliability and coverage.  
[issues/165403, issues/165411, issues/165412, issues/165413, issues/165414, issues/165415, issues/165416, issues/165742]

Compilation and Build Errors: Building extensions like mmcv on Windows with PyTorch 2.9.0 fails due to C++ ambiguity errors in Torch headers, and FP8MatmulCUDA tests fail due to incorrect parameter swizzling in the _scaled_mm_v2 operation.  
[issues/165721, issues/165743]

Module Import Errors: Attempting to import torch.ops.symm_mem as a standalone module results in ModuleNotFoundError because it is not an importable module; its operations should be accessed as attributes instead.  
[issues/165761]

Feature Requests and Documentation Improvements: Requests include adding examples to torch.nn.ConvTranspose1d documentation and mechanisms to track dead ReLU activations to aid debugging and model diagnosis.  
[issues/165552, issues/165615]

Script and Accuracy Checking Bugs: A script incorrectly returns a fail_accuracy status when exceptions occur during accuracy checking instead of properly handling the exceptions.  
[issues/165753]

API Changes and Implementation Updates: The MPSGraph implementation of torch.cat is proposed to be removed in favor of a Metal kernel, reflecting ongoing backend improvements.  
[issues/165350]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 234
Key Open Pull Requests
1. [WIP][inductor] generate fused rms/layer norm bwd: This pull request is focused on developing and iteratively updating a work-in-progress feature within the PyTorch inductor backend to generate a fused backward pass for RMS normalization and layer normalization operations.

URL: pull/165370

Merged: No

Associated Commits: 6d566, a7816, 8d97d, 7a687, 23c01, 9c81a, d2d9d, 6fbfd, 3d419, f5c65, cd0f4, 8745c, 70e3f, a4cbc, 12615, df0b8, 21d22, 4f40d, ac51f, 2adf0, 4b590, 0c110

2. Fix nn.Dropout accuracy discrepancies between triton and torch implementations: This pull request proposes a fix for accuracy discrepancies in PyTorch's nn.Dropout between Triton-compiled and eager execution modes by introducing a compiler switch that aligns random number generation to preserve exact dropout mask results without sacrificing performance through fusion with adjacent kernels.

URL: pull/165545

Merged: No

Associated Commits: bcb9f, 5a711, 0de41, 0e443, 6dd7c, a897a, 2a4ec, f274c, d8ebb, 7549a, 14aea, 0c68a, a6732, 3a485, 09c78, 32c25, e498d, a8611

3. Bugfix to forward autodiff causing different datatype 2: This pull request addresses a bug in the forward automatic differentiation process related to incorrect data type promotion when handling Python scalars and zero-dimensional tensors by introducing a new property was_wrapped_number to accurately track wrapped numbers, modifying autograd code to set this property during arithmetic operations, and updating the dtype promotion logic accordingly, along with adding new tests to validate the fix.

URL: pull/165784

Merged: No

Associated Commits: 0dc5c, 746eb, 04a23, 729f1, ee590, 4d35e, d24b0, 787e8, d3981, 35015, 141ce, 63656, 112c7, e7d59, 3473a, 51da7, eb40d

Other Open Pull Requests

DLPack tensor exchange API unification and updates: This pull request introduces a unified DLPackExchangeAPI struct replacing separate function pointers for DLPack tensor exchanges, aligning with the latest DLPack standard and exposing key function pointers including a new current_work_stream for improved device stream handling. It updates all conversions to a _no_sync convention requiring explicit stream synchronization, adds a non-owning DLTensor conversion to reduce reference counting overhead, updates dlpack.h, and includes unit tests to ensure stability without releasing the GIL.  
pull/165483

XPU per-process memory fraction APIs: Two pull requests introduce new APIs for the XPU backend: one to get the allowed memory fraction per single process and another to set this memory fraction, aligning XPU memory management with other PyTorch backends. These additions enable users to retrieve and customize memory usage limits on XPU devices.  
pull/165511, pull/165510

DebugMode enhancements for detailed logging and hooks: Multiple pull requests enhance DebugMode by adding a run_graph() method for detailed logging of fx.Node calls during graph execution and introducing hooks for the __torch_dispatch__ mechanism and GraphModule nodes. These changes enable flexible recording, annotation, and comprehensive trace outputs for debugging tensor operations and graph executions.  
pull/165377, pull/165462, pull/165499

Tensor descriptor transposition for load/store operations: This pull request adds an option to transpose tensor descriptors by reordering block parameters to ensure descending stride order, improving compatibility and matching of tensor descriptors beyond the previously supported 2D case. This enhancement facilitates better tensor descriptor handling during load and store operations.  
pull/165541

Dynamo assertion handling improvements: This pull request cleans up and improves assertion handling in the Dynamo component by refining type checks, comments, and error handling, addressing related issues #162852 and #164878. These changes enhance code robustness and maintainability in Dynamo.  
pull/165430

Triton kernel autotuning parameter addition: A new max_autotune_configs parameter is introduced to enable advanced Triton kernel autotuning for custom operations, allowing these ops to benefit from both algorithmic and kernel-level optimizations similar to the existing tuned_mm approach. This two-tier optimization strategy maintains backward compatibility while enhancing performance.  
pull/165526

CalculateSmallVectorDefaultInlinedElements migration: This pull request migrates the CalculateSmallVectorDefaultInlinedElements implementation from a template struct to a constexpr function using C++17 features, reducing template instantiation complexity. This change minimizes compilation time and the size of the generated binary.  
pull/165572

CUDAAllocatorConfig deprecation and improvements: Two pull requests focus on the CUDAAllocatorConfig component by deprecating overlapping functions to streamline the codebase and refining the CUDA BackendStaticInitializer to improve allocator selection. These changes aim to enhance clarity and functionality in CUDA allocator management.  
pull/165289, pull/165298

nan_to_num complex tensor support and fixes: This pull request updates the nan_to_num function to support complex-valued arguments for nan, posinf, and neginf on complex tensors, compatible with torch.complex128 and torch.complex64. It includes CUDA kernel improvements, bug fixes, and unit tests to ensure accurate handling of complex inputs.  
pull/165355

Inductor graph executor runtime call recording: This pull request adds functionality to record detailed runtime calls of the inductor graph executor within DebugMode, capturing inputs, cache keys, function call arguments, and post-gradient computation graphs. This enables enhanced tracing and debugging of tensor operations and graph executions.  
pull/165499

MPS backend Objective-C memory leak fix: This pull request addresses memory leaks in the MPS backend by adding autorelease calls to MPSGraphPooling2DOpDescriptor object creation in Pooling.mm, following a previous fix pattern for Linear.mm. This ensures proper release of descriptor objects and prevents memory accumulation.  
pull/165619

DTensor local tensor mode test enablement and fixes: This pull request enables additional DTensor tests in local tensor mode by unconditionally collecting RNG state from all CPU and CUDA devices during operation dispatch to ensure consistent randomness across ranks. It also fixes integration issues related to per-rank computations in _MaskedPartial and Shard placements discovered during test enablement.  
pull/165716

vLLM dependency pinned commit update: This pull request updates the pinned commit of the vLLM dependency to a specific commit (#25845) from the vLLM repository, primarily for testing purposes and does not require review.  
pull/165270

Generic API for accelerator allocator settings: This pull request introduces a generic API named torch._C._accelerator_setAllocatorSettings to enhance allocator settings management for accelerators in PyTorch.  
pull/165291

AllocatorConfig parsing bug fix: This pull request fixes a bug in the AllocatorConfig parsing logic related to roundup division, addressing incorrect behavior in memory allocation calculations.  
pull/165304

New tracer enabled as default: This pull request proposes enabling the new tracer as the default option in PyTorch to improve tracing functionality.  
pull/165332

ROCm grid sampler bilinear interpolation optimization: This pull request addresses performance bottlenecks in the ROCm implementation of grid sampler bilinear interpolation by moving atomic operations for the backward pass’s grad_input computation from global memory to faster thread-block private shared memory (LDS). It also proposes optimizations such as leveraging texture objects for grad_grid and improving code robustness and readability.  
pull/165337

Lint workflow consolidation in CI: This pull request modifies the lint workflow in the CI pipeline to run both partial and full lint checks simultaneously using GitHub Actions matrixes, reducing confusion from having two separate lint workflows and improving reliability by consolidating jobs dependent on pull request context into a single workflow.  
pull/165656

MSVC C++ compilation error fix in pycore_stackref.h: This pull request fixes an MSVC C++ compilation error by wrapping the pycore_stackref.h header in a C file and compiling it with a C compiler to support designated initializers, along with additional platform-specific guards and code cleanups.  
pull/165686

Allocator Config error message improvements: This pull request improves the clarity and user-friendliness of error messages related to the Allocator Config in PyTorch.  
pull/165288

Test-only wrapper check addition: This test-only pull request adds a wrapper check to the PyTorch codebase, as indicated by the title and multiple iterative commit updates.  
pull/165365

gm.print_readable function update with annotations: This pull request updates the gm.print_readable function to include annotations in its output, enhancing readability by displaying additional metadata such as compilation details for the flex_attention operation.  
pull/165397

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 199
Key Closed Pull Requests
1. Up sample bilinear2d backward for AMD: This pull request aims to optimize and enable the backward operation of the bilinear 2D upsampling function specifically for AMD GPUs, improving performance and compatibility on ROCm platforms.

URL: pull/165802

Merged: No

Associated Commits: 476d3, 8e6d6, 3ff78, fe59c, 14b0f, 99627, 275c0, 214e0, 4361e, b324b, edcb1, 4142e, 5e67b, c58ce, 40610, fcc0d, 16b82, 2711b, 5b2a3, 7b8bc, 506d5, 55b24, 123b6, 426b2, 31b3b, 06ee6, c126f, 4fe15, 9bb5b, fa57f, 2fbf4, 1735e, 2ef9f, 72b4e, e90b3

2. bf16 support for per tensor backward: This pull request proposes adding bfloat16 (bf16) support for the backward pass of the torch._fake_quantize_learnable_per_tensor_affine() function by upcasting parameters to float32 during computation and downcasting gradients to bf16 before returning, while also adjusting testing procedures to handle numerical differences between Python and C++ downcasting to maintain precision and avoid breaking changes.

URL: pull/165362

Merged: No

Associated Commits: 2de11, e6b85, b4aa3, 49540, 6b504, d9cbe, e5d98, b911f, f22bd, d280f, 32b17, ab812, 21155, 9f068, 77f7b, cecd6, d0b5a, 5af66, 527be, 6d643

3. Overlap scheduler improvements: This pull request proposes a series of improvements to the overlap scheduler in PyTorch, including accounting for bucketing in overlap calculations to reduce latency, updating compute ordering to be based on compute index rather than depth to maintain execution order, enforcing waits on collectives within the same process group, and enhancing memory handling through pre-fetch limiting and scheduling waits when memory usage exceeds peak, all aimed at optimizing collective operation scheduling and memory efficiency.

URL: pull/165318

Merged: No

Associated Commits: c1f09, f5531, ef4da, 6584b, eda03, 2614f, 3473e, caafe, 4cf8b, 83025

Other Closed Pull Requests

Windows Cross-Compilation CI Workflow: This topic covers the addition of a continuous integration workflow for testing Windows cross-compilation of PyTorch using AOTI, including building on Windows with Visual Studio 2022 and CUDA 12.8, uploading .lib artifacts, and cross-compiling on Linux with mingw support. It also involves modifying the Linux CUDA Docker image to install mingw and preparing for future tests that load and run the compiled artifacts on Windows.  
pull/165344, pull/165573, pull/165560, pull/165479

DeviceMesh Refactoring and API Simplification: Multiple pull requests focus on refactoring the DeviceMesh component by removing the explicit mesh Tensor storage, introducing a private constructor with limited parameters, and simplifying the unflatten method with helper functions. These changes streamline DeviceMesh instantiation, improve readability, and optimize memory usage by using _layout and _global_rank_permutation attributes.  
pull/165555, pull/165556, pull/165554

FBGEMM and CUTLASS Integration Updates: This topic includes updating the FBGEMM submodule to its latest main version with CMake modifications for NVFP4 grouped GEMM kernels and fixing duplicated CUTLASS paths in CMake by restricting inclusion to the fbgemm_genai target. These changes prevent version mismatches and support new features in GEMM kernels.  
pull/165544, pull/165424

Lazy Module and Dynamo Logging Enhancements: Pull requests here improve LazyVariableTracker logging by adding variable source attribution and types for better traceability and integrate the variable source into the __repr__ method of the Lazy module. These changes address previous comments, fix related tests, and enhance debugging capabilities.  
pull/165402, pull/165339

Backward Pass bf16 Support: This pull request adds bf16 support for the per-channel backward pass by upcasting parameters to fp32 and downcasting gradients to bf16, adjusting tests and tolerances to handle numerical differences between Python and C++ casting implementations. This ensures backward compatibility and precision in mixed-precision training.  
pull/165325

Bug Fixes and Stability Improvements: This group addresses various issues including fixing a crash during large tensor max pooling on CUDA, relaxing equality checks for objects inherited from multiple types, and notifying users with a warning when using unsupported older Intel GPUs. These fixes improve stability and user experience.  
pull/165374, pull/165460, pull/165372

Distributed and Checkpointing Code Improvements: These pull requests improve assert statements in distributed checkpointing code by replacing them with explicit checks and meaningful error messages, and clean assert statements across multiple torch/utils subdirectories while fixing attribute usage and lint issues. These changes enhance code clarity and robustness.  
pull/165622, pull/165311

Testing and OpInfo Enhancements: This topic includes adding and fixing OpInfo tests for the default partitioner to address dynamic shape test failures and ensure all tests pass before disabling functionalization. It also covers attempts to fix failures in periodic debug tests by updating FakeProcessGroup reference counting and removing deprecated usage.  
pull/165372, [pull/165479](https://github.com/pytorch/pytorch/pull/165479]

Scaled GEMM and NVFP4 Support: This pull request adds an optional alpha argument to the at::cuda::blas::scaled_gemm function to support two-level-scaled NVFP4 GEMM calls, introducing device-constant memory and a statically-held tensor buffer to manage the lifetime of the alpha tensor during matrix multiplication.  
pull/165563

Interpreter and Traceback Preservation: This pull request adds an interpreter to the local_map implementation to preserve fx_traceback annotations when lowering Dynamo-traced HOP bodies to aten nodes, addressing the propagation of these annotations through joint graph traces.  
pull/165336

Obsolete Code Removal and Configuration Refactoring: These pull requests propose removing an unused parameter related to extension attributes due to SYCL compiler upgrades and refactoring CUDAAllocatorConfig using ConfigTokenizer for better configuration management, although the latter was not merged.  
pull/165623, pull/165281

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

bobrenjc93
317
39
48
28

cyyever
164
57
0
25

malfet
90
15
12
99

laithsakka
150
14
6
36

Skylion007
19
11
2
153

anijain2305
154
21
4
6

pianpwk
112
35
0
3

eellison
80
15
0
50

slayton58
96
16
0
22

ezyang
39
12
4
77

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
bobrenjc93	317	39	48	28
cyyever	164	57	0	25
malfet	90	15	12	99
laithsakka	150	14	6	36
Skylion007	19	11	2	153
anijain2305	154	21	4	6
pianpwk	112	35	0	3
eellison	80	15	0	50
slayton58	96	16	0	22
ezyang	39	12	4	77