Weekly GitHub Report for Pytorch: October 13, 2025 - October 20, 2025 (12:05:29)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on x86 CPUs, expanded Intel GPU support, a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward improved performance, security, and streamlined deployment.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
dynamic_axes in torch.onnx.export seems broken in 2.9.0: This issue reports that the
dynamic_axesargument intorch.onnx.exportis not functioning correctly in PyTorch version 2.9.0, where dynamic axis names are replaced by serial numbers instead of user-defined names as expected. The user provides example code and output comparisons between versions 2.8.0 and 2.9.0 to demonstrate the regression, highlighting that the older version preserves custom axis names while the newer one does not.- The discussion suggests using the newer
dynamic_shapesargument instead ofdynamic_axesand mentions that the change is related to the internal handling of axis renaming and exporter behavior controlled by adynamoflag. It is recommended to adddynamo=Falseto maintain the old behavior temporarily, with plans to eventually deprecate the old conversion method, and the conversation ends with appreciation for the clear and timely explanations. - Number of comments this week: 6
- The discussion suggests using the newer
-
PyTorch CPU / DataLoader: “Too many open files” (
EMFILE): This issue describes a problem where PyTorch's DataLoader on CPU encounters a "Too many open files" (EMFILE) error due to the combination of largeprefetch_factor,batch_size, and per-sample file usage exceeding the system file descriptor limit, even with a single worker. The user provides a reproducible example demonstrating how file descriptors accumulate and cause the worker process to crash, and several commenters discuss potential workarounds, diagnostics, and related experiences with file descriptor exhaustion in multiprocessing and shared memory contexts.- Commenters suggest increasing the system file descriptor limit as a practical workaround, but some report that even very high limits do not prevent the error, indicating possible file descriptor leaks in PyTorch or related libraries. Diagnostic tips include using
lsofto track open files and monitoring file handle usage, while others recommend disabling multiprocessing (num_workers=0) to avoid leaks, upgrading packages, and providing minimal reproducible code to help identify the root cause. - Number of comments this week: 5
- Commenters suggest increasing the system file descriptor limit as a practical workaround, but some report that even very high limits do not prevent the error, indicating possible file descriptor leaks in PyTorch or related libraries. Diagnostic tips include using
-
[Meta] Type Hints in Pytorch: This issue initiates a discussion about improving type hints in PyTorch, emphasizing that top-level type aliases used in public interfaces should themselves be public to avoid forcing users to duplicate or import private types. It also highlights concerns about backward compatibility when refining type hints, particularly when changing from general types like
strto more restrictiveLiteraltypes, which can unintentionally introduce type errors for users relying on dynamic string values.- The comments express strong agreement with the need for public type aliases and backward compatibility, debating whether to centralize type definitions in a dedicated module or keep some local to their context. Participants discuss the trade-offs between using
Literaltypes versusStrEnumfor string-based parameters, noting current limitations in type checkers and the complexity of supporting both static and dynamic string values. Suggestions include creating a linter to enforce public exposure of types used in public APIs and carefully managing deprecations to avoid breaking user code, while acknowledging that some type-breaking changes may be acceptable in minor releases given their limited impact compared to runtime breaks. - Number of comments this week: 5
- The comments express strong agreement with the need for public type aliases and backward compatibility, debating whether to centralize type definitions in a dedicated module or keep some local to their context. Participants discuss the trade-offs between using
-
PyTorch on aarch64 crashes with ConvTranspose1d: This issue reports a crash occurring when running a large ConvTranspose1d operation in PyTorch on an aarch64 architecture, caused by an invalid memory write in the ARM Compute Library (ACL). The root cause is identified as an integer overflow in the offset calculation within ACL due to the large tensor dimensions, and a proposed fix involves changing the offset calculation to use 64-bit integers to prevent overflow.
- The discussion confirms the crash is reproducible with specific versions of PyTorch, oneDNN, and ACL on Neoverse-V2 hardware. The problem is traced to a write to an invalid address caused by a 32-bit integer overflow in ACL’s offset calculation for large tensor sizes, and it is suggested that switching to 64-bit offsets will resolve the issue.
- Number of comments this week: 5
-
Memory leak when converting from numpy array: This issue reports a memory leak occurring when converting numpy arrays to PyTorch tensors on the CPU, specifically observed with PyTorch version 2.8.0 and Python 3.10/3.13. The user provides minimal reproducible examples showing that certain tensor operations and conversions cause unexpected memory growth, likely due to internal caching or allocator behavior rather than a traditional leak, and notes that the problem appears starting from PyTorch 2.6.
- The comments further simplify the example to isolate the leak and suggest the issue may stem from PyTorch’s CPU memory allocator caching small tensors, causing memory fragmentation or overuse. It is clarified that the memory is not truly leaked since dropping references frees it, but the allocator’s caching strategy leads to unexpectedly large memory consumption for small tensors stored in lists, and manual calls to system-level memory trimming can partially mitigate the problem.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0.dev, CUDA 12.1, and Ubuntu 22.04, and shares a code snippet where the error occurs while compiling parts of a pipeline with torch.compile using the "reduce-overhead" mode.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a MaxPool2D operation with a larger kernel size as multiple MaxPool2D operations with a smaller kernel size, which reduces the computational cost per cell. The suggested modification targets the MaxPool2D layer directly to avoid additional backpropagation overhead and is expected to yield performance improvements specifically on CPU, as demonstrated by testing that showed a speedup of approximately 1.3 times.
- cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted
/tmpdirectory set to permission mode1777. Although the model compiles successfully, execution fails with an error indicating that the shared objectcuda_utils.socannot map a segment due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions. - Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, and managing known formatting challenges, while also providing a detailed worklist organized by directory to coordinate and track progress on this large-scale formatting effort.
- [JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the
torch.jit.save()function that allows users to exclude debug files, such as.debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files, which are primarily for debugging purposes, can significantly increase the archive size without affecting model correctness, especially impacting deployment on mobile devices where storage is limited.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 95
Summarized Issues:
- Type and Annotation Issues in PyTorch APIs: Several issues highlight problems with type annotations and parameter typings, such as the incorrect typing of
autowrap_modulesin symbolic tracing, mismatched type annotations in distributed module functions, and the need for better type hinting practices to avoid breaking user code. These problems can cause confusion, runtime errors, or hinder interoperability across PyTorch components.
- Memory Leaks and Memory Safety Bugs: Multiple reports describe memory leaks and memory safety issues, including leaks when converting numpy arrays to tensors, leaks in torch.compile with flash_attn_varlen_func, and a severe heap corruption bug during backward passes of loss functions. These issues lead to increased memory usage, crashes, or undefined behavior during training or inference.
- Torch.compile and Dynamo Compilation Failures: Several issues involve failures or limitations in torch.compile and Dynamo, such as unsupported dynamic shape operators, errors with tensor metadata comparisons, in-place mutations on overlapped tensors, and problems with custom operator constraints or invalid attribute assignments. These problems prevent successful compilation or cause runtime errors in compiled models.
- Inductor Graph Partitioning and Custom Pass Mechanism Issues: There are multiple reports about the Inductor graph partitioning mechanism needing redesign, including problems with relying on FakeTensors, missing custom operations in FX graph cache, and embedding non-serializable stateful objects in configs. These issues complicate partitioning, reduce robustness, and hinder migration efforts.
- issues/165341, issues/165501, [issues/165595](https://github.com/issues/165595]
- ONNX Export and FX Graph Metadata Problems: Issues include failures during ONNX export due to index errors in decomposition, loss of stack trace metadata when symbolic tracing FX graphs, and strict export mode failing to handle global constants properly. These problems affect model export workflows and downstream tooling like quantization.
- Distributed and Parallelism Bugs: Problems reported include hangs in DistributedDataParallel training with tied weights on NVIDIA Blackwell GPUs, incorrect outputs from dist.reduce_scatter_tensor with world size one, and ineffective timeout updates in the gloo backend. These issues impact distributed training stability and correctness.
- Operator and Backend Implementation Bugs: Several issues report incorrect or inconsistent operator behavior, such as Conv1d on Intel GPUs producing wrong results, aten.masked_select and torch.unique being rejected in fullgraph compilation, and abs operations returning empty tensors on custom device backends. These bugs cause incorrect computations or compilation failures.
- Numerical and Precision Errors: Reports include NaN gradients in backward pass of atan2 with zero inputs, dtype mismatch errors in mixed precision training with torch.func.jvp, and inconsistent outputs in flex_attention with AMP or block masks. These issues degrade numerical stability and precision correctness.
- Crash and Segmentation Faults in Specific Operations: Crashes occur in matrix multiplication with CUDA 13.0 on Ubuntu, fmod with int64 causing segfaults, and ConvTranspose1d on aarch64 due to integer overflow. These critical failures cause abrupt termination of programs.
- Documentation and Usability Concerns: Issues include broken image links in release notes, inaccurate backend integration documentation, unclear ONNX output_names behavior, and confusing error messages in FakeTensorMode. These reduce user experience and increase confusion.
- Build, CI, and Environment Issues: Problems such as Python version mismatch in Docker images, ROCm GPU maintenance causing queue delays, and missing libcudnn.so.9 on AlmaLinux affect development workflows and environment stability.
- Performance Regressions and Profiling Gaps: Reports include drastic slowdowns on Intel XPU after kernel changes, missing CPU spans in profiling with expandable_segments enabled, and failures in autotuning and caching in CuTeDSL Inductor path. These affect runtime efficiency and profiling accuracy.
- Test Failures and Disabled Tests on Specific Platforms: Several tests are disabled or failing on xpu or ROCm platforms, including TestTritonDotReduction and TestDistributions, indicating platform-specific instability or incompatibility.
- Requests for New Features and Improvements: Suggestions include adding FFT support to FLOP counters, element-wise operation flop formulas, NumPy dtype acceptance in APIs, sparse tensor support for view_as_complex, and Python-only backend registration APIs. These aim to enhance PyTorch's functionality and interoperability.
- DataLoader and Multiprocessing Stability Issues: A file descriptor exhaustion problem in DataLoader on CPU causes worker crashes due to large prefetch and batch sizes combined with multiple file openings, highlighting resource management challenges.
- Symbolic and FX Graph Export Limitations: Issues with symbolic integer support in functions like tril and problems with make_fx producing incorrect graphs when BatchedFallback is triggered indicate limitations in symbolic tracing and FX graph generation.
- Error Handling and Messaging Improvements Needed: Several issues call for clearer error messages, such as for deepcopying CUDA tensors in FakeTensorMode, invalid setattr in Dynamo, and scalar usage with torch.where and out= parameter, to improve developer experience.
- Miscellaneous Bugs in Core Operations: Bugs include floating point exceptions in dynamic quantization with zero output dimension, inconsistent remainder behavior between CPU and CUDA, and bugs in fusion scoring logic due to copy-paste errors. These affect correctness and stability in core functionalities.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 32
Summarized Issues:
- Installation and Compatibility Issues: Several users report problems installing PyTorch or specific versions due to compatibility issues with Python versions or operating systems. These include errors finding matching distributions for torch==2.7.0 and lack of prebuilt wheels for Python 3.14.0, with plans to support newer Python versions in future releases.
- [issues/165276, issues/165396]
- XPU Platform Test Failures and Disabling: Multiple tests in the DynamicShapesGPUTests and TestInductorOpInfoXPU suites are consistently failing on the XPU platform, leading to their disabling on the main branch. These include tests for dynamic shapes, unspec inputs, var correction, and comprehensive atanh operations.
- [issues/165403, issues/165411, issues/165412, issues/165413, issues/165414, issues/165415, issues/165416, issues/165742]
- Memory Format and Data Type Bugs on CUDA: Using torch.nn.MaxPool2d with channels_last memory format and bfloat16 dtype on CUDA causes NaN outputs or illegal memory access errors. This issue can be avoided by converting tensors to a contiguous memory format before applying MaxPool2d.
- [issues/165297]
- Distributed and Backend Integration Challenges: There are difficulties mixing NCCL and MPI backends within torch.distributed, specifically using NCCL for global process groups and MPI for subgroups, which leads to segmentation faults and hangs during subgroup creation.
- [issues/165428]
- Custom Operator and Inductor Graph Partitioning Issues: Enhancements and bug fixes are needed in custom operator registration to prevent outputs being views of inputs, and proposals exist to improve inductor graph partitioning by specifying custom operators via string names to avoid serialization problems.
- [issues/165360, issues/165486]
- Flash Attention and Functorch Compatibility Problems: The torch.func.jvp function does not support models using flash attention due to missing autograd methods, causing runtime errors and NotImplementedErrors when computing gradients or forward automatic differentiation.
- [issues/165517, issues/165530]
- Tensor View and Reshape Errors: Using the .view() method on non-contiguous tensor slices results in size and stride incompatibility errors, with recommendations to use .reshape() instead to avoid runtime failures.
- [issues/165525]
- Documentation and Contribution Process Clarifications: Users inquire about submitting documentation pull requests, with guidance that minor fixes can be submitted directly while larger changes should be discussed first. Additionally, documentation inaccuracies exist regarding tensor.index_put_ behavior with duplicate indices.
- [issues/165503, issues/165536]
- PyTorch Compiler and Serialization Bugs: AOT precompile serialization fails with AttributeErrors when running compiled functions multiple times due to caching conflicts, and the Inductor compiler encounters AttributeErrors related to loop reordering during model compilation with specific settings.
- [issues/165447, issues/165579]
- Resource Management and Reference Counting Bugs: The c10::intrusive_ptr::reset_() method mishandles weak reference counts, potentially causing resource leaks, and PyTorch Dynamo's RelationalGuard classes improperly store raw PyObject* pointers without incrementing reference counts, risking dangling pointers.
- [issues/165262, issues/165722]
- Test Suite Failures and Disabling on XPU: Several tests related to dynamic shapes, unspec inputs, and specific operations are disabled due to consistent failures on the XPU platform, affecting test reliability and coverage.
- [issues/165403, issues/165411, issues/165412, issues/165413, issues/165414, issues/165415, issues/165416, issues/165742]
- Compilation and Build Errors: Building extensions like mmcv on Windows with PyTorch 2.9.0 fails due to C++ ambiguity errors in Torch headers, and FP8MatmulCUDA tests fail due to incorrect parameter swizzling in the _scaled_mm_v2 operation.
- [issues/165721, issues/165743]
- Module Import Errors: Attempting to import torch.ops.symm_mem as a standalone module results in ModuleNotFoundError because it is not an importable module; its operations should be accessed as attributes instead.
- [issues/165761]
- Feature Requests and Documentation Improvements: Requests include adding examples to torch.nn.ConvTranspose1d documentation and mechanisms to track dead ReLU activations to aid debugging and model diagnosis.
- [issues/165552, issues/165615]
- Script and Accuracy Checking Bugs: A script incorrectly returns a fail_accuracy status when exceptions occur during accuracy checking instead of properly handling the exceptions.
- [issues/165753]
- API Changes and Implementation Updates: The MPSGraph implementation of torch.cat is proposed to be removed in favor of a Metal kernel, reflecting ongoing backend improvements.
- [issues/165350]
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 234
Key Open Pull Requests
1. [WIP][inductor] generate fused rms/layer norm bwd: This pull request is focused on developing and iteratively updating a work-in-progress feature within the PyTorch inductor backend to generate a fused backward pass for RMS normalization and layer normalization operations.
- URL: pull/165370
- Merged: No
- Associated Commits: 6d566, a7816, 8d97d, 7a687, 23c01, 9c81a, d2d9d, 6fbfd, 3d419, f5c65, cd0f4, 8745c, 70e3f, a4cbc, 12615, df0b8, 21d22, 4f40d, ac51f, 2adf0, 4b590, 0c110
2. Fix nn.Dropout accuracy discrepancies between triton and torch implementations: This pull request proposes a fix for accuracy discrepancies in PyTorch's nn.Dropout between Triton-compiled and eager execution modes by introducing a compiler switch that aligns random number generation to preserve exact dropout mask results without sacrificing performance through fusion with adjacent kernels.
- URL: pull/165545
- Merged: No
- Associated Commits: bcb9f, 5a711, 0de41, 0e443, 6dd7c, a897a, 2a4ec, f274c, d8ebb, 7549a, 14aea, 0c68a, a6732, 3a485, 09c78, 32c25, e498d, a8611
3. Bugfix to forward autodiff causing different datatype 2: This pull request addresses a bug in the forward automatic differentiation process related to incorrect data type promotion when handling Python scalars and zero-dimensional tensors by introducing a new property was_wrapped_number to accurately track wrapped numbers, modifying autograd code to set this property during arithmetic operations, and updating the dtype promotion logic accordingly, along with adding new tests to validate the fix.
- URL: pull/165784
- Merged: No
- Associated Commits: 0dc5c, 746eb, 04a23, 729f1, ee590, 4d35e, d24b0, 787e8, d3981, 35015, 141ce, 63656, 112c7, e7d59, 3473a, 51da7, eb40d
Other Open Pull Requests
- DLPack tensor exchange API unification and updates: This pull request introduces a unified
DLPackExchangeAPIstruct replacing separate function pointers for DLPack tensor exchanges, aligning with the latest DLPack standard and exposing key function pointers including a newcurrent_work_streamfor improved device stream handling. It updates all conversions to a_no_syncconvention requiring explicit stream synchronization, adds a non-owning DLTensor conversion to reduce reference counting overhead, updatesdlpack.h, and includes unit tests to ensure stability without releasing the GIL.
- XPU per-process memory fraction APIs: Two pull requests introduce new APIs for the XPU backend: one to get the allowed memory fraction per single process and another to set this memory fraction, aligning XPU memory management with other PyTorch backends. These additions enable users to retrieve and customize memory usage limits on XPU devices.
- DebugMode enhancements for detailed logging and hooks: Multiple pull requests enhance DebugMode by adding a
run_graph()method for detailed logging offx.Nodecalls during graph execution and introducing hooks for the__torch_dispatch__mechanism and GraphModule nodes. These changes enable flexible recording, annotation, and comprehensive trace outputs for debugging tensor operations and graph executions.
- Tensor descriptor transposition for load/store operations: This pull request adds an option to transpose tensor descriptors by reordering block parameters to ensure descending stride order, improving compatibility and matching of tensor descriptors beyond the previously supported 2D case. This enhancement facilitates better tensor descriptor handling during load and store operations.
- Dynamo assertion handling improvements: This pull request cleans up and improves assertion handling in the Dynamo component by refining type checks, comments, and error handling, addressing related issues #162852 and #164878. These changes enhance code robustness and maintainability in Dynamo.
- Triton kernel autotuning parameter addition: A new
max_autotune_configsparameter is introduced to enable advanced Triton kernel autotuning for custom operations, allowing these ops to benefit from both algorithmic and kernel-level optimizations similar to the existingtuned_mmapproach. This two-tier optimization strategy maintains backward compatibility while enhancing performance.
- CalculateSmallVectorDefaultInlinedElements migration: This pull request migrates the
CalculateSmallVectorDefaultInlinedElementsimplementation from a template struct to a constexpr function using C++17 features, reducing template instantiation complexity. This change minimizes compilation time and the size of the generated binary.
- CUDAAllocatorConfig deprecation and improvements: Two pull requests focus on the CUDAAllocatorConfig component by deprecating overlapping functions to streamline the codebase and refining the CUDA BackendStaticInitializer to improve allocator selection. These changes aim to enhance clarity and functionality in CUDA allocator management.
- nan_to_num complex tensor support and fixes: This pull request updates the
nan_to_numfunction to support complex-valued arguments fornan,posinf, andneginfon complex tensors, compatible withtorch.complex128andtorch.complex64. It includes CUDA kernel improvements, bug fixes, and unit tests to ensure accurate handling of complex inputs.
- Inductor graph executor runtime call recording: This pull request adds functionality to record detailed runtime calls of the inductor graph executor within DebugMode, capturing inputs, cache keys, function call arguments, and post-gradient computation graphs. This enables enhanced tracing and debugging of tensor operations and graph executions.
- MPS backend Objective-C memory leak fix: This pull request addresses memory leaks in the MPS backend by adding autorelease calls to
MPSGraphPooling2DOpDescriptorobject creation in Pooling.mm, following a previous fix pattern for Linear.mm. This ensures proper release of descriptor objects and prevents memory accumulation.
- DTensor local tensor mode test enablement and fixes: This pull request enables additional DTensor tests in local tensor mode by unconditionally collecting RNG state from all CPU and CUDA devices during operation dispatch to ensure consistent randomness across ranks. It also fixes integration issues related to per-rank computations in _MaskedPartial and Shard placements discovered during test enablement.
- vLLM dependency pinned commit update: This pull request updates the pinned commit of the vLLM dependency to a specific commit (#25845) from the vLLM repository, primarily for testing purposes and does not require review.
- Generic API for accelerator allocator settings: This pull request introduces a generic API named
torch._C._accelerator_setAllocatorSettingsto enhance allocator settings management for accelerators in PyTorch.
- AllocatorConfig parsing bug fix: This pull request fixes a bug in the AllocatorConfig parsing logic related to roundup division, addressing incorrect behavior in memory allocation calculations.
- New tracer enabled as default: This pull request proposes enabling the new tracer as the default option in PyTorch to improve tracing functionality.
- ROCm grid sampler bilinear interpolation optimization: This pull request addresses performance bottlenecks in the ROCm implementation of grid sampler bilinear interpolation by moving atomic operations for the backward pass’s
grad_inputcomputation from global memory to faster thread-block private shared memory (LDS). It also proposes optimizations such as leveraging texture objects forgrad_gridand improving code robustness and readability.
- Lint workflow consolidation in CI: This pull request modifies the lint workflow in the CI pipeline to run both partial and full lint checks simultaneously using GitHub Actions matrixes, reducing confusion from having two separate lint workflows and improving reliability by consolidating jobs dependent on pull request context into a single workflow.
- MSVC C++ compilation error fix in pycore_stackref.h: This pull request fixes an MSVC C++ compilation error by wrapping the
pycore_stackref.hheader in a C file and compiling it with a C compiler to support designated initializers, along with additional platform-specific guards and code cleanups.
- Allocator Config error message improvements: This pull request improves the clarity and user-friendliness of error messages related to the Allocator Config in PyTorch.
- Test-only wrapper check addition: This test-only pull request adds a wrapper check to the PyTorch codebase, as indicated by the title and multiple iterative commit updates.
- gm.print_readable function update with annotations: This pull request updates the
gm.print_readablefunction to include annotations in its output, enhancing readability by displaying additional metadata such as compilation details for theflex_attentionoperation.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 199
Key Closed Pull Requests
1. Up sample bilinear2d backward for AMD: This pull request aims to optimize and enable the backward operation of the bilinear 2D upsampling function specifically for AMD GPUs, improving performance and compatibility on ROCm platforms.
- URL: pull/165802
- Merged: No
- Associated Commits: 476d3, 8e6d6, 3ff78, fe59c, 14b0f, 99627, 275c0, 214e0, 4361e, b324b, edcb1, 4142e, 5e67b, c58ce, 40610, fcc0d, 16b82, 2711b, 5b2a3, 7b8bc, 506d5, 55b24, 123b6, 426b2, 31b3b, 06ee6, c126f, 4fe15, 9bb5b, fa57f, 2fbf4, 1735e, 2ef9f, 72b4e, e90b3
2. bf16 support for per tensor backward: This pull request proposes adding bfloat16 (bf16) support for the backward pass of the torch._fake_quantize_learnable_per_tensor_affine() function by upcasting parameters to float32 during computation and downcasting gradients to bf16 before returning, while also adjusting testing procedures to handle numerical differences between Python and C++ downcasting to maintain precision and avoid breaking changes.
- URL: pull/165362
- Merged: No
- Associated Commits: 2de11, e6b85, b4aa3, 49540, 6b504, d9cbe, e5d98, b911f, f22bd, d280f, 32b17, ab812, 21155, 9f068, 77f7b, cecd6, d0b5a, 5af66, 527be, 6d643
3. Overlap scheduler improvements: This pull request proposes a series of improvements to the overlap scheduler in PyTorch, including accounting for bucketing in overlap calculations to reduce latency, updating compute ordering to be based on compute index rather than depth to maintain execution order, enforcing waits on collectives within the same process group, and enhancing memory handling through pre-fetch limiting and scheduling waits when memory usage exceeds peak, all aimed at optimizing collective operation scheduling and memory efficiency.
- URL: pull/165318
- Merged: No
Other Closed Pull Requests
- Windows Cross-Compilation CI Workflow: This topic covers the addition of a continuous integration workflow for testing Windows cross-compilation of PyTorch using AOTI, including building on Windows with Visual Studio 2022 and CUDA 12.8, uploading
.libartifacts, and cross-compiling on Linux with mingw support. It also involves modifying the Linux CUDA Docker image to install mingw and preparing for future tests that load and run the compiled artifacts on Windows.
- DeviceMesh Refactoring and API Simplification: Multiple pull requests focus on refactoring the DeviceMesh component by removing the explicit
meshTensor storage, introducing a private constructor with limited parameters, and simplifying the unflatten method with helper functions. These changes streamline DeviceMesh instantiation, improve readability, and optimize memory usage by using_layoutand_global_rank_permutationattributes.
- FBGEMM and CUTLASS Integration Updates: This topic includes updating the FBGEMM submodule to its latest main version with CMake modifications for NVFP4 grouped GEMM kernels and fixing duplicated CUTLASS paths in CMake by restricting inclusion to the
fbgemm_genaitarget. These changes prevent version mismatches and support new features in GEMM kernels.
- Lazy Module and Dynamo Logging Enhancements: Pull requests here improve LazyVariableTracker logging by adding variable source attribution and types for better traceability and integrate the variable source into the
__repr__method of the Lazy module. These changes address previous comments, fix related tests, and enhance debugging capabilities.
- Backward Pass bf16 Support: This pull request adds bf16 support for the per-channel backward pass by upcasting parameters to fp32 and downcasting gradients to bf16, adjusting tests and tolerances to handle numerical differences between Python and C++ casting implementations. This ensures backward compatibility and precision in mixed-precision training.
- Bug Fixes and Stability Improvements: This group addresses various issues including fixing a crash during large tensor max pooling on CUDA, relaxing equality checks for objects inherited from multiple types, and notifying users with a warning when using unsupported older Intel GPUs. These fixes improve stability and user experience.
- Distributed and Checkpointing Code Improvements: These pull requests improve assert statements in distributed checkpointing code by replacing them with explicit checks and meaningful error messages, and clean assert statements across multiple
torch/utilssubdirectories while fixing attribute usage and lint issues. These changes enhance code clarity and robustness.
- Testing and OpInfo Enhancements: This topic includes adding and fixing OpInfo tests for the default partitioner to address dynamic shape test failures and ensure all tests pass before disabling functionalization. It also covers attempts to fix failures in periodic debug tests by updating FakeProcessGroup reference counting and removing deprecated usage.
- pull/165372, [pull/165479](https://github.com/pytorch/pytorch/pull/165479]
- Scaled GEMM and NVFP4 Support: This pull request adds an optional
alphaargument to theat::cuda::blas::scaled_gemmfunction to support two-level-scaled NVFP4 GEMM calls, introducing device-constant memory and a statically-held tensor buffer to manage the lifetime of thealphatensor during matrix multiplication.
- Interpreter and Traceback Preservation: This pull request adds an interpreter to the local_map implementation to preserve fx_traceback annotations when lowering Dynamo-traced HOP bodies to aten nodes, addressing the propagation of these annotations through joint graph traces.
- Obsolete Code Removal and Configuration Refactoring: These pull requests propose removing an unused parameter related to extension attributes due to SYCL compiler upgrades and refactoring CUDAAllocatorConfig using ConfigTokenizer for better configuration management, although the latter was not merged.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| bobrenjc93 | 317 | 39 | 48 | 28 |
| cyyever | 164 | 57 | 0 | 25 |
| malfet | 90 | 15 | 12 | 99 |
| laithsakka | 150 | 14 | 6 | 36 |
| Skylion007 | 19 | 11 | 2 | 153 |
| anijain2305 | 154 | 21 | 4 | 6 |
| pianpwk | 112 | 35 | 0 | 3 |
| eellison | 80 | 15 | 0 | 50 |
| slayton58 | 96 | 16 | 0 | 22 |
| ezyang | 39 | 12 | 4 | 77 |