Weekly Project News

Archives
Subscribe
December 1, 2025

Weekly GitHub Report for Pytorch: November 24, 2025 - December 01, 2025 (12:02:33)

Weekly GitHub Report for Pytorch

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is v2.6.0

1.2 Version Information:

Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda package publishing.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. [CI] Gradcheck accuracy constraints are too tight that cause float32 test to pass on x86 but fail on ARM: This issue addresses the problem where the gradcheck accuracy constraints in the continuous integration tests are too strict, causing a float32 test to pass on x86 architectures but fail on ARM architectures. The discrepancy arises due to minor floating-point precision differences between CPU instruction sets, leading to inconsistent test results across platforms.

    • The comments discuss that the observed accuracy difference is minor and likely due to differences in floating-point precision between x86 and ARM CPUs. Contributors agree that the gradcheck should run in double precision to avoid such issues, and suggest relaxing accuracy constraints rather than modifying the gradcheck itself, to ensure tests pass consistently across architectures without breaking other operations.
    • Number of comments this week: 8
  2. ONNX export fails with dynamo=True: This issue reports a failure when exporting a transducer decoder network to ONNX format using the dynamo=True flag in PyTorch, which causes the export process to break due to an error in handling the aten.unbind.int operator during graph translation. The user provides a reproducible example and detailed environment information, and commenters analyze the problem as related to Dynamo decomposing the LSTM module into unsupported constituent operations, with suggestions that the issue stems from incomplete support for certain dynamic operators in the ONNX export path when using Dynamo.

    • The discussion reveals that exporting with dynamo=False works correctly, producing the expected LSTM node in the ONNX graph, while dynamo=True triggers decomposition of LSTM into lower-level ops that cause errors. It is noted that a newer version of onnxscript was expected to fix this, but the user already has the latest version, and the root cause appears to be limited support for dynamic behavior in the aten::unbind.int operator during ONNX export with Dynamo enabled.
    • Number of comments this week: 6
  3. Reduce Overhead and Max Autotune NaN: This issue reports that when using the max-autotune or reduce-overhead compilation modes in PyTorch’s Inductor backend, training a specific model results in NaN values, whereas training without compilation proceeds without error. The user found that employing a gradient scaler with mixed precision (fp16) or enabling certain precision emulation settings can avoid the NaNs, but the exact cause and correct fix for training with bfloat16 under these modes remain unclear.

    • The comments confirm the problem occurs only with compilation enabled and not with standard training. Multiple workarounds were tested, including using a gradient scaler with fp16 and toggling precision emulation flags, both of which resolved the NaNs. However, the underlying reason for the failure with bfloat16 and the precise solution are still uncertain, prompting further investigation.
    • Number of comments this week: 5
  4. [DTensor] aten.max.dim returns wrong indices when using DTensor: This issue reports a bug in the aten.max.dim function when used with DTensor sharding, where the indices returned are incorrect because the framework does not account for the offset in the sharded dimension. The user is seeking a strategy to correctly implement index sharding for the max operation along a distributed dimension.

    • The discussion centers on the challenge that aten.max.dim is not a linear reduction due to its index output, and the proposed solution involves creating a custom handler to manage index offsets or disabling linear reduction and replicating the sharded dimension. Contributors mention a related pull request that implements global indices and suggest leveraging or extending it to fix this issue, with plans to submit a new PR to accommodate the max/min operations with dimension support.
    • Number of comments this week: 4
  5. max_autotuned BMM produces wrong result when multiple threads are used: This issue describes a bug where the max_autotuned batched matrix multiplication (BMM) produces incorrect results when multiple threads are used, specifically triggered by the interaction between OpenMP dynamic thread adjustment and the OpenCV library importing. The problem arises because OpenMP’s dynamic thread setting, enabled implicitly by importing cv2, causes the parallel region to execute fewer threads than requested, leading to wrong computation results, and the reporter suggests switching to a different parallelization backend to avoid this dependency.

    • The comments include sharing the problematic generated C++ code, clarifying the context of the issue, and a pull request proposing a fix that changes the parallelization pragma to #pragma omp parallel for to ensure correct thread distribution despite OpenMP dynamic thread adjustments caused by importing OpenCV.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0.dev, CUDA 12.1, and Triton 2.2.0, and shares code snippets demonstrating the error occurring while compiling specific pipeline components with torch.compile.
  2. Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a MaxPool2D operation with a larger kernel size as multiple MaxPool2D operations with a smaller kernel size, which reduces the computational cost per cell. The suggested modification targets the MaxPool2D layer directly to avoid additional backpropagation overhead and is expected to improve performance specifically on CPU, as demonstrated by testing code showing a notable speedup.
  3. cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot map a segment due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
  4. Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, handling known formatting-related problems in certain files, and organizing the work by directory to facilitate incremental and reviewable pull requests.
  5. [JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, specifically .debug_pkl files, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are primarily for debugging purposes, can occupy a significant portion of the archive size without affecting model correctness, making the feature particularly beneficial for deploying smaller models on resource-constrained devices like mobile platforms.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 83

Summarized Issues:

  • Accuracy and Precision Issues Across Architectures and Backends: Several issues report accuracy regressions and precision inconsistencies in PyTorch operations across different hardware platforms and backends. These include strict gradcheck accuracy constraints causing float32 tests to fail on ARM but pass on x86, ROCm backend accuracy losses during compilation, and unexpected large discrepancies in computed tensor norms between CPU and GPU that cannot be explained by floating-point precision differences.
    • issues/168929, issues/168935, issues/169237
  • ROCm and Inductor Backend Bugs and Compilation Failures: Multiple issues highlight bugs and crashes related to the ROCm backend and Inductor compiler, including incorrect reduction results due to missing store masks, compilation crashes from constexpr redefinition errors, and device mismatch errors during autotuning. These problems affect kernel generation, dynamic shape handling, and multi-dimensional reductions, causing incorrect outputs or backend crashes.
    • issues/168937, issues/168945, issues/169197
  • TorchDynamo Compilation and Export Errors: Several issues describe TorchDynamo failing to compile or export models due to internal errors such as incorrect tensor indexing, unsupported symbolic operations during ONNX export, and type promotion failures. These errors cause compilation failures despite models running correctly in eager mode, and include problems with tensor shape indexing, mixed input types in torch.where, and dynamic tensor chunking.
    • issues/168954, issues/168969, issues/169002, issues/169009
  • DebugMode Usability and Feature Limitations: Multiple issues point out limitations and usability problems in DebugMode, such as impractical sample code assumptions, lack of fuzzy matching for hash mismatches, inability to export hashes for external comparison, poor tracking of input/output relationships, and difficulty correlating print outputs with FX graphs. These hinder effective debugging and traceability in real-world training scenarios.
    • issues/168973, issues/168974, issues/168975, issues/168976, issues/168977
  • Performance Regressions and Optimization Challenges: Some issues report performance regressions and optimization challenges, including a 15-20% slowdown in the MPS backend's at::mul_out() function due to a new kernel, flaky memory overflow bugs in torch.var_mean caused by Triton kernel autotuning, and quadratic time complexity bottlenecks during checkpoint resharding at large scale. These affect runtime efficiency and stability.
    • issues/168963, issues/168964, issues/169113
  • Bugs in Distributed and Parallel Training Implementations: Issues report bugs in distributed training components such as FSDP passing incorrect types causing AttributeErrors, OpenMP dynamic thread adjustment causing incorrect batched matrix multiplication results, and pipeline parallelism lacking coordinated debug log ordering. These bugs cause crashes, incorrect computations, or hinder debugging in parallel training setups.
    • issues/168947, issues/168965, issues/169152
  • ONNX Export and Model Export Failures: Several issues describe failures when exporting models to ONNX format, including unsupported decompositions of LSTM modules causing TypeErrors, and ResNet18 exported with dynamo=True producing near-zero accuracy on ImageNet, indicating broken export paths or incorrect model behavior after export.
    • issues/168969, [issues/169178](https://github.com/issues/169178]
  • CI and Infrastructure Interruptions and Failures: Some issues report interruptions and failures in PyTorch's continuous integration and infrastructure, such as partial CI pauses affecting Mac and ROCM runners, container initialization failures due to NVIDIA-CONTAINER-TOOLKIT errors on specific runners, and Dr CI service outages caused by firewall changes. These impact testing and development workflows.
    • issues/168993, issues/169000, issues/169033
  • Device and Backend-Specific Bugs and Limitations: Various issues highlight device-specific bugs such as CUDA stream capture breaking when creating float16 tensors from -inf, Apple MPS backend showing saturating overflow behavior differing from CPU/CUDA, MPS backend failing to check shape consistency or output_padding constraints in unpooling and conv transpose layers, and CPU Triton backend failing to compile certain combo kernels due to type mismatches.
    • issues/168991, issues/169058, issues/169235, issues/169236, issues/169195
  • Graph Partitioning and Compilation Backend Errors: Issues report errors in graph partitioning and compilation backends, including UnboundLocalError from referencing uninitialized variables during CUDA graph partitioning, signature errors from forbidden cudagraph operations causing NameErrors, and assertion failures in Inductor optimization passes due to incorrect assumptions about function arguments. These cause backend crashes and incorrect compiled outputs.
    • issues/169050, issues/169011, issues/169232
  • Sparse Tensor and Higher-Order Operator Support Limitations: Some issues describe lack of support for sparse tensors in torch.compile with cudagraphs backend causing runtime failures, and higher-order operators like cond and scan failing with torch.func.functional_call due to internal graph capture problems, preventing functional control flow in compiled workflows.
    • issues/169119, issues/169120
  • Memory and Storage Handling Bugs: Issues report memory-related bugs such as segmentation faults in batch_norm with empty running_var tensors, bad frees caused by invalid PackedSequence metadata, and segmentation faults when converting QInt32Storage to bfloat16 due to improper dtype conversion handling. These cause crashes and memory corruption.
    • issues/169208, issues/169210, issues/169213
  • Compilation and Runtime Failures with torch.compile and Backends: Multiple issues describe failures when using torch.compile with various backends and options, including runtime CUDA driver errors during random number generation, NotImplementedErrors in MXFP8 MoE training code, incorrect results with TVM backend on CUDA, and failures tracing symbolic integer inputs or dlpack functions. These prevent successful compilation or cause incorrect outputs.
    • issues/168960, issues/168995, issues/169188, issues/169111, issues/169112
  • Feature Requests for Usability and Functionality Enhancements: Several issues request new features or improvements, such as adding a meaningful str method for schedulers, fuzzy matching in DebugMode hash checks, automatic ROCm architecture detection for kernel compilation, support for custom non-integer indexing in datasets, and fused backward support in cuBLASLt for addmm with epilogue fusions. These aim to improve usability and performance.
    • issues/168967, issues/168984, issues/169073, issues/169078, issues/169055
  • Type and Metadata Handling Issues in Tracing and Export: Issues highlight problems with missing metadata for non-tensor inputs during tracing, incorrect hashing of functional collectives in debug mode, and silent failures when stacking tensors exceeding 32-bit size limits on CUDA, indicating gaps in type and metadata management that affect correctness and debugging.
    • issues/169083, issues/169150, issues/169163
  • Model Loading and Quantization Bugs: One issue reports that loading a quantized model exported with torch.export.save fails unless a specific quantization module is imported beforehand, indicating missing automatic dependency handling during model loading.
    • issues/169187
  • Miscellaneous Bugs and Questions: Other issues include unexpected test behavior such as an xfail test unexpectedly passing, questions about fusion restrictions in dynamic mode despite demos showing accuracy, and inquiries about support for certain data types in specific operations. These reflect ongoing uncertainties and edge cases in PyTorch development.
    • issues/169054, issues/169106, issues/169035

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 28

Summarized Issues:

  • Test Disabling on ROCm Platform: Multiple tests including test_3layer_split_reduction, test_transformerencoder_fastpath variants, test_matmul_fp16, test_ordered_distribute_all_combination, and test_svd_lowrank_cuda_complex128 have been disabled on the ROCm platform due to consistent failures on the main branch. These failures affect various test suites such as NoMixOrderReductionTest, TestTransformersCUDA, TestTritonDotReduction, DistributeWithDeviceOrderTest, and are linked to hardware differences and recent ROCm upgrades.
  • [issues/166836, issues/167733, issues/167793, issues/168196, issues/168197, issues/168302]
  • Test Disabling on XPU Platform: The tests test_augment_trace_against_flop_counter_maxat0_xpu_float16 and test_augment_trace_against_flop_counter_maxat3_xpu_float32 in the TestAnalysisXPU suite were disabled due to consistent failures on the main branch. These issues indicate ongoing instability or incompatibility in the XPU testing environment.
  • [issues/168327, issues/168346]
  • Precision and Performance Issues in Inductor and Triton Kernels: There are precision mismatches in bf16 activation operators between eager ATen and Triton kernels, and a severe performance regression in nn.Conv3d forward pass with bfloat16 on NVIDIA H200 GPU was reported. Additionally, a Triton 3.6 pin update caused out-of-memory errors during compilation of a flex attention test. These problems highlight challenges in kernel accuracy, performance, and resource management.
  • [issues/168148, issues/168167, issues/168214]
  • Compiler and Dynamo Debugging Enhancements: Issues include type ID guards in TorchDynamo showing only numeric IDs, poor handling of functools.partial as dictionary keys in Dynamo, and a compile-time NotImplementedError triggered by nested view operations with in-place mutations in torch.compile. These reflect ongoing efforts to improve debugging clarity and compiler robustness.
  • [issues/168160, issues/168962, issues/169010]
  • Distributed and Async Compilation Limitations: Using async_op=True with dist.all_to_all_single causes graph breaks in torch.compile, with requests for clarification on support and timelines. This indicates current limitations in PyTorch’s compiler for async distributed collectives and a need for community guidance on contributing fixes.
  • [issues/168352]
  • CUDA and Triton Compatibility Issues: Enabling emit_multi_arch_kernel in Inductor with CUDA 12.6 fails due to Triton's bundled ptxas targeting an unsupported PTX version, causing kernel compilation errors. This mismatch between Triton's internal compiler tools and installed CUDA versions creates compatibility challenges.
  • [issues/168353]
  • Installation and Version Compatibility Problems: Users face installation failures due to incompatible torchvision versions for nightly torch builds with CUDA 12.9, which is unsupported on Windows. This results in version mismatch errors and closed issues due to unsupported CUDA versions.
  • [issues/168953]
  • Symbolic Integer Hashing and Python 3.14 Compatibility: torch.Size objects containing symbolic integers do not compute hashes correctly under Python 3.14’s new tuple hash caching, caching zero instead of raising TypeError. This breaks libraries like einops that rely on this error for correct functionality.
  • [issues/168254]
  • Memory Management and Resource Release Issues: Pinned CPU tensors' memory is not immediately released back to the OS after deletion and garbage collection, demonstrated by a user unable to free 1 GiB of pinned memory. This points to inefficiencies in memory management and resource reclamation in PyTorch.
  • [issues/169160]
  • Segmentation Faults and Crashes: Segfaults occur when converting QInt32Storage to bfloat16 and when using tensors with Python 3.14’s free-threading build under multithreading, though the latter was resolved by a pull request. These crashes highlight stability issues related to dtype conversions and Python runtime changes.
  • [issues/169136, issues/169209]
  • Documentation and User Guidance Improvements: Outdated documentation on accessing the ctx object in torch.autograd.Function subclasses and requests for better type hints and auto-completion support indicate a need for clearer developer guidance on new API behaviors.
  • [issues/167843]
  • Bug Reports on Backend Behavior: The nn.LPPool1d module in Inductor fails with OverflowError for norm_type=float('inf'), and the nn.EmbeddingBag module on MPS silently ignores negative out-of-bound indices, causing inconsistent behavior compared to CPU and CUDA backends. These bugs affect correctness and error handling across backends.
  • [issues/167197, issues/169201]
  • User Experience and Community Contributions: A user shared their experience creating their first AI character named Arianna, reflecting community engagement beyond technical issues.
  • [issues/169041]
  • Miscellaneous Issues: Other problems include an LLVM error with missing symbols on Windows with NVIDIA RTX 4090, questions about upgrading cuDNN versions to address regressions, and disabling of the test_fx_annotations test on Linux and Windows due to recent code reverts.
  • [issues/169175, issues/169194, issues/169221]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 182

Key Open Pull Requests

1. [xpu][feature] [2/2] Introduce XPUPluggableAllocator in frontend part: This pull request introduces the torch.xpu.memory.XPUPluggableAllocator and an API torch.xpu.memory.change_current_allocator in the frontend part of the PyTorch XPU memory management system, providing functionality analogous to existing CUDA APIs to enable dynamic allocator changes and improve memory allocation flexibility.

  • URL: pull/169043
  • Merged: No
  • Associated Commits: 1b0a1, 5c1e7, 43df7, 0438e, 0fb01, 06f6f, 96242, 5a4ea, 74233, dd8c9, d6d59, 8e748, c1aaf, 7ccf7, da19c, 8bccf, f8941

2. [xpu][feature] [1/2] Introduce XPUPluggableAllocator in cpp part: This pull request introduces the XPUPluggableAllocator in the C++ part of the PyTorch project as a foundational step, with the goal of keeping it simple and preparing for a subsequent pull request that will add related Python frontend code.

  • URL: pull/168966
  • Merged: No
  • Associated Commits: 5d18c, 38900, 5fa3e, ad5e0, 882c9, 9f689, bf630, 93367, b845b, e90ef, c74e7, 30919, 8e9e7, 90ba2

3. [ROCm][CI] separate out docker-caching: This pull request aims to improve the ROCm continuous integration workflow by separating out the docker caching process into distinct configuration files, enhancing modularity and maintainability.

  • URL: pull/169005
  • Merged: No
  • Associated Commits: 07970, 15ced, b91ad, d4020, fa8ba, 43d32, abd64, 8f513, bc1b7, 7e912

Other Open Pull Requests

  • DTensor argmax/argmin implementations: This pull request introduces efficient implementations of argmax and argmin operations for DTensor by leveraging min/max operators to perform linear reductions on sharded dimensions while dynamically computing global indices. It ensures accurate results during tensor gathering, improving DTensor's reduction capabilities.
    pull/168983
  • Overlap tracking and scheduling improvements: This pull request introduces tracking of overlap on a per-process-group basis and modifies scheduling to allow collectives on other process groups during exposed waits for better overlap. It also prioritizes reduce scatter operations before prefetching to reduce memory usage, carefully avoiding increased memory due to scheduling reduce scatters.
    pull/169019
  • Inductor combo kernel race condition fixes: This pull request fixes race conditions in Inductor's combo kernels with symbolic shapes that produce scalar outputs by adding missing store masks to sub-kernels. This ensures that only the appropriate threads perform writes during tl.store operations to prevent incorrect concurrent writes.
    pull/168939
  • Static Triton kernel launcher for XPU: This pull request officially enables the static Triton kernel launcher for the Inductor XPU backend in the PyTorch project. It also includes preparatory changes unifying configuration across Triton backends (CUDA, ROCm, and XPU) to support this feature.
    pull/168952, pull/168951
  • XPU caching allocator enhancements: This pull request enhances the XPU caching allocator by adding support for custom raw_alloc and raw_delete methods from a user-provided allocator within the memory pool. Additionally, it includes work-in-progress snapshot support for the allocator to improve memory management on XPU devices.
    pull/168957, pull/169203
  • Dynamo ConstDictVariable tracker improvements: This pull request decentralizes and enhances the key hash implementation for dictionary keys in Dynamo's ConstDictVariable tracker by introducing new protocol methods to VariableTracker subclasses for hashability checks, hash computation, and equality comparison. It simplifies centralized logic, improves maintainability and extensibility, and adds comprehensive test coverage with clearer error messages.
    pull/169204
  • Sparse tensor backward sum on MPS: This pull request implements the backward sum operation for sparse tensors on the MPS device in PyTorch. It is intended to be merged after a related pull request and includes several fixes and updates to tests.
    pull/169240
  • Inductor combo kernel variable name collision fix: This pull request fixes a crash in PyTorch's Inductor combo kernels caused by variable name collisions in generated Triton code when fusing operations with multi-dimensional tiled reductions. The fix corrects block size variable generation to use reduction dimension prefixes instead of sub-kernel indices.
    pull/168946
  • Intel Triton integration update for PyTorch 2.10: This pull request updates the Intel Triton integration within the [xpu][inductor] component to align with the PyTorch 2.10 release.
    pull/168950
  • CI tests update for Python 3.14 compatibility: This pull request updates the continuous integration tests for the PyTorch 3.13 branch to be compatible with Python 3.14.
    pull/169032
  • Dynamo and FSDP tests migration for Intel GPU (XPU): This pull request migrates 13 Dynamo and FSDP test cases to support Intel GPU (XPU) by introducing device-agnostic methods, replacing CUDA-specific statements with GPU-agnostic ones, adding XPU checks in test logic, and refactoring device handling. This enables seamless testing on Intel GPUs while maintaining original code styles.
    pull/169241
  • Dynamo component test fixes: These pull requests fix local test failures related to logging and test state leakage issues in the dynamo component of the PyTorch project.
    pull/168927, pull/168928
  • Initial Pallas matrix multiplication support: This pull request introduces initial support for Pallas matrix multiplication within the torch/_inductor component of the PyTorch project.
    pull/168944
  • Code refactoring for constant variable checks: This pull request proposes refactoring the code by replacing instances of the type check isinstance(x, ConstantVariable) with the method call x.is_python_constant() to improve code clarity and maintainability.
    pull/169006
  • Custom non-integer indexing for images: This pull request introduces custom non-integer indexing capabilities that allow users to index images using arbitrary coordinate systems such as latitude/longitude for satellite imagery or nanometers for medical images. This enhances flexibility in data handling.
    pull/169040
  • Testing ignore update for torch.Tensor.__annotate__: This pull request proposes adding the torch.Tensor.__annotate__ method to the testing_ignore list to skip it during tests for the __torch_override__ functionality.
    pull/169013
  • ROCm platform testing improvements: This pull request removes outdated flaky models and enables deterministic algorithms on the ROCm platform to improve testing reliability and consistency.
    pull/169024
  • Side effects support in invoke_subgraph (draft): This draft pull request proposes adding support for handling side effects within the invoke_subgraph function in the PyTorch codebase.
    pull/169045
  • Symbolic integer lifting in Dynamo and HOPS: This pull request enhances the Dynamo and HOPS components by enabling symbolic integers (symints) to be lifted as inputs for subgraphs, improving the handling of symbolic shapes within PyTorch.
    pull/169091
  • Static CUDA launcher refactor to static Triton launcher: This pull request refactors the static CUDA launcher into a static Triton launcher by renaming the class StaticallyLaunchedCudaKernel to StaticallyLaunchedTritonKernel (with the former inheriting from the latter), renaming related files, and restructuring code to enable reuse by the XPU Triton backend.
    pull/169121
  • MPS backend grid_sampler_2d_backward implementation: This pull request implements the grid_sampler_2d_backward function for the Metal Performance Shaders (MPS) backend in PyTorch, including kernel and backend code, parameter struct definitions, dispatch registration, and tests for both float32 and float16 data types.
    pull/169143
  • Runtime estimation alignment for multi-process groups: This pull request improves PyTorch Inductor by aligning runtime estimations across all process groups in multi-dimensional parallelism setups to ensure deterministic compiler decisions. It replaces the previous method that only used the default process group and includes scheduler updates and a new distributed test for multiple process groups.
    pull/168979

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 75

Key Closed Pull Requests

1. [DO NOT MERGE] test nightly rock yml file: This pull request is about testing and updating the nightly ROCm build workflow by fixing hardcoded ROCm paths, removing deprecated components, adding helper scripts, and integrating a new nightly workflow configuration, although it was not merged.

  • URL: pull/169020
  • Merged: No
  • Associated Commits: 37842, 5cad9, 684d8, 0d098, 26d49, eeb23, 9c40b, ffce5, 92707, 038d3

2. Add optimizer tests in operator microbenchmarks: This pull request adds comprehensive benchmarks for PyTorch optimizers, including AdamW, Adam, SGD with momentum, RMSprop, and Adagrad, to measure the performance of optimizer.step() across various parameter counts and sizes in operator microbenchmarks.

  • URL: pull/168101
  • Merged: No
  • Associated Commits: 074df, abfc5, 9dfb3, 2705d, c0943, 1066c, c4465

3. [3.14] Fix nn.Module annotations lookup: This pull request addresses a fix for the nn.Module annotations lookup in the PyTorch project, specifically targeting issues found in non-threaded builds and originally identified in a related pull request.

  • URL: pull/168325
  • Merged: No
  • Associated Commits: fd67a, a6a9e, 6d428, 49dc4, c8b70

Other Closed Pull Requests

  • Dynamo support for user-defined and functools.partial dictionary keys: Multiple pull requests focus on improving Dynamo's handling of user-defined objects and functools.partial objects as dictionary keys. These changes address tracing issues and error message improvements, although the functools.partial support PR was not merged.
    • pull/169028, pull/169016
  • Invoke_subgraph and effectful operations management: Several pull requests enhance the handling of effectful operations within invoke_subgraphs by tracking effect tokens and fixing redundant executions. These updates improve subgraph side effect management and prevent double execution in autograd eager implementations.
    • pull/167231, pull/167245, pull/167363
  • Dynamo test stability and isolation improvements: A group of pull requests aim to fix test failures and prevent state leakage in Dynamo tests by isolating tests, fixing configuration leaks, and ensuring unique custom operation names. These efforts improve test reliability and reduce interference between tests.
    • pull/168914, pull/168915, pull/168924, pull/168925, pull/168914
  • Performance optimizations and caching improvements: Pull requests introduce a static onednn context cache for qlinear CPU operations and optimize S3 artifact downloads to improve performance. These changes reduce overhead and speed up processing in critical paths.
    • pull/168150, pull/168948
  • Profiling and synchronization enhancements: One pull request adds explicit device synchronization before starting a trace in the PyTorch profiler to ensure accurate profiling results without overhead. This prevents event carryover between profiling phases.
    • pull/168920
  • Error message and bug fixes: Several pull requests improve error messages for Partial objects and fix bugs such as silent ignoring of out-of-bounds indices in the MPS backend and CUDA error message truncation. These fixes enhance user feedback and backend correctness.
    • pull/168212, pull/169205, [pull/168942](https://github.com/pytorch/pytorch/pull/168942], pull/168936, pull/169205
  • Codebase refactoring and cleanup: Pull requests refactor strategy and rule registration to resolve circular imports and remove unnecessary uses of thrust::tuple as preparation for CCCL transition. These changes improve code maintainability and modularity.
    • pull/167788, pull/168926
  • Benchmark cache and CI environment fixes: One pull request prevents segfaults caused by ExecutionPlan teardown by leaking BenchmarkCache and updates the compile-time cuDNN version in CI as a test fix.
    • pull/169153
  • PyTorch inductor backend fixes: A pull request fixes the selection of the wrong contiguous node during mix-order reduction fusion in the inductor backend due to dynamic shapes.
    • pull/168371
  • FP8 quantized convolution CPU fixes: One pull request addresses issues related to the fp8 quantized convolution implementation on CPU.
    • pull/167611

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
jeffdaily 26 1 493 4
malfet 94 11 5 84
cyyever 137 37 0 14
Skylion007 5 1 4 170
ezyang 65 8 25 68
williamwen42 102 14 9 32
anijain2305 100 9 7 22
mlazos 120 8 1 9
guangyey 100 16 2 17
mikaylagawarecki 104 14 0 9

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.