Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Pytorch: August 04, 2025 - August 11, 2025 (22:41:01)

Weekly GitHub Report for Pytorch

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is v2.6.0

1.2 Version Information:

Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new performance control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, a backward-incompatible security improvement changing the default of torch.load to weights_only=True, and the deprecation of official Conda package publishing, reflecting a trend toward improved performance, security, and streamlined deployment.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. torch.compile regression on side-effects between torch 2.7.1 and 2.8 (final RC): This issue reports a regression in PyTorch’s torch.compile feature between versions 2.7.1 and 2.8.0, where a previously working pattern of temporarily renaming module parameters via pre- and post-forward hooks breaks due to unsupported tracing of the __delattr__ method on nn.Module. The user provides a minimal reproducible example demonstrating that the compiled model fails with an error related to Dynamo’s inability to handle __delattr__, and the discussion centers on whether this change was intentional, potential workarounds, and plans for a fix in an upcoming patch release.

    • The comments confirm the regression on multiple platforms and clarify that the behavior in 2.7.1 was likely a bug fixed in 2.8, though this breaks certain use cases relying on parameter renaming hooks. Contributors discuss alternative approaches and acknowledge the need to restore this functionality, with a fix proposed and targeted for inclusion in the 2.8.1 release, while the issue remains open pending the cherry-pick and merge of the fix.
    • Number of comments this week: 12
  2. [DTensor] Decide / Document RNG semantics: This issue discusses how DTensor should handle random number generator (RNG) semantics when a user passes their own RNG to DTensor random operations, specifically whether the RNG state advancement should be visible to the user. It highlights the current discrepancy where DTensor maintains an alternate copy of the default RNG state and proposes aligning the behavior so that DTensor updates the global RNG state and the passed-in RNG state consistently, improving usability and predictability.

    • The comments clarify that the proposal involves DTensor no longer maintaining a separate RNG copy but instead using and advancing the global RNG state, requiring users to ensure RNG consistency across distributed ranks. Participants generally agree this matches user expectations, discuss synchronization details, express concerns about potential desynchronization, and suggest a debug mode to catch issues, while also noting that DTensor’s RNG state advancement may differ from non-DTensor runs due to offset usage.
    • Number of comments this week: 11
  3. cuda 13 broken: This issue reports that compiling PyTorch with CUDA 13 is currently broken due to incompatibilities, specifically related to the new NCCL support for Thor/Spark and errors involving missing members in CUDA device properties. The user shares build logs and error messages, and the discussion includes suggestions to cherry-pick a related pull request or disable TensorPipe, but further compilation problems persist, including compiler segmentation faults during tests.

    • The comments reveal attempts to clarify the issue with missing links and logs, identification of a specific error tied to a recent TensorPipe PR, and advice to disable TensorPipe or reduce build load to avoid compiler crashes; despite these efforts, the user continues to encounter build failures, indicating that CUDA 13 support requires additional fixes.
    • Number of comments this week: 11
  4. Enable CUDA 13.0 binaries: This issue requests enabling CUDA 13.0 binaries in the project, highlighting the major upgrades and benefits such as support for new NVIDIA architectures, significant binary size reduction, improved math libraries, and unified Arm platform builds. The discussion focuses on clarifying the new hardware capabilities introduced in CUDA 13.0 compared to CUDA 12.x versions, managing build workflow complexity, and ensuring stable CUDA 12 builds remain while adopting CUDA 13.0 features.

    • The comments include requests for clearer explanations of CUDA 13.0 benefits, detailed clarifications on hardware architecture support differences between CUDA 12.8, 12.9, and 13.0, and suggestions to reduce the number of CUDA 12 builds to manage build resource demands. Contributors also discuss the unification of Arm platforms under CUDA 13.0, upcoming hardware launches, and updates on related libraries like cuDNN and NCCL.
    • Number of comments this week: 10
  5. Regression on compile with backend inductor with torch 2.8: This issue reports a regression in PyTorch 2.8 where compiling a model's forward method with the "inductor" backend on CUDA causes a runtime assertion failure related to autograd's stream handling, which did not occur in version 2.7.1. The problem stems from a mismatch between expected CPU metadata and actual CUDA inputs for zero-dimensional tensors during autograd evaluation, revealing a longstanding issue exacerbated by recent changes in autograd's stream checks and the use of FakeTensor subclasses in compilation.

    • The comments clarify that the error only occurs on CUDA and involves autograd's handling of zero-dimensional CPU tensors used on CUDA, causing a stream assertion failure. It is noted that this issue predates torch.compile and arises from an autograd recovery mechanism that moves gradients to match metadata devices, which is not properly handled with FakeTensor during tracing. The issue is acknowledged as a known bug and marked for backporting to the 2.8.1 release.
    • Number of comments this week: 8

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's Inductor backend during model compilation. The user provides detailed environment information, including PyTorch version 2.4.0 development build, CUDA 12.1, and Ubuntu 22.04, and shares a code snippet demonstrating the error occurring while compiling parts of a pipeline with torch.compile.
  2. Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, with demonstrated speedup in testing.
  3. cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot map a segment due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
  4. Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, handling known formatting-related problems, and organizing the work by directory to facilitate incremental and reviewable changes.
  5. [JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, such as .debug_pkl, from the JIT archive to reduce file size. The motivation stems from observations that these debug files significantly increase the archive size without affecting model correctness, which is particularly important for deploying smaller, quantized models on mobile devices.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 97

Summarized Issues:

  • Gradient scaling and stability improvements: A new PyTorch module nn.GradBank is proposed as a drop-in wrapper to adaptively rescale gradients using rolling statistics, aiming to prevent vanishing and exploding gradients in deep neural networks. Empirical validation shows it significantly reduces gradient variance with minimal impact on stable models.
  • issues/159765
  • Performance issues with quantization and Triton kernels: The inductor code generation for float8 dynamic quantization in the backward pass of scaled_grouped_mm is significantly slower than expected, with Triton-generated quantization kernels as slow as main GEMM computations. Additionally, the Triton-based scaled_mm kernel produces incorrect results on NVIDIA H100 GPUs, causing test failures due to large mismatches.
  • issues/159769, issues/159940
  • FlexAttention enhancements and compilation issues: Proposals to enhance torch.nn.attention.flex_attention include extracting statistics from attention weights without full matrix materialization. However, the FlexAttention backward pass fails to compile with the Triton compiler on NVIDIA B200 GPUs, and there are questions about compatibility and memory benefits when combining Selective Activation Checkpointing with compiled flex_attention.
  • issues/159770, issues/159074, issues/159970
  • CUDA and hardware support updates: Integration of CUDA 13.0 binaries is underway to leverage new hardware support and improved compilation performance, but building PyTorch with CUDA 13 currently fails due to incompatibilities and missing members. There is also an RFC discussing CUDA version support for PyTorch 2.9, and a proposal to add CUDA Green Context support to FSDP2 for better throughput.
  • issues/159779, issues/160104, issues/159980, issues/160272
  • PyTorch compilation and torch.compile regressions: Several issues report regressions and bugs related to torch.compile, including loss of item subscription support on compiled ModuleDict, inconsistent outputs with sliding window cache updates, segmentation faults on Intel ARC XPU hardware, and failures compiling pre- and post-forward hooks due to unsupported attribute deletion.
  • issues/159831, issues/159855, issues/159974, issues/159958
  • Distributed and parallelism challenges: Problems include ambiguous behavior of the @require_world_size(4) macro causing test failures, divergence of parameters when combining data parallelism and tensor parallelism due to non-deterministic effects, and Intel Gaudi devices being incorrectly assigned a fake backend causing distributed regressions.
  • issues/159987, issues/160169, issues/159945
  • ONNX and export limitations: Exporting models with tensor subclasses breaks state dictionary equivalence due to decomposition and inconsistent naming, and exporting models using custom autograd functions with bfloat16 constants fails due to lack of ONNX support. Additionally, exporting a model in eval mode fails due to torch.typename calls skipped during tracing.
  • issues/159918, issues/159928, issues/160241
  • Documentation and usability gaps: There is a lack of official documentation on device mesh usage in DistributedDataParallel with torch.compile, missing Google style docstrings in torch._dynamo, and unclear gradient behavior documentation for torch.min and torch.max. Requests also include updating Python version support on the PyTorch website and improving documentation for random number generator state semantics in DTensor.
  • issues/159836, issues/159886, issues/160273, issues/160246, issues/159991
  • Backend and platform-specific failures: Failures and regressions are reported on multiple platforms including CUDA initialization errors on Ubuntu with RTX 3060, ROCm MIOpen dropout compilation errors, MPS backend runtime errors for certain operations, and Intel XPU test failures due to recent changes in accelerator detection logic.
  • issues/159954, issues/160141, issues/160208, issues/160232
  • Memory and resource management issues: Problems include increased GPU memory usage after compilation, out-of-memory errors with default symmetric memory backend in distributed setups, and CUDAGraphs overwriting tensor outputs causing runtime errors. There are also issues with the new at::HostAllocator interface limiting multiple allocator registrations.
  • issues/160247, issues/160289, issues/160281, issues/159906
  • Build and CI environment problems: The hardcoded PYTHONPATH in CI scripts breaks test.sh usage in some environments, and the torchrun executable uses a hardcoded shebang ignoring virtual environments. Additionally, building PyTorch on CentOS Stream 10 with GCC 14 fails due to out-of-bounds array access warnings treated as errors.
  • issues/160193, issues/160130, issues/159962
  • Tensor and API behavior inconsistencies: Issues include torch.nn.functional.sigmoid producing different outputs on CPU vs GPU for complex inputs, torch.cond() raising inconsistent errors with symbolic predicates, and torch.nn.functional.pad with mode="circular" not working correctly for 4D/5D inputs despite error messages.
  • issues/159870, issues/159852, issues/160053
  • Compilation and runtime errors with FX and Triton: Bugs in PyTorch Inductor cause incorrect FX graph segments for multiple Triton kernels in large reductions, and Triton device-side functions for nvshmem cause compilation errors and CUDA illegal memory access. There are also assertion failures due to FakeTensorMode mismatches when compiling with inductor backend.
  • issues/160124, issues/160137, issues/160057
  • Feature requests and improvements: Requests include adding immutable assignment to tensors similar to JAX, enabling autoformatter to convert old style type annotations to modern syntax, adding officially distributed statically linked libtorch libraries, and releasing a prebuilt PyTorch Vulkan backend for broader hardware support.
  • issues/159784, issues/159866, issues/159947, issues/160230
  • Testing and flaky test management: Several tests are disabled on the XPU platform due to consistent failures, and there are issues with unstable GraphQL queries causing instability in the "Check Labels" process and mergeability checks for ghstack pull requests.
  • issues/160243, issues/160244, issues/160245, issues/159894, issues/159899
  • Miscellaneous bugs and regressions: These include a bug where strict export fails to capture side effects, a RuntimeError when calling .view(dtype=torch.int8) on certain tensors, and a regression where LBFGS optimizer triggers warnings about converting tensors with requires_grad=True to scalars.
  • issues/159787, issues/159932, issues/160197

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 44

Summarized Issues:

  • CUDA and GPU Kernel Support Issues: Several issues address the need for improved CUDA kernel implementations and GPU support, including a proposal for a Triton-based CUDA kernel to replace inefficient CPU fallbacks for quantized inference, requests to add CUDA support for the NVIDIA RTX 5070 Ti GPU, and a missing implementation of the aten::grid_sampler_3d operator on Apple MPS devices causing runtime errors. These issues highlight performance bottlenecks and hardware compatibility gaps that affect efficient GPU utilization in PyTorch.
    • issues/158849, issues/159847, issues/159851, issues/160237
  • Test Failures and Disabled Tests on XPU Platforms: Multiple test cases such as test_addmm_dtype_mismatch, test_repeat_interleave_2_dynamic_shapes_xpu, test_comment_graph_fragment, test_hop_eager, and test_hop have been disabled or failed consistently on XPU platforms, indicating ongoing instability or platform-specific issues. These failures have led to temporary test disables and require further investigation to ensure test suite reliability across hardware backends.
    • issues/159631, issues/159803, issues/159925, issues/159950, issues/159951
  • Distributed Training and Profiling Issues: Problems in distributed training include warnings from torchrun about hostname retrieval failures causing delays, and discrepancies in thread ID reporting by the Kineto profiler on NVIDIA platforms, which affect profiling accuracy. These issues impact debugging and performance monitoring in distributed and GPU-accelerated environments.
    • issues/159007, issues/159771
  • Compiler and Exporter Bugs: Several bugs affect PyTorch's compiler and model export functionality, including a SpecViolationError when exporting models using torch.cond with Triton kernels, a regression causing runtime errors with torch.jit.script and @torch.compiler.disable, and ONNX exporter crashes due to unhandled None outputs in FX nodes. These issues disrupt model serialization and scripting workflows.
    • issues/159955, issues/160059, issues/160150
  • Performance and Memory Management Concerns: Performance regressions and memory management issues are reported, such as the DeviceCopy IR node layout mismatch causing extra memory copies and degraded QPS, the lack of runtime improvement in torch.autograd.grad_mode.inference_mode, and requests to extend torch.cuda.empty_cache() to support clearing specific memory pools. These highlight inefficiencies and limitations in current memory and execution optimizations.
    • issues/159612, issues/159633, issues/160069
  • Build and CI Infrastructure Problems: Issues include PyTorch CI jobs failing to create EC2 instances causing queue delays, build failures on riscv64 due to outdated Cargo versions, and environment variable mismanagement during source builds causing module import errors. These infrastructure problems hinder continuous integration and cross-platform build reliability.
    • issues/159651, issues/160170, [issues/160092](https://github.com/issues/160092]
  • Documentation and Naming Confusions: Confusing or incorrect documentation and naming are reported, such as the misnamed HAS_CUDA constant in inductor tests, unclear error messages for unsupported complex values on MPS backend, and duplicate version entries in the PyTorch documentation dropdown. These issues cause user confusion and reduce clarity in the codebase and docs.
    • issues/159399, issues/159637, issues/159972, issues/160034
  • Numerical and Data Type Inconsistencies: Problems include inconsistent results in addmv for bfloat16 tensors on x86, incorrect output from torch.sparse.to_sparse_semi_structured for certain dtypes, and overflow errors when creating uint64 tensors exceeding int64 range with torch.full. These issues affect numerical correctness and data type handling.
    • issues/159872, issues/159168, [issues/159960](https://github.com/issues/159960]
  • Release and Version Support Limitations: The PyTorch 2.2.2 version does not support ONNX opset 21, limiting export capabilities, and there is a proposal to remove MacOS-13 support due to its upcoming end-of-life and maintenance burden. These reflect challenges in maintaining compatibility and support for evolving platforms and standards.
    • issues/159630, [issues/159275](https://github.com/issues/159275]
  • Security and Vulnerability Fixes: A reported vulnerability related to improper resource shutdown (CVE-2025-3730) was addressed and patched in nightly builds and the 2.8.0 release, improving PyTorch's security posture.
    • [issues/159963](https://github.com/issues/159963]
  • Feature Requests for Architecture and Backend Support: Requests include adding RISCV CPU support via detection of the __riscv macro, integrating the cutlass-sycl submodule for XPU support, and clarifying allocator behavior for zero-size malloc requests in CUDAPluggableAllocator. These aim to expand hardware compatibility and clarify backend behaviors.
    • issues/160171, issues/160176, [issues/159892](https://github.com/issues/159892]
  • Reproducibility and Randomness Issues: The behavior of torch.multinomial when sampling multiple values differs from repeated single-value sampling with the same seed, and bfloat16 inference results are not reproducible across platforms, causing challenges in deterministic model evaluation.
    • issues/159927, [issues/159846](https://github.com/issues/159846]

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 222

Key Open Pull Requests

1. [vllm in torch ci ][step 1/3] add build logics: This pull request adds the build logic for integrating the vllm project into the PyTorch continuous integration pipeline by setting up a CLI tool, modifying Dockerfiles, and creating configuration files to enable building and testing vllm and its dependencies alongside PyTorch artifacts.

  • URL: pull/159815
  • Merged: No
  • Associated Commits: f9dbb, a8428, f4230, c0470, 218a9, c6ef9, e6abe, 0ba75, 11b7f, 9a327, 790ba, 2757b, 00271, b952b, 6ff65, 77aee, c6672, c24b7, d8cf0, b418a, 79413, 4e900, 2ef6e, c655d, ce9d1, 5cba6, 61593, 8cd9f, ab198, 75a3b, af066, b30a4, 51092, 1bebb, 7b401, 39b5b, 0223e, 8b16c, 584fe, 1184e

2. [3/3][ghstack][vllm ci build setup]vllm build workflow: This pull request proposes the third part of a stacked series to set up the vllm build workflow within the PyTorch continuous integration system, aiming to integrate and automate the build process for the vllm component.

  • URL: pull/160116
  • Merged: No
  • Associated Commits: ef598, b2583, 9aa2a, eec74, 01842, e8187, 3499d, d4bad, 571ac, 9730f, 1e965, f0950, b6767, 896b0, b26e6, 28e80, 93801, 8ba95, 998d2, 84189, 7c393, a9048, 06bc7, acf5e, d79c4, 178a0, e1096, 87cb5, 973d5, 05542, d783e, 094be, 0cce7, 1ab86

3. [2/3 step][ vllm ci build setup] Add vlllm buld logic and dockerfile : This pull request adds build logic and a temporary Dockerfile for the vllm project, introduces a BuildRunner abstract class to standardize build processes, and includes unit tests to support the vllm continuous integration build setup.

  • URL: pull/160089
  • Merged: No
  • Associated Commits: e9261, 92ea8, a30a2, 08f54, 5feae, b7977, fb1d1, 721ab, bf1a5, 5e89d, 2562a, 14bcd, ecd02, 11140, 35868, 0ebab, 6c0ab, 22fa4, f529a, 73585, 622ad, 676ec, 2276b, ef65b, 82270, dd1a8, 2263c, f52aa, da6db, bd6ea, 7ae21

Other Open Pull Requests

  • Base updates: Multiple pull requests propose a series of base updates to the PyTorch project, all marked with "[ghstack-poisoned]" and titled "Update (base update)". These updates aim to fix an unspecified issue referenced as #ISSUE_NUMBER and appear repeatedly across several PRs.
    • pull/160117, pull/160090, pull/160045
  • Dynamo improvements: Several pull requests focus on improving the Dynamo component by installing dictionary watchers for recursive dictionary tag optimization, fixing tag safeness propagation, and preventing excessive recompilations caused by nested graph breaks. These changes aim to enhance performance and stability within Dynamo's guard and compilation systems.
    • pull/159796, pull/159807, pull/159786
  • Build and CI workflows: Multiple pull requests introduce and improve build workflows and continuous integration setups, including adding a USE_SEQUENTIAL option for sequential wheel building, installing ccache, setting up CI workflows for testing, and improving build workflows for torch_cli and vllm components. These efforts aim to streamline and optimize the build and test processes.
    • pull/159821, pull/160149, pull/160146, pull/160120, pull/160149
  • vllm CLI integration: Two pull requests work on adding and testing a command-line interface for the vllm component, including adding a vllm placeholder in the CLI with argparse, unit tests, and configuration templates. These are initial steps toward integrating and testing vllm within the PyTorch CLI.
    • pull/160043, pull/160146
  • Flash attention and decoding: Pull requests add flash attention implementation to the flex attention module and introduce flash decoding support in the CPU backend with kernel options to optimize parallelism. These changes also re-enable previously disabled CPU backend unit tests to ensure correctness.
    • pull/160109, pull/159835
  • Triton kernel support and fixes: Several pull requests address Triton kernel issues, including making user-defined Triton kernels serializable for fx_graph_runnable, fixing mask propagation in the Triton bucketize operation, improving kernel launch performance by changing constexpr argument injection, and fixing AOTAutogradCache to detect source changes in Triton kernels hidden in custom ops.
    • pull/160002, pull/159961, pull/160000, pull/159961, pull/159961
  • CuteDSL template support: One pull request introduces support for compiling CuteDSL templates by adding a fixed CuteDSL template implementation for tensor addition, including kernel definitions, JIT wrappers, autotuning selection, and tests verifying correctness within the PyTorch inductor backend.
    • pull/160108
  • CPU backend test guards: A pull request adds guards to CPU C++ wrapper tests to ensure they only run when the CPU backend has a C++ wrapper, preventing irrelevant test failures on other CPU backends without such wrappers.
    • pull/159850
  • Distributed CPU-only support: One pull request enables distributed PyTorch modules to be importable and runnable on CPU-only builds by generating Python stubs for the c10d C extension when unavailable, mediated through a new module.
    • pull/159889
  • CI configuration fix: A pull request removes the app field from the GH_CHECKSUITES_FRAGMENT in the CI configuration to fix access permission errors caused by GITHUB_TOKEN restrictions, resolving failing workflows and related issues.
    • pull/160056
  • Platform support and fixes: One pull request adds support for Linux aarch64 and Windows Python 3.14 nightly builds, including several fixes to ensure proper integration.
    • pull/159869
  • Peak memory estimation improvements: A pull request introduces an alternative peak memory estimation method accounting for multiple phases within SchedulerNodes, applies it to communication collective reordering, tracks buffer deallocation points, and adds an environment variable to limit reordered collectives, aiming to better preserve peak memory usage.
    • pull/160113
  • ABI compatibility in AOTInductor: One pull request addresses ABI compatibility issues by modifying RecordFunction implementation to avoid injecting Aten into codegen, instead using C10::IValue to call the full record function, with plans to extend profiling information later.
    • pull/159842

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 270

Key Closed Pull Requests

1. [SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels: This pull request introduces a triton.jit wrapper over core NVSHMEM extern functions to enable sending typed Triton tensors with unerased type information directly to NVSHMEM Triton kernels, abstracting away manual byte-size calculations and raw pointer handling to simplify development and support local tensor operations, while noting that tensor-aware implementations of certain signaling functions remain to be completed due to address resolution challenges.

  • URL: pull/159788
  • Merged: No
  • Associated Commits: f961e, 460f7, 9104d, dfe1a, c32bd, 98e9d, c04be, 06423, de53d, c65e8, 25787, 79557, 73dcb, 2ed02, 0b050, 79fd9, b8461, 9534b

2. [SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch: This pull request proposes a refactor of the NVSHMEM Reduction API by introducing a single, generic Triton-extern wrapper function, nvshmem.reduce, that unifies and simplifies team-based reductions across all supported operations and data types with automatic dtype-based dispatch and normalization for ergonomic use within Triton kernel launches.

  • URL: pull/159755
  • Merged: No
  • Associated Commits: 1475e, 46a38, 423ec, f747f, 5e3b4, 5ea94, 3468b, 740e7, 25a3d, 6c64d, a7473, ec58c, c82c1, 9b8cd, ba1f6

3. [SymmMem] Add helpful docstrings for all NVSHMEM APIs : This pull request adds helpful and verified docstrings for all NVSHMEM APIs in the SymmMem component by leveraging Claude Code NVSHMEM documentation to improve code clarity and maintainability.

  • URL: pull/159756
  • Merged: No
  • Associated Commits: 69348, 89c67, 85449, eebd5, f37ff, bef8d, 1466b, 7ec67, 53dd6, 72d5a, 0b6b8, 686de, 50937, 7c775, ca544

Other Closed Pull Requests

  • NVSHMEM Triton Kernel Optimizations: Multiple pull requests focus on optimizing NVSHMEM Triton kernels by initializing the NVSHMEM module only for relevant kernels and implementing sum, min, and max reduction collective operations for int64 data types. These changes improve efficiency and enable parallel reduction computations across processing element teams.
    • pull/159734, pull/158515, pull/160102
  • Inductor Backend Improvements: Several pull requests enhance the Inductor backend by integrating kernacle configuration lookup during lowering and scheduling, allocating non-blocking copy destinations in pinned memory, adding debug symbol generation support, fixing memory estimation for buffers with multiple mutations, and proposing moving CPU scalar values to pinned memory (not merged). These updates aim to improve performance, debugging, and memory management in Inductor.
    • pull/159929, pull/158758, pull/159938, pull/159569, pull/158983, pull/158983
  • MPS Backend Enhancements: Pull requests address issues in the MPS backend by implementing a coalesce function optimized for Metal Performance Shaders, enabling dynamic shapes tests, and changing kernel initialization from lazy to upfront to avoid errors in higher-order operation subgraphs. These changes improve sparse tensor operations and stability on Apple devices.
    • pull/159729, pull/159753, pull/159433
  • Triton Kernel and CUDA Launcher Updates: Updates include renaming the macro HAS_CUDA to HAS_CUDA_AND_TRITON, adding support for the new profile_scratch argument in Triton kernels, and fixing related launcher and calling logic issues. These changes ensure compatibility with Triton versions 3.2 through 3.4 and improve testing infrastructure.
    • pull/159883, pull/159772
  • Testing and CI Fixes: Pull requests fix XPU continuous integration test failures in Inductor unit tests and reland the setup of TorchBench within a Docker environment while maintaining macOS CI compatibility. These efforts address test stability and CI permission issues.
    • pull/159759, pull/159300
  • Memory and Index Handling Improvements: Changes include adding an overflow validation check in the pad_sequence function to prevent int64 overflow, and preventing the use of int32 indices when the upper bound exceeds int32 max value. These updates improve robustness in tensor padding and indexing.
    • pull/159589, pull/159433
  • Developer Experience and Code Modernization: Pull requests upgrade the C++ standard to C++17, update dependencies, replace Boost optional with std::optional, add strict type coverage to torch/_dynamo/utils.py, and improve naming conventions in generated C++ projects. These changes enhance code quality, compatibility, and developer tooling.
    • pull/159834, pull/159580, pull/159456, pull/158560
  • Cutlass Kernel Flexibility: One pull request enables the Cutlass kernel to accept offsets as arguments, allowing more flexible and dynamic kernel execution configurations.
    • pull/159761
  • IValue Unsigned Integer Support: A pull request adds support for unsigned integers to the IValue type by refactoring int64 and uint64 saving logic and includes a regression test to ensure correct dispatch of uint64 values.
    • pull/160102

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
yangw-dev 385 46 5 21
malfet 112 19 7 126
ezyang 94 19 3 84
anijain2305 149 16 0 7
guangyey 86 14 0 58
xuhancn 124 17 1 5
janeyx99 67 8 3 42
wconstab 54 9 2 55
codingwithsurya 94 9 1 8
albanD 33 4 1 65

Access Last Week's Newsletter:

  • Link
Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.