Weekly GitHub Report for Pytorch: May 26, 2025 - June 02, 2025 (12:00:59)

            Weekly GitHub Report for Pytorch: May 26, 2025 - June 02, 2025 (12:00:59)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notably, this version also adds FP16 support on X86 CPUs and marks a shift away from publishing on Conda, directing users to alternative package sources.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

MPS Memory Leak: This issue reports a memory leak when using the Metal Performance Shaders (MPS) backend in PyTorch, where memory usage steadily increases over time during model training, unlike the stable memory usage observed with the CPU backend. The problem is demonstrated with two minimal scripts, highlighting specific lines of code that seem to contribute to the memory leak, and the issue persists even when using the latest nightly build of PyTorch.

The comments discuss attempts to reproduce the issue, with some users unable to replicate the memory leak while others confirm its presence. Suggestions include checking memory statistics, trying different PyTorch versions, and examining potential causes related to MPS's memory management. A potential fix is identified involving MPSGraph caching, and a pull request is mentioned to address part of the memory growth, though some memory behavior is attributed to macOS's handling of autorelease pools.
Number of comments this week: 17

bfloat16 Conv2d slower than float16 on 4090: This issue reports that on an NVIDIA 4090 GPU, the Conv2d operation using bfloat16 precision is consistently slower than when using float16, which the user did not expect based on available documentation. The user provides a test script and environment details to illustrate the performance discrepancy and seeks clarification on whether this behavior is expected.

The comments reveal that this performance difference is expected on consumer GPUs like the 4090 due to better optimization for FP16, while BF16 is more optimized on data center GPUs. Suggestions include trying different settings like torch.backends.cudnn.benchmark_limit = 0 and using channels-last memory layout, but these did not yield significant performance improvements. Further investigation is suggested to determine if the issue is due to heuristics or kernel coverage.
Number of comments this week: 6

torch.compile regression: it cause recompile when int value changed: This issue describes a regression in the torch.compile function, which causes unnecessary recompilation when an integer value changes, negatively impacting performance. The problem arises when an integer value, which changes every epoch, is passed directly to the model, leading to recompilation issues that were not present in earlier versions of the library.

The comments discuss the issue of recompilation caused by a specific commit, with requests for a unit test using pure PyTorch to help diagnose the problem. A simple example is provided to reproduce the issue, and a suspected commit is identified as the cause. There is a suggestion to explore methods to avoid recompilation when an integer value is involved, and a script is shared to demonstrate the recompilation problem, highlighting the changes before and after the commit.
Number of comments this week: 6

[BUG] DataLoader low GPU utilization and extremely slow compared to manual batching: This issue highlights a significant performance discrepancy between using PyTorch's DataLoader and manual batching, where DataLoader exhibits low GPU utilization and is substantially slower, especially when using bfloat16 data types. The user provides a reproducible code sample demonstrating that DataLoader is 7-22x or even 50x slower than direct data access, despite attempts to optimize DataLoader settings.

The comments discuss the impact of large batch sizes and the effect of shuffling on performance, with suggestions to adjust DataLoader parameters like shuffle, num_workers, and prefetch_factor to improve speed. Despite these adjustments, the DataLoader remains significantly slower, with one commenter noting a 600% slowdown and low GPU utilization. The conversation also touches on the potential for DataLoader to perform unnecessary tasks for simple use cases, and the challenges of balancing performance with memory usage.
Number of comments this week: 5

grouped_mm optional zero initialization of the output: This issue discusses the optional zero initialization of the output tensor in the grouped_mm kernel, which currently allocates the output without initialization, potentially leading to uninitialized sections when using operations like scatter_add(). The proposal suggests adding functionality to initialize all or only the uninitialized parts of the output with zeros to prevent unintended behavior, especially when padding is involved due to alignment requirements.

The comments include requests for more detailed examples and discussions about the performance impact of zero-initialization, with one user noting a significant performance degradation in a specific use case. Another comment suggests using embedding_backward for ignored indices and criticizes the inefficiency of the current implementation, highlighting unnecessary data copying and inefficient handling of indices.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend, and is likely related to compatibility or versioning issues with the Triton library.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly for other kernel sizes. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a model in a Docker environment with a tmpfs permission set to 1777, where the execution of the cached cuda_utils.so file in the /tmp directory fails due to the absence of the execution bit, despite the directories having the correct permissions. The error occurs during the execution of a PyTorch model, specifically when using the torch.compile function, and is related to the inability to map a segment from the shared object, which is crucial for the model's operation.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
[JIT archive] Add a flag to not include debug files: This issue addresses the need for a feature in the PyTorch library that allows users to exclude debug files when saving models using the torch.jit.save() function, as these files significantly increase the size of the saved model without affecting its functionality. The proposal suggests adding a flag to the function to prevent the inclusion of .debug_pkl files, which are primarily used for debugging purposes, thereby reducing the file size and making it more suitable for deployment on resource-constrained devices like mobile phones.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 92
Summarized Issues:

Bugs in PyTorch's torch.distributed and torch.fake_quantize_per_tensor_affine functions: These issues highlight bugs in PyTorch's torch.distributed.checkpoint.state_dict.get_model_state_dict and torch.fake_quantize_per_tensor_affine functions. The former fails to update _metadata keys after removing the module. prefix, while the latter handles +inf inconsistently between CPU and CUDA devices, mapping it incorrectly on CPU.
issues/154327, issues/154328

Memory and Performance Issues in PyTorch: PyTorch faces a memory leak issue with the Metal Performance Shaders (MPS) backend on macOS, leading to increased memory usage over time. Additionally, a performance discrepancy is noted where Conv2d using bfloat16 is slower than float16 on an NVIDIA RTX 4090 GPU.
issues/154329, issues/154351

Enhancements and Proposals for PyTorch: There are proposals to implement Jacobian-vector product for flex attention and to remove explicit backend references from torch.distributed. These enhancements aim to improve performance in few-step diffusion models and simplify usage by inferring backend from device_id.
issues/154332, issues/154345

Bugs in PyTorch's torch.qr and torch.prod functions: PyTorch's torch.qr function with the out parameter causes crashes, suggesting a switch to torch.linalg.qr. Similarly, using .prod(dtype=torch.int16) with autograd results in an internal assertion failure, indicating issues with differentiable types.
issues/154356, issues/154357

Memory Inefficiency and Graph Breaks in PyTorch: PyTorch's scaled_dot_product_attention function shows significant memory inefficiency in a Generalized Query Attention setting. Additionally, a graph break occurs during torch compilation of a forward pass in an attention layer using flash attention.
issues/154363, issues/154365

Errors and Crashes in PyTorch's TorchScript and Sparse Operations: PyTorch's TorchScript raises a RuntimeError with view(dtype) due to invalid shape transformations. Additionally, torch.sparse.softmax crashes due to deprecated torch.sparse.FloatTensor with invalid indices.
issues/154407, issues/154419

Bugs in PyTorch's torch.fmod and torch.remainder functions: Using torch.fmod and torch.remainder with uint8 tensors causes a ZeroDivisionError in eager mode and a crash with a "Floating point exception" when compiled using the Inductor backend.
issues/154420

Segmentation Faults in PyTorch Functions: Segmentation faults occur in torch.fx.experimental.partitioner_utils.map_arg and torch.jit.ignore with drop=True, highlighting issues with recursive data structures and TorchScript compilation.
issues/154422, issues/154423

Segmentation Faults in PyTorch's Sparse Operations: Segmentation faults occur in torch.matmul and torch.sparse.addmm with sparse CSR tensors due to invalid crow_indices, suggesting the use of check_invariants=True to identify input errors.
issues/154424

Floating Point Exception in PyTorch's ONNX Export: A floating point exception occurs during torch.onnx.export with a PixelShuffle layer, likely due to an incorrect input tensor shape with zero input channels.
issues/154425

GitHub Bug Affecting PyTorch's Pytorchbot: The pytorchbot incorrectly identifies a pull request as already merged due to a GitHub bug, suggesting a flaw in the logic or GraphQL queries.
issues/154427

Discrepancies in PyTorch's Sine Function: A significant discrepancy is noted in the output of the sine function when applied to the exponential of a tensor on CPU versus CUDA, particularly for very large arguments.
issues/154428

Bugs in PyTorch's torch._dynamo.optimize and torch.jit.script functions: torch._dynamo.optimize incorrectly converts a torch.Size object to a tuple, while torch.jit.script raises a RuntimeError with view(dtype) due to invalid shape transformations.
issues/154432, issues/154407

Proposals for PyTorch's Dependency Management: A proposal suggests removing the ideep git submodule dependency and directly integrating the oneDNN API to improve transparency and efficiency, particularly for x86_64, aarch64 CPUs, and Intel GPUs.
issues/154444

Failures in PyTorch's Inductor Jobs: Inductor jobs fail related to the opacus_cifar10 test after updating to opacus version 1.5.4, as indicated by a specific pull request and a linked HUD error report.
issues/154446

Bugs in PyTorch's Android Libraries: PyTorch Android Torch Vision and PyTorch Lite libraries do not support the required 16KB page size alignment mandated by Google for apps targeting Android 15+ from November 1st, 2025.
issues/154449

Logging and Cache Issues in PyTorch's Dynamo: The logging system for re-raising exceptions in PyTorch's Dynamo is confusing due to duplicate graph breaks and chained exceptions. Additionally, torch.compile does not utilize the cache effectively, leading to prolonged warmup times.
issues/154454, issues/154456

Regression in PyTorch's Slow-Autograd Tests: A regression in the slow-autograd tests is identified, prompting a pull request to exclude the problematic test file from running slow gradcheck.
issues/154459

Inconsistent Behavior in PyTorch's Pytorchbot: The pytorchbot experiences inconsistent behavior when handling interactively rebased commits, resulting in incorrect commit references.
issues/154461

Bugs in PyTorch's torch.compiler.save_cache_artifacts() API: The torch.compiler.save_cache_artifacts() API fails due to an ImportError caused by a circular import in the torch.compiler._cache module.
issues/154463

Kernel Hash Key Issues in PyTorch's Inductor: A kernel hash key for the ChoiceCaller in PyTorch's Inductor is needed to be independent of runtime parameters to prevent incorrect cache usage.
issues/154467

Inconsistencies in PyTorch's torch.fft.fft and torch.fft.irfft functions: The torch.fft.fft function produces inconsistent results for infinite input values between CPU and GPU devices. Similarly, torch.fft.irfft handles non-Hermitian inputs inconsistently across CPU and GPU.
issues/154474, issues/154496

Regression in PyTorch's FFT Operators: A regression in FFT operators occurs when using the 2024 version of MKL, resulting in an error due to inconsistent configuration parameters.
issues/154477

Bugs in PyTorch's torch.jit.trace_module and torch.svd_lowrank functions: The torch.jit.trace_module API fails to respect __jit_ignored_attributes__, while torch.svd_lowrank produces inconsistent U matrix outputs on CPU versus CUDA.
issues/154478, issues/154479

Feature Request for PyTorch's Compiler: A feature request suggests enhancing the PyTorch compiler's ability to infer data-dependent information from tensor constructor calls to prevent data-dependent errors.
issues/154489

Regression in PyTorch's torch.compile Function: A regression in the torch.compile function leads to unnecessary recompilation when changing an integer value during each epoch of model generation.
issues/154490

Bugs in PyTorch's torch.randn and torch.add Functions: Using torch.randn with device='mkldnn' results in an "INTERNAL ASSERT FAILED" error. Additionally, torch.add and torch.sub return incorrect results for complex128 tensors with infinite components.
issues/154491, issues/154501

Bugs in PyTorch's torch.fft.irfft and torch.add Functions: The torch.fft.irfft function handles non-Hermitian inputs inconsistently across CPU and GPU. Similarly, torch.add and torch.sub return incorrect results for complex128 tensors with infinite components.
issues/154496, issues/154501

Bugs in PyTorch's torch.jit.script and torch.compile Functions: The torch.jit.script function does not recognize axis as an alias for dim, causing runtime errors. Additionally, torch.compile does not utilize the cache effectively, leading to prolonged warmup times.
issues/154613, issues/154456

Bugs in PyTorch's torch.compile and torch.export.export Functions: A NotImplementedError occurs when running torch.compile with flex_attention and NJT inputs. Additionally, torch.export.export fails due to a data-dependent error related to unbacked symbols.
issues/154556, issues/154559

Bugs in PyTorch's torch.export.export and torch.jit.script Functions: The torch.export.export function generates invalid code for Tensor.split with the meta device. Additionally, the torch.jit.script function does not recognize axis as an alias for dim.
issues/154721, issues/154613

Bugs in PyTorch's torch.jit.script and torch.compile Functions: The torch.jit.script function does not recognize axis as an alias for dim, causing runtime errors. Additionally, torch.compile does not utilize the cache effectively, leading to prolonged warmup times.
issues/154613, issues/154456

Bugs in PyTorch's torch.jit.script and torch.compile Functions: The torch.jit.script function does not recognize axis as an alias for dim, causing runtime errors. Additionally, torch.compile does not utilize the cache effectively, leading to prolonged

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 19
Summarized Issues:

RuntimeError with MaxUnpool2d in PyTorch: This issue involves a RuntimeError encountered when using the MaxUnpool2d function in PyTorch. The error arises due to a non-contiguous tensor being used in a backward pass, which can be resolved by ensuring the tensor is contiguous before setting requires_grad to True.
issues/154341

NCCL Communication Failures on NVIDIA H100 GPUs: A bug where NCCL communications fail with an internal error when using PyTorch version 2.7.0 on a system with 8 NVIDIA H100 GPUs. The issue was resolved by manually adding arch and vendor fields to the topology XML file provided by AWS Hyperpod configuration.
issues/154342

Shared State in Wishart Distribution Instances: This issue describes a bug in the PyTorch library where creating a second instance of the Wishart distribution modifies the constraints on the first instance. This occurs due to the arg_constraints being defined as a class variable, resulting in shared state across instances.
issues/154355

Compatibility Issue with torch.get_default_device(): An error is encountered when attempting to load a model using the transformers library due to a missing function torch.get_default_device(). The solution is to upgrade PyTorch to version 1.13 or newer to resolve the compatibility issue.
issues/154362

Discrepancy in F1-score Across PyTorch Versions: A problem where the user experiences different outputs and a 3% decrease in F1-score when running the same code with different versions of PyTorch. This discrepancy is despite using seed control and consistent data, and the user seeks clarification on PyTorch's lack of guaranteed bitwise reproducibility across versions.
issues/154411

Release Highlight Feature Testing for PyTorch 2.8.0: This issue involves testing the release highlight feature for a proposed feature in the PyTorch project. Plans include documentation and tutorial submissions, marketing coverage, and testing support, specifically targeting the 2.8.0 release version on Linux.
issues/154462

Inconsistent Results in torch.fft Functions: Several issues highlight discrepancies in torch.fft functions, such as hfft2, ifft2, and others, where infinite input values result in inconsistent outputs between CPU and GPU devices. These inconsistencies lead to different patterns of inf and nan values in the output tensor for identical inputs.
issues/154520, issues/154521

Disabled Tests Due to Failures on Main Branch: Multiple tests, such as test_promotes_int_to_float_ldexp_cuda_int16 and test_linear, have been disabled due to consistent failures on the main branch. These issues involve requests to specify affected platforms and contributions from various developers.
issues/154550, issues/154684, issues/154760

Inconsistent Conversion of torch.inf Values: Several issues highlight bugs in PyTorch where methods like .int(), .char(), and .type_as() produce inconsistent results for torch.inf values between CPU and GPU devices. These discrepancies lead to different numerical results across hardware platforms.
issues/154726, issues/154727

Bug in flex_attention with Nested Jagged Tensor Inputs: This issue highlights a bug in the flex_attention function when used with Nested Jagged Tensor (NJT) inputs, resulting in inconsistent outputs compared to other data layouts. The problem may be related to the lack of a block_mask for NJT inputs, which should be addressed by using create_nested_block_mask.
issues/154554

Compilation Error on Windows 11 with CUDA 12.9: A compilation error occurs when attempting to build PyTorch's static library on Windows 11 using CUDA 12.9, due to an LLVM out-of-memory error and an nvcc error. The issue was closed due to lack of actionable information, with a suggestion that the problem might be related to the CUDAToolkit.
issues/154604

Memory Leakage in TorchTitan llama3 8b Training: This issue involves a memory leakage problem during the training of TorchTitan llama3 8b on H100 machines, specifically when Selective Activation Checkpointing (SAC) is used and torch.compile() is disabled. The leakage leads to out-of-memory (OOM) errors due to reference cycles that include tensors.
issues/154642

Bug in speculate_subgraph with torch.func.functionalize: This issue pertains to a bug in the PyTorch library where the speculate_subgraph function fails to detect input mutations when a function is wrapped with torch.func.functionalize. The problem is demonstrated by the provided Python code snippet.
issues/154669

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 176
Key Open Pull Requests
1. [ONNX] Implements converter for higher order ops scan: This pull request implements a converter for higher order operations, specifically the "scan" operation, in the ONNX framework within the PyTorch project, addressing issue #151327.

URL: pull/154513

Merged: No

Associated Commits: 22dd1, 11523, e7ab7, 882d5, 249f4, 0b004, 44999, 0f250, 0caa1, ad6bb, 04f01, f169d, dd484, d90f9, 7b457, 45937, 402a3, 1da86, acc6d, ed383, f3aa4, bc23f, 4decd, 69e32, 773b7, 59f15, 082f4, 0880a, 23c60, c11eb, 5d4b5, 4798e, b814f, 03857, b0db1, 11e9b, 8442c, 59e09

2. [ONNX] Create support for rotary embeddings: This pull request introduces support for rotary embeddings in the PyTorch ONNX exporter by registering the RotaryEmbedding operator in the torch.ops.onnx namespace, allowing the exporter to recognize and export ONNX operators, and providing user-friendly, unversioned functions for native use in PyTorch models.

URL: pull/154745

Merged: No

Associated Commits: 7027d, c498a, fe716, 60a4f, 7cf8d, 3afc5, c5199, 13660, 99aa1, c8f84, b8ce1, db32e, 3774f, 4f2ea, 59c70

3. Type hints for distributions/constraints: This pull request introduces type hints to the distributions and constraints modules in the PyTorch project, addressing issues #144196 and #144219, by making several enhancements such as making Independent and MixtureSameFamily generic, annotating attributes, adding type aliases to __all__, and incorporating various other improvements across multiple commits.

URL: pull/154711

Merged: No

Associated Commits: caade, 6fef4, a89be, da8cf, 46c16, b72e7, eaa7b, a9a40, b73dd, e9d8d, 74f20, fb2d6, 7c1ab

Other Open Pull Requests

Dynamo Enhancements: This topic involves enhancing the Dynamo component of the PyTorch project to improve its tracing capabilities. The pull requests focus on ensuring that explicit dunder method calls can be traced, as part of a series of updates managed through the ghstack tool.
pull/154366

CUDA Integration Updates: These pull requests address updates and fixes related to CUDA integration in the PyTorch project. They utilize newer CMake syntax and targets, remove outdated modules, fix MSVC issues, update version references, and adopt the official CUDA module.
pull/154595

PrecompileContext Implementation: This pull request introduces the PrecompileContext, a specialized CacheArtifactManager for managing precompile artifacts during Torch compilation. It focuses on testing and basic AOTAutograd logic, with future updates planned for dynamo-related features.
pull/154415

ONNX Exporter Enhancements: The ONNX exporter in PyTorch is enhanced to export the Scaled Dot-Product Attention (SDPA) to the ONNX Attention operator. This includes adding unit tests, duplicating functions for varied opsets, fixing lint issues, and updating relevant files.
pull/154596

Sparse Tensor Validation Control: A new feature allows users to control the validation of sparse tensor invariants when loading data from external sources. The default setting disables this validation to avoid computational expense.
pull/154610

Documentation Format Conversion: This pull request involves converting PyTorch's documentation from reStructuredText (.rst) to Markdown (.md). It includes several updates to refine the conversion process.
pull/154438

Large Indices Data Type Update: An issue with large indices in PyTorch is addressed by changing the data type from torch.int32 to torch.int64. This prevents invalid indexing when the tensor's number of elements exceeds 2^31.
pull/154575

Intel GPU Support in AOTInductor: Support for Intel GPU's xpu mkldnn operations is added within the AOTInductor framework. This enhancement involves multiple contributors and reviewers.
pull/154586

Sparse Tensor Pinning Check: The pinning check is disabled when loading sparse tensors to address a specific issue in PyTorch. This change is intended to be merged two weeks after a related pull request.
pull/154638

Wheel File Reuse: An issue in PyTorch is addressed by reusing an old wheel file and replacing its version. This is indicated by the title and reference to a specific issue number.
pull/154773

Cudagraph CPU Tensor Support: The issue of cudagraph not supporting CPU tensors is addressed by updating the graph to move CPU tensors to the GPU when beneficial. This involves graph partitioning to enable cudagraphification of remaining GPU operations.
pull/154464

CI Failures and CUDA 12.8: CI failures related to specific tests are addressed due to a CUDA 12.8 update. The pull request includes fixes for graph breaks and is discussed in a related pull request.
pull/154497

NVSHMEM Version Addition: NVSHMEM version 3.2.5 is proposed to be added to the PYTORCH_EXTRA_INSTALL_REQUIREMENTS. This version supports both cu11 and cu12 builds, as detailed in the associated commits.
pull/154568

Symbolic Integers in Graph Partitioning: Symbolic integers (symints) are added to the get_graph_inputs function during graph partitioning. This prevents errors in the codegen_input_symbol_assignment process when tensor shapes involve expressions.
pull/154679

Guard Overhead Measurement: The issue of inaccurately measured guard overhead during compilation is addressed by flushing the cache. This ensures profiling results are more realistic and consistent with runtime observations.
pull/154764

Profiler Traces Enhancement: A new event is introduced in profiler traces to efficiently record pre-graph bytecode. This aims to identify models where the pre-graph bytecode is particularly resource-intensive.
pull/154769

CUDA 12.8 CI Tests: New continuous integration tests for CUDA 12.8 in eager execution mode are introduced. This includes updates to Docker builds, specific tests, and fixes for rebase and linting issues.
pull/154469

Printf Function Replacement: Certain call sites in PyTorch are proposed to be replaced with the fmtlib printf function. This aims to achieve a faster and memory-safe implementation.
pull/154533

Compile-Time Performance Improvement: Building the main and kernel code in separate threads is proposed to improve compile-time performance on the CPU. This change maintains no observable changes in the TorchInductor dashboard.
pull/154551

ConvTranspose3D Support on MacOS: ConvTranspose3D is enabled for FP32 and Complex64 data types on MacOS 14 and 15. Half-precision data types remain unsupported due to discrepancies between CPU and GPU implementations.
pull/154696

FP8 GEMM Bias Argument: Support for a bias argument in the fp8 GEMM operation within the Cutlass library is introduced. This is part of a series of updates tracked through the ghstack tool.
pull/154761

Draft Pull Request: A draft pull request in the PyTorch GitHub repository involves multiple updates and contributors. It is part of a stack of changes managed by ghstack but has not yet been merged.
pull/154388

mm.out Function Migration: The mm.out function is migrated from an out-of-tree implementation to an in-tree one. A new file, MTIAOps.cpp, is added to handle the dispatching of mm.out operations separately.
pull/154393

Padding Validation Update: The padding validation in max_pool1d, max_pool2d, and max_pool3d functions is updated to correctly account for dilation. This ensures valid padding values are not incorrectly rejected.
pull/154395

Metal Kernel Migration: Remaining inverse trigonometric and hyperbolic unary operations are moved to metal kernels in PyTorch. This includes specific kernels for inverse tangent, asin, and acos, along with formatting fixes.
pull/154465

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 170
Key Closed Pull Requests
1. Release 2.6 test distributed spawn failed: This pull request addresses the issue of a failed test related to distributed spawning in the PyTorch project, specifically targeting the release 2.6, by implementing a series of fixes and updates across multiple commits, including the use of a validate-docker-images workflow, adjustments to release-specific configurations, and various code optimizations and bug fixes to ensure compatibility and performance improvements across different platforms and environments.

URL: pull/154508

Merged: No

Associated Commits: c69ea, af92b, aad1c, f3c08, 5363f, 5fbc4, 2b84d, c92f6, 1d3ff, f9e99, 46f55, 0cdf8, 6628b, c953e, 22775, 9b688, 4b9b7, f61bf, 31b52, b1a10, 5eb54, d9eed, 41811, 23e39, f01a6, 929ef, 4e418, 478a9, 7d329, 3a3de, f35ab, 7092d, 8c034, be126, e1858, d155d, 4d9de, a99cc, 51829, 983ea, 47f4e, eb304, 6e304, e2067, 57421, a61b5, 4658a, e19c1, 232eb, 1d2c2, a2639, cd15d, 9c34a, 8d4b8, dcb8a, 7be6b, ca3c3, 32070, 2236d, 1eba9, 93864, 88b97, bbd00, 1f32b, d33dd, f1481, ea546, ac7d6, ed487, 66dfe, 3783d, 8adc1, 5c4fa, 639ee, b445b, 8d72c, 374e5, 6a3b5, e607b, aafc7, d5947, 1b753, ba1ba, 70f30, 3398f, 8354d, 737cf, 4202f, 7c27e, 2e2c7, 3a818, 53ad2, 8eb5d, dbe8c, fcdff, 92b55, f6789, 2e1ed, 13339, 82ac2, 3608e, bfb23, 86b0a, 03714, 34caa, ac032, 5dd61, 7d528, d9a03, 7c072, 73dd0, b08d9, d70a9, 7ad5a, 2fd46, ed8c6, bf084, 20ad8, 2fb0a, 8cfa9, 50a04, 45896, 9d0a4, 1a808, 6fe84, a3632, 68180, fb24f, e53a9, 2cda1, 9d566, a7044, c7ba8, cbd7b, f4c96, 9cf15, 5c42a, 1a150, 1ded2, 2ff80, b6e5f, 50924, 4642c, 5de86, f0b4a, 96c61, 751e4, 684f6, 0afd4, 93ff7, e22ae, 2eff7, 2a634, 90ab5, 26b82

2. [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging: This pull request modifies TorchInductor's autotuning flow to include Triton's "base32" cache key in each best_config JSON file, facilitating debugging and analysis by allowing developers to easily match compiled binaries and intermediate representations with their corresponding best configurations, while minimizing impact by adding only an extra field.

URL: pull/154618

Merged: No

Associated Commits: f2460, 7dc8d, 9ee61, bcf84, 09b13, ef40e, 1dc8c, 7c3fb, f9fdc, 163a7, 5c6ae, 96414, 1a4b6, 1af01, 15038, 88b22, a4bc0

3. [upstream triton] support build with setup.py in ./python/ or in ./: This pull request addresses the relocation of the setup.py file in the upstream Triton project from the python/ directory to the root directory (./), enabling the build process to adapt by determining the current working directory based on the new location of setup.py.

URL: pull/154635

Merged: No

Associated Commits: 049bb, 775f8, c934a, 7710e, d0e2a, e37b5, 7457a, 0b9e3, d4b64, 34bf9, 54a1c, 7a842, cc29c, 9719c, 67113, 35ea7

Other Closed Pull Requests

Removal of 'allow-untyped-defs' option: Several pull requests aimed to remove the 'allow-untyped-defs' option from various files in the PyTorch project as part of a series of updates managed through the ghstack tool. Despite the efforts, these changes were ultimately not merged into the main codebase.
pull/154626, pull/154625, pull/154624, pull/154623, pull/154622

Addition of docblocks: Multiple pull requests focused on adding docblocks to various functions and components within the PyTorch project. These changes were part of a larger stack of related updates managed by the ghstack tool, but none of them were merged.
pull/154397, pull/154399, pull/154396, pull/154398, pull/154400, pull/154379, pull/154380, pull/154381, pull/154402, pull/154403

XPU Compatibility and Enhancements: A pull request aimed to ensure compatibility of the XPU with toolchain version 2025.2, including updates to specific files. Another pull request enhanced unit tests for the elapsed_time function in the XPUEvent class to prevent incorrect elapsed time issues.
pull/154359, pull/154494

Custom AMD Triton Kernels: A pull request addressed the issue of custom AMD Triton kernels erroring out due to special keyword arguments not appearing in the kernel signature. The proposed solution was to ignore such kwargs when absent, enhancing compatibility with PT2.
pull/154605

Inductor Lowering Dictionary: A pull request introduced the capability to pass a custom lowering dictionary to the register_lowering() function in Inductor. This change allows systems like Helion to manage their own lowering dictionaries independently of the global lowerings dictionary.
pull/154344

Common Subexpression Elimination (CSE): A pull request addressed the issue of not performing CSE on unbacked nodes in the PyTorch project. Despite multiple updates and commits, this change was ultimately not merged.
pull/154387

Multi-architecture Kernel Binaries: A pull request added support for multi-architecture kernel binaries in the "package_cpp_only" mode. This involved generating specific CMake targets to compile .ptx files into .fatbin files and embedding them into the final shared library or binary.
pull/154414

Tensor Mutation Reflection: A pull request addressed an issue where the system needed to reflect back mutations to the input tensor when a cloned, misaligned tensor is mutated. This ensures that the mutation is preserved even when the tensor's alignment changes.
pull/154442

Distributed 3D Composability Testing: A pull request involved adding an "h100_distributed" label to facilitate testing of distributed 3D composability on an 8*H100 GPU node. Despite several commits, including label addition and YAML file corrections, it was not merged.
pull/154562

Memory-efficient Attention in CUDA: A pull request addressed issues in the backward pass of memory-efficient attention for large tensors in CUDA. It fixed identified problems and added support for logsumexp computation in the forward pass.
pull/154663

Typing and ABI-compatible Dispatching: A pull request focused on improving the typing of a cpp_wrapper interface and preparing for ABI-compatible AOTI C-shim dispatching. It involved removing unnecessary asserts and control flow while ensuring functional neutrality.
pull/154371

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

test
Toxicity Score: 0.55 (Escalating tension, Defensive responses, Unresolved issues)
This GitHub conversation involves a series of interactions where username1 expresses frustration over a solution proposed by username2, which did not resolve the issue at hand. The tone of the conversation is initially neutral but becomes tense as username1's dissatisfaction grows. Username2 responds defensively, which further escalates the tension. The conversation is marked by a lack of resolution and increasing frustration from both parties.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
142
14
9
125

bobrenjc93
201
46
7
28

Skylion007
92
26
3
152

guilhermeleobas
235
16
2
1

laithsakka
80
24
10
34

anijain2305
112
7
1
8

eellison
55
16
2
36

pianpwk
67
16
4
15

henrylhtsang
63
8
8
16

ngimel
34
6
0
55

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	142	14	9	125
bobrenjc93	201	46	7	28
Skylion007	92	26	3	152
guilhermeleobas	235	16	2	1
laithsakka	80	24	10	34
anijain2305	112	7	1	8
eellison	55	16	2	36
pianpwk	67	16	4	15
henrylhtsang	63	8	8	16
ngimel	34	6	0	55