Weekly GitHub Report for Pytorch: July 28, 2025 - August 04, 2025 (12:00:55)

            Weekly GitHub Report for Pytorch: July 28, 2025 - August 04, 2025 (12:00:55)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on x86 CPUs, expanded Intel GPU support, FlexAttention for x86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda package publishing.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[Inductor] Inconsistent results between eager execution and torch.compile: This issue reports a correctness discrepancy in the Rotary Position Embedding (RoPE) computation when using torch.compile with the Inductor backend on the Qwen3 transformer model, specifically with bfloat16 inputs. The problem appears because eager execution implicitly handles certain type conversions that Inductor does not, leading to inconsistent results between the two modes.

The discussion identifies the issue as isolated to the Inductor backend and related to precision casting of bfloat16 tensors, with a suggested workaround of explicitly converting inputs to float32. Attempts to fix the problem using TORCHINDUCTOR_EMULATE_PRECISION_CASTS showed mixed results depending on the device and code version, and a parameter naming mismatch in the Inductor code was found and proposed for correction. Overall, compiled execution sometimes yields improved numerical accuracy compared to eager mode, but the differences and workarounds were debated, with suggestions to document these behaviors for users encountering similar issues.
Number of comments this week: 8

Invalid onnx model is exported for model where data is assigned using a mask and index: This issue reports a bug where exporting a PyTorch model to ONNX fails when the model assigns data using a mask and index, resulting in an invalid ONNX model that produces a runtime error during inference. The problem stems from missing ONNX decompositions for several PyTorch operators used in the model, causing the export process to fail with conversion errors related to unsupported operations.

The discussion involved requests to test the export with the latest nightly build using specific export flags and to provide detailed conversion reports. The user shared an extensive ONNX export report showing multiple operators lacking registered ONNX decompositions, and after updating ONNX and onnxscript versions, the issue persisted. The maintainers acknowledged the report and indicated they would investigate further based on the provided diagnostic information.
Number of comments this week: 6

SimpleFSDP + TP embedding sharding error: This issue describes a bug encountered when using SimpleFSDP combined with tensor parallel (TP) embedding sharding, which started occurring after a specific commit made on July 16. The problem involves the embedding's self.mask_buffer.data becoming None during graph tracing with Dynamo after enabling compilation, and reverting certain files to their previous state temporarily resolves the issue.  

The comments discuss the potential cause being related to DTensor dispatch caching changes introduced in the referenced commit, with speculation that caching now skips code that sets self.mask_buffer. Participants agree that the commit should be reverted until the issue is fully understood, suggest adding basic SimpleFSDP tests to prevent future regressions, and share ideas on how to reproduce the bug using smaller debug models.
Number of comments this week: 6

HAS_CUDA in the inductor tests is really HAS_CUDA_AND_TRITON: This issue addresses the confusion caused by the misnamed constant HAS_CUDA in the inductor tests, which actually represents the combined condition HAS_CUDA_AND_TRITON. The reporter suggests clarifying this naming and updating all test cases to reflect the correct usage to prevent further misunderstandings.

The commenters confirm the intent to rename HAS_CUDA to HAS_CUDA_AND_TRITON and discuss ongoing work on this update. They also consider whether a similar renaming should be applied to HAS_XPU, agreeing that it could be handled separately in a follow-up pull request.
Number of comments this week: 5

can kineto profile only one thread with corresponding gpu tasks among multithreads: This issue concerns the use of the Kineto profiler in a multithreaded server environment where the user wants to profile only one specific thread handling a single query, but currently Kineto records events from all threads. The user is seeking a way to make Kineto profiling thread-local, similar to the legacy profiler, to avoid capturing CUDA events from unrelated threads.

The discussion clarifies that Kineto currently records all CUDA events from all threads because CUPTI, the underlying tool, is not thread-aware. A contributor acknowledges the limitation and expresses willingness to explore adding an option to filter out threads without matching CPU operations, which the user appreciates and looks forward to as a potential new feature.
Number of comments this week: 4

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module, which causes a backend compiler failure in PyTorch's Inductor backend during model compilation. The user provides detailed environment information and code snippets showing that the error arises while compiling specific model components with torch.compile, indicating a potential compatibility or packaging problem with the Triton compiler integration in the PyTorch development version.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, supported by testing that demonstrates a measurable speedup.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs-mounted /tmp directory set to permission mode 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase by removing approximately 1,500 files currently excluded from UFMT and applying consistent formatting to them. It outlines the process for updating the .lintrunner.toml configuration, running the formatting tool, handling known edge cases that require preparatory fixes, and organizing the work by directory to facilitate manageable and reviewable pull requests.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, specifically .debug_pkl files, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can significantly increase the archive size—posing challenges for deploying models on resource-constrained devices like mobile phones—while their removal does not affect the model’s correctness.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 103
Summarized Issues:

Dynamo and Inductor Backend Caching and Compilation Issues: Several issues highlight problems with caching precompile in Dynamo and Inductor backends, including assertion errors during recompile triggered cache updates, unexpected recompilations on subsequent runs, and incompatibility of static tensor references with precompile caching. These problems cause trade-offs between caching behavior, compile time, and runtime performance, sometimes leading to crashes or incorrect behavior during training with optimizers like Adam.  
issues/159228, issues/159229, issues/159230

Installation and Build Process Improvements: There are requests to simplify PyTorch installation by adding support for Python bindings installation via CMake and removing outdated setup.py references in documentation. These changes aim to reduce reliance on setuptools and patching, and to keep documentation consistent with current build practices.  
issues/159232, issues/159234

Performance Regressions and Discrepancies: Multiple reports describe significant performance regressions, including slower torch.jit.trace() on newer GPUs, FP16 autocast causing slower inference than FP32, and a major slowdown in torch.matmul on CPU. Additionally, numerical discrepancies occur between CPU and GPU executions and in specific attention implementations, indicating precision and optimization issues.  
issues/159238, issues/159246, issues/159309, issues/159346, issues/159551

Torch.compile and Inductor Backend Bugs: Several issues report bugs triggered by torch.compile with the Inductor backend, including C++ compilation errors due to conflicting buffer declarations, stride validation errors with certain operations, incorrect output shapes or NaNs in gradients, and failures caused by attribute shadowing or unsupported operators. These bugs affect model compilation correctness and runtime stability.  
issues/159239, issues/159245, issues/159445, issues/159460, issues/159462, issues/159469

Distributed and Parallelism Memory and Caching Concerns: Issues discuss memory usage and caching strategies in distributed training, including optional caching of forward outputs in pipeline parallelism to reduce GPU memory, failures in distributed RPC tests due to missing files, and memory fragmentation metrics becoming unavailable with certain allocator options. These highlight challenges in efficient memory management in distributed contexts.  
issues/159251, issues/159354, issues/159564

Profiling and Debugging Tool Limitations: Challenges with profiling tools like Kineto are reported, specifically the inability to selectively profile single threads in multithreaded environments due to how CUDA events are captured. This limits fine-grained performance analysis in complex applications.  
issues/159256

Runtime Errors and Crashes in Compilation and Execution: Multiple issues describe runtime errors such as crashes when accessing weakref proxies during compilation, shape mismatches in conditional model branches, and errors caused by in-place operations on leaf variables during export. These errors often occur only in compiled or exported modes, not in eager execution.  
issues/159258, issues/159353, issues/159623

ONNX Export and Model Interoperability Issues: Problems with exporting PyTorch models to ONNX are reported, including invalid models due to missing operator decompositions and incorrect resize operations in exported models, leading to runtime failures or incorrect inference outputs.  
issues/159295, issues/159468

Memory Management and CUDA Allocator Bugs: Several issues highlight bugs and limitations in CUDA memory management, such as crashes caused by improper block release order, memory leaks during training with CudaGraphs, and OOM errors due to lack of retry on allocation failure in custom allocators. These affect stability and performance on GPU devices.  
issues/159567, issues/159669, issues/159674

Documentation and Testing Improvements: Requests include correcting inaccurate documentation for functions like torch.lu() and torch.sub(), adding more tests for specific backends and hardware, and improving CI coverage for features like NCCL registration and Windows ROCm support. These aim to enhance usability and reliability.  
issues/159616, issues/159637, issues/159510, issues/159520, issues/159535

Dynamic Shape and Graph Partitioning Limitations: Issues report lack of support for dynamic shapes in conditional subgraphs and DTensor compilation, causing failures or regressions when using autograd or graph partitioning features. These limitations restrict flexibility in model design and compilation.  
issues/159381, issues/159590, issues/159635, issues/159709

Distributed Communication Backend Challenges: Problems with NCCL backend support for peer-to-peer communication, conflicts when mixing NCCL and GLOO backends, and missing test executions in CI pipelines are reported, indicating ongoing challenges in distributed communication infrastructure.  
issues/159559, issues/159563, issues/159535

Compiler and Graph Rewriting Bugs: Bugs in graph rewriting functions cause invalid graph states due to improper handling of consecutive pattern matches, and incomplete symbol tracking in extern kernels leads to undefined symbol errors during compilation. These affect the correctness of compiler transformations.  
issues/159613, issues/159685

Distributed Checkpointing and RPC Hanging Issues: Using asynchronous checkpoint handles in distributed settings can cause indefinite hangs, and RPC tests fail due to missing files and timeouts, indicating robustness issues in distributed checkpointing and RPC mechanisms.  
issues/159700, issues/159354

Windows and Cross-Platform Compatibility Problems: Issues include Unicode decode errors during compilation on Windows, unnecessary .lib files in Windows wheels, and subprocess hangs in conda environments with specific Python versions, highlighting cross-platform build and runtime challenges.  
issues/159537, issues/159514, issues/159645

Miscellaneous Bugs in Tensor Operations and APIs: Various bugs are reported such as incorrect copying of NaN values in bfloat16 on CUDA, failure of decorators when functions are called as object attributes, and issues with torch.repeat_interleave on MPS devices under compilation. These affect correctness and usability of tensor operations.  
issues/159333, issues/159372, issues/159408

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 51
Summarized Issues:

Test Failures and Disabling on Multiple Platforms: Several tests have been disabled due to flakiness and consistent failures across different platforms including ROCm, NVIDIA, and XPU. These include issues with test_graph_partition_reorder_custom_op_with_no_dependency, test_einsum_to_pointwise, and multiple XPU-related tests, reflecting ongoing stability challenges in CI and test suites.  
issues/157871, issues/158546, issues/159000, issues/159330, issues/159331, issues/159332, issues/159334, issues/159335

Backend and Hardware-Specific Bugs and Crashes: Multiple issues report crashes, incorrect results, or regressions on specific hardware or backends such as CUDA, ROCm, AMD MI300X, Apple MPS, and Triton. Problems include torch.nansum crashing on CUDA complex32 tensors, torch.cumsum instability on Triton backend, segmentation faults on AMD MI300X, and NaNs from F.gumbel_softmax on MPS devices.  
issues/158003, issues/158182, issues/158635, issues/159070, issues/159103

Graph and Inductor Compiler Issues: Several problems relate to PyTorch's inductor compiler and graph tracing, including errors from unsupported aliasing, incorrect memory layout handling, and bugs in custom operator input contiguity. These issues cause test failures and incorrect results, requiring fixes such as cloning outputs or refactoring input handling.  
issues/158375, issues/158892, issues/159097

Distributed and Parallelism Bugs: There are issues with distributed operations such as torch.distributed.gather producing incorrect results on noncontiguous tensors with both Gloo and NCCL backends, and problems with process group destruction causing unexpected GPU memory allocation. Additionally, the context_parallel API incorrectly assumes tensor sharding requirements.  
issues/158902, issues/159548, issues/159262, issues/159634

Documentation and Localization Concerns: Several issues highlight documentation inconsistencies, minor typos, and discussions in Arabic without clear resolutions. There are also proposals to update or remove outdated quantization documentation and clarify parameter constraints in pooling functions.  
issues/159141, issues/159338, issues/159339, issues/159340, issues/159375, issues/159528

Performance Regressions and Slowdowns: Performance issues include a 40x slowdown in C++ tensor row indexing, increased Windows CPU build times due to Visual Studio 2022 changes, and Inductor autotuning interfering with Triton benchmarking, all indicating regressions or inefficiencies in build or runtime performance.  
issues/159222, issues/159082, issues/159525

CUDA and GPU Support Issues: Problems include missing support for NVIDIA RTX 5060 Ti GPUs in stable CUDA 12.8 builds causing kernel errors, and CUDA runtime not found warnings in CUDA EC2 runners despite driver installation, affecting reliability of test environments.  
issues/157844, issues/159446

ONNX Export and Operator Support Limitations: Difficulties exporting RMS Norm operators to ONNX due to opset version constraints and missing operator implementations hinder model interoperability and require adding support for newer ONNX opsets.  
issues/159249, issues/159257

Code Quality and Warning Fixes: Issues include a DeprecationWarning for use of co_lnotab needing replacement with co_lines(), false positives from the set_linter tool on f-strings, and inconsistent error message terminology regarding boolean tensor ambiguity.  
issues/158833, issues/159056, issues/159710

Build and Packaging Concerns: A significant increase in Windows wheel size due to debug info embedding and incorrect installation of CUDA libraries when installing CPU-only PyTorch versions highlight packaging and distribution issues that affect user experience and storage.  
issues/159515, issues/159560

Tensor Operation Bugs and Numerical Discrepancies: Bugs include incorrect output from torch._grouped_mm kernels under certain modes, numerical discrepancies between CPU and GPU for InstanceNorm2d and max_unpool1d, and silent incorrect results from custom operators with float8 inputs due to contiguity issues.  
issues/159378, issues/159310, [issues/159314](https://github.com/pytorch/pytorch/issues/159314], issues/158892

Miscellaneous Code and API Questions: Questions about the use of swap in NCCLUtils.cpp, caching behavior of _cuda_getDeviceCount, and unexpected behavior in is_nonzero(input) error messages indicate areas needing clarification or improvement in codebase consistency.  
issues/159248, issues/159641, issues/159710

CI Infrastructure and Test Environment Stability: A major disruption in PyTorch CI caused by Linux Foundation runner fleet provisioning failures due to advanced SSM parameter policies not enabled in the LF AWS account led to large PR job queueing delays, requiring fallback to Meta fleet and reversion of changes.  
issues/159290

Model Prediction Inconsistencies Across Versions: Users report inconsistent prediction results from the same visual model and input image when running inference under different PyTorch versions, raising concerns about reproducibility and version-related behavioral changes.  
issues/159351

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 177
Key Open Pull Requests
1. setup [Do not review]: This pull request is focused on initial setup tasks for the project, as indicated by the repeated commit messages titled "setup" and the instruction not to review it, suggesting it is preparatory work rather than feature development or bug fixes.

URL: pull/159636

Merged: No

Associated Commits: 4ce19, de4e4, 7c9b5, 40db2, 44a17, 289d2, cad31, f3488, 962be, 33891, 2dff6, 4076d, 39e82, 45b61, a044b, e8250, a2259, 2c469, 1dcdc, 38a06, 2e312, b5392, 684f1, 77e41, de9a4, f5892, 9d963, 71ccf, 04da9, e51f1, 5b4b6, 8f628, 80a3c, 0d27d, 9e5ae

2. Add dynamic shapes doc: This pull request adds comprehensive new documentation for Dynamic Shapes in the PyTorch project, including a structured layout with an introduction, core concepts, and troubleshooting sections, along with multiple updates and refinements to improve clarity and usability.

URL: pull/159428

Merged: No

Associated Commits: 0ec6f, cc143, 033c8, abad1, 7edc8, 4ab42, 6169e, dce0c, 88224, 78ec9, 7277c, ede7d, 6e23c, afd42, 843b8, 1ddc2, 822b0, ab71b, b59ea, 0b14d, 2396a, fdc90

3. [dynamo][guards] Make class members go through obj.class.dict: This pull request aims to modify the dynamo guards in the PyTorch project so that class member accesses are routed through the object's class dictionary (obj.__class__.__dict__) to improve attribute resolution.

URL: pull/159534

Merged: No

Associated Commits: 905c8, d094d, 08f0a, d8ca7, a83fe, 8f510, 8a2be, 72fe8, 50343, 1f358

Other Open Pull Requests

Nested Graph Handling in Dynamo: Multiple pull requests improve PyTorch Dynamo's handling of nested graphs by modifying the resume function call to use the CALL_FUNCTION_EX opcode and adding support for simple nested graph breaks. These changes fix breakages related to nested graph execution and enable graphs with nested calls that include graph breaks to break correctly once.  
pull/159281, pull/159329

Intel GPU and XPU Backend Support: Several pull requests enhance Intel GPU and XPU support by enabling SDPA backend selection and priority setting, porting distributed tests to support Intel GPU, adding XPU device support in torchrun's --nproc-per-node option, and unifying allocator configurations for CUDA and XPU. These updates improve backend detection, device selection, and configuration sharing across different hardware.  
pull/159464, pull/159473, pull/159474, pull/159553

Memory and Buffer Management in Inductor Scheduler: A pull request addresses inaccurate memory estimation in the Inductor scheduler by ensuring buffers with multiple non-overlapping mutations are only deallocated after all aliases are freed. It also introduces runtime verification to track buffer allocation and deallocation lifecycles to detect mismatches, preventing double counting of memory usage.  
pull/159569

Hugging Face Sharded File Consolidation Improvements: Multiple pull requests optimize Hugging Face sharded file consolidation by enabling all ranks to participate in consolidation, buffering entire tensors before writing to reduce small writes, and replacing f.read() with safe_open for improved performance. These changes significantly speed up consolidation and improve compatibility with storage systems.  
pull/159393, pull/159394, pull/159395

Distributed and Parallel Computing Enhancements: Updates include skipping unnecessary CUDA synchronizations and allgather operations in FSDP when world size is 1, adding a distributed job for the B200 CUDA runner in CI, and introducing threading support in the "cute" based flash attention implementation. These changes improve efficiency, testing infrastructure, and flexible attention mechanisms.  
pull/159417, pull/159323, pull/159521

Platform and Build System Updates: Pull requests add OSX and Windows support to OpenReg by abstracting platform-specific APIs and setting default symbol visibility, fix and reland TorchBench setup in Docker environments while maintaining macOS CI compatibility, and improve Windows compilation and test cases. These changes enhance cross-platform support and CI reliability.  
pull/159441, pull/159300, pull/159379

User Guide and Documentation Improvements: A pull request introduces a placeholder for the User Guide by adding new markdown files and reorganizing the top navigation to include sections like Get Started, User Guide, Reference API, Community, and Tutorials. Notes are relocated under the User Guide section to improve documentation structure.  
pull/159379

Data Structure Fixes in Collections: Fixes related to collections.Counter and collections.NamedTuple are addressed in separate pull requests as part of stacked changes, improving the reliability of these data structures within the codebase.  
pull/159368, pull/159367

Overflow and Indexing Safety Checks: A pull request adds an overflow validation check in the pad_sequence function to prevent integer overflow with int64 tensors and throws a RuntimeError if the padding value exceeds limits. Another pull request prevents the use of int32 indices when the upper bound exceeds int32 max value, ensuring safer indexing.  
pull/159589, pull/159433

DeviceMesh API Enhancements: A new _split API for DeviceMesh is introduced, allowing creation of device meshes without a backend and managing splitting with bookkeeping of sub-mesh dimensions and process group reuse. Unit tests accompany the feature, which currently has some limitations and seeks early feedback.  
pull/159482

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 240
Key Closed Pull Requests
1. [Release/2.7] pin requirements: This pull request is about pinning the package requirements for the release/2.7 branch of the PyTorch project to ensure consistent dependency versions.

URL: pull/159650

Merged: No

Associated Commits: e294d, 04c7c, 414fc, ff69f, c0fde, 41d64, f31bd, ae842, 4e346, ef226, 40f0d, a80d3, 06077, cced6, de84f, 54d00, 39a79, 8d7ae, 7010d, ad7a2, d458c, 6fd40, 030d6, 7fe67, 62f12, 943cc, 39c25, 55ef4, 47c27, 5df0d, ee96e, 823b1, 1f8a9, 3f73e, f717b, dec2e, 90091, 3cddd, f001a, 6e62a, 1cb81, 9030e, 2c220, 03c7d, 1fee1, 02cee, 92d6d, e0afc, 83133, 7a876, bbd01, 38f2b, a9d0d, 77a7b, f0c1c, 4c858, 5ebff, 17364, 2337d, bf007, ba48d, d17e2, 189aa, e867a, 5631e, 83049, d62a3, 68990, 8a12d, 3fc00, c7ce5, 197c9, b5d59, 9412d, c17ce, 34f3b, 13520, 06c10, 49675, d598f, 8e450, 575e2, 7edf5, c7a1e, 66726, 2a215, 0bd40, df38c, 7a768, 509a6, a4d60, 6fba5, dce73, 4d586, e725e, 7f01c, 4c00e, 866cc, 9434e, b6228, f86d1, 3ea89, b2571, fc756, 22c98, cd0f7, fe3d3, 30508, f07b7, 6b52d, 35dae, d5542, 60111, a929f, be95f, 1cd45, f0534, f0aeb, 6c845, 44c0e, faae1, 29c62

2. [do not review] Add vllm build: This pull request proposes adding a vllm build setup workflow to enable building the vllm project against PyTorch, although it was not merged.

URL: pull/158797

Merged: No

Associated Commits: 4f2b5, 2c1ab, 558d9, bfa14, 222d4, 9028d, 8c683, 7f761, 5f445, 9cb70, 2cfed, 98484, 2066c, 479cc, 70eef, f2e5f, 5e58d, 33716, db5b1, e6923, e8542, 6131d, 97752, 2c8a5, 4b131, 0e16c, f8f76, 4f28a, 22660, 1159e, 1d587, 7856a, 46db8, 8c350, 5f5b0, 93e72, c8726, 6b8c0, 74581, 5d4e6, 96ab5, 5544c, 5dbbd, ed9c9, e06ce, 1886b, 5f669, 484e6, f128a, 131fc, d2b75, a63f2, 9c33d, 63bbf, 3924b, 0161b, 5c8c2, 77b0c, 56184, 0800e, 6c912, d4023, 22cf0, 8d0ee, a1759, 43d2f, 94240, df7ad, 67ec8, 31ef7, 1c23a, 28b4e, 5a394, fcf17, 77d9b, 0d06f, 5e2cb, d3594, abb7c, f5dc8, 3cd02, 6659b, 28a73

3. [DO NOT MERGE] Test New MI325X Capacity.: This pull request is a non-mergeable test aimed at evaluating additional capacity for the MI325X hardware by updating and renaming various ROCm-related CI configuration files and triggering multiple continuous integration runs.

URL: pull/159059

Merged: No

Associated Commits: 44b27, 266ab, c17b5, 6a7a1, fdb49, 1e274, 570dd, 61a20, e0abf, 0dacf, a1b0e, 5d8fd, bce86, 3f886, decda, 4f343, b48c0

Other Closed Pull Requests

Function replacement in script_object.py: This pull request proposes replacing the usage of the unimplemented function with unimplemented_v2 in the torch/_dynamo/variables/script_object.py file to address part of issue #147913cc. The change was intended to improve code correctness but was not merged.

[pull/159343]

Zero-size constant arrays in assembly: This pull request introduces the get_zero_consts_asm_code function to handle zero-size constant arrays in assembly code for Windows and Linux platforms. It addresses compiler limitations such as MSVC's error on zero-size arrays and alignment restrictions in Win32 assembly, ensuring compatibility with C++ standards and PyTorch's alignment requirements.

[pull/159225]

Deprecation of pin_memory_device in DataLoader: This pull request proposes deprecating the pin_memory_device parameter by moving pin memory enabling back inside the _BaseDataLoaderIter class to support StatefulDataloader usage. It also includes a test for CPU-only environments where pin_memory=True has no effect.

[pull/158323]

Inductor backend lowering for repeat_interleave.Tensor: This pull request adds a lowering implementation in the Inductor backend for the repeat_interleave.Tensor operation when an output size is specified. This enhances support for this operation in PyTorch's compilation stack.

[pull/158462]

Addmm fusion optimization performance testing: This pull request attempts to add an addmm fusion optimization for performance testing but was not merged due to inability to trigger workloads on branches outside the main repository.

[pull/159182]

ROCm CK Inductor backend gfx950 support: This pull request enables support for the gfx950 architecture in the ROCm CK Inductor backend by updating autotuning configurations, fixing code generation for conv2d and CK-tile matmul, adapting fp8 data types, and cleaning up tests.

[pull/159195]

Strategy hashing argument fix: This pull request fixes a mismatch in strategy hashing arguments by ensuring the hashing output depends only on input variables affecting output sharding strategy. This prevents incorrect reuse of cached variables, improves strategy cache hits, and removes the need for specifying static arguments in RuntimeSchemaInfo for some operations.

[pull/159289]

RMS normalization implementation and testing: This pull request implements RMS normalization using the ONNX RMSNormalization operator with correct epsilon handling for float32 precision. It extends testing by integrating a reference runtime through ONNXProgram and assert_onnx_program to ensure accurate functionality.

[pull/159377]

Inductor set_linter fix for Python 3.12+: This pull request fixes the [inductor] set_linter functionality to properly handle f-strings for Python versions 3.12 and above, resolving issue #159056.

[pull/159252]

MI355 CI regression and hiprtc kernel compilation fix: This pull request addresses MI355 CI regression and hiprtc kernel compilation failures caused by duplicate trait definitions by modifying the trait check to use the HIP version instead of the ROCm version. It also increases the MI355 CI run frequency to twice daily to better detect regressions.

[pull/159292]

Reduction of composable kernel (ck) kernels: This pull request proposes changes to reduce the number of composable kernel (ck) kernels generated, including kernel generation updates, API modifications, and removal of duplicate files. These changes depend on a prior merge in the ROCm composable_kernel repository.

[pull/157964]

OpenReg support for OSX and Windows: This pull request proposes adding support for OSX and Windows platforms to the OpenReg component but was not merged.

[pull/159029]

Flake8 F824 fix in torch directory: This pull request fixes flake8 rule F824 by correcting unnecessary use of global and nonlocal declarations in the torch/ directory. It clarifies these keywords are only needed when assigning to a variable in a local scope, not when modifying the variable's state via methods.

[pull/159119]

Dynamic reduction kernels for MPS backend: This pull request introduces dynamic reduction kernels for the Metal Performance Shaders (MPS) backend in PyTorch's AOT Inductor model. It enables efficient summation operations with both static and dynamic kernel implementations to improve performance on Apple GPUs.

[pull/159355]

Nightly PT2 benchmark enablement on B200: This pull request resumes and finalizes enabling the nightly PT2 benchmark on the B200 platform, continuing previous work and including various testing and adjustments to support this feature.

[pull/158011]

Simplification of GEMM Triton parameter handling: This pull request simplifies loop iteration and centralizes retrieval of common GEMM Triton parameters by extracting shared logic converting BaseConfig objects into keyword arguments. This provides a single modification point for all GEMM Triton templates to avoid inconsistencies.

[pull/158015]

Fix for itertools accumulate function: This pull request aims to fix issues in the itertools accumulate function but was not merged.

[pull/158774]

nn.Parameter constructor semantic change in Dynamo: This pull request proposes a semantic change to the nn.Parameter constructor in PyTorch's Dynamo by defaulting to a graph break with an error message when the constructor lacks a clean source. This improves clarity and reduces complexity in graph construction and debugging, while allowing users to revert to the old behavior via a configuration flag.

[pull/158800]

MPS backend avg_pool3d operation support: This pull request proposes adding avg_pool3d operation support for the Metal Performance Shaders (MPS) backend but was not merged.

[pull/158877]

Dynamo fullgraph=False documentation: This pull request proposes adding documentation for the fullgraph=False option in Dynamo but was not merged.

[pull/159050]

Dynamo recompilation and observability documentation: This pull request adds documentation related to Dynamo recompilation, observability, and reporting issues to improve user understanding and troubleshooting.

[pull/159062]

Flake8 F824 fix in test directory: This pull request fixes flake8 rule F824 in the test directory by correcting unnecessary use of global and nonlocal declarations. It clarifies these keywords are only needed when assigning a variable in local scope rather than modifying the variable's state through methods.

[pull/159120]

Inductor backend layout tag respect fix: This pull request addresses an issue in the PyTorch inductor backend where layout tags for operations with registered lowerings, such as scaled_grouped_mm requiring column-major layout, were not respected. It ensures these tags are properly considered to fix stride order problems.

[pull/159134]

Fused RMSNorm feedback and warning addition: This pull request addresses feedback from the original fused RMSNorm implementation by adding a warning for input and weight data type mismatches and ensuring the default epsilon value is correctly set.

[pull/159317]

Complex number implementation header-only move: This pull request proposes moving the complex number implementation to a header-only format, as indicated by a series of commits, but was not merged.

[pull/159411]

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
88
10
6
94

ezyang
76
15
3
57

guangyey
82
11
1
46

yangw-dev
128
5
3
1

XuehaiPan
106
12
0
8

anijain2305
93
17
0
4

wconstab
48
5
1
59

janeyx99
67
20
3
14

Skylion007
12
3
0
86

xuhancn
81
7
1
10

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	88	10	6	94
ezyang	76	15	3	57
guangyey	82	11	1	46
yangw-dev	128	5	3	1
XuehaiPan	106	12	0	8
anijain2305	93	17	0	4
wconstab	48	5	1	59
janeyx99	67	20	3	14
Skylion007	12	3	0	86
xuhancn	81	7	1	10