Weekly GitHub Report for Pytorch: August 18, 2025 - August 25, 2025 (12:02:55)

            Weekly GitHub Report for Pytorch: August 18, 2025 - August 25, 2025 (12:02:55)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support, FlexAttention for X86 CPUs targeting LLMs, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside the deprecation of official Conda package publishing.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[Dynamo] Support Parameter subclass: This issue concerns enabling support for a subclass of Parameter in PyTorch's Dynamo compiler, specifically allowing a custom Parameter subclass with additional attributes to compile and run without errors. The user reports an error in PyTorch 2.7.1 when compiling a model using such a subclass, which is resolved in PyTorch 2.8.0, and the discussion explores workarounds and fixes for compatibility with earlier versions.

The comments clarify the intended use of the custom attribute outside the forward pass, confirm that the issue is fixed in PyTorch 2.8.0 but persists in 2.7.1, and share multiple workarounds including disabling __torch_function__ and custom implementations to bypass the error, with ongoing troubleshooting and version checks to ensure compatibility.
Number of comments this week: 13

Eager CUDAGraph + stream performance: This issue discusses an unexpected latency difference observed when using eager CUDA graph capture combined with CUDA streams in PyTorch. Specifically, the user reports that performing warmup iterations in two separate blocks results in a slightly higher latency for adding two CUDA tensors compared to commenting out one of the warmup blocks, despite explicit synchronization calls, and this behavior appears to be architecture and driver dependent.

The comments include attempts to reproduce the issue on various hardware (H100, A100, L40S) and driver versions, with mixed results: some can reproduce the latency difference while others cannot. Contributors share detailed repro scripts, profiling insights, and driver information, concluding that the effect is subtle, possibly related to very small kernel execution times, and may not be significant on more realistic workloads.
Number of comments this week: 12

compile with PrivateUse1 see tensors on "meta": This issue concerns a user encountering a runtime error when compiling a model with a custom backend device labeled "PrivateUse1" using torch.compile, where operations unexpectedly involve tensors on the "meta" device instead of the intended custom device. The user is questioning whether this behavior is a bug or expected, as the error arises before any backend-specific code is executed, particularly during calls to overridden aten operations like "view."

The comments discuss the rationale behind using the PrivateUse1 device as a new hardware backend, share a minimal reproducible example illustrating the problem with aten::view overrides, and explore potential workarounds involving conditional handling of meta tensors. Contributors acknowledge the issue as a likely bug related to how Dynamo tracing interacts with custom device implementations, noting that operations on meta tensors during tracing cause device mismatch errors, and suggest that the current behavior deviates from the intended design of fake tensors and meta device usage in torch.compile.
Number of comments this week: 6

Running phi-2 on MacOS with returns garbage: This issue reports that running the "microsoft/phi-2" model on MacOS with the default data type results in garbage output, while switching to the bf16 data type resolves the problem on both CPU and MPS devices. The problem appears related to the handling of float16 precision, specifically during upcasting operations, and is not unique to MacOS as it also occurs on other architectures like aarch64.

The comments confirm the issue is reproducible beyond MacOS, specifically on aarch64 builds, and acknowledge the problem is linked to float16 precision handling during accumulation operations. Developers are investigating the root cause, noting that the issue does not occur on x86 architectures, and coordination is underway to avoid duplicated efforts.
Number of comments this week: 5

Discrepancy between PyTorch and TensorRT when casting from FP32 to BF16 causes significant accuracy mismatch: This issue addresses a significant accuracy mismatch observed when casting from FP32 to BF16 between PyTorch and TensorRT, which affects reproducibility in models exported to ONNX and run with TensorRT in bfloat16 mode. The root cause appears to be differences in rounding behavior during the precision casting step, leading to diverging results and accumulated errors in deep models, particularly in attention-related computations.

The comments discuss the accumulation of errors across many layers in large models, question whether TensorRT follows a specific standard that PyTorch should adopt, and clarify that PyTorch’s rounding is mathematically correct but differs from TensorRT’s, causing discrepancies. Further clarifications seek to understand if such differences are expected and how to align results, with suggestions to investigate which BF16 kernels PyTorch uses during inference.
Number of comments this week: 5

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the 'triton.compiler.compiler' module while using PyTorch's torch._dynamo with the inductor backend. The error occurs during the compilation of custom pipeline components with torch.compile, indicating a possible mismatch or missing symbol in the installed Triton version or its integration with the PyTorch development build.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, and preliminary testing shows a speedup of approximately 1.3 times compared to the traditional approach.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at /tmp having permissions set to 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot map a segment due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, and managing known formatting challenges, while also providing a detailed worklist organized by directory to coordinate and track progress on this large-scale code formatting effort.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, specifically .debug_pkl files, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can significantly increase the model size—posing challenges for deployment on resource-constrained devices like mobile phones—while their removal does not affect the model's correctness.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 106
Summarized Issues:

MPS Backend Compatibility and Limitations: Several issues highlight problems with the MPS backend, including failure to load models trained on MPS when running on CPU-only systems due to unsupported parameter conversions, lack of deterministic index operations causing test failures, and incorrect behavior in 1-byte copy operations on next-generation macOS due to unsupported copy methods. Additionally, manual parameter updates using .data on MPS do not update weights properly, causing training failures on Apple Silicon GPUs.  
issues/160846, issues/161029, issues/161265, issues/161361

torch.compile and Inductor Backend Crashes and Errors: Multiple issues report crashes, segmentation faults, and runtime errors when using torch.compile with the Inductor backend, including faults triggered by large step sizes in torch.slice_copy, errors converting MKL-DNN tensors to dense, stride mismatches in compiled models, and failures with new FP8 scaling features. Problems also arise with custom backends on PrivateUse1 devices, and compilation errors occur in FSDP and vLLM models due to unsupported tensor types or shape mismatches.  
issues/160868, issues/160873, issues/161010, issues/160909, issues/161127, issues/161120, issues/161244, issues/161153

XPU Platform Test Failures and Disabled Tests: Several tests in the TestInductorOpInfoXPU and AOTFxirTestCase suites have been disabled due to consistent failures on the XPU platform, affecting tests for various data types including float16, float32, and float64, as well as specific tests like test_aoti_fx_const, test_aoti_fx_linear, and test_data_dependent_failure.  
issues/160946, issues/160947, issues/160948, issues/160951, issues/160969, issues/160970, issues/161038, issues/161162

Documentation and Installation Improvements: There are requests to update PyTorch documentation to replace pip3 with python3 -m pip or pip commands, add installation instructions for uv, clarify the units expected in torch.sin inputs, and improve softmax examples to be more comprehensive. Additionally, an installation error was reported for documentation dependencies on Ubuntu 24.04.3 with Python 3.12.3.  
issues/160854, issues/160875, issues/160995, issues/161252, issues/160949

DTensor and Distributed Tensor Issues: Problems with DTensor include a severe output mismatch in sharded linear layers at low precision, failure of the broadcast_to function with KeyError, runtime errors in the view operation due to incorrect shape calculations, and silent ignoring of argument strategies for operators with multiple inputs. Additionally, a silent data inconsistency bug occurs in distributed communication when using slices of 2D tensor views.  
issues/160911, issues/160968, issues/161091, issues/161218, [issues/161324](https://github.com/issues/161324]

torch.export Functionality and Export Errors: Multiple issues report failures in torch.export due to schema mismatches, unsupported keyword arguments, and invalid input values, including errors with add.Scalar and sub.Scalar operations, expand_copy with implicit=True, hardtanh with invalid min/max values, torch.var with negative correction, and inconsistent behavior in _cdist_forward compute modes.  
issues/161076, issues/161080, issues/161081, issues/161083, [issues/161089](https://github.com/issues/161089]

Performance Regressions and Benchmark Instabilities: There are reports of significant CPU training slowdowns linked to dropout in PyTorch 2.7.0+, unstable inductor benchmark results with unclear regression causes, and slower vector-matrix multiplication on Nvidia Blackwell GPUs compared to Hopper GPUs. Additionally, a performance regression in Megatron T5 was traced to the record_function context manager in Inductor.  
issues/161163, issues/161290, issues/161295, issues/161134, issues/161219

Graph Breaks and Dynamo Compilation Issues: Several issues describe graph breaks and compilation errors in the Dynamo compiler caused by operations such as module.to(...) inside forward, multiple backend availability checks, and CPU offloading in accelerate. Other problems include unsupported subclasses of Parameter and errors related to guard check failures during compilation.  
issues/161207, issues/161211, issues/160886, issues/161105

Security and Testing Tooling Enhancements: A comprehensive security analysis identified critical vulnerabilities including JIT code injection and pickle deserialization flaws, with a roadmap for fixes. Additionally, proposals include better tooling for generating reproducible test cases for Dynamo graphs and options to disable specific failing inductor-periodic tests individually in CI.  
issues/161327, issues/161330, issues/161281

CUDA and ROCm Compatibility and Build Issues: Issues include a build failure for CUDA 13.0 binaries on SM_75 architecture due to NVSHMEM linking errors, segmentation faults on AMD Radeon RX 7600 XT with ROCm 6.4.3 during float64 matrix multiplication, and an IndexError in CUDA architecture flag processing causing torchvision build failures.  
issues/160980, issues/161256, [issues/161358](https://github.com/issues/161358]

Memory Management and Leak Issues: Bugs include a memory leak caused by early stopping exceptions in checkpointed custom autograd Functions, incorrect handling of nested torch.cuda.use_mem_pool() context managers leading to all allocations going to the first pool, and a RuntimeError in CUDA memory block management triggered by expandable segments configuration.  
issues/161186, issues/161193, [issues/161356](https://github.com/issues/161356]

Attention and vLLM Model Compilation and Runtime Errors: Several issues report runtime errors and excessive recompilations in Falcon-7B models using vLLM, including errors in Triton kernel tracing due to unsupported numpy operations, illegal memory access in Triton IMA kernels, and stride or size mismatches in selective_scan custom operations, requiring patches that disable certain TorchDynamo compilations.  
issues/161111, issues/161113, issues/161115, [issues/161119](https://github.com/issues/161119]

API and Feature Requests: Requests include adding support for torch._grouped_mm on new CUDA architectures (SM_120 and SM_89), adding installation commands for uv, supporting nn.SyncBatchNorm in torch.compile, adding partial shape inference for Nested Jagged Tensors' view(), and adding reshape support for sparse tensors to enable einsum with sparse linear operators.  
issues/160875, issues/160891, issues/161302, issues/161287, [issues/161357](https://github.com/issues/161357]

Precision and Numerical Consistency Issues: Problems include accuracy mismatches between PyTorch and TensorRT in FP32 to BF16 casting, unexpected change of FlexAttention default precision from IEEE fp32 to TF32, NaN outputs from F.silu with negative infinity inputs in PyTorch 2.7.0+, and half precision versus full precision discrepancies in MPS operator tests.  
issues/160856, issues/161022, issues/160876, [issues/161283](https://github.com/issues/161283]

Miscellaneous Bugs and Errors: Other issues include a bug where TunableOP leaves ROCM_VERSION blank causing CSV load failures, a TypeError in unpin_memory due to incorrect argument types, silent errors in tracing requires_grad metadata for inplace ops, and a runtime error caused by an assertion failure in MoE graph reuse due to unbacked symbolic integers.  
issues/160874, issues/160983, issues/161275, [issues/161276](https://github.com/issues/161276]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 45
Summarized Issues:

Performance regressions and compilation issues: Several issues report performance regressions and compilation problems in PyTorch, including a 9% throughput regression in Stable Diffusion FP32 mode on Intel Xeon CPUs and crashes or incorrect code generation by torch.compile with the inductor backend on CPU. These problems cause degraded performance or runtime failures, often linked to specific commits or operator fusion bugs that affect model execution and compilation correctness.  
[issues/159121, issues/159154, issues/160882, issues/161060]

Memory and gradient calculation bugs: There are reports of incorrect gradient computations for very large tensors in torch.nn.functional.pairwise_distance and memory corruption due to wrong size inference in compiled C++ code. These bugs lead to incorrect mathematical results or corrupted memory states during execution, impacting model training and reliability.  
[issues/159154, issues/159750]

Hook and attribute handling regressions: A regression in PyTorch 2.8 breaks support for side-effectful pre- and post-forward hooks that rename and restore module parameters by deleting and setting attributes, causing tracing errors due to unsupported __delattr__ usage. This change disrupts workflows relying on these hooks and is targeted for a fix in version 2.8.1.  
[issues/159958]

Test failures and test disabling on XPU and GPU platforms: Multiple tests such as test_copy_non_blocking_is_pinned_xpu, test_comprehensive_nn_functional_interpolate_trilinear_xpu_float64/float32, and test_addmm_activation are disabled due to consistent failures on the XPU platform. Additionally, tests fail on NVIDIA H100 GPUs with CUDA 12.6, indicating platform-specific instability and correctness issues in the test suite.  
[issues/160243, issues/160244, issues/160245, issues/160727, issues/160305]

Random number generator (RNG) state and DTensor semantics: There is ambiguity in DTensor’s RNG state management regarding whether user-passed RNG states should visibly advance and be synchronized across distributed ranks. Clarifying these semantics is necessary to ensure intuitive and consistent RNG behavior in distributed settings.  
[issues/159991]

CUDA and NVSHMEM build and runtime errors: Several issues report build failures and runtime errors related to CUDA and NVSHMEM, including missing CUDA library files, undefined NVSHMEM linker references, and fatal errors when loading CUDA libraries with newer CUDA versions. These problems affect nightly builds and runtime stability on CUDA-enabled systems.  
[issues/160762, issues/160877, issues/160972, issues/160657]

MPS backend correctness and operator support issues: The MPS backend exhibits multiple correctness bugs such as failing to handle scalar index tensors in index_select, incorrect behavior of torch.var on zero-dimensional inputs, incorrect AvgPool2d outputs, and zeroing out imaginary parts in complex tensor operations. These discrepancies cause test failures and numerical inaccuracies compared to CPU behavior.  
[issues/160737, issues/160738, issues/160743, issues/160845, issues/160993]

Documentation and build guide improvements: The PyTorch website incorrectly states Python 3.12 as the highest supported version on Windows, and the build-from-source guide redundantly installs CMake and Ninja. Updating documentation and removing redundant instructions would improve clarity and accuracy for users.  
[issues/160246, issues/160302]

Distributed and parallelism testing issues: The @require_world_size(4) macro is ambiguous and causes test failures when the actual world size differs, and questions remain about the fixed order of data parallelism and tensor parallelism dimensions in FSDP. These issues affect distributed testing reliability and understanding of parallelism semantics.  
[issues/159987, issues/161006]

Synchronization and offload bugs in FSDP2: Enabling both HSDP and CPU offload in FSDP2 causes incorrect CPU-GPU synchronization during backward passes, resulting in non-deterministic NaN gradients due to a hardcoded stream mismatch. This bug was fixed by correcting the synchronization event stream.  
[issues/160291]

Linker and dependency conflicts on Windows and ARM: The addition of a new Intel dependency caused nightly binary validation failures on Windows arm64 and Intel XPU builds due to duplicated Intel OpenMP metadata. Additionally, missing dependencies caused ModuleNotFound errors in windows-arm64 nightly builds, which were later resolved.  
[issues/160962, issues/160898]

Task Manager and system monitoring anomalies: On Windows, disk I/O activity for multiple data loader workers is only shown for one worker in Task Manager despite multiple workers running, raising questions about whether this is a monitoring bug or expected behavior.  
[issues/160963]

Tensor operation and compiler fusion bugs: Compiling functions using torch.bucketize with sliced buckets tensors causes incorrect results due to improper fusion of slice operations. Similarly, torch.compile crashes with assertion errors when operator inputs have shape mismatches not handled properly even with dynamic shapes enabled.  
[issues/160964, issues/160882]

TensorBoard and visualization usability: TensorBoard lacks a direct download option for histogram visualizations when used with torch-pruning, forcing users to take screenshots to save visualizations, highlighting a usability gap in exporting data.  
[issues/160954]

Model loading and import crashes: Loading Hugging Face transformer models on the main branch can cause aborts resolved by rebuilding torchvision, and importing internal quantization modules crashes on macOS with Apple M1 Pro CPUs due to outdated torchvision or local build issues.  
[issues/161070, issues/161228]

CI/CD and workflow failures: Nightly validation workflows fail due to Docker image pull timeouts and intermittent build errors caused by changes in CI job responsibilities and unpinned dependencies, impacting continuous integration reliability.  
[issues/161048, issues/160988]

Operator support gaps in sparse and complex tensor operations: The aten::_spsolve function is not implemented for SparseCsrCPU backend, causing NotImplementedErrors, and index_add_ on complex tensors zeroes imaginary parts on MPS backend, leading to numerical inconsistencies.  
[issues/160813, issues/160845]

Runtime errors in specific environments: A RuntimeError "CUDA error: operation not supported" occurs when calling torch.ones on CUDA device in certain environments, possibly related to Megatron-LM integration, indicating environment-specific runtime issues.  
[issues/161046]

Link and documentation fixes: Broken links in the CUDA-basics wiki were fixed by updating missing Dropbox presentation files with alternatives, improving documentation reliability.  
[issues/160923]

Test failures in TorchAO and int8 fusion: An AssertionError in the TorchAO unit test for SDPA int8 fusion was reported and subsequently fixed, indicating ongoing maintenance of quantization-related test correctness.  
[issues/161024]

Triton compilation and operator fusion workarounds: Users face Triton compilation errors caused by operator fusion in Inductor backend and seek robust methods to prevent fusion of specific operators to avoid invalid fused kernel code and NameErrors during compilation.  
[issues/161060]

Runtime errors on MPS device: The constant_pad_nd function fails with runtime errors on MPS devices when called with empty padding lists, whereas it works on CPU, indicating incomplete MPS support for this operation.  
[issues/161066]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 221
Key Open Pull Requests
1. [Intel GPU] Enable backward for SDPA XPU [Don't merge, test only]: This pull request aims to enable the backward pass for the Scaled Dot-Product Attention (SDPA) on Intel GPUs within the PyTorch framework, primarily for testing purposes and not intended for merging at this stage.

URL: pull/161052

Merged: No

Associated Commits: 88afe, d2495, 8d749, ae0e8, f894a, a629b, 6eece, 48857, 6ef6b, 0fcff, 4f5ae, e29cf, cf33a, a8d4c

2. port distributed tensor parallel test files for Intel GPU: This pull request ports the distributed tensor parallel test files to support Intel GPU by enabling Intel GPU usage through torch.accelerator, skipping problematic cases on xpu devices, and maintaining the original code style.

URL: pull/161261

Merged: No

Associated Commits: 8a62c, f75ee, 30c95, 1f59b, 934e5, d1bc8, eac52, dc190, 688b7, 2658b, 57c0e, b8823, dbc56

3. [WIP] Introduce CachingDeviceAllocatorInterface as a base impl: This pull request proposes the introduction of a new base implementation interface called CachingDeviceAllocatorInterface to improve or standardize device memory allocation caching within the PyTorch project.

URL: pull/160878

Merged: No

Associated Commits: f08d4, 33781, 3147d, 8d2dc, 81238, d09d0, 1af45, 37140, 5b555, cfb2b, 7b2d1, da933

Other Open Pull Requests

DeviceMesh Refactoring and CuTe Layout Integration: Multiple pull requests refactor the DeviceMesh internal bookkeeping by introducing the CuTe layout algebra from NVIDIA's Cutlass library, simplifying and generalizing index operations, and replacing existing mappings with layout-based groupings. These changes improve scalability and extensibility without altering existing behavior, supported by new backend initialization functions and detailed documentation with unit tests.  
pull/161106, pull/161016

Flash Attention and CuTe DSL Enhancements: Pull requests add Flash Attention support to the FlexAttention module and enhance the CuTe DSL template renderer with new render functions and score modification capabilities. These improvements aim to test and exercise various attention mechanisms, providing detailed testing outputs comparing flash-enabled and disabled modes.  
pull/161118, pull/161117

Caching Device Allocator Improvements: Two pull requests introduce a generic CachingDeviceAllocatorImpl for cross-backend use and improve the CUDACachingAllocator by reusing the existing CachingDeviceAllocatorInterface. These changes enhance code modularity and maintainability in PyTorch's memory management.  
pull/160872, pull/160958

Inductor Template Heuristics Refactoring: Several pull requests restructure the inductor backend's template heuristics by moving them into a dedicated directory, breaking them into base and Triton components, and adding support for extra keyword arguments and workspace handling. These changes facilitate easier expansion, cleaner interception of template choices, and improved handling of kernel inputs.  
pull/161044, pull/161097, pull/161123, pull/161124, pull/161093

Matrix Multiplication Support on Intel GPUs: A pull request introduces complex stubs for matrix multiplication operations on Intel GPUs, enabling complex data type support by implementing these operations in torch-xpu-ops using oneMKL due to the lack of oneDNN support. This expands PyTorch's capabilities for complex matmul on Intel hardware.  
pull/160867

Dataloader Thread-based Workers: One pull request adds thread-based dataloading workers as an alternative to multiprocessing in PyTorch DataLoader, allowing users to select worker_method=thread for improved memory usage, faster startup, and better scalability with multiple datasets. This feature maintains backward compatibility and includes thread-local random number generation and unit tests.  
pull/161026

XPU Quantized Kernel Support: A pull request introduces the _weight_fp8_mm operation to enable A16W8 matrix multiplication support for quantized kernels on XPU, supporting various activation and weight types with fp32 scales. This eliminates the need for dequantization before linear operations, improving efficiency over existing implementations.  
pull/161045

Paged Attention Accuracy and Safety Fixes: One pull request fixes accuracy issues in paged attention when the key-value sequence length is not divisible by the block size by applying an upper mask mod and prevents invalid memory access in Triton kernels by adding boundary checks and early exit logic. These changes ensure correct attention outputs and safe kernel execution.  
pull/160861

Dynamo Test Porting for Intel XPU: A pull request ports six dynamo test files to support Intel XPU by adapting device type detection, replacing CUDA-specific checks with general GPU checks, and introducing new wrapper methods for accelerator compatibility. It also disables unsupported features and fixes device type handling while preserving original code styles.  
pull/160953

PyBind11 GIL Header Update: One pull request updates the C++ wrapper in PyTorch to use the new PyBind11 simple GIL header for managing the Global Interpreter Lock, modernizing the codebase's concurrency handling.  
pull/161063

convert_frame.compile_frame Refactor: A pull request refactors the convert_frame.compile_frame function to be self-contained by changing its signature to accept frame information directly instead of a callback transform function. This simplifies integration of a fullgraph capture API on top of it.  
pull/160900

CUDA 13.0 Periodic Test Addition: One pull request adds a basic periodic test for CUDA 13.0 to the PyTorch continuous integration system, including related build fixes, Docker adjustments, runner configuration for sm75 architecture, driver API updates, and suppression of deprecation warnings.  
pull/161013

Polyfill and Iterator Enhancements: A pull request adds a polyfill for the _heapq module by redirecting to the Python implementation, improves error handling in PolyfilledFunctionVariable, and implements the __next__ method in the IteratorVariable class.  
pull/161093

Typing and Abstract Base Class Improvements: One pull request proposes adding an abstract base class to the OrderedDictWrapper and clarifies typing for torch.nn.Module in the PyTorch codebase, improving code clarity and type safety.  
pull/160888

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 241
Key Closed Pull Requests
1. [Debug1]New(rebased) modifiy setupvllm: This pull request proposes a series of modifications primarily focused on the setup process of the vLLM module, including multiple commits labeled "setup," linter fixes, and the addition of tests, but it was not merged.

URL: pull/160627

Merged: No

Associated Commits: 9be74, 82fe9, 3cb29, 45665, 76db4, 934b4, 2f5e6, 2308c, 82e3d, 38cf2, 08be7, ebb81, 6fcac, ac1c9, 63b9d, 6fb76, 79e68, 9a0a5, 155b6, 50085, 93a0d, e83f0, ca21d, 9afcd, fcce9, d3ac9, 7b886, 150b4, 6231c, f1e5e, ade55, a27d7, be353, 14650, 44eca, 124f3, 7406d, ac5bd, 3cf15, 2c0de, 9fb92, 3bafa, 587ba, f9ae0, 62f78, 68abb, b31d5, 20022, 0cd59, ddd46, 8f19c, 9e5d8, 39966, a42cc, edf38, cab35, c845f, 76a8d

2. Debug #160583: This pull request is a debug update related to issue #160583 in the PyTorch project, containing multiple base updates and build configuration changes, but it was not merged and marked as not requiring review.

URL: pull/161140

Merged: No

Associated Commits: 67867, 7f537, 58c92, 4849e, 2ba50, c797f, 80ecc, 4bb8d, c734a, 70008, 8e0d3, 03b38, 2e7fa, 804a8, e488c, fccf6, cd7da, 1cff6, bb37a, 61b99, b139d, cec35, ce465, 18c32, 5cc5c, e787a, 56431, 0b368, 7d479, 9130e, eea58, 279b5, 3069b, de7a7, 43e10, 4a263, c2407, 02548, 46ae8, a5b26, bafe6, 3cc65, 0d53f, b54f1, 5bd48, e7d08, fb730, a2030, 5da98

3. [VLLM]setup test cli logics: This pull request proposes setting up the test CLI logics for VLLM by installing wheels from a previous build stage, dynamically generating and installing a VLLM test package list based on the Torch wheels present, and running tests according to a temporary predefined test plan for basic VLLM testing.

URL: pull/160361

Merged: No

Associated Commits: 80a66, b027b, 6b02c, 81d34, 537a7, 347f9, 5afb3, e80a5, 52f36, 0586b, 74033, 3c35f, c3176, 5d1c0, 45f23, 2f90d, 8a3f2, 67aad, effdb, eb4a2, 255b8, ba244, 3f5be, 20017, ec416, e227f, b56f5, 497de, 02618, 65e29, 44fea, 19c50, 43947, d9cba, bae14, 6d713, a704f, cd390, 84b71, b2366, fc107, ea0f0, d1f60, 73dae

Other Closed Pull Requests

Setup and Test Workflow Improvements: Multiple pull requests focus on setup enhancements and test workflow configurations across different components, including PyTorch, vLLM, and VLLM projects. These include unmerged setup branches, adding support for sm89, scheduling tests every 12 hours, and setting up test workflows, all aimed at improving testing infrastructure and setup processes.  
pull/159636, pull/160583, pull/160625, pull/161192

CUDA 13.0 and Nightly Build Support: Several pull requests propose adding support for CUDA 13.0 on different platforms, including x86 Linux and Windows, with updates to Docker images, build scripts, and packaging for nightly builds. These changes enhance compatibility with the latest CUDA version and prepare PyTorch for nightly distribution with updated dependencies.  
pull/160956, pull/161056, pull/161298

Dynamo and Compilation Fixes: Pull requests address improvements and bug fixes in PyTorch's Dynamo compiler, including adding a hint_override argument for dynamic shape hints, fixing filename display issues in stack traces, and logging exception stack traces to improve error visibility during compilation. These changes aim to enhance debugging and performance consistency in Dynamo.  
pull/161007, pull/161073, pull/161096, pull/161056

Transformer and CuteDSL Kernel Development: There are pull requests focused on testing different transformer configurations and adding support for CuteDSL templates in the PyTorch compiler. These include implementing fixed CuteDSL templates for element-wise addition kernels with autotuning and testing transformer model variations.  
pull/161079, pull/161079

Flex Attention and Decode Logic Updates: Two pull requests propose adding flash attention implementation to the flex attention module and fixing dispatch logic to correctly use flex attention instead of flex decode when group counts are not powers of two. These changes address performance and correctness in attention mechanisms.  
pull/160108, pull/160109

Memory Estimation and SchedulerNode Improvements: A pull request proposes an alternative estimate_peak_memory function that accounts for multiple phases in SchedulerNodes and applies this to reorder communication passes, including tracking buffer deallocation and limiting collective reorderings. This aims to improve memory estimation despite unresolved peak memory regression issues.  
pull/160904

ONNX Opset Compatibility Fix: One pull request fixes broken support for ONNX opset versions lower than 18 when using dynamo=True by modifying registry creation logic and requiring onnxscript version 0.4 or higher. This restores compatibility with older ONNX opsets.  
pull/161056

Bucketing Logic Bug Fix: A pull request addresses a bug in bucketing logic that caused cycles by only considering direct arguments and ignoring transitive dependencies, fixing the cycle introduction issue.  
pull/160967

Indexing Tests and MPS Backend Fixes: One pull request moves indexing tests to a dedicated module to enable running on the MPS device, marks some tests as expected failures due to unimplemented features, and fixes a hard crash and deterministic algorithm issues on MPS.  
pull/160994

NCCL Config Exposure: A pull request exposes the unsafe_get_ptr pointer in dist.ProcessGroupNCCL.NCCLConfig to allow external creation and control of ncclConfig_t objects, facilitating management of multiple NCCL communicators.  
pull/161136

Profiler Analysis Enhancement: A pull request introduces a new profiler analysis flag to combine multiple profile files into a single consolidated profile, especially useful for distributed program runs with different process IDs.  
pull/161145

XPU CI and Inductor Test Fixes: One pull request attempts to fix broken test cases caused by community changes related to XPU CI and Inductor unit tests, addressing specific issues #160243, #160244, and #160245.  
pull/160403

FSDP CPU-GPU Synchronization Fix: A pull request fixes a bug in the Fully Sharded Data Parallel implementation by replacing a hard-coded synchronization stream with post_reduce_stream.record_event() during HSDP with CPU offload, including adding a unit test to prevent failures.  
pull/160481

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

yangw-dev
618
24
3
35

malfet
111
11
15
139

guangyey
65
12
0
71

guilhermeleobas
98
21
0
14

coconutruben
105
25
0
1

xuhancn
105
18
0
5

anijain2305
105
13
0
10

ezyang
59
14
2
51

janeyx99
55
8
3
44

wconstab
35
8
1
65

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
yangw-dev	618	24	3	35
malfet	111	11	15	139
guangyey	65	12	0	71
guilhermeleobas	98	21	0	14
coconutruben	105	25	0	1
xuhancn	105	18	0	5
anijain2305	105	13	0	10
ezyang	59	14	2	51
janeyx99	55	8	3	44
wconstab	35	8	1	65