Weekly GitHub Report for Pytorch: March 31, 2025 - April 07, 2025 (12:07:01)

            Weekly GitHub Report for Pytorch: March 31, 2025 - April 07, 2025 (12:07:01)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notable changes include the deprecation of PyTorch's official Anaconda channel, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[export] run_decomposition does not preserve custom CompositeImplicitAutograd ops: This issue is about the failure of the run_decomposition function to preserve custom CompositeImplicitAutograd operations during ONNX export, which is causing problems for the user who is trying to implement composite custom operations for ONNX/TRT export purposes. The user is seeking a solution to prevent these custom operations from being decomposed automatically, as they have specified a custom_translation_table that should ideally keep these operations intact for translation.

The comments discuss the unexpected decomposition of custom operations during ONNX export, with various contributors suggesting debugging steps and potential solutions. The conversation includes attempts to reproduce the issue, discussions on the behavior of run_decompositions, and suggestions to modify the decomposition table to exclude specific custom operations. There is also a discussion about the need to adjust the ONNX decomposition registry to handle custom operations correctly.
Number of comments this week: 39

Pytorch nightly Cuda 12.8 - 'too many resources requested for launch' with multiple layers of LayerNorm after strided Conv1d: This issue reports a bug in the PyTorch nightly build for CUDA 12.8, where a runtime error occurs due to "too many resources requested for launch" when running a backward pass through a model with multiple strided Conv1d modules followed by LayerNorm modules. The error does not occur when the Conv1d modules have a stride of 1 or when only one strided Conv1d followed by one LayerNorm is used, and it is specific to CUDA as it does not appear when using a CPU or replacing LayerNorm with RMSNorm.

Multiple users confirm experiencing the same error with similar setups, particularly on NVIDIA RTX 5090 GPUs. A suggested workaround is to downgrade to PyTorch 2.7, which resolves the issue for several users. The problem is suspected to be related to a specific pull request, and a fix involving __launch_bounds__ is proposed to reduce register usage.
Number of comments this week: 17

Intermittent SSL certificate expiry warnings for download.pytorch.org (load balancer?): This issue reports intermittent SSL certificate expiry warnings when accessing download.pytorch.org, suggesting that the problem might be due to an expired certificate on a load-balanced node. The issue is not specific to any version and has been observed from various locations, indicating a potential problem with certificate propagation across different nodes.

Multiple users from different locations, including the UK, France, and Switzerland, report experiencing the same SSL certificate expiry issue. Some users suggest temporary workarounds, while others note that the problem persists intermittently, likely due to propagation delays in updating the certificate across all nodes. The issue is linked to a similar past problem, and users are advised to wait for the certificate to propagate fully.
Number of comments this week: 12

CUTLASS backend updates: Instantiation level, long compilation and long autotuning time: This issue addresses the current status and challenges of the CUTLASS backend in the PyTorch project, focusing on its performance on the H100 hardware. It highlights the backend's potential to outperform other solutions like Aten and Triton, while also discussing significant obstacles such as long kernel compilation and autotuning times, and missing features that hinder seamless benchmarking and performance evaluation.

The comments discuss the potential use of CUTLASS's Python interface for generating C++ code, with some skepticism about its utility given the upcoming CUTLASS 4.x release. There is also a mention of a talk on Python & CUTLASS 4.0, which suggests a new Pythonic DSL for writing GEMM kernels. Concerns are raised about the lack of performance support for Hopper and earlier architectures in CUTLASS 4 Python, with a clarification that only bug fixes will be provided for these older architectures.
Number of comments this week: 8

[distributed] Crash when trying to use default PG after creating new PG: This issue describes a crash occurring when attempting to use the default process group (PG) after creating a new process group in a PyTorch distributed environment. The user provides a Python script to reproduce the error, which involves initializing process groups with the NCCL backend and encountering a segmentation fault when calling dist.barrier().

The comments discuss attempts to reproduce the issue, with some users unable to replicate the crash on their systems. Suggestions include enabling detailed debug logs and checking for discrepancies in the environment setup. A user provides a gdb backtrace indicating a segmentation fault in the NCCL communication setup, suggesting a potential issue with the process group initialization or cleanup.
Number of comments this week: 6

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs during the execution of a Python script that utilizes the OotdPipeline and involves compiling components with Torch's compile function, specifically when using the 'inductor' backend.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a script in a Docker environment with a tmpfs permission set to 1777, where the execution of a cached cuda_utils.so file in the /tmp directory fails due to the absence of the execution bit, despite the directories having the correct permissions. The error occurs during the execution of a PyTorch model, specifically when attempting to execute the compiled CUDA utilities, resulting in an ImportError indicating a failure to map a segment from the shared object.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes are committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature is to reduce the size of model files, particularly for deployment on mobile devices, by eliminating unnecessary debug files, as demonstrated by a reduction from 6.7MB to 5.6MB in a test case.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 99
Summarized Issues:

Indexing Support for Sparse Tensors: The need for consistent and comprehensive indexing support for sparse tensors in PyTorch is highlighted, particularly addressing the limitations of current methods like index_select which only partially support SparseCOO tensors with 1-dimensional indices. Additionally, there is a lack of support for direct indexing and SparseCSR tensors, emphasizing the importance of these features for efficiently handling sparse datasets.
issues/150277, issues/150285

Output Inconsistencies in PyTorch: There are output inconsistencies between Eager mode and torch.compile using mode="max-autotune-no-cudagraphs" in PyTorch, where the outputs differ beyond strict tolerance levels. This potentially indicates numerical instability in backend optimizations, which could affect the reliability of results.
issues/150280, issues/150314

Bugs in PyTorch Functions: Several bugs have been identified in PyTorch functions, such as torch.utils.rename_privateuse1_backend("test") triggering CUDA profiling warnings and a CUPTI_ERROR_NOT_INITIALIZED error, and torch.fill raising a TypeError due to an invalid combination of arguments. These issues highlight the need for better error handling and support for various function arguments.
issues/150281, issues/150284

Integration and Optimization Proposals: Proposals have been made to integrate the ZenDNN library and Zentorch optimizations into PyTorch to enhance inference performance on AMD EPYC™ CPUs. This includes a phased approach for optimizing recommender systems and NLP workloads, with options for explicit or automatic enablement based on CPU architecture.
issues/150296

Lint Rule and Internal Failures: A need for a lint rule to enforce the use of std::optional over c10::optional in the PyTorch project is discussed, as some internal failures might have been avoided with such a rule. Input is sought from specific contributors who have previously worked on related changes.
issues/150313

Attention Mechanism and Memory Access: A common problem in PyTorch's attention mechanism is highlighted, where users encounter out-of-bounds memory access when using score modifications with block masks. This is due to the overlapping semantics of mask_mods and score_mods, and potential solutions are suggested to incorporate masking from mask_mods into score_mods to prevent invalid memory reads.
issues/150321

Dynamic Shape and Compilation Issues: Dynamic shapes do not function correctly with keyword arguments (kwargs) in a PyTorch module, resulting in an error due to a mismatch between the expected top-level keys of the dynamic_shapes dictionary and the actual argument names. This affects models like rag and xlnet in Hugging Face, highlighting the need for better handling of dynamic shapes.
issues/150371, issues/150319

Distributed Training and Initialization Errors: A RuntimeError is encountered when attempting to initialize a distributed training process using the "gloo" backend with nightly torch version 2.8, where the error message "makeDeviceForHostname(): unsupported gloo device" occurs. This affects both CUDA and CPU devices, with a stable workaround found using earlier nightly versions of torch.
issues/150381

Memory Management and Profiling: A bug in the PyTorch function torch.as_strided_scatter is identified, where fuzz testing with specific edge case inputs results in a "double free or corruption (out)" error. This indicates a potential memory management problem that leads to a core dump, although this behavior is not consistently reproducible in the latest PyTorch versions.
issues/150388

Compilation and Export Errors: Compilation errors are encountered when using AOT inductor with PyTorch 2.6.0 RC in a Docker environment lacking a local CUDA Toolkit (CTK), where the absence of the cuda.h file leads to a failure. This prompts a discussion on whether to provide a clearer error message or include the missing file automatically.
issues/150442, issues/150446

Optimizer and Data Type Issues: Bugs in PyTorch's optimizers (Adam, AdamW, and RAdam) are highlighted, where the weight_decay parameter is incorrectly applied to tensors with requires_grad=False. This should not happen as weight_decay is only meaningful for parameters that require gradients, suggesting a need for better parameter validation.
issues/150610

Memory Usage and Fragmentation: A RAM leak problem is encountered during data loading with multiprocessing and Conv3d on the CPU in a custom PyTorch Dataset's __getitem__ method. The process is unexpectedly killed due to out-of-memory errors when using random tensor shapes, but not when using fixed tensor shapes, indicating a potential issue with memory management.
issues/150612

Compilation and Execution Errors: A bug in the torch.compile function of PyTorch is identified, where it unnecessarily creates a CUDA context and allocates GPU memory even when the code is intended for CPU execution. This leads to out-of-memory errors in multi-processing environments when multiple processes each create their own GPU context.
issues/150622

Profiling and Metadata Enhancements: Enhancements to the PyTorch profiling tool are proposed, enabling Inductor to pass metadata for aten operations to the dispatcher. This allows this information to be included in the args field of the kineto trace, as an alternative to the current fragile method of manually post-processing the profile.json.
issues/150709

Performance and Compilation Issues: A severe performance degradation is reported when continuously calling nn.Linear in fp32 on the NVIDIA GeForce RTX 5090D GPU, as compared to a manual matrix multiplication and addition. The problem may not occur on other 50 series cards, indicating a potential issue with the specific GPU model.
issues/150725

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 50
Summarized Issues:

Test Failures and Instabilities: This topic covers various issues related to test failures and instabilities in the PyTorch project. One issue involves a disabled test due to flaky failures on the ROCm platform, while another describes an intermittent failure in a CUDA test with no clear cause. Additionally, there are issues with build instabilities on specific architectures and environments, such as the s390x architecture and the "linux-jammy-py3-clang12-executorch" build. These issues highlight the challenges in maintaining stable test environments across different platforms and configurations.
issues/149050, issues/149370, issues/150200, issues/150261

ONNX Export Issues: Several issues are related to exporting PyTorch models to ONNX format. Users have encountered problems with unsupported operators, such as 'aten::fft_fft2', and issues with opset version mismatches during export. Additionally, there are challenges with large model exports exceeding size limits and unexpected behavior in opset version handling. These issues indicate the complexities involved in ensuring compatibility and correctness when converting models between frameworks.
issues/149332, issues/149760, issues/150400, issues/150592

Compilation and Build Errors: Compilation and build errors are a recurring theme, with issues ranging from linker path errors on Windows to CMake compatibility problems. Users have reported errors due to implicit function declarations and incorrect library paths, as well as challenges with specific build environments like Triton on Windows. These issues underscore the importance of maintaining up-to-date and compatible build configurations across different systems.
issues/149889, issues/150159, issues/150167, issues/150480

Performance and Utilization Challenges: Performance issues are highlighted in several cases, such as low GPU utilization during model training and slow execution of specific nodes in ComfyUI software. Users have also faced challenges with PyTorch's Distributed Data Parallel (DDP) environment, which affects efficient parallel processing. These issues reflect the ongoing efforts to optimize performance and resource utilization in machine learning workflows.
issues/150615, issues/150665

Security Vulnerabilities: Security concerns are raised in issues related to CVE-2024-7804, which involves de-serialization vulnerabilities in PyTorch. The vulnerability is reportedly present in versions greater than 2.3.1, prompting discussions on its validity and potential overlap with other CVEs. These issues highlight the need for vigilance in addressing security risks in software projects.
issues/150224, issues/150414

Compatibility and Support Issues: Compatibility issues are evident in several cases, such as the lack of support for certain CUDA capabilities and incompatible glibc versions causing import errors. Users have also reported problems with unsupported GPU models and outdated dependencies, indicating the challenges in maintaining compatibility with evolving hardware and software ecosystems.
issues/150724, issues/150733

Documentation and Configuration Errors: Documentation and configuration errors are highlighted in issues related to unsupported section titles in docstrings and incorrect configuration settings. These issues emphasize the importance of accurate documentation and configuration management in software development.
issues/150134, issues/150604

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 183
Key Open Pull Requests
1. [Inductor] Refactor wrapper codegen to use Wrapper IR.: This pull request refactors the existing wrapper code generation in the PyTorch project to utilize a new Wrapper Intermediate Representation (IR), which extends the current Memory Planning IR into a comprehensive Wrapper IR, aiming to provide structured information for generating wrapper code across different backends without modifying core files, thereby promoting modularity and encapsulation.

URL: pull/150458

Merged: No

Associated Commits: d3a36, 6e176, c55fd, f299c, 403b9, fd03d, 2cd01, bd78b, df2f5, 0dfd0, 40011, 20b1d, 91a39, e7335, be417, fc5f0, 6c1cb, c94d1, 2b978, 2b6a4, 5e33e, 6e09c, b4c7a, 41cab, 2a839, 43773, 30cb1, 46e84, e6a12, 05cf7, aa7a2, 76b12, 2831c, 8678c, fc9ff, 2f17b, fb252, 9dc74, 57369, 63593, f621b, 51ea0, 0cd44

2. WIP: Remove Conda Instructions: This pull request involves the removal of Conda installation instructions from the PyTorch project documentation, leaving only the pip instructions where multiple installation methods were previously provided, and includes comments for areas where the replacement instructions were uncertain, as well as minor text fixes across several commits.

URL: pull/150646

Merged: No

Associated Commits: 989af, 60a05, a348d, 5527a, 30405, dd37e, f4acf, f9fa6, 66ff9, 58edd, 63b68, e50d3, f387c, b53ae, 3f2f8, 65135, b8d8b, 35227

3. [test] binary docker builds: This pull request aims to improve the testing and building process of binary Docker images by modifying workflows to fetch images from AWS ECR instead of Docker IO, introducing reusable actions for binary Docker builds, simplifying scripts, and addressing issues related to image tagging and rebuilding, while also considering limitations for certain architectures like s390x and xpu.

URL: pull/150558

Merged: No

Associated Commits: 8d01b, 1b5eb, 0f001, 5fa0a, 5d7d5, 10d1f, c697a, ad356, 7da1c, 5444b, c19f9, 4d90e, 441cf, 87ad8, 2e29f

Other Open Pull Requests

DTensor Enhancements: This topic covers improvements in the DTensor library, including support for uneven sharding and a utility for converting strided sharding placements. These changes aim to enhance the functionality and flexibility of DTensor, particularly in handling unevenly divisible parameters and improving distributed checkpointing.
pull/150490, pull/150493, pull/150650

MKLDNN Fusion for XPU: The pull requests under this topic focus on enabling MKLDNN fusion for the XPU backend in PyTorch. They include enhancements such as enabling specific operations, refining test scripts, and ensuring compatibility and performance improvements.
pull/150525, pull/150287

ROCm Support Enhancements: These pull requests introduce support for hipSPARSELt and reintroduce SymmetricMemory in ROCm, enhancing performance and compatibility. They address various updates, including configuration flags and conditional checks, while noting certain hardware limitations.
pull/150578, pull/150580

Dynamo and Inductor Improvements: This topic includes enhancements to the Dynamo component with a suite of CPython tests and updates to the Inductor component for handling custom operations. These changes aim to improve testing coverage and the handling of custom operators.
pull/150466, pull/150511, pull/150586

CUDA and Tensor Enhancements: The pull requests here focus on CUDA enhancements, including RMSNorm kernel introduction and lazy cloning in Tensor.to. These changes aim to improve performance and efficiency in data transfer and computation.
pull/150569, pull/150576

Type Hinting and Refactoring: These pull requests aim to refactor and simplify type hinting in PyTorch by adopting PEP 585 and PEP 604. They address type shadowing issues and improve the .pyi stub files without requiring future imports.
pull/150727, pull/150728

Miscellaneous Enhancements: This topic includes various enhancements such as enabling vec::Vectorized operations with scalars, tuning _scaled_grouped_mm, and introducing PaddedTensor for dynamic shapes. These changes aim to improve performance and functionality across different areas of PyTorch.
pull/150380, pull/150421, pull/150567

Experimental and Work-in-Progress Updates: These pull requests include experimental updates like user buffer registration for FSDP2 and work-in-progress refactoring efforts. They are part of ongoing efforts to enhance PyTorch's capabilities and infrastructure.
pull/150564, pull/150584

General Refactoring and Fixes: This topic covers general refactoring efforts such as the CUDAAllocatorConfig reuse and gen_pyi.py refactoring. These changes aim to streamline code and improve maintainability.
pull/150312, pull/150626

Symbolic Expression Enhancements: The pull request introduces sym_and and sym_or functions to enhance symbolic expression handling. This change simplifies expressions and improves runtime assertions and branch preservation.
pull/150456

Guard Mechanism and Storage Updates: This topic includes updates to the computeStorageNbytes function with a guard_or_false mechanism. These changes ensure correct integration and testing of the functionality.
pull/150483

cusparse Installation in Docker: This pull request addresses the installation of cusparse in the binary build Docker image for PyTorch. It is part of ongoing efforts to improve the build process and infrastructure.
pull/150688

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 226
Key Closed Pull Requests
1. Work on API Forwarding: This pull request, titled "Work on API Forwarding," was mistakenly opened against the wrong repository and was not merged, involving multiple commits that address various issues such as updating workflows, fixing bugs, and improving functionalities related to PyTorch's build processes, device compatibility, and API enhancements, as evidenced by the detailed commit messages and associated URLs.

URL: pull/150528

Merged: No

Associated Commits: c69ea, af92b, aad1c, f3c08, 5363f, 5fbc4, 2b84d, c92f6, 1d3ff, f9e99, 46f55, 0cdf8, 6628b, c953e, 22775, 9b688, 4b9b7, f61bf, 31b52, b1a10, 5eb54, d9eed, 41811, 23e39, f01a6, 929ef, 4e418, 478a9, 7d329, 3a3de, f35ab, 7092d, 8c034, be126, e1858, d155d, 4d9de, a99cc, 51829, 983ea, 47f4e, eb304, 6e304, e2067, 57421, a61b5, 4658a, e19c1, 232eb, 1d2c2, a2639, cd15d, 9c34a, 8d4b8, dcb8a, 7be6b, ca3c3, 32070, 2236d, 1eba9, 0fddc, 7808f, 74869, 15671, bb376, 423f3, 49be9, c5891, 125a1, 3a243

2. Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle: This pull request introduces statically launchable CachingAutotuners into the FXGraphCache's cache entry within the PyTorch project, allowing for the storage of compiled Triton kernels with a minimal memory footprint by utilizing StaticCudaLauncher and StaticTritonCompileResult, thereby eliminating the need for separate process calls on cache hits and ensuring consistent kernel usage while simplifying cache logic once feature parity is achieved.

URL: pull/149054

Merged: No

Associated Commits: 7598f, f664e, 0d4a2, 9cb2f, 8a3cc, d0ab8, 1a60c, b8a94, 85dd8, 932fe, 2e827, 37795, 2311d, 13ef5, 59f06, a0a7a, 6d2fd, e9fdb, 1cd6d, b5d83, 59871, 99c9c, 4dfe7, 04e5b, d7b79, 92056, 32edc, c7b15, e0899, fb574, aa0b5, 2d0d7, 37a60, f5571, db019, 3f610, 35048, bc328, 1ece4, 4acbc, f59c7, 547fc

3. update get start xpu document for v2.7: This pull request involves updating the "get start xpu" documentation for version 2.7 of the PyTorch project, including modifications to the support table, prerequisites for binaries and building from source, and adjustments to the documentation format, although it was ultimately not merged.

URL: pull/150397

Merged: No

Associated Commits: 122ae, cb561, 7e0a4, 4c6fc, dc4b2, 7a001, 2d789, d106c, b3771, 37c52, 7ba05, 065d3, 3d8db, 5694b, 355e1, ef7bd, 56b93, ad7af, e777d, 30662, 10e67, 99da1, 0edcf, 8220f, e24fc, 5a461, 9aa6a, f7b24, a74e6

Other Closed Pull Requests

PyTorch Dynamo Enhancements: This set of pull requests focuses on improving PyTorch Dynamo's handling of tensor subclasses. They ensure tracing into __torch_function__, support dynamic attributes, and address issues with torch.Tensor._make_subclass. These changes enhance compilation without altering legacy code and resolve specific issues in external projects like Hugging Face Diffusers and ComfyUI-GGUF.
pull/149792, pull/149482, pull/149483, pull/149484

Unmerged Pull Requests: Several pull requests were proposed but not merged, addressing various enhancements and bug fixes. These include implementing raise ... from ... syntax, supporting unbacked subgraphs, rewriting the "should_swap" function, and adding support for hermite_polynomial_h in the MPS backend.
pull/148766, pull/149298, pull/149215, pull/150279

Matrix Multiplication and GEMM Enhancements: These pull requests introduce and optimize matrix multiplication operations in PyTorch. They include enabling bf16 grouped GEMM, consolidating Triton lowering of scaled_mm, and fusing matmul-reduce-scatters in asynchronous tensor parallelism.
pull/150374, pull/150045, pull/149946

Bug Fixes and Reliability Improvements: These pull requests address various bugs and improve test reliability. They fix race conditions in unit tests, a compile error with torch.Tensor.unsqueeze_, and ensure deterministic settings for XPU accuracy tests on Intel GPUs.
pull/150463, pull/150573, pull/149028

Dynamic Shapes and Symint Handling: Enhancements in handling dynamic shapes and symbolic integers are addressed in these pull requests. They utilize statically_known_true for dynamic shapes and introduce a fallback mechanism for unbacked symints in the Inductor component.
pull/149084, pull/149994

CI and Build Process Optimization: This pull request optimizes the CI process by using system-installed NCCL, reducing build time, and unifying NCCL version pins, although it slightly increases Docker pull times.
pull/150226

Miscellaneous Enhancements: Various enhancements include adding a "reason" field to torch.compiler.disable, improving cpp_wrapper, and ensuring GPU architecture compatibility in ROCm. These changes streamline processes and improve interface usability.
pull/150341, pull/149350, pull/150473

Performance Improvements: This pull request enhances the performance of sum and prod reductions in MPSInductor by using simd_sum and simd_product operations, significantly improving torch.compile performance for specific models.
pull/150566

Unmerged Testing and Bindings: These pull requests focus on testing binary builds and introducing C++ bindings for specific functions. They remain unmerged due to challenges in implementation and testing.
pull/150459, pull/150148

Device and RNG State Handling: This pull request addresses treating third-party devices with set_rng_state() and get_rng_state as CUDA-like devices, although it was not merged.
pull/149093

GPU Warning Correction: This pull request ensures that a warning about the absence of a GPU is only triggered when attempting to enable the tf32 setting in a CPU-only environment, correcting the behavior during torch.export initialization.
pull/149926

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

Add reverse engineered code to iOS build
Toxicity Score: 0.55 (Defensive responses, unresolved conflict, skepticism.)
This GitHub conversation involves username1 proposing changes to the iOS build process, which username2 critiques for potential issues. Username1 responds defensively, leading to a tense exchange. Username3 attempts to mediate by suggesting a compromise, but username2 remains skeptical. The tone shifts from collaborative to confrontational, with username1 expressing frustration over the lack of consensus.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
229
22
13
114

guilhermeleobas
251
6
1
7

atalman
126
13
12
25

laithsakka
93
15
7
57

jamesjwu
142
8
11
10

justinchuby
84
3
5
78

anijain2305
126
10
18
14

pianpwk
97
23
1
39

clee2000
131
15
5
3

xmfan
109
12
5
22

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	229	22	13	114
guilhermeleobas	251	6	1	7
atalman	126	13	12	25
laithsakka	93	15	7	57
jamesjwu	142	8	11	10
justinchuby	84	3	5	78
anijain2305	126	10	18	14
pianpwk	97	23	1	39
clee2000	131	15	5	3
xmfan	109	12	5	22