Weekly GitHub Report for Pytorch: May 19, 2025 - May 26, 2025 (12:02:45)

            Weekly GitHub Report for Pytorch: May 19, 2025 - May 26, 2025 (12:02:45)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notably, this version also adds FP16 support for X86 CPUs and marks a shift away from publishing on Conda, directing users to alternative package sources.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Passing device_id to torch.distributed.init_process_group() results in NCCL randomly hanging during communications: This issue describes a bug in PyTorch 2.7 where passing a device_id to torch.distributed.init_process_group() causes the script to randomly hang during distributed communications, a problem not present in version 2.6. The issue appears to be related to an upgrade in the NCCL library from version 2.21 to 2.26, which is used for GPU communication, and affects various types of GPUs.

The comments discuss similar issues reported by other users, potential causes related to the NCCL upgrade, and attempts to reproduce the bug with different NCCL versions. Suggestions include downgrading NCCL or using a workaround to disable non-blocking API mode, with some users confirming that the issue does not occur with NCCL 2.27. There is also a discussion about the implications of downgrading NCCL due to other bugs in earlier versions.
Number of comments this week: 17

torch.compile raise JSONDecodeError("Extra data", s, end)  while using Ray with Ulysses + 4 GPUs: This issue involves a JSONDecodeError that occurs when using torch.compile with Ray, Ulysses, and 4 GPUs, specifically when the program occasionally raises a compile error related to extra data in JSON decoding. The error seems to be associated with the LocalAutotuneCache, and there is a suspicion that it might not be thread-safe, especially in a multi-process situation with Ray.

The comments discuss whether the issue can be reproduced without Ray and suggest that the problem might be related to the LocalAutotuneCache's thread safety. A temporary workaround is proposed by disabling caches, and a user confirms that setting torch._inductor.config.autotune_local_cache to False resolves the issue. There is also a discussion about setting different cache directories for each rank to avoid race conditions, with a clarification that doing so limits sharing between ranks.
Number of comments this week: 13

Duplicated milestones keys in torch.optim.lr_scheduler.MultiStepLr after loading a checkpoint: This issue involves a problem with the torch.optim.lr_scheduler.MultiStepLR when used in conjunction with DistributedCheckPointSaver (DCP), where the milestones parameter, which should be restored to its original state, ends up with duplicated keys due to a mismatch between integer and string types after loading a checkpoint. The proposed solution is to modify the MultiStepLR.load_state_dict method to convert the loaded keys from strings back to integers, addressing a problem that has affected multiple users and complicating debugging efforts.

The comments discuss the root cause of the issue, which is the restriction of DCP to only allow string keys in state_dicts, leading to conversion issues. Suggestions include addressing the problem at the DCP level, adding correctness tests, and handling the edge case in user space. A user-side workaround is mentioned, but a source-level fix is recommended to prevent future occurrences. The discussion also covers the internal workings of DCP and the challenges of maintaining backward compatibility while enforcing stricter key types.
Number of comments this week: 10

Better padding API: This issue addresses the need for a more user-friendly and comprehensive padding API in PyTorch, as the current torch.nn.functional.pad() is considered awkward and incomplete by users. The proposal suggests creating a new torch.pad() function that could potentially align with NumPy's padding API, offering more intuitive ordering and a complete set of padding modes.

The comments reflect a variety of opinions on the proposed options, with some users favoring specific options while others find certain APIs confusing or ambiguous. There is a consensus on the need for a fill value and additional modes like "reflect" and "circular," though implementing "circular" is seen as potentially complex. Some users suggest that the existing circular operation could be improved with a better API.
Number of comments this week: 9

AOTI packaged model can't be run on newly created tensor of same shape as tensor created from slice: This issue describes a bug encountered when running inference on different tensors using an AOTI compiled model, where the model works with a sliced tensor of 9 channels from an original 36-channel tensor but fails with a newly created tensor of the same shape. The user highlights the inconsistency in behavior and the lack of helpful error logs, which complicates debugging, especially when the error leads to illegal memory access issues with subsequent GPU operations.

The comments discuss whether the channel size was marked as dynamic during model compilation, with the user confirming it was not. A repository is shared to reproduce the issue, and it is noted that a memory leak issue was resolved in version 2.7, but the current issue persists. There is confusion about why slicing works but creating a new tensor does not, and it is suggested that input shape checking is not performed by default for efficiency. The user expects input shape failures in both cases and is surprised that the second case succeeds when run together with the first.
Number of comments this week: 9

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a script that utilizes the OotdPipeline and involves compiling components with Torch's compile function, specifically affecting the 'inductor' backend due to the missing import.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing MaxPool2D in PyTorch when the stride is equal to 1, suggesting that a MaxPool2D operation with a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly for other kernel sizes. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached cuda_utils.so file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with a tmpfs permission set to 1777, and the problem is highlighted by an ImportError indicating a failure to map a segment from the shared object, which is crucial for the model's execution.
Enable UFMT on all files in PyTorch: This issue addresses the need to apply uniform formatting (UFMT) to approximately 1,500 files in the PyTorch codebase that are currently exempt from this formatting standard. The process involves removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to ensure all files adhere to the desired formatting, with additional preparatory work required for files with known issues to facilitate easier review.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this request is to reduce the size of model files, particularly for deployment on mobile devices, where storage space is limited, as demonstrated by the user's experience of reducing a model's file size from 6.7MB to 5.6MB by manually removing these debug files.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 128
Summarized Issues:

Bugs in PyTorch's torch.compile function: Several issues have been reported with the torch.compile function, including graph break errors, incorrect tensor outputs, and unexpected behavior when using specific backends or configurations. These problems often result in runtime errors or inconsistencies in output shapes and values, highlighting the need for further debugging and potential workarounds. 
pytorch/pytorch/issues/153832, pytorch/pytorch/issues/153852, pytorch/pytorch/issues/154009, pytorch/pytorch/issues/154111, pytorch/pytorch/issues/154282, pytorch/pytorch/issues/154284

Performance issues in PyTorch: Various performance-related issues have been identified, such as degradation in execution speed on specific hardware, excessive memory usage, and discrepancies in gradient calculations. These issues often require optimization of kernel functions or adjustments to algorithm parameters to achieve expected performance levels. 
pytorch/pytorch/issues/153809, pytorch/pytorch/issues/153825, pytorch/pytorch/issues/153957, pytorch/pytorch/issues/154094, pytorch/pytorch/issues/154301

Bugs in PyTorch's distributed and backend systems: Several bugs have been reported in PyTorch's distributed and backend systems, including incorrect behavior when passing arguments, random hangs during communication, and issues with backend capability assignment. These bugs often lead to runtime errors or improper functionality in distributed environments. 
pytorch/pytorch/issues/153822, pytorch/pytorch/issues/153960, pytorch/pytorch/issues/154102, pytorch/pytorch/issues/154297

Bugs in PyTorch's ONNX export functionality: Issues have been identified with exporting PyTorch models to ONNX format, including lack of support for certain operators, runtime reshape failures, and device mismatch errors. These issues often require updates to the export functionality or workarounds to ensure successful model conversion. 
pytorch/pytorch/issues/153823, pytorch/pytorch/issues/153955, pytorch/pytorch/issues/154093

Bugs in PyTorch's tensor operations: Several bugs have been reported in PyTorch's tensor operations, including incorrect handling of complex tensors, floating point exceptions, and unexpected behavior with specific input shapes or data types. These issues often result in runtime errors or incorrect outputs, requiring fixes or workarounds. 
pytorch/pytorch/issues/153852, pytorch/pytorch/issues/153919, pytorch/pytorch/issues/154014, pytorch/pytorch/issues/154311, pytorch/pytorch/issues/154312

Bugs in PyTorch's profiling and tracing functionality: Issues have been identified with PyTorch's profiling and tracing functionality, including incorrect capture of execution steps, runtime errors during tracing, and discrepancies in reported test results. These issues often require updates to the profiling tools or adjustments to the tracing logic. 
pytorch/pytorch/issues/153901, pytorch/pytorch/issues/153938, pytorch/pytorch/issues/154101

Bugs in PyTorch's library and build configuration: Several bugs have been reported related to PyTorch's library and build configuration, including missing type definitions, build failures with specific compilers, and issues with library exports. These bugs often require updates to the build scripts or configuration files to resolve. 
pytorch/pytorch/issues/153933, pytorch/pytorch/issues/154096, pytorch/pytorch/issues/154105

Bugs in PyTorch's Triton integration: Several issues have been reported with PyTorch's integration with the Triton library, including test failures, API changes, and runtime errors. These issues often require updates to the Triton library or adjustments to the integration logic to ensure compatibility and functionality. 
pytorch/pytorch/issues/154207, pytorch/pytorch/issues/154209, pytorch/pytorch/issues/154210, pytorch/pytorch/issues/154212, pytorch/pytorch/issues/154213

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 54
Summarized Issues:

Test Disabling in TestInductorOpInfoXPU Suite: The TestInductorOpInfoXPU suite has multiple tests disabled due to consistent failures on Linux platforms. These failures involve various data types such as float64, float32, float16, int32, int64, and bool, and require attention from specific contributors to resolve the underlying issues.
issues/152898, issues/152910, issues/152911, issues/152912, issues/152925, issues/152929, issues/152930, issues/152931, issues/152970, issues/152971, issues/153017, issues/153018

PyTorch Cacheable Functions Enhancement: Enhancing PyTorch to allow users to specify their own non-torch functions as cacheable involves creating a registration mechanism and generating static string cache keys. This ensures safety and compatibility across different code versions, focusing on user-defined function caching.
issues/152434

Error Handling and Consistency Issues: Several issues highlight inconsistencies in error handling across different PyTorch backends and functions. These include discrepancies in torch.batch_norm error behavior between CPU and GPU, and the aot_eager backend's handling of try...except blocks.
issues/153137, issues/153605

Test Failures and Disabling in Various Suites: Multiple tests across different PyTorch suites have been disabled due to failures. These include tests in AOTInductorTestABICompatibleGpu, TestApplyCUDA, and TestMaxAutotune suites, requiring input from contributors to address the issues.
issues/153829, issues/153830, issues/154112, issues/154218

Memory Management and Compatibility Issues: Users have reported issues with GPU memory management and compatibility in PyTorch. These include gradual GPU memory usage increases, incorrect compute capability reporting, and compatibility problems with specific NVIDIA GPUs.
issues/153363, issues/153928, issues/153944

Bugs in PyTorch Functions and Modules: Various bugs have been reported in PyTorch functions and modules, such as incorrect results in torch.cuda.memory._record_memory_history, torch.multinomial non-determinism, and RNNBase parameter sharing issues.
issues/153571, issues/154031, issues/154238

Documentation and Configuration Issues: Issues related to documentation errors and configuration parameters have been identified. These include incorrect documentation in RMSNorm and the need for configurable precompilation timeouts.
issues/154184, issues/153392

Performance and Regression Issues: Performance problems and regressions have been reported, such as the first run of PyTorch XPU on Windows taking longer and incorrect output shapes in torch.linalg.vector_norm.
issues/154180, issues/153568

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

[Upstream Triton] RuntimeError: Expected to find "uint32_t grid_0 = 1023L;" but did not find it test_triton_autotuning_cuda
Toxicity Score: 0.55 (Temporary blocking, repeated issues, frustration)
This GitHub conversation involves multiple users, with username1 expressing frustration over a recurring issue, leading to a temporary block to prevent further repeated issues. The tone is tense, with username1's action indicating a level of exasperation. The trigger of tension appears to be the repeated creation of similar issues, which has not been resolved satisfactorily.

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 159
Key Open Pull Requests
1. [draft][do not review] H-FSDP prototype: This pull request introduces a draft prototype for Hierarchical Fully Sharded Data Parallel (H-FSDP) in PyTorch, which includes test code and modifications to enable reduce scatter operations and shard management, but is not yet ready for review or merging.

URL: pull/154000

Merged: No

Associated Commits: c2640, 89ddf, fcba0, 66294, fee4c, a1f68, b41e2, 453af, 18eb5, aa3d2, ee541

2. [MPS] Implement max_pool3d_with_indices: This pull request implements the max_pool3d_with_indices operation for the Metal Performance Shaders (MPS) backend in PyTorch, utilizing the MPSGraphPooling4DOpDescriptor and maxPooling4DReturnIndicesWithSourceTensor to add a fourth spatial dimension to the input, output, and indices tensors, while only supporting contiguous memory format, addressing one of the top requested operations for the MPS backend.

URL: pull/154145

Merged: No

Associated Commits: a7f78, 00354, b9ca9, 765dd, fe518, 1d29d, b0b19, 7b80b, 15d7f, 34cd5, 2ce56

3. Remove MemPoolContext : This pull request aims to remove the MemPoolContext from custom user memory pools in the PyTorch project to prevent synchronization issues between the MemPoolContext and the active pool in graph_pools, which could occur in multithreaded scenarios, as highlighted by previous related pull requests.

URL: pull/154042

Merged: No

Associated Commits: 568c9, 03364, 447f6, 2bd83, dccd0, b9b97, c7e4d, 5a31b, 5fde1, ec5e7

Other Open Pull Requests

Deprecation Warning for torch.ao.quantization Module: This pull request introduces a deprecation warning for the torch.ao.quantization module, advising users to transition to the new torchao APIs for eager mode and FX graph mode quantization. It also recommends using the XNNPACK quantizer in ExecuTorch as part of a broader migration plan.
pull/153892

New Lint Adaptor "pyrefly": A new lint adaptor named "pyrefly" is introduced to the PyTorch project, as part of a stack of changes managed by ghstack. This pull request is currently unmerged with multiple updates indicated by the series of commits.
pull/154059

ONNX Python Package Update: This pull request updates the ONNX Python package to version 1.18 in the PyTorch project. It addresses various dependencies and configuration issues, such as updating installation scripts and modifying requirements for continuous integration.
pull/153920

Memory-Efficient Attention Fix: This pull request addresses issue #146704 by implementing a fix for memory-efficient attention in scenarios with large batch dimensions. It includes multiple commits that involve test fixes, error string tests, and code rewrites to support compilation.
pull/154029

Binary Operators for Data Structures: This pull request implements several binary operators for the data structures dict, set, frozenset, and dict_keys within the PyTorch project. It is part of a series of changes tracked by ghstack and is currently not merged.
pull/154063

Enhancements to Dynamo Set Functionality: Enhancements to the Dynamo Set functionality include raising a TypeError when an unhashable object is encountered. This ensures better error handling and robustness in the code.
pull/154064, pull/154065, pull/154066

Model Parameter Data Type Flexibility: This pull request addresses issue #154082 by allowing different data types for model parameters that do not require gradients in the PyTorch project. It includes a series of commits that relax consistency checks, add tests, and make necessary code adjustments.
pull/154103

Pybind11 Submodule Update: The pull request updates the pybind11 submodule to version 3.0.0rc in the PyTorch project. It addresses potential issues by removing deprecated methods, fixing formatting bugs, and making additional necessary adjustments.
pull/154115

XPU Triton Commit Pin Update: This pull request updates the XPU Triton commit pin for the upcoming PyTorch release 2.8 by integrating a newer version of OpenAI Triton. It involves relocating the setup.py file and updating the CMake version requirement.
pull/154194

Linting Rule Application: This pull request applies the same linting rules to specific test files as other files in the project. It addresses the difficulty of updating tests in these previously skipped files due to the inability to lint them locally.
pull/154261

Graph Breaks Counting Enhancement: The pull request enhances the PyTorch project by utilizing the gb_type field of unimplemented_v2 to accurately count graph breaks. It includes several commits addressing exceptions, updates, fixes, and a reversion.
pull/153818

Support for num_ctas > 1 in StaticCudaLauncher: This pull request introduces support for num_ctas > 1 in the StaticCudaLauncher using cuLaunchKernelEx. It requires a device capability of at least 9 and plans to add support for launch_cooperative_grid in a future update.
pull/153834

Support for NamedTuple Subclasses: This pull request addresses issue #133762 by adding support for namedtuple subclasses within the PyTorch project. It focuses on handling tuple subclasses constructed inside compile regions and managing the "fake" global scope associated with NamedTuple-generated __new__.
pull/153982

Parent Fallback Logic Cleanup: The pull request aims to clean up the parent fallback logic by removing the redundant parent in fallback_node_due_to_unsupported_type. It ensures that the tests in test_add_complex produce the same codegen and resolves an issue encountered by a contributor.
pull/154006

Graph Partitioning Logic Fix: This pull request addresses an issue in the graph partitioning logic of a PyTorch project by ensuring correct partitioning of nodes that read from or write to CPU scalar tensors. It prevents incorrect cudagraph wrapping of nodes like triton_poi_fused_add_0.
pull/154013

TestOpenReg Split: The pull request aims to split the existing TestOpenReg into two distinct parts to separately test the third-party accelerator integration mechanism and the openreg functionality itself. It is part of a series of changes managed through the ghstack tool.
pull/154018, pull/154019

Dynamic Shapes Feature: This pull request, titled "[WIP][dynamic shapes] unbacked safe unsqueeze," aims to address an issue in the PyTorch project by implementing a feature related to dynamic shapes. It includes multiple commits such as initial setup and updates to __init__.py.
pull/154087

Dynamic Whitelist for Recompilations: The pull request suggests implementing a dynamic whitelist for recompilations in the PGO code state by proposing the use of TORCH_COMPILE_DYNAMIC_SOURCES. It aims to close issue #153442 which previously explored the dynamo guards approach.
pull/154189

Convolution Fusion for Intel GPUs: This pull request focuses on implementing convolution fusion for Intel GPUs within the XPU backend of the PyTorch project. It is part of a stack of changes managed through the ghstack tool and involves multiple contributors and reviewers.
pull/154202

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 237
Key Closed Pull Requests
1. [Monitoring] Add util for linux build: This pull request, titled "[Monitoring] Add util for linux build," was intended to address a specific issue in the PyTorch project by introducing a utility for Linux builds, but it was ultimately closed without being merged.

URL: pull/153456

Merged: No

Associated Commits: 4810f, 4c51f, fbbd1, 0ca34, 4ba45, d4a1f, cd8c6, a31f9, bc8e6, 50e29, 2066f, 6938d, 0de95, 263bc, 5ee40, 3a432, 85ad1

2. [Monitoring] enable local logs and add mac test monitoring: This pull request aims to enhance the monitoring capabilities of the PyTorch project by enabling local logging and adding test monitoring for macOS, allowing the upload utilization logic to run using a local pointer instead of relying on data from S3, which could also benefit ROCm.

URL: pull/153454

Merged: No

Associated Commits: 0683d, b3e98, 9e5e2, 46b43, 32fa0, 6c3d7, d0ab2, 917fe, 4d00e, 000c6, 9027d, c21d5, d2976

3. [CI] Move Mac testing to 3.12: This pull request aims to transition the Mac testing environment to Python 3.12 as a preparatory step towards eliminating the use of Conda during the build process, as part of a series of changes managed through the ghstack tool.

URL: pull/154177

Merged: No

Associated Commits: 78c01, cd1f9, 226c3, 8f3d0, d757f, a736a, cbdda, e2b5c, b47a1, 49581, e8714

Other Closed Pull Requests

Dynamic Shape Handling in PyTorch: This topic involves enhancing the PyTorch framework by introducing a dynamic whitelist to reduce recompilations due to dynamic shape changes and refactoring GuardDebugInfo to separate verbose code from failure reasons. Additionally, it provides detailed logging for recompilation triggers and guard failures to improve debugging and performance.
pull/153442

Serialization Documentation Updates: This topic covers multiple updates to the serialization documentation in the PyTorch project, as indicated by a series of commits. However, the changes were ultimately not merged into the main branch.
pull/153631

Subgraph Construction for Benchmarking: This pull request addresses an issue where subgraphs with inputs having a FlexibleLayout do not freeze the layouts, leading to inconsistencies. The solution involves constructing the subgraph with benchmarking arguments instead of example inputs to ensure consistency.
pull/153753

Optimization of Continuous Integration Process: This topic focuses on optimizing the continuous integration process by reusing existing wheel files for builds when only Python files unrelated to the object files in the wheel have been modified. This approach reduces build times to approximately six minutes under specific conditions.
pull/153838

Autotuning Process for Cutlass Backend: This pull request introduces a two-stage autotuning process, known as prescreening, for the Cutlass backend in the Inductor project. It aims to optimize the tuning of backend kernels by initially identifying top configurations with a specific swizzle and then performing autotuning on these configurations across different swizzles.
pull/153335

ONNX Opset Updates: This topic addresses issue #153687 by updating the onnx->symbolic_opset23.py file to include features for opsets 21, 22, and 23. Despite multiple commits adding these opsets and a refactor of the file, the changes were not merged into the main branch.
pull/153702

Graph Partitioning Edge Cases: This pull request addresses three edge cases in graph partitioning by supporting removed arguments and handling NoneLayout which cannot be partition inputs or outputs. It ensures internal buffers are not allocated as buf_id and uses mutation_real_name for partition signatures.
pull/153899

Deprecation Checks in XPU Codebase: This topic introduces a C10_NODEPRECATED check for the XPU codebase, preventing the use of deprecated c10::optional and related constructs. It promotes the use of standard library counterparts and depends on updates from related pull requests in the torch-xpu-ops repository.
pull/153935

Initial Metal Support in PyTorch: This pull request, titled "[aoti] Initial Metal support," aims to introduce initial support for Metal in the PyTorch project. Despite a series of commits linked to the pull request, it was not merged.
pull/153959

Handling of Scalar Tensors for Intel GPUs: This pull request addresses the handling of scalar tensors in the addmm and baddmm functions for Intel GPUs by expanding the shape of the self tensor to match the output tensor's dimensions. This change is necessitated by the upgrade to a new version of oneDNN and includes updates to unit tests.
pull/153051

Mutation Renames in PyTorch Dependencies: This pull request addresses the need to update mutation renames in the dependencies of a multi-template buffer within the PyTorch project. It acknowledges PaulZhang12 for the original discovery and seeks advice on capturing inductor logging output for testing purposes.
pull/153895

CUDA Version Update for Inductor Benchmark Jobs: This pull request aims to update the inductor benchmark jobs to use CUDA 12.8, the latest supported version, for improved performance and consistency with NVIDIA's testing on Blackwell. It also removes outdated references to CUDA 12.4 in the PyTorch CI.
pull/154004

Code Update in variables/dict.py: This pull request involves updating the code in variables/dict.py by replacing the function unimplemented with unimplemented_v2. Despite being part of a series of commits tracked by ghstack, it was ultimately not merged.
pull/154040

Metal Ops for fmod and remainder Operations: This pull request aims to move the fmod and remainder operations to Metal ops in the PyTorch project. It addresses a correctness issue with large integer types and improves performance for floating point types, although it was not merged.
pull/154280

CK-tile Based Universal GEMM Kernels: This pull request introduces code generation for CK-tile based universal GEMM kernels to the CK backend for Inductor. It adds these kernels to the autotune choices in torch.mm and involves creating a new template for code generation.
pull/152341

Vectorization of FP8 E4M3 Format: This topic introduces the Vectorized<Float8_e4m3fn> class to enable vectorization of the FP8 E4M3 format. It includes methods for conversion to and from Vectorized<float> and common vectorized operations such as multiplication, absolute value, and equality checks.
pull/152417

FP8_E4M3 Quantization and Dequantization: This pull request aims to enable vectorized code generation for FP8_E4M3 quantization from float32 and dequantization to float32 using the Inductor CPP backend in the PyTorch project.
pull/152418

BundledAOTAutogradCacheEntry Enhancement: This pull request introduces the BundledAOTAutogradCacheEntry, an enhancement to the AOTAutogradCacheEntry by directly saving the entire CompiledFxGraph within the entry. It eliminates the dependency on FxGraphCache, simplifying the logic and potentially improving cache efficiency.
pull/152840

XPU Memory Reporting in PyTorch Profiler: This pull request adds support for XPU memory reporting in the PyTorch Profiler by updating the XPUCachingAllocator.cpp to report allocation events. It allows the profiling table to include XPU Mem columns, aligning XPU memory profiling with existing CUDA profiling capabilities.
pull/152842

Versatility of MegaCache Component: This pull request aims to make the MegaCache component of the PyTorch project more versatile by allowing the registration of external plugins. It includes making MegaCache generic, reverting formatting changes, and updating cache information artifacts.
pull/152977

HOP-ification of Out-of-Tree Functions: This pull request aims to enable the HOP-ification of out-of-tree functions during the compilation process. It is part of a series of related changes tracked through the ghstack tool, although it was ultimately not merged.
pull/153487

Rechecking Autotune Cache for Triton Kernels: This pull request addresses the need to recheck the autotune cache when loading statically launchable Triton kernels from FxGraphCache. It ensures the best configuration is utilized even if it was not precompiled and includes a new unit test to verify this functionality.
pull/153565

Test Contamination Prevention: This pull request addresses the issue of test contamination by ensuring that the preference for using cuBLAS or cuBLASLt is not inadvertently carried over across different tests. It explicitly parameterizes the backend setting for tests that need to exercise both backends.
pull/153655

Benchmark Database Management: This pull request addresses the issue of the benchmark database growing unexpectedly fast by proposing to skip uploading certain debug information from the TorchInductor benchmark. It aims to prevent database bloat by omitting data not utilized by any dashboard.
pull/153769

AOTIModelContainerRunnerMps and MPS Fallback: This pull request introduces the AOTIModelContainerRunnerMps and a shim for Metal Performance Shaders (MPS) fallback operations. It includes a specific shim with an operator to set arguments for the Metal kernel, although it was not merged into the main branch.
pull/153964

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

[list] Implement list.count
Toxicity Score: 0.55 (Frustration expressed, Defensive response, Mediation attempt, Continued dissatisfaction.)
This GitHub conversation involves username1 expressing frustration over username2's implementation not meeting expectations, leading to a defensive response from username2. The tone shifts as username3 attempts to mediate, but username1's continued dissatisfaction triggers further tension.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
203
36
11
126

guilhermeleobas
218
12
3
1

Skylion007
70
16
4
125

anijain2305
157
3
1
8

laithsakka
79
25
13
32

bobrenjc93
87
11
6
21

henrylhtsang
67
9
8
28

eellison
43
12
2
52

cyyever
69
19
0
12

ngimel
31
4
0
56

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	203	36	11	126
guilhermeleobas	218	12	3	1
Skylion007	70	16	4	125
anijain2305	157	3	1	8
laithsakka	79	25	13	32
bobrenjc93	87	11	6	21
henrylhtsang	67	9	8	28
eellison	43	12	2	52
cyyever	69	19	0	12
ngimel	31	4	0	56