Weekly GitHub Report for Pytorch: March 31, 2025 - April 07, 2025 (12:07:01)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notable changes include the deprecation of PyTorch's official Anaconda channel, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[export] run_decomposition does not preserve custom CompositeImplicitAutograd ops: This issue is about the failure of the
run_decomposition
function to preserve customCompositeImplicitAutograd
operations during ONNX export, which is causing problems for the user who is trying to implement composite custom operations for ONNX/TRT export purposes. The user is seeking a solution to prevent these custom operations from being decomposed automatically, as they have specified acustom_translation_table
that should ideally keep these operations intact for translation.- The comments discuss the unexpected decomposition of custom operations during ONNX export, with various contributors suggesting debugging steps and potential solutions. The conversation includes attempts to reproduce the issue, discussions on the behavior of
run_decompositions
, and suggestions to modify the decomposition table to exclude specific custom operations. There is also a discussion about the need to adjust the ONNX decomposition registry to handle custom operations correctly. - Number of comments this week: 39
- The comments discuss the unexpected decomposition of custom operations during ONNX export, with various contributors suggesting debugging steps and potential solutions. The conversation includes attempts to reproduce the issue, discussions on the behavior of
-
Pytorch nightly Cuda 12.8 - 'too many resources requested for launch' with multiple layers of LayerNorm after strided Conv1d: This issue reports a bug in the PyTorch nightly build for CUDA 12.8, where a runtime error occurs due to "too many resources requested for launch" when running a backward pass through a model with multiple strided Conv1d modules followed by LayerNorm modules. The error does not occur when the Conv1d modules have a stride of 1 or when only one strided Conv1d followed by one LayerNorm is used, and it is specific to CUDA as it does not appear when using a CPU or replacing LayerNorm with RMSNorm.
- Multiple users confirm experiencing the same error with similar setups, particularly on NVIDIA RTX 5090 GPUs. A suggested workaround is to downgrade to PyTorch 2.7, which resolves the issue for several users. The problem is suspected to be related to a specific pull request, and a fix involving
__launch_bounds__
is proposed to reduce register usage. - Number of comments this week: 17
- Multiple users confirm experiencing the same error with similar setups, particularly on NVIDIA RTX 5090 GPUs. A suggested workaround is to downgrade to PyTorch 2.7, which resolves the issue for several users. The problem is suspected to be related to a specific pull request, and a fix involving
-
Intermittent SSL certificate expiry warnings for
download.pytorch.org
(load balancer?): This issue reports intermittent SSL certificate expiry warnings when accessingdownload.pytorch.org
, suggesting that the problem might be due to an expired certificate on a load-balanced node. The issue is not specific to any version and has been observed from various locations, indicating a potential problem with certificate propagation across different nodes.- Multiple users from different locations, including the UK, France, and Switzerland, report experiencing the same SSL certificate expiry issue. Some users suggest temporary workarounds, while others note that the problem persists intermittently, likely due to propagation delays in updating the certificate across all nodes. The issue is linked to a similar past problem, and users are advised to wait for the certificate to propagate fully.
- Number of comments this week: 12
-
CUTLASS backend updates: Instantiation level, long compilation and long autotuning time: This issue addresses the current status and challenges of the CUTLASS backend in the PyTorch project, focusing on its performance on the H100 hardware. It highlights the backend's potential to outperform other solutions like Aten and Triton, while also discussing significant obstacles such as long kernel compilation and autotuning times, and missing features that hinder seamless benchmarking and performance evaluation.
- The comments discuss the potential use of CUTLASS's Python interface for generating C++ code, with some skepticism about its utility given the upcoming CUTLASS 4.x release. There is also a mention of a talk on Python & CUTLASS 4.0, which suggests a new Pythonic DSL for writing GEMM kernels. Concerns are raised about the lack of performance support for Hopper and earlier architectures in CUTLASS 4 Python, with a clarification that only bug fixes will be provided for these older architectures.
- Number of comments this week: 8
-
[distributed] Crash when trying to use default PG after creating new PG: This issue describes a crash occurring when attempting to use the default process group (PG) after creating a new process group in a PyTorch distributed environment. The user provides a Python script to reproduce the error, which involves initializing process groups with the NCCL backend and encountering a segmentation fault when calling
dist.barrier()
.- The comments discuss attempts to reproduce the issue, with some users unable to replicate the crash on their systems. Suggestions include enabling detailed debug logs and checking for discrepancies in the environment setup. A user provides a gdb backtrace indicating a segmentation fault in the NCCL communication setup, suggesting a potential issue with the process group initialization or cleanup.
- Number of comments this week: 6
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs during the execution of a Python script that utilizes the OotdPipeline and involves compiling components with Torch's compile function, specifically when using the 'inductor' backend.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly, a kernel size of 7 can be represented by three such operations. The motivation behind this approach is to reduce computational costs on the CPU by modifying the MaxPool2D layer directly, as demonstrated by testing code that shows a significant speedup in execution time compared to the traditional method.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a script in a Docker environment with a
tmpfs
permission set to1777
, where the execution of a cachedcuda_utils.so
file in the/tmp
directory fails due to the absence of the execution bit, despite the directories having the correct permissions. The error occurs during the execution of a PyTorch model, specifically when attempting to execute the compiled CUDA utilities, resulting in anImportError
indicating a failure to map a segment from the shared object. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes are committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature is to reduce the size of model files, particularly for deployment on mobile devices, by eliminating unnecessary debug files, as demonstrated by a reduction from 6.7MB to 5.6MB in a test case.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 99
Summarized Issues:
- Indexing Support for Sparse Tensors: The need for consistent and comprehensive indexing support for sparse tensors in PyTorch is highlighted, particularly addressing the limitations of current methods like
index_select
which only partially support SparseCOO tensors with 1-dimensional indices. Additionally, there is a lack of support for direct indexing and SparseCSR tensors, emphasizing the importance of these features for efficiently handling sparse datasets.
- Output Inconsistencies in PyTorch: There are output inconsistencies between Eager mode and
torch.compile
usingmode="max-autotune-no-cudagraphs"
in PyTorch, where the outputs differ beyond strict tolerance levels. This potentially indicates numerical instability in backend optimizations, which could affect the reliability of results.
- Bugs in PyTorch Functions: Several bugs have been identified in PyTorch functions, such as
torch.utils.rename_privateuse1_backend("test")
triggering CUDA profiling warnings and aCUPTI_ERROR_NOT_INITIALIZED
error, andtorch.fill
raising aTypeError
due to an invalid combination of arguments. These issues highlight the need for better error handling and support for various function arguments.
- Integration and Optimization Proposals: Proposals have been made to integrate the ZenDNN library and Zentorch optimizations into PyTorch to enhance inference performance on AMD EPYC™ CPUs. This includes a phased approach for optimizing recommender systems and NLP workloads, with options for explicit or automatic enablement based on CPU architecture.
- Lint Rule and Internal Failures: A need for a lint rule to enforce the use of
std::optional
overc10::optional
in the PyTorch project is discussed, as some internal failures might have been avoided with such a rule. Input is sought from specific contributors who have previously worked on related changes.
- Attention Mechanism and Memory Access: A common problem in PyTorch's attention mechanism is highlighted, where users encounter out-of-bounds memory access when using score modifications with block masks. This is due to the overlapping semantics of mask_mods and score_mods, and potential solutions are suggested to incorporate masking from mask_mods into score_mods to prevent invalid memory reads.
- Dynamic Shape and Compilation Issues: Dynamic shapes do not function correctly with keyword arguments (kwargs) in a PyTorch module, resulting in an error due to a mismatch between the expected top-level keys of the
dynamic_shapes
dictionary and the actual argument names. This affects models like rag and xlnet in Hugging Face, highlighting the need for better handling of dynamic shapes.
- Distributed Training and Initialization Errors: A RuntimeError is encountered when attempting to initialize a distributed training process using the "gloo" backend with nightly torch version 2.8, where the error message "makeDeviceForHostname(): unsupported gloo device" occurs. This affects both CUDA and CPU devices, with a stable workaround found using earlier nightly versions of torch.
- Memory Management and Profiling: A bug in the PyTorch function
torch.as_strided_scatter
is identified, where fuzz testing with specific edge case inputs results in a "double free or corruption (out)" error. This indicates a potential memory management problem that leads to a core dump, although this behavior is not consistently reproducible in the latest PyTorch versions.
- Compilation and Export Errors: Compilation errors are encountered when using AOT inductor with PyTorch 2.6.0 RC in a Docker environment lacking a local CUDA Toolkit (CTK), where the absence of the
cuda.h
file leads to a failure. This prompts a discussion on whether to provide a clearer error message or include the missing file automatically.
- Optimizer and Data Type Issues: Bugs in PyTorch's optimizers (Adam, AdamW, and RAdam) are highlighted, where the
weight_decay
parameter is incorrectly applied to tensors withrequires_grad=False
. This should not happen asweight_decay
is only meaningful for parameters that require gradients, suggesting a need for better parameter validation.
- Memory Usage and Fragmentation: A RAM leak problem is encountered during data loading with multiprocessing and Conv3d on the CPU in a custom PyTorch Dataset's
__getitem__
method. The process is unexpectedly killed due to out-of-memory errors when using random tensor shapes, but not when using fixed tensor shapes, indicating a potential issue with memory management.
- Compilation and Execution Errors: A bug in the
torch.compile
function of PyTorch is identified, where it unnecessarily creates a CUDA context and allocates GPU memory even when the code is intended for CPU execution. This leads to out-of-memory errors in multi-processing environments when multiple processes each create their own GPU context.
- Profiling and Metadata Enhancements: Enhancements to the PyTorch profiling tool are proposed, enabling Inductor to pass metadata for
aten
operations to the dispatcher. This allows this information to be included in the args field of the kineto trace, as an alternative to the current fragile method of manually post-processing the profile.json.
- Performance and Compilation Issues: A severe performance degradation is reported when continuously calling
nn.Linear
in fp32 on the NVIDIA GeForce RTX 5090D GPU, as compared to a manual matrix multiplication and addition. The problem may not occur on other 50 series cards, indicating a potential issue with the specific GPU model.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 50
Summarized Issues:
- Test Failures and Instabilities: This topic covers various issues related to test failures and instabilities in the PyTorch project. One issue involves a disabled test due to flaky failures on the ROCm platform, while another describes an intermittent failure in a CUDA test with no clear cause. Additionally, there are issues with build instabilities on specific architectures and environments, such as the s390x architecture and the "linux-jammy-py3-clang12-executorch" build. These issues highlight the challenges in maintaining stable test environments across different platforms and configurations.
- ONNX Export Issues: Several issues are related to exporting PyTorch models to ONNX format. Users have encountered problems with unsupported operators, such as 'aten::fft_fft2', and issues with opset version mismatches during export. Additionally, there are challenges with large model exports exceeding size limits and unexpected behavior in opset version handling. These issues indicate the complexities involved in ensuring compatibility and correctness when converting models between frameworks.
- Compilation and Build Errors: Compilation and build errors are a recurring theme, with issues ranging from linker path errors on Windows to CMake compatibility problems. Users have reported errors due to implicit function declarations and incorrect library paths, as well as challenges with specific build environments like Triton on Windows. These issues underscore the importance of maintaining up-to-date and compatible build configurations across different systems.
- Performance and Utilization Challenges: Performance issues are highlighted in several cases, such as low GPU utilization during model training and slow execution of specific nodes in ComfyUI software. Users have also faced challenges with PyTorch's Distributed Data Parallel (DDP) environment, which affects efficient parallel processing. These issues reflect the ongoing efforts to optimize performance and resource utilization in machine learning workflows.
- Security Vulnerabilities: Security concerns are raised in issues related to CVE-2024-7804, which involves de-serialization vulnerabilities in PyTorch. The vulnerability is reportedly present in versions greater than 2.3.1, prompting discussions on its validity and potential overlap with other CVEs. These issues highlight the need for vigilance in addressing security risks in software projects.
- Compatibility and Support Issues: Compatibility issues are evident in several cases, such as the lack of support for certain CUDA capabilities and incompatible glibc versions causing import errors. Users have also reported problems with unsupported GPU models and outdated dependencies, indicating the challenges in maintaining compatibility with evolving hardware and software ecosystems.
- Documentation and Configuration Errors: Documentation and configuration errors are highlighted in issues related to unsupported section titles in docstrings and incorrect configuration settings. These issues emphasize the importance of accurate documentation and configuration management in software development.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 183
Key Open Pull Requests
1. [Inductor] Refactor wrapper codegen to use Wrapper IR.: This pull request refactors the existing wrapper code generation in the PyTorch project to utilize a new Wrapper Intermediate Representation (IR), which extends the current Memory Planning IR into a comprehensive Wrapper IR, aiming to provide structured information for generating wrapper code across different backends without modifying core files, thereby promoting modularity and encapsulation.
- URL: pull/150458
- Merged: No
- Associated Commits: d3a36, 6e176, c55fd, f299c, 403b9, fd03d, 2cd01, bd78b, df2f5, 0dfd0, 40011, 20b1d, 91a39, e7335, be417, fc5f0, 6c1cb, c94d1, 2b978, 2b6a4, 5e33e, 6e09c, b4c7a, 41cab, 2a839, 43773, 30cb1, 46e84, e6a12, 05cf7, aa7a2, 76b12, 2831c, 8678c, fc9ff, 2f17b, fb252, 9dc74, 57369, 63593, f621b, 51ea0, 0cd44
2. WIP: Remove Conda Instructions: This pull request involves the removal of Conda installation instructions from the PyTorch project documentation, leaving only the pip instructions where multiple installation methods were previously provided, and includes comments for areas where the replacement instructions were uncertain, as well as minor text fixes across several commits.
- URL: pull/150646
- Merged: No
- Associated Commits: 989af, 60a05, a348d, 5527a, 30405, dd37e, f4acf, f9fa6, 66ff9, 58edd, 63b68, e50d3, f387c, b53ae, 3f2f8, 65135, b8d8b, 35227
3. [test] binary docker builds: This pull request aims to improve the testing and building process of binary Docker images by modifying workflows to fetch images from AWS ECR instead of Docker IO, introducing reusable actions for binary Docker builds, simplifying scripts, and addressing issues related to image tagging and rebuilding, while also considering limitations for certain architectures like s390x and xpu.
- URL: pull/150558
- Merged: No
- Associated Commits: 8d01b, 1b5eb, 0f001, 5fa0a, 5d7d5, 10d1f, c697a, ad356, 7da1c, 5444b, c19f9, 4d90e, 441cf, 87ad8, 2e29f
Other Open Pull Requests
- DTensor Enhancements: This topic covers improvements in the DTensor library, including support for uneven sharding and a utility for converting strided sharding placements. These changes aim to enhance the functionality and flexibility of DTensor, particularly in handling unevenly divisible parameters and improving distributed checkpointing.
- MKLDNN Fusion for XPU: The pull requests under this topic focus on enabling MKLDNN fusion for the XPU backend in PyTorch. They include enhancements such as enabling specific operations, refining test scripts, and ensuring compatibility and performance improvements.
- ROCm Support Enhancements: These pull requests introduce support for hipSPARSELt and reintroduce SymmetricMemory in ROCm, enhancing performance and compatibility. They address various updates, including configuration flags and conditional checks, while noting certain hardware limitations.
- Dynamo and Inductor Improvements: This topic includes enhancements to the Dynamo component with a suite of CPython tests and updates to the Inductor component for handling custom operations. These changes aim to improve testing coverage and the handling of custom operators.
- CUDA and Tensor Enhancements: The pull requests here focus on CUDA enhancements, including RMSNorm kernel introduction and lazy cloning in
Tensor.to
. These changes aim to improve performance and efficiency in data transfer and computation.
- Type Hinting and Refactoring: These pull requests aim to refactor and simplify type hinting in PyTorch by adopting PEP 585 and PEP 604. They address type shadowing issues and improve the
.pyi
stub files without requiring future imports.
- Miscellaneous Enhancements: This topic includes various enhancements such as enabling
vec::Vectorized
operations with scalars, tuning_scaled_grouped_mm
, and introducingPaddedTensor
for dynamic shapes. These changes aim to improve performance and functionality across different areas of PyTorch.
- Experimental and Work-in-Progress Updates: These pull requests include experimental updates like user buffer registration for FSDP2 and work-in-progress refactoring efforts. They are part of ongoing efforts to enhance PyTorch's capabilities and infrastructure.
- General Refactoring and Fixes: This topic covers general refactoring efforts such as the
CUDAAllocatorConfig
reuse andgen_pyi.py
refactoring. These changes aim to streamline code and improve maintainability.
- Symbolic Expression Enhancements: The pull request introduces
sym_and
andsym_or
functions to enhance symbolic expression handling. This change simplifies expressions and improves runtime assertions and branch preservation.
- Guard Mechanism and Storage Updates: This topic includes updates to the
computeStorageNbytes
function with aguard_or_false
mechanism. These changes ensure correct integration and testing of the functionality.
- cusparse Installation in Docker: This pull request addresses the installation of cusparse in the binary build Docker image for PyTorch. It is part of ongoing efforts to improve the build process and infrastructure.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 226
Key Closed Pull Requests
1. Work on API Forwarding: This pull request, titled "Work on API Forwarding," was mistakenly opened against the wrong repository and was not merged, involving multiple commits that address various issues such as updating workflows, fixing bugs, and improving functionalities related to PyTorch's build processes, device compatibility, and API enhancements, as evidenced by the detailed commit messages and associated URLs.
- URL: pull/150528
- Merged: No
- Associated Commits: c69ea, af92b, aad1c, f3c08, 5363f, 5fbc4, 2b84d, c92f6, 1d3ff, f9e99, 46f55, 0cdf8, 6628b, c953e, 22775, 9b688, 4b9b7, f61bf, 31b52, b1a10, 5eb54, d9eed, 41811, 23e39, f01a6, 929ef, 4e418, 478a9, 7d329, 3a3de, f35ab, 7092d, 8c034, be126, e1858, d155d, 4d9de, a99cc, 51829, 983ea, 47f4e, eb304, 6e304, e2067, 57421, a61b5, 4658a, e19c1, 232eb, 1d2c2, a2639, cd15d, 9c34a, 8d4b8, dcb8a, 7be6b, ca3c3, 32070, 2236d, 1eba9, 0fddc, 7808f, 74869, 15671, bb376, 423f3, 49be9, c5891, 125a1, 3a243
2. Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle: This pull request introduces statically launchable CachingAutotuners into the FXGraphCache's cache entry within the PyTorch project, allowing for the storage of compiled Triton kernels with a minimal memory footprint by utilizing StaticCudaLauncher and StaticTritonCompileResult, thereby eliminating the need for separate process calls on cache hits and ensuring consistent kernel usage while simplifying cache logic once feature parity is achieved.
- URL: pull/149054
- Merged: No
- Associated Commits: 7598f, f664e, 0d4a2, 9cb2f, 8a3cc, d0ab8, 1a60c, b8a94, 85dd8, 932fe, 2e827, 37795, 2311d, 13ef5, 59f06, a0a7a, 6d2fd, e9fdb, 1cd6d, b5d83, 59871, 99c9c, 4dfe7, 04e5b, d7b79, 92056, 32edc, c7b15, e0899, fb574, aa0b5, 2d0d7, 37a60, f5571, db019, 3f610, 35048, bc328, 1ece4, 4acbc, f59c7, 547fc
3. update get start xpu document for v2.7: This pull request involves updating the "get start xpu" documentation for version 2.7 of the PyTorch project, including modifications to the support table, prerequisites for binaries and building from source, and adjustments to the documentation format, although it was ultimately not merged.
- URL: pull/150397
- Merged: No
- Associated Commits: 122ae, cb561, 7e0a4, 4c6fc, dc4b2, 7a001, 2d789, d106c, b3771, 37c52, 7ba05, 065d3, 3d8db, 5694b, 355e1, ef7bd, 56b93, ad7af, e777d, 30662, 10e67, 99da1, 0edcf, 8220f, e24fc, 5a461, 9aa6a, f7b24, a74e6
Other Closed Pull Requests
- PyTorch Dynamo Enhancements: This set of pull requests focuses on improving PyTorch Dynamo's handling of tensor subclasses. They ensure tracing into
__torch_function__
, support dynamic attributes, and address issues withtorch.Tensor._make_subclass
. These changes enhance compilation without altering legacy code and resolve specific issues in external projects like Hugging Face Diffusers and ComfyUI-GGUF.
- Unmerged Pull Requests: Several pull requests were proposed but not merged, addressing various enhancements and bug fixes. These include implementing
raise ... from ...
syntax, supporting unbacked subgraphs, rewriting the "should_swap" function, and adding support forhermite_polynomial_h
in the MPS backend.
- Matrix Multiplication and GEMM Enhancements: These pull requests introduce and optimize matrix multiplication operations in PyTorch. They include enabling bf16 grouped GEMM, consolidating Triton lowering of
scaled_mm
, and fusing matmul-reduce-scatters in asynchronous tensor parallelism.
- Bug Fixes and Reliability Improvements: These pull requests address various bugs and improve test reliability. They fix race conditions in unit tests, a compile error with
torch.Tensor.unsqueeze_
, and ensure deterministic settings for XPU accuracy tests on Intel GPUs.
- Dynamic Shapes and Symint Handling: Enhancements in handling dynamic shapes and symbolic integers are addressed in these pull requests. They utilize
statically_known_true
for dynamic shapes and introduce a fallback mechanism for unbacked symints in the Inductor component.
- CI and Build Process Optimization: This pull request optimizes the CI process by using system-installed NCCL, reducing build time, and unifying NCCL version pins, although it slightly increases Docker pull times.
- Miscellaneous Enhancements: Various enhancements include adding a "reason" field to
torch.compiler.disable
, improving cpp_wrapper, and ensuring GPU architecture compatibility in ROCm. These changes streamline processes and improve interface usability.
- Performance Improvements: This pull request enhances the performance of
sum
andprod
reductions in MPSInductor by usingsimd_sum
andsimd_product
operations, significantly improvingtorch.compile
performance for specific models.
- Unmerged Testing and Bindings: These pull requests focus on testing binary builds and introducing C++ bindings for specific functions. They remain unmerged due to challenges in implementation and testing.
- Device and RNG State Handling: This pull request addresses treating third-party devices with
set_rng_state()
andget_rng_state
as CUDA-like devices, although it was not merged.
- GPU Warning Correction: This pull request ensures that a warning about the absence of a GPU is only triggered when attempting to enable the
tf32
setting in a CPU-only environment, correcting the behavior duringtorch.export
initialization.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- Add reverse engineered code to iOS build
- Toxicity Score: 0.55 (Defensive responses, unresolved conflict, skepticism.)
- This GitHub conversation involves username1 proposing changes to the iOS build process, which username2 critiques for potential issues. Username1 responds defensively, leading to a tense exchange. Username3 attempts to mediate by suggesting a compromise, but username2 remains skeptical. The tone shifts from collaborative to confrontational, with username1 expressing frustration over the lack of consensus.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 229 | 22 | 13 | 114 |
guilhermeleobas | 251 | 6 | 1 | 7 |
atalman | 126 | 13 | 12 | 25 |
laithsakka | 93 | 15 | 7 | 57 |
jamesjwu | 142 | 8 | 11 | 10 |
justinchuby | 84 | 3 | 5 | 78 |
anijain2305 | 126 | 10 | 18 | 14 |
pianpwk | 97 | 23 | 1 | 39 |
clee2000 | 131 | 15 | 5 | 3 |
xmfan | 109 | 12 | 5 | 22 |