Weekly GitHub Report for Pytorch: February 24, 2025 - March 03, 2025
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and FP16 support on X86 CPUs. Notably, the release also marks the deprecation of PyTorch's official Anaconda channel, with a shift towards using Manylinux 2.28 for Linux binaries, and introduces a backward compatibility-breaking change by setting weights_only=True
as the default for torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL: This issue is a feature request for a stream management API in PyTorch's Process Group NCCL to address asynchronous communication challenges, particularly the "read-before-write" issue that arises when each NCCL process group operates on its own dedicated stream. The request suggests allowing users to define or set streams in process groups to ensure proper execution order of collective operations, which is crucial for overlapping communication with compute operations in multi-data center training scenarios.
- The comments discuss related issues and potential solutions, including using existing APIs with user-defined streams to manage synchronization and execution order. Some contributors suggest that the current behavior is not a bug but a user responsibility to ensure correct synchronization, while others propose improvements for better stream control. There is a consensus on the need for clearer documentation and possibly a new feature to allow more direct control over which stream NCCL uses.
- Number of comments this week: 16
-
Checkpoint doesn't work with torch_function if torch_function change tensor metadata: This issue highlights a problem with the PyTorch checkpoint functionality when used in conjunction with
TorchFunctionMode
, specifically when__torch_function__
alters tensor metadata, leading to a metadata mismatch error during the recomputation phase. The user provides a minimal reproducible example demonstrating the error and seeks advice on whether there is a way to make__torch_function__
compatible with checkpointing.- The comments discuss potential reasons for the issue, such as the violation of autograd conditions due to metadata changes, and suggest workarounds like manually activating
TorchFunctionMode
within the checkpointed function. There is also a discussion about the possibility of checkpoint detecting and re-enabling modes during recomputation, with some users expressing optimism about the feasibility of this solution. - Number of comments this week: 9
- The comments discuss potential reasons for the issue, such as the violation of autograd conditions due to metadata changes, and suggest workarounds like manually activating
-
FlexAttention compiled has illegal memory access or device-side assert even though all tensors are contiguous: This issue involves a bug in the PyTorch library where the FlexAttention module, when compiled, encounters illegal memory access or device-side assertions despite all tensors being contiguous. The problem arises under specific parameter configurations, and while a workaround involving padding the
rel_bias
tensor can mitigate the illegal memory access, it does not resolve the device-side assertion issue, and further complications occur during the backward pass.- The comments discuss potential causes and solutions for the issue, including a suggestion to modify the block-mask function and score-mod function to prevent out-of-bounds access. There is a debate about whether the score_mod function should be evaluated conditionally based on the mask_mod function, and it is noted that this problem has been a long-standing issue with some workarounds available.
- Number of comments this week: 6
-
[inductor][cpu]AOT inductor AMP static shape default wrapper occupied almost 3x disk than before: This issue reports a significant increase in disk usage when using the AOT inductor AMP static shape default wrapper, which is observed to occupy almost three times more disk space than before, specifically when running the ResNet50 model. The problem is suspected to be caused by a specific commit, which has been identified as potentially responsible for this behavior.
- The comments discuss the need to identify the commit causing the issue, with some users suggesting manual searching due to the lack of an automated mechanism. A suspected commit is identified, and further investigation is requested from other contributors to understand the cause of the increased disk usage. There is also a mention of an upcoming change that might resolve the issue.
- Number of comments this week: 6
-
No gradient for
residuals
in the return value oftorch.linalg.lstsq
: This issue highlights a concern with thetorch.linalg.lstsq
function in PyTorch, where theresiduals
in its return value do not have a gradient, unlike thesolution
. The user is questioning whether this behavior is expected and is considering contributing a solution to address this limitation.- The comments discuss the inefficiency of manually computing gradients for
residuals
and suggest that the current API design is flawed. A user proposes a code solution and offers to submit a pull request to improve the functionality, while another commenter advises against changing the API and suggests users implement their own solutions if needed. - Number of comments this week: 5
- The comments discuss the inefficiency of manually computing gradients for
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 135
Summarized Issues:
- ONNX Export Issues: The PyTorch ONNX export functionality encounters various issues, such as unsupported operations and errors during model export. These include failures with dynamic shapes, unsupported operators like
aten::bucketize
, and slicing operations on complex tensors. These issues prevent successful model export and require workarounds or updates to the ONNX export process.
- Segmentation Faults and Errors: Several issues report segmentation faults and errors in PyTorch, often related to specific operations or configurations. These include faults in the Triton upstream, dtype view conversions, and the use of
torch.sparse.sum
, leading to crashes and requiring investigation and fixes.
- Distributed and Parallel Computing Challenges: PyTorch faces challenges in distributed and parallel computing, such as the lack of sharding strategies for certain operators, issues with the
ProcessGroupNCCL
, and problems with thefully_shard
function in FSDP2. These issues affect performance and require enhancements to improve distributed training capabilities.
- Backend and Device-Specific Bugs: Various backend and device-specific bugs are reported, including issues with the MPS backend, CUDA, and the Inductor backend. These bugs lead to incorrect results, crashes, and performance discrepancies, necessitating backend-specific fixes and optimizations.
- Compilation and Export Errors: Errors during the compilation and export processes in PyTorch are reported, including issues with
torch.compile
,torch.onnx.dynamo_export
, and the handling of dynamic shapes. These errors hinder model deployment and require updates to the compilation and export mechanisms.
- Performance and Optimization Issues: PyTorch experiences performance and optimization issues, such as slow operations with certain data types, inefficient memory usage, and discrepancies in execution times across different backends. These issues require performance tuning and optimization strategies to enhance efficiency.
- Model and Operation Inconsistencies: Inconsistencies in model outputs and operations are reported, such as incorrect gradients, mismatched outputs between backends, and unexpected behavior in certain functions. These inconsistencies require debugging and corrections to ensure reliable model performance.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 48
Summarized Issues:
- Unit Test Failures Due to Triton Update: The PyTorch project experienced widespread unit test failures due to an AttributeError caused by a deprecated API in the latest Triton update. This issue specifically affected the
cpp_wrapper
for GPU and XPU, requiring fixes for compatibility with NVIDIA and AMD platforms.
torch.export
API Issues: Thetorch.export
function in PyTorch has issues with marking inputs as constant, leading to errors during the export process. Additionally, it fails to provide useful error messages when unrecognized dataclasses are used as input.
- Test Failures in
TestViewOpsLAZY
Suite: The testtest_real_imag_view_lazy_complex128
in theTestViewOpsLAZY
suite was disabled due to its failure on the main branch of the PyTorch project. Multiple issues document the disabling of this test with references to recent failure examples.
- Bugs in PyTorch's MPS Backend: PyTorch's MPS backend has several bugs, including crashes when using
scaled_dot_product_attention
with non-contiguous tensors andtorch.randn
producing identical random values for 5D tensors. These issues highlight problems with the MPSGraph rather than PyTorch itself.
- Memory Access and Allocation Issues: PyTorch encountered memory access violations on the ROCm platform and memory allocator lock contention slowing down operations in the Inductor-CPU project. These issues suggest solutions like modifying backend allocation and memory buffer allocation strategies.
- Compilation and Runtime Errors: Various compilation and runtime errors were reported in PyTorch, including assertion errors with
torch.norm
andtorch.nn.Fold
, and aSubgraphLoweringException
during inductor compilation. These issues indicate compatibility problems with certain functions and backends.
- Backend and Platform Compatibility Issues: PyTorch faced compatibility issues with different backends and platforms, such as the
OffsetBasedRNGTracker
always using CUDA, and the GradScaler not functioning on Intel ARC A770 GPUs. These issues highlight the need for more flexible backend support.
- Test Failures and Disabling: Several tests in PyTorch were disabled due to failures, including
test_flatten_nonview_xla
andtest_mkldnn.py::TestMkldnnCPU::test_mul_cpu
. These failures were not detected by the test detection system, indicating a limitation in the current testing framework.
- Errors with Triton and ROCm: PyTorch encountered errors with Triton and ROCm, such as a "Cannot bitcast data-type" error and a Triton HIP error indicating no kernel image available. These issues suggest compatibility challenges with the latest Triton updates.
- Miscellaneous Issues: Other issues in PyTorch include a malicious link promotion, a
ResourceWarning
from thetempfile
module, and a request for guidance on building a specific PyTorch version. These issues highlight a range of challenges faced by the project.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 180
Key Open Pull Requests
1. [test] Linter docker image: This pull request is focused on testing a linter Docker image for the PyTorch project, as indicated by the title "[test] Linter docker image," and it aims to address a specific issue referenced as #ISSUE_NUMBER, although it has not yet been merged.
- URL: pull/147789
- Merged: No
- Associated Commits: 4545e, ff328, c3f77, c65fb, 28551, 5488f, d143c, eb7af, dc5d2, 3f852, 029b1, 8e0c1, d9995, 2a3d9, a1f32, cfec8, 111b7, 821f8, d698d, 7a52c, a95d2, 14bcd, e92fb, cf903, 15d46, 1ce32, d6df9, 4fcbc, 097fd, f491b, 74240
2. Add note to get start xpu: This pull request aims to add a note to the PyTorch documentation to inform users that installing PyTorch from binaries will automatically include Intel® Deep Learning Essentials runtime packages, which could lead to environment issues if oneAPI is activated in a standalone installation, and thus advises users to avoid this situation.
- URL: pull/148168
- Merged: No
- Associated Commits: 122ae, cb561, 7e0a4, 4c6fc, dc4b2, 7a001, 2d789, d106c, b3771, 37c52, 7ba05, 065d3, 3d8db, 5694b, 355e1, ef7bd, 56b93, ad7af, e777d, 30662, 10e67, 99da1, 0edcf, 055f5
3. [pytree] add another simplified pytree module torch.pytree
: This pull request introduces a new simplified module, torch.pytree
, to the PyTorch library, which aligns more closely with the JAX pytree API by removing the tree_
prefix from its functions and reversing the argument order of the unflatten
function to enhance compatibility with functools.partial
, without causing any backward compatibility issues.
- URL: pull/148180
- Merged: No
- Associated Commits: 3ae64, 51232, c5f43, c3620, 08aff, 4e773, 308f6, 00a64, f6206, c93a2, 4b8ce, b8ead, c7e5b, ef3ad, 0518f, 7c8ae, a5d05
Other Open Pull Requests
- Intel Triton Update for PyTorch 2.7: This topic involves updating the Intel Triton component within the PyTorch project to prepare for the release of version 2.7. The pull request is marked as work-in-progress and includes multiple commits refining the update, with several contributors involved for review and collaboration.
- Configurable Record and Storage Alignment in torch.save: This topic focuses on making the record and storage alignment in the
torch.save
function configurable. The pull request is part of a series of changes tracked through the ghstack tool, with multiple updates and contributions from several collaborators.
- Eager Then Compile Stance in PyTorch: This topic introduces a new
eager_then_compile
stance through the set_stance APIs. The pull request aims to reduce compile times and improve the ergonomics of using dynamic shapes by initially running the compile in eager mode and then deriving input dynamism on subsequent invocations.
- Support for contextlib.suppress in PyTorch: This topic involves adding support for the
contextlib.suppress
feature in the PyTorch project. The pull request is part of a series of related changes tracked by the ghstack tool and involves multiple commits that are currently not merged.
- Enhancements to torch.load in FakeTensorMode: This topic focuses on enhancing the PyTorch library by adding information about checkpoint offsets to untyped storages when using the
torch.load
function under theFakeTensorMode
. The pull request is part of a series of changes tracked by the ghstack tool.
- Threading of boxed_forward_device_index in CompiledFXGraph: This topic addresses the threading of the correct
boxed_forward_device_index
fromgraph_kwargs
toCompiledFXGraph.post_compile
. The pull request ensures accurate updates ofBoxedDeviceIndex
from cache hits and includes testing with a specific Python benchmark.
- Performance Enhancements in PyTorch Symbolic Tracing: This topic involves enhancing the performance of the PyTorch project by moving the
Node._prepend
andNode._remove_from_list
methods to C++. The pull request results in a significant reduction in function calls and execution time during microbenchmarking of symbolic tracing.
- RuntimeEstimator and SACEstimator Modifications: This topic involves modifications to the RuntimeEstimator and SACEstimator in the PyTorch project. The pull request addresses an unspecified issue and includes various commits such as testing fake utilities, fixing default arguments and bindings, and resolving linting issues.
- Loading FakeTensors with Correct Devices in FakeTensorMode: This topic addresses the issue of loading FakeTensors with the correct devices under FakeTensorMode in PyTorch. The pull request fixes the functions _rebuild_tensor_v2 and _rebuild_tensor_v3 as part of a series of related changes tracked through ghstack.
- torch.serialization.skip_data Compatibility with torch.load: This topic aims to enhance the PyTorch library by enabling the
torch.serialization.skip_data
functionality to be compatible with thetorch.load
method. The pull request is part of a series of related updates tracked through the ghstack tool.
- Propagation Strategy Change in Inductor Pattern Matcher: This topic proposes a change in the propagation strategy of
arg_kwarg_vals
within the PyTorch project. The pull request modifies the Inductor Pattern Matcher to trace replacement graphs usingarg_kwarg_vals
instead ofnode.meta['val']
.
- New Operator Tag "needs_exact_strides" for Inductor: This topic introduces a new operator tag, "needs_exact_strides," for the Inductor component in the PyTorch project. The pull request enforces exact strides on custom operators, with plans to make this behavior the default in a subsequent update.
- Templatized CUDA Kernel for GammaBeta Backwards Pass: This topic introduces a new templatized CUDA kernel designed to replace three existing non-ROCM CUDA kernels for the GammaBeta backwards pass. The pull request addresses performance issues by utilizing warp shuffles, coalesced loads, and parallelism across the
M
dimension.
- Regression Fix in evaluate_expr Function: This topic addresses a regression issue in the
evaluate_expr
function caused by a previously added logging argument. The pull request refactors the code to eliminate the use ofexpr_sym_node_id
in cache lookups and introduces a new functionevaluate_sym_node
.
- Renaming of node.meta["arg_kwarg_vals"] Key: This topic proposes renaming the key
node.meta["arg_kwarg_vals"]
tonode.meta["orig_arg_kwarg_vals"]
in the codebase. The pull request aims to improve clarity and is awaiting continuous integration (CI) test results for validation.
- Optimization of PyTorch Codebase: This topic aims to optimize the PyTorch codebase by removing unnecessary tensor clones and redundant variable calls. The pull request also adds argument comments for clarity as part of addressing a specific issue referenced in the project.
- Selective Code Building at Optimization Level O1: This topic is focused on selectively building code at optimization level O1 as part of an ongoing stack of changes. The pull request includes tasks such as porting improvements to
cpp_wrapper
mode and addressing CMake packaging issues.
- Support for Rowwise Scaling in Scaled GEMM Operations: This topic introduces support for rowwise scaling in scaled GEMM operations. The pull request includes various enhancements such as fixes for offline tuning and the addition of new unit tests for offline scaled GEMM.
- Sharding Propagation Refactor for Cross-Mesh Computations: This topic refactors the sharding propagation mechanism in the PyTorch project to handle cross-mesh computations. The pull request moves the same mesh check from the sharding propagation level to the individual operator level.
- Registration Process for at::_weight_int4pack_mm_with_scale_and_zeros: This topic aims to facilitate the registration process related to the
at::_weight_int4pack_mm_with_scale_and_zeros
function. The pull request is part of a stack of changes managed through the ghstack tool.
- Stride Consistency in while Loop for body_fn Function: This topic addresses the requirement for the stride in a while loop to remain consistent with the input for the
body_fn
function. The pull request is part of a series of updates tracked via ghstack and involves multiple contributors for review and collaboration.
- Preservation of Torch Function Mode Stack in torch.utils.checkpoint: This topic addresses the issue of preserving the torch function mode stack during recompute in
torch.utils.checkpoint
. The pull request ensures that the TorchFunctionModeTLS remains active even when.backward()
is called.
- New Utility Function torch._library.utils.normalize_args_kwargs: This topic introduces a new utility function,
torch._library.utils.normalize_args_kwargs
. The pull request standardizes the (args, kwargs) to align with the PyTorch dispatcher calling convention and includes new tests to validate this functionality.
- Integration of XPU Device Support in LayerNormKernel: This topic aims to integrate support for the XPU device into the LayerNormKernel devices within the PyTorch project. The pull request is in collaboration with a related pull request from the Intel torch-xpu-ops repository.
- Performance Improvement by Moving map_aggregate to C++: This topic involves moving the
map_aggregate
function to C++ within the PyTorch project. The pull request results in improved performance as demonstrated by a microbenchmark showing a reduction in function calls and execution time.
- Crash Fix in gen_patterns.py Script: This topic addresses a crash issue encountered when running the
gen_patterns.py
script in the PyTorch project. The pull request fixes aTypeError
related to theissubclass()
function and includes multiple commits for fixing the error and cleaning up the code.
- Argument Passing Support in DeviceMesh.get_group Function: This topic introduces support for passing arguments to the
DeviceMesh.get_group
function in the PyTorch project. The pull request includes adding tests and updating relevant files liketest_dtensor_compile.py
anddistributed.py
.
- Enhancements to torchgen Tool for C Shim Files: This topic aims to enhance the torchgen tool by enabling it to automatically update C shim files with a version number and a list of new arguments for modified operations. The pull request ensures backward compatibility when adding new arguments to fallback operations in Python.
- Addition of Meta Kernels in PyTorch: This topic aims to enhance the PyTorch project by adding additional meta kernels. The pull request is associated with the ghstack tool, which helps manage stacked pull requests.
- Support for Custom Operations with Arbitrary Input Types: This topic introduces support for custom operations in PyTorch that can handle arbitrary input types. The pull request demonstrates this through a test case involving a custom operation that processes dictionary and tensor inputs.
- Fix for linspace Function Decomposition: This topic addresses an issue in the PyTorch project by fixing the decomposition for the
linspace
function. The pull request ensures that no non-functional operations are performed on functional operators.
- Recompile Limit Handling in Dynamo Component: This topic addresses the issue of exceeding the recompile limit in the Dynamo component of the PyTorch project. The pull request implements a solution that allows the system to run recursively only when this limit is surpassed.
- Data Type Checks in torch.matmul and Related Functions: This topic addresses an issue related to the data type checks in the
torch.matmul
function and its related operations. The pull request ensures correct behavior when handling different dimensional inputs.
- Replacement of unimplemented with unimplemented_v2 in torch/_dynamo/variables/base.py: This topic involves replacing the
unimplemented
function withunimplemented_v2
in thetorch/_dynamo/variables/base.py
file. The pull request is part of issue #147913 and includes several commits for updates and fixes.
- Removal of Internal Stack Traces for Graph Breaks: This topic aims to remove internal stack traces for graph breaks when the
fullgraph=True
option is used in the Dynamo component of the PyTorch project. The pull request is indicated by the title and commit messages.
- Renaming of Test File in PyTorch Project: This topic involves renaming a test file from "test_graph_break_messages" to "test_error_messages" in the PyTorch project. The pull request is part of a series of changes managed through the ghstack tool.
- XPU Support for Inductor MM Triton Kernel Benchmark: This topic aims to enable the XPU (Intel GPU) support for the Inductor MM Triton Kernel Benchmark. The pull request addresses a test case regression issue introduced in a previous update.
- PT2 Enablement Tests for torch.float8_e8m0fnu Data Type: This topic introduces PT2 enablement tests for the
torch.float8_e8m0fnu
data type. The pull request addresses specific functionalities such as displaying e8m0 in TORCH_LOGS output and testing uint8 to e8m0 conversions.
- Support for invoke_subgraph Feature in PyTorch: This topic introduces support for the
invoke_subgraph
feature in the PyTorch project. The pull request is part of a stack of changes managed by ghstack and includes multiple commits with updates and discussions.
- CK Submodule Version Usage in ROCm Environment: This topic aims to ensure that PyTorch uses the CK submodule's version of the
config.h
file instead of potentially defaulting to the system version. The pull request enhances compatibility and consistency within the ROCm environment.
- Enhancements to nonstrict_trace Functionality: This topic aims to enhance the
nonstrict_trace
functionality in the PyTorch project. The pull request allows it to handle objects whose types have been registered as constants usingpytree.register_constant
.
- Support for Navi4 Architecture in CUDA Tests: This topic involves updating the
test/test_matmul_cuda.py
andtorch/testing/_internal/common_cuda.py
files to introduce support for the Navi4 architecture. The pull request adds anIS_NAVI4
constant and implements conditional test skipping for row-wise FP8 tests.
- Autotuning of User-Defined Triton Kernels: This topic aims to enhance the autotuning process of user-defined Triton kernels in the PyTorch project. The pull request utilizes real input data to ensure the correct execution path is followed, with test cases included to validate the changes.
- Upgrade of oneDNN Submodule to Version 3.7: This topic aims to upgrade the oneDNN submodule to version 3.7 in the PyTorch project. The pull request enhances performance for various operations on Intel Xeon processors and Intel GPUs, while also addressing several issues related to accuracy and performance.
- Main Tests for Cutlass Backend Matrix Multiplication: This topic introduces the initial step of adding main tests for matrix multiplication (mm), addition and multiplication (addmm), and batch matrix multiplication (bmm) functionalities in the Cutlass backend of the PyTorch project. The pull request is part of a series of related changes tracked through the ghstack tool.
- Spelling Corrections Across PyTorch Codebase: This topic addresses and corrects various spelling errors across the PyTorch codebase. The pull request enhances documentation quality and code readability, ensuring consistent spelling without affecting the functionality of the code.
- Introduction of RandomBatchSampler in PyTorch: This topic introduces a new
RandomBatchSampler
to the PyTorch project. The pull request optimizes the process of generating batch indices by replacing the traditional iteration method with slicing, resulting in significant speed improvements.
- Skipping of Intel GPU TestCommon::test_dtypes Test: This topic aims to skip the Intel GPU TestCommon::test_dtypes test for the bmm and addbmm operations. The pull request addresses the lack of complex64 support and extends the DecorateInfo to accommodate a list of device types.
- CK Backend for Memory-Efficient Attention in ROCm: This topic introduces CK as the backend for memory-efficient attention in ROCm. The pull request enables the use of attention bias while noting limitations such as the lack of support for Nested Tensors.
- Removal of Unused rand Function Call: This topic addresses issue #147171 by removing an unused call to the
rand
function when not falling back to eager execution. The pull request includes commits that eliminate dead code in the Graph component of the PyTorch project.
- Stack-Allocated Buffer in GEMM Template: This topic proposes using a stack-allocated buffer in the GEMM template to reduce memory allocator lock contention. The pull request potentially saves a few cycles and removes some non-determinism, although no significant performance difference was observed.
- Test Update for ghstack-Poisoned Changes: This topic, titled "test," is part of a stack of changes managed by ghstack. The pull request includes updates marked as "[ghstack-poisoned]" with two specific commits, but it has not yet been merged into the PyTorch project.
- Application of torch_compile_options to C10 Libraries: This topic aims to enhance the PyTorch project by applying
torch_compile_options
to the C10 libraries. The pull request addresses a specific issue and includes commits that introduce this change and fix a semicolon error.
- Enhancement of MSVC Build Process with /permissive- Flag: This topic aims to enhance the MSVC build process of the torch libraries by implementing the
/permissive-
flag. The pull request addresses build errors as part of the solution and involves collaboration with several contributors.
- Parallelization of bf16 to f32 Conversion in at::addmm and Linear Kernels: This topic aims to enhance the performance of at::addmm and linear kernels by parallelizing the conversion from bf16 to f32. The pull request focuses on parallelization and vectorization of the conversion process.
- Switching TestConsistency to Use MPS Device: This topic proposes switching the
TestConsistency
to use the MPS device. The pull request is part of a stack of changes aimed at eventually moving decorators away fromtest_mps
toOpDB
.
- Prevention of Premature Garbage Collection in THPGenerator_reduce: This topic addresses the issue of premature garbage collection of the state tensor in the
THPGenerator_reduce
function. The pull request increases its reference count, preventing runtime errors when using themultiprocessing
spawn methods "forkserver" and "spawn" in PyTorch.
- Resetting AOT Counter in torch._dynamo.reset Function: This topic aims to reset the AOT (Ahead-Of-Time) counter when the
torch._dynamo.reset
function is called. The pull request is part of a series of changes tracked by ghstack and involves multiple contributors for review and collaboration.
- Introduction of export_cache Feature in PyTorch: This topic introduces a feature called
export_cache
, which is a modified version of@mark_compiled_region
. The pull request is designed to handle function call differentiation by input metadata in non-strict export scenarios.
- Annotation of Forward Graph Dynamic Tensor Outputs: This topic involves annotating forward graph dynamic tensor outputs with
mark_dynamic
in the PyTorch project. The pull request is part of a stack of changes managed by ghstack and is currently a work in progress.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test update aimed at upgrading the oneDNN library to version 3.7 without making any changes to iDeep. The pull request is not intended for merging.
- Reconstruction of WeakRefVar in Dynamo Component: This topic involves the reconstruction of the WeakRefVar in the Dynamo component of the PyTorch project. The pull request is part of a stack of changes managed by ghstack, with multiple contributors being notified for review or collaboration.
- Resolution of HSDP Custom Hook Unit Test Issues: This topic addresses the issue of HSDP custom hook unit tests being multi-threaded and using a single physical GPU. The pull request removes the device rank setting to prevent referencing the same GPU with multiple ranks.
- Draft for Stable Version of Torch Library: This topic is a draft for a stable version of the Torch library. The pull request is part of a stack of changes managed by ghstack, with the continuous integration checks intentionally skipped.
- Change of force_nn_module_property_static_shapes Flag Default Setting: This topic proposes changing the default setting of the
force_nn_module_property_static_shapes
flag toFalse
. The pull request supports the dynamic shapes roadmap by reducing the number of unrolled out flags.
- Correction of Parameter and Function Descriptions in Test Package: This topic addresses inaccuracies in parameter and function descriptions within the test package of the PyTorch project. The pull request aims to correct these issues for improved clarity and accuracy.
- Enablement of Kineto for XPU: This topic aims to enable Kineto for XPU by updating the intel-pti to version 0.10.1. The pull request turns on the XPU_ENABLE_KINETO flag as part of the ongoing development in the PyTorch project.
- Replacement of unimplemented with unimplemented_v2 in torch/_dynamo/variables/constant.py: This topic proposes replacing the existing 'unimplemented' functionality with 'unimplemented_v2' in the torch/_dynamo/variables/constant.py file. The pull request is part of a series of changes tracked by ghstack and is linked to issue #147913 on GitHub.
- Data Type Checks for torch.addbmm, torch.addmv, and torch.baddbmm: This topic addresses the issue of incorrect data type checks for the output of the PyTorch functions
torch.addbmm
,torch.addmv
, andtorch.baddbmm
. The pull request includes updates to ensure proper handling of these data types.
- Migration of Python Formatting Tool to ruff format: This topic aims to migrate the Python formatting tool for the
torch/ao/
directory fromPYFMT
toruff format
. The pull request is part of a stack of changes tracked via ghstack and involves multiple commits with updates.
- Use of itertools.chain.from_iterable for Code Enhancement: This topic proposes the use of
itertools.chain.from_iterable
in the codebase to enhance readability, efficiency, and support for infinite iterables. The pull request is currently open for review on the PyTorch GitHub repository.
- Update of XPU Triton Build to Manylinux 2.28 Environment: This topic aims to update the continuous integration process by moving the XPU Triton build to the manylinux 2.28 environment. The pull request includes a change to use GCC version 13.
- Untracked Unbacked Symbols Handling in Conditional Statements: This topic addresses the issue of untracked unbacked symbols leaking from the true and false branches of a conditional statement in the PyTorch project. The pull request ensures that these symbols are properly identified and tracked as outputs of the while_loop operator.
- Exposure of Functions in torch_python DLL for Custom Backend: This topic aims to expose functions used in a custom backend within the
torch_python
DLL to improve performance. The pull request addresses issue #148208 while referencing a related discussion on symbol hiding.
- Introduction of AppendingByteSerializer Utility Class: This topic introduces a new utility class called
AppendingByteSerializer
to the PyTorch project. The pull request is designed to facilitate the efficient appending of sequential byte data with customizable serialization and deserialization processes.
- Performance Enhancement of save_cache_artifacts Function: This topic aims to significantly enhance the performance of the
save_cache_artifacts
function. The pull request redesigns the serialization algorithm and opts out of using pickle, reducing the computational expense incurred when the function is called repeatedly in internal workloads.
- Modification of Cutlass Backend for Self-Multiplication Operations: This topic aims to modify the Cutlass backend by removing an assertion that previously prevented self-multiplication operations. The pull request allows such operations to proceed without restriction.
- Expansion of addmm Test in Cutlass Backend: This topic aims to enhance the cutlass backend by expanding the addmm test to cover all four broadcastable shape biases. The pull request is part of a series of related changes tracked through a stack of pull requests.
- Modification of require_contiguous Function for Exact Strides: This topic aims to modify the
require_contiguous
function to necessitate exact strides rather than just the stride order. The pull request is part of a series of changes tracked by ghstack and involves multiple contributors and reviewers.
- Reenablement of Subprocess Addition Matrix Multiplication Test: This topic aims to reenable a subprocess addition matrix multiplication test in the Cutlass backend of the PyTorch project. The pull request involves multiple contributors for review and collaboration.
- Fix for Vectorized Code Generation of tanh Function: This topic addresses an issue in the PyTorch project by fixing the vectorized code generation for the
tanh
function. The pull request resolves this by switching to the Sleef implementation to ensure consistent outputs.
- Default USE_LIBUV to 0 for dist.init_process_group on Windows: This topic addresses an issue with the
dist.init_process_group
function on Windows by proposing to defaultUSE_LIBUV
to 0. The pull request includes a more informative error message to improve user experience.
- Performance Enhancement of Interpolation Operations on MPS Backend: This topic aims to enhance the performance of interpolation operations in PyTorch on the Metal Performance Shaders (MPS) backend. The pull request addresses a bug in the benchmarking script and optimizes the computation of spatial coordinates.
- Enhancement of Error Messaging for Missing Ninja Build System: This topic aims to enhance the error messaging related to missing Ninja build system in the cpp_extensions module of the PyTorch project. The pull request is indicated by the commit titled "Update ninja missing error message."
- Fix for Multiple OpenMP Runtimes Linked to libtorch_cpu.so: This topic addresses an issue in PyTorch where building with OpenBLAS support and directly linking libopenblas with libgomp.so results in multiple OpenMP runtimes being linked to libtorch_cpu.so. The pull request proposes a fix by avoiding linking against libomp.so if OpenBLAS is already linked with libgomp.so.
- Tolerance Adjustments for test_torchinductor_opinfo on AArch64: This topic addresses the failure of the
test_torchinductor_opinfo
test fornn.functional.triplet_margin_loss
on AArch64. The pull request increases the acceptable absolute and relative tolerances (ATOL and RTOL) for this test when using F16.
- Minimum Viable Product for P1 INT16 Full Quantization Target: This topic introduces a minimum viable product (MVP) for the P1 INT16 Full quantization target. The pull request involves quantizing the input to int16 as part of the PyTorch project.
- Registration of Normal Class to register_dataclass Function: This topic aims to address a specific issue discussed in a previous pull request by registering a normal class to the
register_dataclass
function within the PyTorch project. The pull request is indicated by the commit message and linked discussion.
- Rank Local Checkpointing Demonstration in DCP: This topic is a work in progress aimed at demonstrating rank local checkpointing in the Distributed Checkpointing Protocol (DCP) for the PyTorch project. The pull request is not yet ready for review.
- Respecting priority_order Setting in torch.compile Path: This topic addresses the issue where the
torch.compile
path was not respecting thepriority_order
setting ofsdpa_kernel
. The pull request ensures that the context manager handling withintorch.compile
now properly acknowledges this configuration.
- Handling of Real-Tensor Fallback Failures in Dynamic Shapes: This topic addresses an issue in the PyTorch project by implementing a solution to ignore failures when the real-tensor fallback mechanism does not succeed. The pull request is part of handling dynamic shapes during export.
- Deprecation of Silent Fallback Mechanism for GEMM Tuning: This topic initiates the first stage of deprecating the silent fallback mechanism for tuning GEMM in the PyTorch project. The pull request involves the removal of a conditional block related to the eager mode implementation for
int_mm
.
- Reversion of copy2d Implementation for Data Transfers: This topic aims to revert a previous change that implemented the use of "copy2d" for host-to-device and device-to-host data transfers. The pull request is indicated by the original commit changeset aa7d1b82ac9d.
- Enablement of AddressSanitizer in CUDA Tests: This topic aims to enable AddressSanitizer (ASAN) in CUDA tests for the PyTorch project. The pull request involves collaboration with multiple contributors mentioned in the body, although it has not yet been merged.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic aims to upgrade the oneDNN library to version 3.7 in the PyTorch project, specifically for testing purposes and not intended for merging. The pull request involves multiple contributors tagged for review or awareness.
- Optimization in PyTorch Distributed Library: This topic introduces an optimization in the PyTorch Distributed (PTD) library by allowing the use of the current compute stream as the NCCL stream when operating in async=False mode. The pull request significantly reduces CPU overhead by 50% and overall CPU/GPU time by 15% during collective communication operations.
- Removal of Assertion in expand_to_full_mesh_op_strategy Function: This topic addresses issue #147732 by removing an assertion in the
expand_to_full_mesh_op_strategy
function. The pull request involves several contributors for review and discussion, but it has not yet been merged.
- Fix for is_compile_supported() Function in PyTorch: This topic addresses a bug in the PyTorch project by fixing the
is_compile_supported()
function to correctly handle cases where thedevice_type
includes a device index. The pull request is referenced in issue #147826.
- Setting disable_clone Parameter to True in opt_gm Function: This topic proposes setting the
disable_clone
parameter toTrue
when executing theopt_gm
function. The pull request addresses issue #147843 in the PyTorch project.
- Transition of mkldnn_linear Components to oneDNN MatMul: This topic aims to transition the mkldnn_linear and mkldnn_linear_backward components from using oneDNN Inner Product to oneDNN MatMul. The pull request is part of a series of changes tracked by ghstack and is currently not intended for merging.
- Performance Enhancement of gemv Operator in PyTorch: This topic aims to enhance the performance of the gemv operator in PyTorch by offloading OpenBLAS gemv calls to a dedicated OpenBLAS kernel. The pull request results in a 14% performance improvement for operations on matrices of shape 1x4096x4096.
- Testing of optree Component with Latest HEAD Version: This topic is focused on testing the
optree
component with the latest HEAD version in the PyTorch project. The pull request is indicated by the title and commit message and has not yet been merged.
- Introduction of Dim._OBLIVIOUS Feature in Export Dynamic Shapes: This topic introduces the
Dim._OBLIVIOUS
feature in export dynamic shapes and the_mark_oblivious()
function in dynamo decorators. The pull request allows developers to opt into size-oblivious reasoning and avoid 0/1 specialization.
- Addition of Missing Matrix Cases in CI Setup: This topic adds missing matrix cases for the
pytorch-linux-focal-py{3.12,3.13}-clang10
configuration in the continuous integration setup. The pull request references specific lines in the project's GitHub workflow files to ensure comprehensive testing coverage.
- Removal of Unnecessary Tombstone Messages from TARGETS Files: This topic aims to proactively remove unnecessary tombstone messages from
TARGETS
files. The pull request addresses the redundancy of these messages due to the merging of files usingnon_fbcode_target
.
- Allowing Tensor Types in allowed_getattr_types_for_subgm: This topic addresses an issue in the PyTorch project by allowing tensor types in the
allowed_getattr_types_for_subgm
when verifying export processes. The pull request previously caused aSpecViolationError
due to invalidget_attr
types in non-lowerable parts of a graph.
- Increase of Persistent Reduction Threshold for Inductor Multikernel Flag: This topic proposes increasing the persistent reduction threshold for the inductor multikernel flag from 16 to 32. The pull request is expected to yield significant performance improvements, as demonstrated by benchmark results.
- Use of TorchFunctionMode for SDPA Dispatch in CP Feature: This topic introduces the use of TorchFunctionMode to dispatch the Scaled Dot-Product Attention (SDPA) for the CP (Checkpointing) feature in the PyTorch project. The pull request is indicated by the title and the associated commit.
- Triggering of MI300-Specific CI Workflows on PRs: This topic aims to enable the triggering of MI300-specific continuous integration workflows on pull requests. The pull request uses a PR label with a temporary workaround using the
ciflow/unstable
label.
- Opportunity Finder Feature for GEMM Horizontal Fusion Search: This topic introduces an "opportunity finder" feature within the inductor for General Matrix Multiply (GEMM) horizontal fusion search. The pull request includes a detailed test plan for local reproduction and performance benchmarking on a GPU.
- Handling of Partial and Scalar Values in PyTorch: This topic addresses a specific issue related to the handling of partial and scalar values in the PyTorch project. The pull request involves collaboration with multiple contributors, although it is not intended to be merged at this time.
- Enablement of cpu_offload Feature for _distribute_state_dict Function: This topic aims to enhance the PyTorch project by enabling the
cpu_offload
feature for the_distribute_state_dict
function. The pull request is part of an ongoing effort to address a specific issue and involves collaboration with multiple contributors.
- Update of basic.TestSqueeze for 0-Dimensional Squeeze Operations: This topic aims to update the
basic.TestSqueeze
by replacing a TODO with a test for 0-dimensional squeeze operations. The pull request ensures that scalars remain unchanged as part of the PyTorch project.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is marked as a test and not intended for merging.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test update that aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is not intended to be merged.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test and not intended for merging, aiming to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is indicated by the commit message and the involvement of multiple contributors for review and feedback.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test and not intended for merging, aiming to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is indicated by the commit message and the involvement of multiple contributors tagged for review or notification.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic is a test update that aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is not intended for merging as indicated by the title and the lack of a specific issue number in the body.
- Test Update for oneDNN Library Upgrade to Version 3.7: This topic aims to upgrade the oneDNN library to version 3.7 in the PyTorch project. The pull request is indicated by the commit message and the involvement of multiple contributors tagged for review or notification.
- Test-Related Update for PyTorch Project: This topic is a test-related update for the PyTorch project, as indicated by the title '[TEST]' and the test plan mentioned in the body. The pull request includes a single commit with a differential revision reference, but it has not yet been merged.
- Update of Protobuf Dependency to Version 5.29: This topic aims to update the Protobuf dependency to version 5.29 in the PyTorch project. The pull request successfully addresses CMake build compatibility while expressing uncertainty about resolving issues with Bazel builds.
- Order Maintenance in ElasticDistributedSampler: This topic addresses an issue in the ElasticDistributedSampler where the order of indices is not maintained correctly when the
start_index
is not zero. The pull request ensures that training resumes from the correct point if a job is restarted.
- Extension of CUDA Test to Include XPU SyclExtension Case: This topic extends the existing CUDA test to include the XPU SyclExtension case for the
py_limited_api
feature. The pull request cannot be merged until the commit pin for torch-xpu-ops is updated.
- Documentation Update for py_limited_api Feature in SyclExtension: This topic updates the documentation to align the description of the
py_limited_api
feature in the SyclExtension with the existing descriptions for CPP and CUDA. The pull request addresses a previously missed change due to concurrent work on the SyclExtension.
- CI Workflow Modification to Avoid Workspace Cleaning: This topic addresses an issue in the continuous integration process by modifying the workflow to avoid cleaning the workspace when fetching the repository. The pull request is indicated by the title and the reference to a specific issue number.
- Default Generation of AOTI Size and Stride Input Checks: This topic introduces a default generation of AOTI (Ahead-Of-Time Inductor) size and stride input checks in the PyTorch project. The pull request ensures these checks are only executed when the
AOT_INDUCTOR_DEBUG_COMPILE
environment variable is set.
- Reversion of Triton Call in Worker Process: This topic aims to revert a previous change that involved calling Triton in the worker process and compiling ahead of time. The pull request is indicated by the title and commit message referencing the original changeset and differential revision.
- Reversion of Triton Call in Worker Process: This topic aims to revert a previous change that involved calling Triton in the worker process and compiling ahead of time. The pull request is indicated by the original commit changeset 5e70e713d95b and the associated differential revision D70210584.
- Skipping of test_reference_numerics_large_jiterator_unary_cuda_complex64 Test: This topic proposes to skip the
test_reference_numerics_large_jiterator_unary_cuda_complex64
test on CUDA due to a change in recent numpy versions. The pull request alters the convention fromnan+infj
to-inf+infj
, similar to a previous skip on ROCM.
- Gradient Computation Corner Case in torch.nn.functional.hardswish: This topic addresses a corner case in the gradient computation of
torch.nn.functional.hardswish
. The pull request modifies the condition for gradient calculation, enabling CUDA support for the testtest_hardswish_grad_corner
.
- Modification of torch.celu and quantized_celu Operations: This topic addresses issue #148065 by modifying the
torch.celu
andquantized_celu
operations to return the input directly whenalpha
is set to infinity. The pull request ensures that the functioncelu(x, inf)
is well-defined for all input valuesx
.
- Replacement of unimplemented with unimplemented_v2 in codegen.py: This topic aims to address issue #147913 by replacing the
unimplemented
function withunimplemented_v2
in thecodegen.py
file. The pull request removes the unused import ofunimplemented
.
- Compatibility Fix for Atomic Operations on ARMv8-A Platforms: This topic addresses compatibility issues with atomic operations on ARMv8-A platforms, such as the Raspberry Pi 4. The pull request adjusts the compilation flags to use
-march=armv8-a+sve
, ensuring that PyTorch builds correctly without generating unsupported instructions.
- NameError Fix in PyTorch Test Suite: This topic addresses a NameError in the PyTorch project's test suite by providing a dummy
DataType
. The pull request ensures syntactical correctness when theTEST_TENSORBOARD
flag is set toFalse
.
- Consistency and Brevity Enhancement in Test Code: This topic aims to enhance the consistency and brevity of the test code in the PyTorch project. The pull request utilizes the existing
load_torchbind_test_lib
function to replace multiple slightly different implementations of repeated code.
- OpenMP Flag Parsing Fix for clang-cl on Windows: This topic addresses the issue of incorrect OpenMP flag parsing by clang-cl on Windows. The pull request ensures that MSVC-style arguments are used and clang-style arguments are properly prefixed with
-Xclang
.
- Experimental Change for C++ Wrapper with CUDA Graphs: This topic is an experimental change aimed at measuring the impact of integrating a C++ wrapper with CUDA graphs in the PyTorch project. The pull request is indicated by its title and the associated commit.
- Integration of myst_nb Plugin into PyTorch Documentation: This topic proposes the integration of the myst_nb plugin into the PyTorch documentation. The pull request enables the rendering of Jupyter notebooks and execution of code blocks within markdown documents.
- Output Memory Planning for ATen Convolution Operation: This topic aims to enable the ATen convolution operation to plan its output memory for potential fusion opportunities. The pull request is part of the lowerings process in the PyTorch project.
- Enhancement of OrderedPreservingDictTest.test_range_insert Functionality: This topic enhances the
OrderedPreservingDictTest.test_range_insert
by incorporating functionality to check key-value pair indexing and order. The pull request addresses an unused loop variable issue.
- Setting of force_parameter_static_shapes Parameter to False: This topic involves setting the parameter
force_parameter_static_shapes
toFalse
in the PyTorch project. The pull request is part of a stack of changes managed byghstack
and has not yet been merged.
- Disabling of cuDNN During Export Tracing for Batch Normalization: This topic addresses the issue of ConstraintViolation errors in the batch normalization operation by disabling cuDNN during export tracing. The pull request prevents the creation of problematic guards.
- Removal of Outdated CUDA Version Checks: This topic proposes the removal of outdated CUDA version checks from the PyTorch project. The pull request is based on the framework now requiring a minimum CUDA version of 11.
- Transformation of UnpackedDualTensor into Namedtuple: This topic aims to enhance the PyTorch project by transforming the
UnpackedDualTensor
into a true namedtuple. The pull request is part of a series of related changes tracked through the ghstack tool.
- Incorporation of Python 3.9 Typing Features: This topic aims to update the PyTorch project by incorporating Python 3.9 typing features. The pull request is indicated by the commit message and the involvement of multiple contributors tagged for review or notification.
- Draft for Addressing Specific Issue in PyTorch Project: This topic is a draft aimed at addressing a specific issue in the PyTorch project. The pull request is indicated by the placeholder '#ISSUE_NUMBER' and involves collaboration with multiple contributors.
- Upgrade of oneDNN Submodule to Version 3.7 with PDB Build Focus: This topic aims to upgrade the submodule oneDNN to version 3.7 in the PyTorch project. The pull request focuses on building PDB with the Z7 option and is currently not intended for merging.
- Enhancement of Code with Docstrings and Type Annotations: This topic focuses on enhancing the code by adding comprehensive docstrings and implementing proper type annotations. The pull request also introduces a class method for context retrieval and improves overall code organization.
- Introduction of HPU Profiler Activity: This topic introduces a new profiler activity specifically for HPU (Habana Processing Unit) devices. The pull request addresses issue #148181 in the PyTorch project.
- Fix for FlexibleLayout Weights in Batch Matrix Multiplication: This topic addresses an issue in the PyTorch project where an error occurs with FlexibleLayout weights in Batch Matrix Multiplication (BMM). The pull request potentially alters node B's layout during a specific kernel selection process.
- Distributed Data Handling for Hugging Face Readers and Writers: This topic introduces the capability for Hugging Face (HF) readers and writers to handle data in a distributed manner. The pull request ensures that all tensors intended for the same file are directed to the same rank.
- Enhancement of Cache Size Limit Error Message: This topic aims to enhance the cache size limit error message by including the configured limit size. The pull request provides more informative feedback when the cache size limit is reached.
- Dispatch Logic Update for BF16 Linear Layers: This topic updates the dispatch logic for
linear
layers using BF16 in the PyTorch project. The pull request utilizes oneDNN instead of OpenBLAS, based on profiling results on AArch64.
- Refactoring of Estimate Runtime and Pick Loop Order Heuristics: This topic involves refactoring the code by moving the estimate runtime and pick loop order heuristics into the
choices.py
file. The pull request is part of an ongoing effort to reorganize similar elements within the scheduler.
- Fix for [No available kernel] Error with cuDNN on A100 GPUs: This topic addresses a '[No available kernel]' error encountered with cuDNN on A100 GPUs. The pull request is part of a stack of changes tracked via ghstack and involves multiple contributors for review and collaboration.
- Support for Dilation in max_pool2d Lowering Process: This topic introduces support for dilation in the max_pool2d lowering process within the PyTorch project. The pull request is part of a stack of changes aimed at enhancing the functionality of the inductor component.
- Lowerings for max_pool3d Function in PyTorch: This topic introduces lowerings for the
max_pool3d
function in the PyTorch project. The pull request is part of a stack of changes and is currently open with a single commit linked to it.
- Addition of kBatch_sweep Option in ROCm Configuration: This topic introduces a new feature to the ROCm configuration by adding a
kBatch_sweep
option. The pull request allows users and tests to specify a set of kBatches to evaluate.
- Disabling of Torch Check for Float8_e5m2 Matrix Multiplication on ROCm: This topic proposes disabling the torch check for the multiplication of two Float8_e5m2 matrices on ROCm. The pull request includes a test command for verification on ROCm hardware that supports fp8.
- Fix for Logging Mechanism to Prevent Maximum Recursion Error: This topic addresses an issue in the PyTorch project by fixing the logging mechanism to prevent a maximum recursion error. The pull request is detailed in the test plan and associated with differential revision D70416613.
- Fallback Mechanism for JK Error on Platform Without Service Network: This topic addresses an issue where a "jk error" occurs on a platform lacking a service network. The pull request implements a fallback mechanism when JK is disabled.
- Enhancement of qlinear_pointwise_binary Fusion Process: This topic aims to enhance the
qlinear_pointwise_binary
fusion process by enabling dimension collapse for 3D linear cases. The pull request specifically targets theqlinear+add
path withsum
as a post-operation.
- Modification of TensorMaker::make_tensor() Function: This topic addresses the issue #146419 by modifying the
TensorMaker::make_tensor()
function to set therequires_grad
attribute. The pull request is currently open for review on the PyTorch GitHub repository.
- Addition of Recursive Glob Support to setuptools: This topic aims to enhance the build process by adding recursive glob support to setuptools in the PyTorch project. The pull request ensures that all necessary files are included during the setup.
- Update of 'fmt' Submodule to Version 11.1.4: This topic aims to update the 'fmt' submodule to version 11.1.4 in the PyTorch project. The pull request primarily addresses bug fixes, ABI fixes, and improvements in compiler support.
- Hot Fix for Inductor Component Following Changes in Pull Request #148011: This topic addresses a hot fix for the Inductor component following changes made in pull request #148011. The pull request involves multiple contributors for review and collaboration.
- Fix for Include Directories with Spaces on Windows: This topic addresses a bug in the PyTorch project where include directories containing spaces on Windows systems cause errors during execution. The pull request implements a fix that ensures paths are correctly handled without being split.
- CMake and RowwiseScaledMM.cu File Updates for SM10.0a Architecture: This topic updates the CMake files and the
RowwiseScaledMM.cu
file to enable building on the SM10.0a architecture. The pull request ensures compatibility with CUDA toolkit 12.8.
- Fix for Test Errors in aot_inductor_package: This topic addresses test errors in the
aot_inductor_package
by ensuring thatscript.ld
is copied to the build-time directory. The pull request fixes the fbcode test failures introduced by a previous pull request.
- New Test for Layernorm CUDA Backwards Pass Accuracy: This topic introduces a new test to ensure the accuracy of the layernorm CUDA backwards pass. The pull request serves as a foundational step towards future performance improvements.
- Upgrade of oneDNN Submodule to Version 3.7: This topic aims to upgrade the oneDNN submodule to version 3.7 in the PyTorch project. The pull request brings various performance improvements and optimizations for convolution and matrix multiplication primitives on Intel Xeon processors.
- Test for Code Base of Previous Pull Request: This topic is a test for the code base of a previous pull request (https://github.com/pytorch/pytorch/pull/147498) in the PyTorch project. The pull request aims to build the Windows binary and test the
test_mkldnn.py.cc
file.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 249