Weekly GitHub Report for Pytorch: February 17, 2025 - February 24, 2025
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and enhancements to AOTInductor. Notable changes include the deprecation of publishing on Conda, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Triton pin update for PyTorch 2.7 / Triton 3.3: Upgrading PyTorch-Triton to a version that Supports Blackwell: This issue involves updating the PyTorch-Triton integration to support the Blackwell architecture by upgrading to a version of Triton that includes necessary optimizations and features. The update aims to address various compatibility and performance issues, particularly focusing on the integration of new functionalities and resolving existing bugs before the release of PyTorch 2.7.
- The comments discuss the urgency of updating the Triton pin to support Blackwell, with concerns about unresolved issues and the timing of the update relative to the PyTorch 2.7 release. Contributors highlight specific test failures and compatibility issues, propose potential solutions, and track progress on related tasks, emphasizing the need for coordination and timely resolution of these issues.
- Number of comments this week: 12
-
[compile] Modularize very long compilation: This issue addresses the prolonged compilation time experienced during the model export/compile process, where a single generated C++ file exceeding 78,000 lines takes over an hour to compile using only one CPU core. The user suggests modularizing and parallelizing the compilation process to improve efficiency and reduce the time required for this stage.
- The comments discuss the potential causes of the issue, including the generation of a large Triton kernel and the need for modularization. Suggestions include splitting the C++ file into smaller parts for parallel compilation, although this may not be straightforward due to the current architecture. There is also a mention of testing with lower optimization levels and the possibility of using subgraph handling to manage repeated submodules.
- Number of comments this week: 11
-
[Export AOTI] dynamic_shapes export and compile degraded output: This issue involves a bug in the export and compile process of a model using dynamic shapes for width (W) and height (H), which results in degraded output compared to using fixed resolutions. The problem seems to be related to the use of
torch.export.Dim
for dynamic shapes, which causes runtime errors during inference unless the dimensions are aligned with the inference resolution.- The comments discuss the difficulty in debugging the issue without a reproducible example, suggest testing subparts of the model, and mention a potential problem with the exported program's graph. A minimal reproduction script is provided, and it is noted that the issue might stem from an invalid graph produced during export, with errors in AOTI and compile processes being secondary.
- Number of comments this week: 10
-
[RFC] Test Cases Enabling for Accelerators: This issue addresses the challenge of enabling existing PyTorch test cases for new device backends, such as accelerators, by proposing a flexible mechanism to determine at runtime which tests to run, skip, or adapt based on a device's specific capabilities. The proposed approach involves introducing device abstractions that report capabilities, allowing for dynamic configuration of test inclusion or parameterization, thereby minimizing intrusive modifications and providing robust coverage across diverse hardware capabilities.
- The comments discuss extending OpInfo for better device capability querying, the potential benefits for both in-tree and out-of-tree backends, and the need for a registration mechanism for device interfaces. There is interest in how this proposal will scale PyTorch tests for third-party hardware, the primary use case, and the integration with existing test infrastructure. The discussion also touches on the compatibility of capabilities across different hardware and the impact on test writing for PyTorch developers.
- Number of comments this week: 9
-
PyTorch VS2022 official build Windows binary illegal instruction on AVX2(max ISA level) CPU: This issue addresses a bug in the PyTorch official build for Windows using Visual Studio 2022, where an illegal instruction error occurs on CPUs with a maximum ISA level of AVX2 due to the generation of AVX512 instructions. The problem does not affect current PyTorch official binaries built with VS2019, and it is challenging to reproduce locally, suggesting it might be specific to the official build environment.
- The comments discuss potential solutions, including involving Microsoft, understanding the issue's scope across platforms, and maintaining AVX2 support due to its prevalence in client CPUs. There is a suggestion to revert to VS2019 or fix the issue by identifying differences between local and official environments. The discussion also touches on the possibility of making AVX2 the new base architecture instead of SSE4, as the problem might be related to non-deterministic linker behavior with AVX512 implementations.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- DISABLED test_transformer_training_is_seq_parallel_False (main.DistTensorParallelExampleTest): This issue pertains to a disabled test,
test_transformer_training_is_seq_parallel_False
, within theDistTensorParallelExampleTest
suite, which is failing on the main branch of a GitHub project. The failure is suspected to be caused by changes introduced in one of the pull requests #122995, #122996, or #122997, and it affects the ROCm platform, prompting a wide range of contributors and maintainers to be notified for further investigation and resolution. - [NestedTensor] multiply batch and ragged dimension to get shape of values tensor: This issue discusses a feature request for the PyTorch library, specifically the ability to manipulate the dimensions of a NestedTensor by collapsing the first two dimensions into one, which would allow for more flexible tensor operations. The proposal includes a code snippet demonstrating how this could be achieved, highlighting the potential for enhanced functionality in handling nested tensor sizes.
- Error: command buffer exited with error status.: This issue describes a problem encountered while training a model using llama2.c on an iMac with an AMD Radeon Pro 5700 XT GPU, where the user experienced a command buffer error at epoch 11,580, causing significant delays in epoch processing times. The error, which appears to be related to GPU timeout issues, occurred after the user built PyTorch from source due to the lack of recent nightly builds for MacOS + x86_64, and the user is seeking insights into whether these GPU timeout errors could be related to garbage collection or other factors.
- scalar_tensor call with symbolic bool input does not work in inductor: This issue involves a bug in the PyTorch library where the
scalar_tensor
function fails when called with a symbolic boolean input while using the Inductor backend. The error occurs during the execution of a compiled function, resulting in aTypeError
due to anEquality
object lacking a length, which disrupts the expected behavior of the function. - Support AOT Autograd level Caching: This issue addresses the need for caching in the
torch.compile
process when using anaot-autograd
enabled backend, as the current compilation time for models like Llama2 7B is significantly long, impacting development speed. The problem is particularly pronounced in scenarios where PyTorch/XLA is integrated with VLLM, requiring pre-compilation of multiple input shape combinations, which currently results in a lengthy warm-up phase due to the lack of support for dynamic shapes.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 107
Summarized Issues:
- AttributeError in Distributed Training with LLaMA 3.2: An AttributeError occurs during the implementation of distributed training using pipelining for the LLaMA 3.2 model. The error message indicates that an 'InterpreterModule' object lacks the 'cache' attribute, potentially due to a missing or incorrectly referenced attribute in the model's pipeline configuration.
- Floating Point Exception in
torch.nn.functional.conv3d
: A floating point exception and subsequent crash occur in thetorch.nn.functional.conv3d
function of PyTorch when using specific input parameters, particularly a very large stride value. This issue is related to the mkldnn backend and can be temporarily bypassed by disabling mkldnn.
- Build Error with oneAPI on Windows: A build error is encountered when upgrading to a new internal build of oneAPI on Windows, where the compiler does not conform to the C++ standard. This results in multiple syntax and conversion errors, which can be resolved by adding the
/permissive-
flag and making specific code changes to address const qualifier issues.
- AttributeError in Triton Upstream Inductor Project: Widespread failures in unit tests for the Triton upstream Inductor project are caused by an AttributeError due to a deprecated API. A 'dict' object is incorrectly expected to have an 'equal_to_1' attribute, affecting platforms NV and ROCm.
- Timeout in Nightly Windows Builds: The timeout of nightly Windows builds for the PyTorch project began around January 31, 2025, due to an increase in build time. This is potentially linked to recent CUDA-related pull requests, specifically the addition of numerous element-wise CUDA kernels.
- Memory Access Violation on ROCm Platform: A memory access violation error, specifically HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION, occurs on some unit tests when using the ROCm platform with the latest Triton commit. This issue is encountered during an attempt to update Triton in preparation for version 3.3.
- Bug in
_export_forward_backward
API: A bug is encountered when using the experimental_export_forward_backward
API in PyTorch to export the joint graph of a model with tied weights. This results in an assertion error due to a parameter not receiving a gradient.
- Missing
comm.reduce_add_coalesced
Function: The absence of thecomm.reduce_add_coalesced
function in the CUDA communication collectives of the PyTorch project is highlighted. The provided documentation link and image indicate this issue, but no potential alternatives or fixes are suggested.
- Discussion on CUDA 11.8 Support Removal: The potential removal of support for CUDA 11.8 in the PyTorch project is discussed, proposing to announce its removal in release 2.7 and officially drop it by release 2.8. The support is to be phased out in nightly builds between March 2025 and June 2025.
- Monitoring Test Count Regressions: Monitoring test count regressions is necessary to identify potential bugs in sharding or unintended changes, such as modifications to bash scripts that might inadvertently stop testing on certain platforms like macOS. This ensures the integrity and completeness of the testing process.
- Bug in
Lazy*
Modules Representation: A bug in the PyTorch library is highlighted where the representation ofLazy*
modules, such asLazyLinear
, does not update correctly after loading parameters from another module's state dictionary. This results in an inaccurate display of input features even after a forward pass is performed.
- Need for Robust API in
torch.export
: There is a need for a robust API intorch.export
to specify whether certain inputs, likeMyStaticInput
, should be treated as constants or not. The current implementation incorrectly treats registered constants as empty containers, leading to aValueError
during the export process.
- Failure in Real-Tensor Tracing: A failure in real-tensor tracing during the export of dynamic shapes in PyTorch is caused by a division by zero error in the
_split_dim_meta
function. This is due to a mismatch between fake and real tensor shapes, suggesting potential solutions such as rewriting metas for real-tensor tracing.
- Bug in PyTorch's Dynamo with Constant Tensors: A bug in PyTorch's Dynamo is highlighted where constant tensors created with
torch.tensor
do not correctly recompile with the appropriate device guards when the ambient device index changes. This leads to failures in ensuring that the output tensor is always on the current CUDA device.
- Configuration Option for Recompile Reasons: A configuration option is added to the PyTorch project that enables the printing and generation of all recompile reasons without halting at the first encountered reason. This is indicated in the GitHub issue titled "Add a config to allow print and generate all recompile reasons and not stop at first."
- Dynamic Epsilon Value in Optimizers: PyTorch needs to dynamically adjust the default epsilon (eps) value in optimizers like Adam based on the precision used, such as float16. This prevents issues like NaN values in model parameters due to ineffective default eps in low precision training.
- Bug in
torch.compile
withdict_items
: A problem with the PyTorch library is highlighted where thetorch.compile
function using the "eager" backend andfullgraph=True
does not support iterating overdict_items
from a user-defined object. This is demonstrated by a provided code snippet that fails to execute correctly.
- Crash in
scaled_dot_product_attention
on MPS: A bug in PyTorch is highlighted where using thescaled_dot_product_attention
function on Apple's Metal Performance Shaders (MPS) backend with non-contiguous 5D tensors causes a crash. This is due to inadmissible tensor sizes being passed toMetalPerformanceShadersGraph
.
- NaN Values in Backward Pass: The occurrence of NaN values during the backward pass in a PyTorch project is highlighted, specifically originating from the
normalize_weight_jit
function. This is used for weight normalization in a neural network, and the problem is highlighted by a runtime error indicating that the function returned NaN values in its output.
- Gradient Checkpointing Inefficiency: A problem with gradient checkpointing in PyTorch is highlighted, where setting
use_reentrant=False
does not reduce peak memory usage compared to not using gradient checkpointing at all. Settinguse_reentrant=True
significantly reduces memory usage, indicating a potential inefficiency or bug whenuse_reentrant=False
.
- Precision Discrepancy in
polygamma
Function: A precision discrepancy in the PyTorch library is highlighted where thepolygamma
function, whenn == 1
, executes thetrigamma_kernel
under eager mode but switches tocal_polygamma
undertorch.compile
. This results in less precise outputs and suggests a change to consistently usetrigamma_kernel
for improved accuracy.
INTERNAL ASSERT FAILED
intorch.svd
: AnINTERNAL ASSERT FAILED
error occurs in PyTorch when using thetorch.svd
function with extremely large input matrices. This is likely due to the use of a 32-bit LAPACK API that cannot handle arguments with more than 2^32-1 elements, and there is currently no plan to support a 64-bit version.
- Floating Point Exception in
torch.nn.functional.conv1d
: A bug in the PyTorch library is reported where using thetorch.nn.functional.conv1d
function with specific input parameters results in a "Floating point exception (core dumped)" error. This is reproducible with the nightly-build version2.7.0.dev20250208+cpu
and may be related to a known issue with oneDNN.
- Incorrect Outputs in
scaled_dot_product_attention
: A bug in PyTorch 2.6.0+rocm6.2.4 is reported where the scaled_dot_product_attention function using the memory-efficient backend (aotriton 0.8.0) produces incorrect outputs when a custom attention mask is applied. Upgrading to aotriton 0.8.2 resolves the problem.
- Performance Regression in "modded-nanogpt": A performance regression in the "modded-nanogpt" project is reported, where the PyTorch version 2.7.0.dev20250209 is observed to be 2 seconds slower than the previous version 2.7.0.dev20250208. Detailed logs are provided for both versions to illustrate the discrepancy.
- Test Failures in FlexDecoding Component: Test failures in the FlexDecoding component of the Triton project are reported when integrated with PyTorch, specifically related to an assertion error indicating an invalid stage for an operation. A detailed script for reproducing the problem is included along with discussions on potential fixes and workarounds.
- Bug in Export and Compile Process with Dynamic Shapes: A bug in the PyTorch export and compile process is reported where using dynamic shapes for width (W) and height (H) results in degraded output. Using fixed resolutions produces correct results, and attempts to debug this are complicated by runtime errors and discrepancies between exported and compiled outputs.
- Test Failures Related to AOTI and Triton on Blackwell Platform: Multiple test failures related to AOTI and Triton on the Blackwell platform are tracked, including internal assertion failures, kernel library production issues, and missing attributes in code generation tests.
- Deprecation of Silent Fallback to ATen Logic in Inductor: The deprecation of the silent fallback to ATen logic in Inductor when generating GEMM kernels is proposed. This aims to respect the user's specified backends in
max_autotune_gemm_backends
by raising an error if these backends fail and ATen is not included.
- Enhancement Request for
Dim.AUTO
andDim.DYNAMIC
: A feature request is made to enhance theDim.AUTO
andDim.DYNAMIC
functionalities in a GitHub project by allowing users to specify optional minimum and maximum values. This could potentially optimize dynamic dimension handling and improve performance through Inductor optimizations.
- Logging Error in
remote_cache.py
: A logging error in theremote_cache.py
file of a PyTorch project is reported, where log messages are being emitted after pytest has exited. This is due to the use of theatexit.register
decorator, resulting in a "ValueError: I/O operation on closed file" when attempting to write to a closed stream.
- Feature Request for Node and Weight Deactivation: A proposal is made to add native functionality to the PyTorch library that allows for the temporary deactivation ("sleep") and reactivation ("wake") of specific nodes and weights within neural network layers. This aims to enhance research capabilities and training optimizations by providing a more granular and reversible approach compared to current methods like pruning or freezing entire layers.
- Slow
torch.distributed
all_reduce
Operation: A problem is reported where the first execution of thetorch.distributed
all_reduce
operation takes significantly longer (over 30 seconds) when using Ray with specificCUDA_VISIBLE_DEVICES
settings. Subsequent executions are much faster, and this behavior does not occur when using other configurations or Docker.
- Unintuitive Behavior of
F.pad
Function: The unintuitive behavior and errors encountered when using theF.pad
function in PyTorch are highlighted. The function unexpectedly requires batch and channel dimensions for padding operations on 2D tensors, despite this requirement not being documented, leading to user confusion and the need for cumbersome workarounds.
- Precision Drop in ONNX Export with Sigmoid Function: A significant precision drop is observed when exporting a PyTorch model using the sigmoid function to the ONNX format. The model performs accurately in PyTorch but shows discrepancies in output accuracy after conversion and inference in ONNX, particularly with small input values.
- Inconsistent
clamp_
Operation on MPS Device: A bug in the PyTorch library is highlighted where theclamp_
operation on tensors behaves inconsistently on the MPS device, particularly when applied to sliced tensors. This results in incorrect in-place modifications, unlike the expected consistent behavior observed on the CPU device.
- Feedback Request for RISC-V Support Enhancements: A request for feedback and review on two pull requests aimed at enhancing PyTorch's support for the RISC-V architecture and the RISC-V Vector Extension (RVV) is made. The focus is on kernel optimization, vector library support, and CI support for cross-compiling.
- Request for L-BFGS-B Algorithm Implementation: A request is made for implementing the L-BFGS-B algorithm in PyTorch to support box constraints. The current L-BFGS implementation lacks this feature, which limits the ability to efficiently perform tasks like maximum likelihood estimation for point processes and Hawkes processes using PyTorch's tensor computation and GPU acceleration.
- Bug in
nn.GaussianNLLLoss
Function: A bug in the PyTorchnn.GaussianNLLLoss
function is reported, where the expected behavior of allowing thevar
parameter to have one dimension of size 1 is not met. The function only permits the final dimension to be of size 1, resulting in an unexpected "var is of incorrect size" error.
- Inclusion of Dataclass Instances in Computational Graph: A bug in the PyTorch project is highlighted where the Dynamo and fx tracing systems currently permit dataclass instances to be included in the computational graph. This poses a problem because the dataclass constructor can contain arbitrary user code, potentially leading to unintended behavior or security vulnerabilities.
- Index Error in FSDP Wrapped Module: A bug in the PyTorch library is described where calling a Fully Sharded Data Parallel (FSDP) wrapped module with zero arguments results in an index error. This is due to an assumption in the code that at least one argument is always provided, suggesting that the code could be modified to support arbitrary numbers of arguments and keyword arguments.
- Lack of Gradient Support in
torch.linalg.lstsq
: The lack of gradient support for theresiduals
component in the return value of thetorch.linalg.lstsq
function is highlighted when using thegels
driver. Thesolution
component does have gradient support, prompting a query on whether this behavior is expected.
- Numerical Stability in
linalg_eig_backward
Function: A proposal is made to enhance numerical stability in thelinalg_eig_backward
function on GPUs within PyTorch by adding a small epsilon to the denominator in the backward formula. This addresses problems with unstable gradients or NaNs during backpropagation in physics-inspired machine learning models usingtorch.linalg.eigh
.
- Illegal Memory Access in FlexAttention Module: A bug in the PyTorch library is reported where the FlexAttention module, when compiled and run on a CUDA device, encounters illegal memory access or device-side assertions. This occurs despite all tensors being contiguous, with the problem persisting even after attempting a workaround involving padding adjustments to the
rel_bias
tensor.
- Documentation Error in
torch.distributed.elastic.multiprocessing.start_process()
: The documentation fortorch.distributed.elastic.multiprocessing.start_process()
in PyTorch 2.3.0 and later versions incorrectly includes the removedtee
parameter. It is suggested that it should be updated to reflect the current API by replacing references totee
with thelogs_specs
parameter.
- Incorrect Behavior with Non-Reentrant Checkpoints: A bug in the PyTorch project is described where using non-reentrant checkpoints in combination with ambient saved tensor hooks results in incorrect behavior. This is demonstrated by a test case involving tensor operations and gradient calculations that produce unexpected results when logging the pack/unpack hooks.
- Segmentation Fault in
copy_()
Function: A bug in PyTorch version 2.6.0 is described where thecopy_()
function fails with a segmentation fault when using Hierarchical Sharded Data Parallel (HSDP) in Fully Sharded Data Parallel (FSDP) version 2 on a 2-GPU machine. The same setup works in version 2.5.1.
- Silent Failure in
view()
with In-Place Modification: A bug in the PyTorch library is described where using theview()
function combined with in-place modification fails silently when applied to aDTensor
. This results in no changes to the tensor's values, as demonstrated by the unchanged all-ones tensor output when theworks
flag is set toFalse
.
- Errors with PyTorch and NCCL on CUDA 12.8: A user experiences errors with PyTorch and NCCL installations on CUDA 12.8, where the NCCL version is incorrectly reported as 2.25.1+cuda12.2 instead of 2.25.1+cuda12.8. A "Cuda failure 1 'invalid argument'" error occurs during code execution.
- Need for Sharding Strategy for
aten.amax.default
: The need to register a sharding strategy for theaten.amax.default
operator in Dtensor is highlighted to address errors encountered with float8 rowwise scaling in both eager mode and vanilla TP. This is identified during the debugging of a related problem in the torchtitan project.
- Discrepancy in TransformerImpl Class Parameters: A discrepancy between the C++ libtorch and Python PyTorch implementations is highlighted, specifically noting that the C++ version's TransformerImpl class lacks certain parameters such as
layer_norm_eps
,batch_first
, andnorm_first
. Guidance is sought on how to pass these parameters in C++.
- Bug in
OffsetBasedRNGTracker
Instantiation: A bug in the PyTorch project is described where theOffsetBasedRNGTracker
is always instantiated with a CUDA backend. This causes problems when attempting to use other backends, such as HPU, due to the lack of support for non-CUDA devices.
- Illegal Memory Access in
ScaledDotProductEfficientAttentionBackward0
: A bug in the PyTorch library is described where an error occurs in theScaledDotProductEfficientAttentionBackward0
function when the input sequence length exceeds 46344 and an attention mask is applied. This results in a CUDA illegal memory access error.
- RuntimeError in Faster R-CNN with Deterministic Algorithms: A
RuntimeError
is encountered when running Faster R-CNN with PyTorch's deterministic algorithms enabled. This is due to the lack of a deterministic implementation for theroi_align_backward_kernel
, despite setting all known deterministic flags and environment variables.
- Performance Regression with Tensor Parallelism: A performance regression is highlighted where the model "meta-llama/Llama-3.1-8B-Instruct" exhibits worse latency when using tensor parallelism (TP) on a CPU setup with Intel's 4th Gen Xeon processors. This is compared to running without TP, and specific pull requests for the CCL and transformers libraries might be needed to address the problem.
- Bug in
WeakRefVariable
withcall_function
: A bug in the PyTorch project is highlighted where theWeakRefVariable
does not utilize the most updated Python referent whencall_function
is executed. This leads to discrepancies between compiled and eager execution outputs, and it is suggested that the original Python referent should be checked each timeWeakRefVariable.call_function
is called to ensure correct behavior.
- Error in ONNX Export with
aten::_make_per_tensor_quantized_tensor
: An error is encountered when attempting to export a PyTorch model to ONNX opset version 11 usingtorch.onnx.export
. This is due to the unsupported operatoraten::_make_per_tensor_quantized_tensor
, and despite attempts to resolve it by using different opset versions and custom operations, the user continues to face a "RuntimeError: ArrayRef: invalid index Index = 11; Length = 11" error.
- RuntimeError in PyTorch Profiler: A
RuntimeError
is encountered when using the PyTorch profiler in a loop to export chrome traces. This intermittently fails with an internal assertion error related to an empty Python replay stack, suggesting a potential bug in PyTorch's profiler implementation.
- Docstring Mistake in
replace_pattern
Function: A minor mistake in the docstring of thereplace_pattern
function intorch/fx/subgraph_rewriter.py
is highlighted, where an unnecessarysum()
operation is included in the pattern definition. This does not align with the intended functionality as demonstrated by the generated code.
- RuntimeError in
torch.nn.AvgPool2d
on CUDA: A bug in the PyTorch library is described where thetorch.nn.AvgPool2d
function fails with a "RuntimeError: integer out of range" when executed on a CUDA device with a stride of 2^31 or larger. It works correctly on a CPU.
- Discrepancy in GAT Model Output in ONNX: A significant discrepancy in the output of a Graph Attention Network (GAT) model is reported when converted from PyTorch to ONNX format. The differences in results are unexpectedly large depending on the input data, despite the expectation of only minor variations.
- Bug in
torch.export.export
with Batch Normalization: A bug in thetorch.export.export
function is reported, where attempting to export a convolutional neural network with a batch normalization layer on a GPU results in guard conditions that prevent successful exporting. This is due to constraints on the batch size that are not satisfied.
- Absence of C-Shim for
aten.grid_sampler_3d.default
: The absence of a c-shim implementation foraten.grid_sampler_3d.default
in the PyTorch project is highlighted, resulting in the use of a proxy executor as a fallback. This may introduce some overhead, and it suggests following a specific pull request to address this by adding the necessary c-shim.
- Runtime Error with ROCm/HIP Backend on AMD Radeon RX 7600 XT: A runtime error is encountered when attempting to perform GPU compute tasks using PyTorch with the ROCm/HIP backend on an AMD Radeon RX 7600 XT. The error "HIP error: invalid device function" occurs during the first attempt to allocate a tensor on the GPU, despite the GPU being detected.
- FIPS Compliance in Python 3.9+: Enforcing full FIPS compliance in Python 3.9+ is proposed by using ruff rule S324 to ensure that the
hashlib
function is not used for cryptographic applications. This requires the addition ofusedforsecurity=False
to allhashlib
calls in the codebase and updating the documentation accordingly.
- Optimization of
torch.sort
Function: Optimizing thetorch.sort
function in PyTorch is proposed to significantly reduce GPU memory usage by allowing the indices to have a dynamic data type instead of the fixed 64-bittorch.long
. This can be particularly beneficial for large datasets, as demonstrated by a reduction in peak and final GPU memory usage when using a boolean matrix example.
- Documentation Discrepancy in
torch.distributed.init_process_group
: A discrepancy in the PyTorch documentation regarding the default behavior of thetorch.distributed.init_process_group
function is highlighted. The documentation inaccurately states that bothgloo
andnccl
backends are created when no backend is specified, whereas in versions 2.6 and 2.7/main, only thenccl
backend is created.
- Enhancements to
torch.compile
Programming Model Documentation: Enhancements to the torch.compile programming model documentation are requested, specifically including a debug_trace API for torch._dynamo, a more readable string output for gm in Jupyter notebooks, and a descriptive make_fx function with a tracing_mode set to Fake.
- Marking Build Job as Unstable: A specific build job is marked as unstable due to potential flakiness, as part of an experiment related to a pull request in the PyTorch project. Further context is provided in the linked pull request.
- Invalid Representation String of Meta Tensor: The representation string of a meta tensor in PyTorch is not a valid
tensor
call due to the use of an unexpected keyword argument 'size'. It is suggested that the representation should be modified to be an executable code snippet, similar to how concrete tensors are represented.
- Shape Function for Einsum Operation in PyTorch XLA: Adding a shape function for the einsum operation to the PyTorch XLA project is proposed to facilitate full code generation. This is currently hindered by the absence of this function in the shape inference header file.
- Enhancing GPUDirect Storage User Experience: Enhancing the user experience of GPUDirect Storage is proposed by integrating support for commonly used APIs like
torch.save
,torch.load
,dcp.save
, anddcp.load
. This enables faster model checkpoint saving/loading and efficient use of GDS-compatible storage solutions, thereby avoiding CPU bottlenecks.
- Unexpected Error in
torch.utils.collect_env
on elementaryOS: A user experiences an unexpected error when running the commandpython3 -m torch.utils.collect_env
to verify their PyTorch installation on elementaryOS 7.1/Ubuntu 22.04.5. This results in an AttributeError due to aNoneType
object, despite following the installation guide and usingsudo
for package installation.
- Failure in Building
torch_cuda.dll
on Windows: A failure in buildingtorch_cuda.dll
is reported due to an unresolved external symbol error when linking_cudnn_attention_forward
. This specifically affects Windows builds for thewheel
andlibtorch
cases and requires attention from NVDA developers as it is potentially related to a previous pull request.
- UserError in Model Export and Compile: A user encounters a
torch._dynamo.exc.UserError
while attempting to export and compile a model from a GitHub repository. This is due to a data-dependent expressionEq(256*u0, 256)
that could potentially be resolved by usingguard_size_oblivious
, as suggested by the error message.
- Adding
_capture_strategy
Field to ONNX Program: Adding a_capture_strategy
field to the ONNX program is proposed to document the strategy used during its creation. This will help in identifying regressions when fallback strategies are activated.
- CUDA Out-of-Memory Error in
distributed_c10d.broadcast
: A CUDA out-of-memory error occurs when broadcasting atorch.tensor(True)
using thedistributed_c10d.broadcast
function with two GPUs. This is potentially linked to the recent addition of CUDA 12.8 support in PyTorch's nightly build.
- IndexError in
_extract_arch_version
Function: A bug in the PyTorch library is described where the_extract_arch_version
function intorch/cuda/__init__.py
fails to correctly parse architecture strings for AMD GPUs, such as the Radeon RX 7700S. This is because these strings do not contain an underscore ('_'), leading to anIndexError
.
- Potential Bug in
fx_passes/binary_folding.py
: A potential bug in thefx_passes/binary_folding.py
file of the PyTorch project is highlighted, where the indexing for checking a convolution's bias appears to be incorrect. It usesconv_node.args[1]
instead of the correctconv_node.args[2]
, as indicated by a comparison with similar code inefficient_conv_bn_eval.py
.
- Request for PyTorch Version Compatible with Blackwell RTX 5080: A request is made for the development and release of a new version of PyTorch that is compatible with the Blackwell RTX 5080 graphics card and CUDA 12.8.
- High Resource Usage by CPUExec in PyTorch Profiling: A user questions why the CPUExec component accounts for a high percentage of resource usage in their PyTorch profiling, despite having few
.device("cuda")
operations in their code. Insights are sought from specific contributors.
- Documentation Error in
register_forward_hook
Method: A documentation error in PyTorch'sregister_forward_hook
method is highlighted, where the text incorrectly references a non-existenttorch.nn.modules.Module
instead of the correcttorch.nn.Module
.
- Inference Failure in Custom DETR Model: A problem with a custom implementation of the DETR model using a ResNet50 backbone is described, where the model fails to produce any detections during inference when the batch size is set to 1. This occurs despite working correctly with larger batch sizes, potentially due to issues related to batch normalization or the small size of the dataset used for fine-tuning.
- Runtime Error in
flex_attention
Function: A bug in theflex_attention
function within PyTorch is described, where the compiled code incorrectly assumes an output tensor shape. This leads to a runtime error due to a mismatch between the expected and actual tensor sizes, particularly when the dimensions of the query/key and value tensors are confused.
- Error in ONNX Conversion with
nn.AdaptiveAvgPool2d
: An error is encountered during the conversion of a trained model usingnn.AdaptiveAvgPool2d
to ONNX format, specifically when the input size tonn.AdaptiveAvgPool2d
is variable. Guidance is sought on resolving this problem.
- Segmentation Fault in
torch.ops.profiler._call_end_callbacks_on_jit_fut
: A segmentation fault occurs in the PyTorch functiontorch.ops.profiler._call_end_callbacks_on_jit_fut
when a tuple containing aNone
value is passed as an argument. This specifically highlights a bug in version 2.6.0+cu124.
- Lack of Sharding Strategy for
aten.select.int
: A problem in the PyTorch project is highlighted where the operatoraten.select.int
lacks a registered sharding strategy. This causes aNotImplementedError
during distributed tensor operations, and it is suggested that the DTensor module needs to address this by adding the necessary operation support incrementally.
- Feature Request for Stream Management API in NCCL: A feature request is made for a stream management API in PyTorch's NCCL process groups to address asynchronous communication challenges. This specifically addresses the "read-before-write" issue that arises when collective operations are executed out of order due to each NCCL process group operating on its own dedicated stream.
- GradScaler Issue on Intel Arc GPUs: A problem is reported where the PyTorch GradScaler does not function correctly on Intel Arc GPUs when attempting to train with mixed precision. It either produces a warning about CUDA not being available or throws a runtime error related to unsupported fp64 aspect when the "xpu" device type is specified.
- AssertionError in
register_sharding
with Keyword Arguments: AnAssertionError
is encountered when usingregister_sharding
for a custom operation with keyword arguments in PyTorch. This is due to a mismatch between the number of input specifications and input argument strategies, as the functionunwrap_to_op_info
handles arguments and keyword arguments separately.
- Segmentation Fault in Triton Upstream on ROCm: A segmentation fault occurs in the cpp_wrapper component of the Triton upstream within the Inductor project on ROCm. This specifically happens when running a unit test related to dtype view conversion from float32 to bfloat16 on CUDA.
- Accuracy Problems with Cooperative Reduction Functions on MI200: Accuracy problems with cooperative reduction functions on the MI200 platform are reported when using ROCm. This is evidenced by multiple test failures in the PyTorch project, where tensor-like objects are not sufficiently close in value, exceeding the allowed differences in both absolute and relative terms.
- Accuracy Problems in
quantile
Operation on ROCm: Accuracy problems in the unit tests for thequantile
operation on ROCm are reported when attempting to update Triton in preparation for version 3.3. This is evidenced by multiple test failures in theTestInductorOpInfoCUDA
suite, where tensor-like objects are not sufficiently close in their values.
- Unit Test Failures in Triton Update for Version 3.3: Unit test failures are encountered in the PyTorch project when attempting to update Triton for version 3.3. This is specifically related to a "Cannot bitcast data-type of size" error occurring during the execution of a CUDA boolean sort test.
- AttributeError in
retinanet_resnet50_fpn()
Model Export: An error is encountered when using thetorchvision.models.detection.retinanet_resnet50_fpn()
model, where the user experiences anAttributeError
due to aTensor
object not having anitems
attribute during the model export process withtorch.jit.trace
.
- Identical Values in
torch.randn_like()
on MPS: A bug in the PyTorch library is described where thetorch.randn_like()
function, when used with the MPS (Metal Performance Shaders) device, produces tensors with identical values along a given dimension once the tensor's dimensionality exceeds a certain size. This behavior is not observed on the CPU.
- ResourceWarning in
torch.distributed.nn.jit.instantiator
: A warning is generated by thetempfile
module due to an uncleaned temporary directory created in thetorch.distributed.nn.jit.instantiator
module. This occurs whentorch_tensorrt
is imported, leading to aResourceWarning
about implicitly cleaning up a temporary directory upon program exit.
- Transition to Public ECR Images for Docker Builds: Transitioning the project's Docker builds to utilize public Amazon Elastic Container Registry (ECR) images instead of Docker Hub is proposed. This is motivated by Docker Hub's impending rate limit changes and the potential for more reliable and faster image pulls within AWS.
- Exposure of NCCL API
ncclGroupSimulateEnd
: Proposing the exposure of the NCCL APIncclGroupSimulateEnd
at the Python level in PyTorch is suggested to enable users to perform runtime estimation of communication operations.
- LoweringException in
flex_attention
withtorch.compile
: A failure in the PyTorch library is reported where attempting to compile theflex_attention
function with dynamic settings usingtorch.compile
results in aLoweringException
. This is due to aTypeError
that prevents determining the truth value of a relational expression.
- Compilation Error with
Dropout
inSequenceParallel
: A compilation error occurs when attempting to compile a PyTorch model usingDropout
parallelized withSequenceParallel
. This results in a runtime error related to tensor conversion, despite documentation suggesting support forDropout
inSequenceParallel
.
- Memory Allocator Lock Contention in Inductor-CPU: The problem of memory allocator lock contention in templated GEMMs within the Inductor-CPU project is addressed. Threads compete for memory allocator locks during the creation of per-thread local accumulation buffers in an OpenMP parallel region, leading to significant performance impacts.
- Disabled Test on ROCm Platform: A disabled test, "test_custom_hook_custom_stream" from the PyTorch project, is failing on the main branch specifically on ROCm platforms due to a "HIP error: invalid device ordinal." It requires attention from several developers and contributors to address the device ordinal issue.
- Disabled Test in
TestHSDPWithCustomHook
on ROCm: A disabled test named 'test_custom_hsdp_all_reduce_hook' within theTestHSDPWithCustomHook
suite on the ROCm platform is failing on the main branch of the PyTorch project. This involves several contributors and stakeholders for resolution.
- BackendCompilerFailed Error in
torch._check
Function: A bug in PyTorch is described where thetorch._check
function fails when used with.item()
followed by aselect
operation. This results in aBackendCompilerFailed
error due to a data-dependent expression that cannot be guarded.
- Bug in
SETUP_WITH
Implementation in Dynamo: A bug in theSETUP_WITH
implementation within the Dynamo component of the PyTorch project is highlighted. The current order of operations deviates from the CPython documentation by pushing__exit__()
onto the stack after creating the block stack, leading to a crash when a graph break occurs.
- Bug in
torch.compiler.allow_in_graph
Decorator: A bug in the PyTorch project is highlighted where decorators liketorch.compiler.allow_in_graph
do not properly handle the reuse of function identifiers. This leads to unexpected behavior when a function is deleted and another function is defined with the same name.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 66
Summarized Issues:
- Test Failures on ROCm Platform: The PyTorch project has encountered multiple test failures on the ROCm platform, leading to the disabling of tests such as
test_attention_vs_linear
andtest_tracker_multi_group_eager
. These failures were linked to changes in the main branch and specific pull requests, prompting discussions on restoring stability and considering reversion if necessary.
- Performance and Compilation Issues: Several issues in PyTorch relate to performance discrepancies and compilation problems, such as slower execution with
torch.compile()
compared to decorators, and excessive compilation times withreduce-overhead
mode. These issues highlight the need for optimization and better handling of specific data types and operations.
- Bugs in PyTorch Functions: Various bugs have been reported in PyTorch functions, including issues with
torch.func.vmap
,torch.onnx.dynamo_export()
, andtorch.cholesky_solve
. These bugs often result in errors or unexpected behavior, necessitating fixes and updates to ensure correct functionality.
- Serialization and Export Challenges: The PyTorch project faces challenges with serialization and export processes, such as errors with nested classes and difficulties exporting models with dynamic shapes. These issues require workarounds and improvements to support various use cases and configurations.
- Test Disabling Due to Failures: Multiple tests in the PyTorch project have been disabled due to consistent failures on the main branch, such as
test_real_imag_view_lazy_complex128
andtest_flatten_nonview_xla
. These failures are documented with examples and linked resources for further investigation.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 181
Key Open Pull Requests
1. [test] 2: This pull request, titled "[test] 2," aims to address and fix an unspecified issue in the PyTorch project, as indicated by the placeholder "#ISSUE_NUMBER," and includes a series of 16 commits, each labeled with the commit message "tc," which suggests a focus on testing or test-related changes, although it has not yet been merged.
- URL: pull/147470
- Merged: No
- Associated Commits: 03e29ef5bda553434e588ee1f041ba8a71031e5e, 176793c7127a2d6502c9834abf2bb21ffe3f638e, 7b5b15cf410a1a364379bd4b8886d22ecec39a77, bd6cdf29c6c6b19e3bdaaf8cf9c5e87832db83be, 1e50cb13d90a2242db56663a8cebaaaeb2859c40, a324ca8d10153dec91e8419a4ecca844bd2e7c39, 2de748a5fcecfeda346d9e7c6269f732e92218d3, 0a2e06b882a35ac6aa3c23f59ce6a7edf560a93f, ed111e570352372ac450c4e4009ca89e08ce981a, bcad8d972d94be48e570da0202be1c2ef654cb37, 29c02d0016c792f6699a88e6e15cf1df4a0dda2f, 32dc214adb0ad68052f7447402211a3f0d6f350a, 4e88f9938d14a2262f3084c70ded2153e50f4033, 15c409c5dd9a73303bc2ed6e6fccd4cbf2814027, 573b2b535cc3f230e0a2fb7aa09cb950df28f598, b8b9756fa1e4e58af566e2ad69bf69650e660616
2. cpp_wrapper: reduce memory usage by removing unneeded temporaries: This pull request aims to reduce memory usage in the cpp_wrapper
of a GitHub project by refactoring reinterpret_view
calls to return temporary RAII tensor objects, thereby making the function's callers responsible for saving the handle when necessary, and eliminating unnecessary temporary tensor handles to align memory usage with the default
inductor mode.
- URL: pull/147403
- Merged: No
- Associated Commits: 01424669b96688c50c271cdca6e8f52a8185bf1d, 67582f8e3028a688d489ab644155862868468250, eb4f8d8ffe626fde86a511f865f60267ee6d33d8, 20c1ad66b21f5ccac2d16853d9e0b953af060d3b, a6f57b847cc9624cac7ffa3e81322aafb31e7eaa, 1c1b4c9270563664983046631b1165d8be8c5b7c, aae0fd8856f2c67dd3ebad7824189c74aae7f15d, 3806a189aa7f4e2124616a7126fa3fa7cf973388, ebf67aeef582fb9e259a9f022f399e9764b71c80, 91ceb7cf1363b3c4590117f5aacb7467ed1b504c, 4d5edaf67e80ca9ca36d301af1ded13967a04790, 35f9d714fc57cfda513c69b1c79c7ee47c5fda4e, b6bf52e56404744de5abd1a2dcfc95271e655835, 7deb4da44f7a2f8b16de60bb680bb65d7062806e, 4ebac8719de768e93930aae722ee10d0ab99bb9e
3. Make Tensor.set_ validate storage_offset when sizes/strides are unchanged: This pull request aims to enhance the Tensor.set_
function in the PyTorch library by adding validation for the storage_offset
parameter when the sizes and strides of the tensor remain unchanged, as part of a series of related updates tracked through the ghstack tool.
- URL: pull/147354
- Merged: No
- Associated Commits: d55833c1f1f9078c7033b7bf966a7a8e04c1a4c0, 9677c958516940094f547641847194926baaeb64, 56c4c38b90973f59f428e67e42b5442ca602cf53, 2db7ebf942472df2e40757fd08368aa4db594be4, 649ddb82ef37da61f9ed0ac8323606f03e549915, 439a693f1a5e89b2a37bcf015d49602f1d24f430, 5eb7fb4b69c1453028421355083fcfc6fa6f0c96, 4d08c805b890c3c3125ab2b07b7eb75cbf881328, 6be5a4b237feba23c6efd8be06d43028b7401beb
Other Open Pull Requests
- C++ Standards Compliance on Windows: This topic involves enforcing C++ standards compliance in the build process on Windows by adding the
/permissive-
flag. This change resolves issues like assigning string literals to non-const pointers and aligns the project with Visual Studio's default settings, improving code quality and fixing related compiler errors.
- Input Validation in PyTorch: This topic focuses on enhancing the robustness of the PyTorch codebase by validating inputs to prevent potential buffer overflow issues. The pull requests address validation in functions like
_nested_view_from_buffer
, ensuring safer code execution.
- Error Message Improvements: This topic covers efforts to improve the readability of error messages in PyTorch by auditing and updating "unimplemented" sites. The changes ensure that messages are clear and understandable, enhancing user experience.
- ONNX Module Enhancements: This topic introduces new strategies and features to the ONNX module in PyTorch, such as the "draft_export" strategy and direct utilization of ONNX operations. These changes aim to improve tensor specialization and integration with the existing ecosystem.
- CUDA and ROCm Support Enhancements: This topic includes introducing blockwise MXFP8 support for CUDA devices and enhancements for ROCm MX-FP8 matrix multiplications. These changes improve matrix multiplication efficiency and validation for specific data types.
- Runtime and SACEstimator Modifications: This topic involves modifications to the RuntimeEstimator and SACEstimator in PyTorch, addressing issues and including tests for fake utilities. The changes also fix default arguments, bindings, and resolve linting issues.
- Masked Fill Implementation: This topic covers the implementation of the
masked_fill_scalar
function as a shader, moving existing functions into a new header, and introducingStridedTensor
andConstStridedTensor
. These changes facilitate the implementation ofmasked_fill
, addressing a specific issue.
- Dynamo Component Enhancements: This topic introduces generic graph break hints and error message improvements to the Dynamo component. The changes include multiple updates and contributions from various collaborators.
- ROCm and XPU Enhancements: This topic includes updates to the
ck_conv_template
code generation for ROCm CK kernels and improvements to the XPU oneDNN context manager API. These changes enhance flexibility, maintainability, and usability.
- Inductor Component Enhancements: This topic covers improvements to the Inductor component, including handling mismatched outputs and optimizing heuristics for outer loop fusion. These changes enhance performance and compatibility with various operations.
- Overflow and Buffer Issues: This topic addresses overflow issues in various functions, such as
checkInBoundsForStorage
and tensor slice calculations. The changes include implementing fixes to prevent crashes and incorrect tensor returns.
- Experimental Features and Tests: This topic introduces experimental features like delayed compilation and new tests for components like CacheBench. These changes involve multiple updates and collaboration among contributors.
- Tensor and Data Type Handling: This topic addresses issues with tensor and data type handling, such as converting non-standard boolean values and handling mismatched outputs. The changes ensure correct operations and improve compatibility.
- Export and Serialization Enhancements: This topic focuses on improving the export process by eliminating unbacked renamings and introducing new passes for recomputing bindings. These changes enhance compatibility with de/serialization.
- CUDA Graph Partitioning: This topic involves implementing a CUDA graph partition feature, building upon previous work related to inductor graph partitioning. The changes include recording mappings and handling metadata and input index mutations.
- MKLDNN and oneDNN Enhancements: This topic includes migrating from oneDNN Inner Product to MatMul and introducing an
is_available
API fortorch.backends.mkldnn
. These changes improve functionality and allow users to check backend availability.
- Sparse Tensor Validation: This topic addresses the validation of sparse tensors constructed via a legacy constructor, highlighting issues like size inconsistency and storage size calculation overflow. The changes refine the solution for these issues.
- FSDP and FlexAttention Enhancements: This topic includes enabling FSDP tests on XPU devices and addressing error messaging in the FlexAttention module. The changes improve testing and guide users experimenting with small tensors.
- NCCL and TCPStore Enhancements: This topic aims to enhance the NCCL communication library to support uint64 tensor types and improve error handling in TCPStore components. The changes address gaps and improve error message specificity.
- Build Process and Compiler Updates: This topic involves updating the build process for XPU and enabling AddressSanitizer support for CUDA. The changes improve compatibility and collaboration among contributors.
- Documentation and Code Refactoring: This topic covers documentation updates and code refactoring efforts, such as correcting docstrings and renaming options for clarity. The changes enhance readability and maintainability.
- Testing and Continuous Integration: This topic focuses on testing the continuous integration process and addressing issues with test scripts. The changes ensure compatibility with new versions and improve the debugging process.
- Attention Mechanism and Quantization: This topic addresses issues with the attention mechanism for tensors with more than four dimensions and introduces a total quantization target for the P1 INT16 model. The changes ensure proper functionality and include a test plan.
- Error Handling and Logging Enhancements: This topic involves enhancing error handling in various components and introducing context managers for logging. The changes improve error message clarity and logging capabilities.
- Memory and Performance Optimizations: This topic includes optimizations for memory-efficient attention in ROCm and performance improvements for integer matrix multiplication on macOS. The changes enhance performance and provide a performance comparison.
- Type Annotations and Code Introspection: This topic focuses on enhancing type annotations for dynamo methods and refactoring function signatures to improve type safety and code introspection. The changes involve collaboration among contributors.
- Cache and Compilation Enhancements: This topic introduces a caching mechanism for save plans and optimizes the integer matrix multiplication kernel for Metal Performance Shaders. The changes reduce computational costs and improve performance.
- Bug Fixes and Issue Resolutions: This topic addresses various bug fixes and issue resolutions, such as fixing the
torch.polygamma()
function and correcting the RNN example code. The changes ensure consistency and correct functionality.
- Experimental and Test Submissions: This topic includes test submissions and experimental attempts, such as implementing end-to-end control plane flex_attention. The changes involve collaboration and are open for review.
- Backend and Device Support Enhancements: This topic covers enhancements to backend and device support, such as enabling SDPA on the XPU backend and updating merge rules for oneDNN. The changes improve compatibility and functionality.
- Kernel and Operation Enhancements: This topic involves implementing a metal kernel for MPS binary operations and enhancing the
torch.compile
function. The changes improve performance and ensure correct operation handling.
- Error Message and Logging Improvements: This topic focuses on improving error messages and logging capabilities, such as updating error messages related to missing build systems and enhancing logging for AOTI. The changes improve user guidance and debugging.
- Optimization and Performance Improvements: This topic includes optimizations for block radix sort and the
matmul_small_brute_force_tunableop
unit test. The changes enhance performance and reduce execution time.
- Type and Data Handling Enhancements: This topic addresses enhancements in type and data handling, such as introducing input vectorization in elementwise kernels and supporting unique user kernel names. The changes improve control over naming and data processing.
- Testing and Validation Enhancements: This topic focuses on enhancing testing and validation, such as introducing a new test for the CacheBench component and ensuring the accuracy of the layernorm CUDA backwards pass. The changes improve test coverage and accuracy.
- Build and Compilation Process Enhancements: This topic involves enhancing the build and compilation process, such as enabling the Triton XPU build process on Windows and updating the pybind11 submodule. The changes improve compatibility and collaboration.
- Error Handling and Bug Fixes: This topic addresses error handling and bug fixes, such as fixing the
inductor/test_kernel_benchmark.py
script and correcting the detection logic forclang++
. The changes ensure correct functionality and prevent unexpected failures.
- Documentation and Code Refactoring: This topic covers documentation updates and code refactoring efforts, such as correcting docstrings and renaming options for clarity. The changes enhance readability and maintainability.
- Testing and Continuous Integration: This topic focuses on testing the continuous integration process and addressing issues with test scripts. The changes ensure compatibility with new versions and improve the debugging process.
- Attention Mechanism and Quantization: This topic addresses issues with the attention mechanism for tensors with more than four dimensions and introduces a total quantization target for the P1 INT16 model. The changes ensure proper functionality and include a test plan.
- Error Handling and Logging Enhancements: This topic involves enhancing error handling in various components and introducing context managers for logging. The changes improve error message clarity and logging capabilities.
- Memory and Performance Optimizations: This topic includes optimizations for memory-efficient attention in ROCm and performance improvements for integer matrix multiplication on macOS. The changes enhance performance and provide a performance comparison.
- Type Annotations and Code Introspection: This topic focuses on enhancing type annotations for dynamo methods and refactoring function signatures to improve type safety and code introspection. The changes involve collaboration among contributors.
- Cache and Compilation Enhancements: This topic introduces a caching mechanism for save plans and optimizes the integer matrix multiplication kernel for Metal Performance Shaders. The changes reduce computational costs and improve performance.
- Bug Fixes and Issue Resolutions: This topic addresses various bug fixes and issue resolutions, such as fixing the
torch.polygamma()
function and correcting the RNN example code. The changes ensure consistency and correct functionality.
- Experimental and Test Submissions: This topic includes test submissions and experimental attempts, such as implementing end-to-end control plane flex_attention. The changes involve collaboration and are open for review.
- Backend and Device Support Enhancements: This topic covers enhancements to backend and device support, such as enabling SDPA on the XPU backend and updating merge rules for oneDNN. The changes improve compatibility and functionality.
- Kernel and Operation Enhancements: This topic involves implementing a metal kernel for MPS binary operations and enhancing the
torch.compile
function. The changes improve performance and ensure correct operation handling.
- Error Message and Logging Improvements: This topic focuses on improving error messages and logging capabilities, such as updating error messages related to missing build systems and enhancing logging for AOTI. The changes improve user guidance and debugging.
- Optimization and Performance Improvements: This topic includes optimizations for block radix sort and the
matmul_small_brute_force_tunableop
unit test. The changes enhance performance and reduce execution time.
- Type and Data Handling Enhancements: This topic addresses enhancements in type and data handling, such as introducing input vectorization in elementwise kernels and supporting unique user kernel names. The changes improve control over naming and data processing.
- Testing and Validation Enhancements: This topic focuses on enhancing testing and validation, such as introducing a new test for the CacheBench component and ensuring the accuracy of the layernorm CUDA backwards pass. The changes improve test coverage and accuracy.
- Build and Compilation Process Enhancements: This topic involves enhancing the build and compilation process, such as enabling the Triton XPU build process on Windows and updating the pybind11 submodule. The changes improve compatibility and collaboration.
- Error Handling and Bug Fixes: This topic addresses error handling and bug fixes, such as fixing the
inductor/test_kernel_benchmark.py
script and correcting the detection logic forclang++
. The changes ensure correct functionality and prevent unexpected failures.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Closed This Week: 264
Key Closed Pull Requests
1. Fix SEGFAULT when None arg was passed in GraphContext.op(..): This pull request addresses a segmentation fault (SEGFAULT) issue in the PyTorch project by fixing a bug in the GraphContext.op(..)
function, which occurred when a None
argument was passed, as indicated by the title "Fix SEGFAULT when None arg was passed in GraphContext.op(..)" and the reference to issue #145261.
- URL: pull/145265
- Merged: No
- Associated Commits: 74c591dd9ed707bd50bca86a5883810102123c51, 1ebec42a552bb117f105facdfff345bb7e6948ba, 529cff0ec7d089468625f232541cbbf6b6113670, 28b7c5fef09cc201f56009c91b856728be4f3ebf, 5548df9541fe76be615c380a18329bc5c2bc3f56, 5c678ea5f303db95c9c59a4a36c8d5cf553d57cd, d8fd5404f24bcbaaaa25fc9eda9bc2eb4966a079, c0e93e46529b4b512b4a9b57ec7266943922da28, d3bc9a0c70ba3c3e263650b00767d4bb3a7e1082, ff743d42efc7bdb01505cea3e6c173e2f4766def, f695f624b2d152c8ed2eb7b8245b5b1d4f329caa, 72d566f36498d70da1e26bae95a56afde888def3, 3056caacf040478428cc123bc3ec289be5dbb9b4, bb8dd29aa831edfb12789e894b9e6f9a9bdc24bb, 36ae184c4464bb21da3dce6e8f87eca18f83ae4b, cd8a42d4cf5ae076a082984b60e4f4c96e2a3fe6, 999d7e3d0cd803c331ac88b05806e67d701d53ed, 891451c9fbb25d437dff680ceacf7018045300ac, fa7cc162055827005a9a5c64c2dff9abdabb3113, d2b0ed86820f0353b34f1f310c8f1a49236e26b1, f843ff7b97a2fabc1f31cb5a2888b18d0577cb62, 260e6700d3d0d80ac77ff4eba92e49e47b1320c3, 8330e02a9b60089bb58c79bfaf6b72fc01c38009, 7b85b7cbf3d95e3328f2a4961add686b0d6139d8, cb52b7731104a9981194bb6aae710f371e5283ac, 55531118fafcc55bc17a6ece7a555e7a7341a4dd, 959bdcfb09612182b52491ca6308d0654ea45668, fdd0570f38185bd6d939b168cbe6d555032c5065, 16bec733680f1e08607b3df814e6f6816e02dddc, 52681ab258a891d30a7a8d4828bf9e6c96a7264c, 1944ac6fbb9e89ae3b6040fe456b2e8125b48e6f, ff5039cb474df01308164c4e521494d5a7b6c43b, 38b54a0d626175203bbd5fd55a647a93c800f631, b88c6014da2a3d4ccbd8cdb6781f98ef5b209464, d8d27556f28eb30b44fe75b9661e417c56539328, cd367141d76f990f7d33ab31d1ffc8cfa21b11c0, 16a76b1780e1e4b59deda850115c3b3432d6d3fa, a4bf85e2961e98f72fb611881c9dbe225d7f5059, a2cc4d869a335c1dcb0dff3750a7945360635c6e, 276b234a888d246285b3333bb498e8beb38b59eb, 107be69f0c19c6388ff30f5ea59b14a53cbfb691, 05d80e717f1b5916a4402820801330bd3c3af3bd, 1824799048358a6a90ee92c30dbcdf36ae5ef8a2, fab8ff9f6f77b7d019077ac370304e21925bfc56, e1111f25e746eb1a3d180a5de84bfcd38ceb9c2c, 2b999e971b6576920eb97d0c9f8c13e08d1a4371, 3e0f847df1bdbaf9aa7b640e9b2cf2834b7a6827, 2b12011385e6ebfe8a7b64abb2b0ee14171f6c9f, c30f665aca710019c958c66bb5f3e6f371b954d0, 2898b5c46586d580f19fc9d8a95a973f09e28d84, a48fe7077378fe63248d1ea0edca91da0bf49a32, 8cb85af956a639df169f5012f14e4b908cbe1caf, 7ad8747f801372f1d60d29604466508217fd5ab1, 406809642a302bf5bd31bc0c9e74ffe17d79399a, 6d4df2d0f724bf999e1d5f712f9b8a7b9f8c3c3d, e6b00b0a180c78674b0b0b74ba0652978663d423, 3377fa1a9bab2db1da2fb678d09cbf0a834f51ea, 3d2cac4d58dc4cbb2a80ccbccbccef2ec651cab9, 025c4feb83bce11e47fe58862ae1eb665fb05276, 4f554b2b7b57070d89b0955b9cde0ffd7a023563, 8256619ef0fde425174d236ec3e6fcba40834898, f4e693aacf8d8c0ad10dc5ae2ba8671f586c981f, 6e7d1887f289506f9e5ec6d92a2096da817843eb, 9cf68285794325e6bd15ffc2001132ea0c59794e, 41fdc2bc77c8b9a0dad60e9a7705b8946f540c86, 847271bc0b0e46e3b858fd022ba7209228d9b33c, 9c315270b149953864b9c09d9e0eba33312a3e80, b6188e3a0e2b32a2c1719ded0c412167e036289a, 936278cd04e6cedef3ede278a06acc8ffb87718b, 38bf3f0af679ced7de5fb8e1bca636a7c76e92cb, 8344307caeb9e25394e9c8736a3b986abba2dbd4, 121b1b00edb3a29267dbd5e8dcaeafd424619a9b, b798a2ea459de2b980f11c1d9ab5768d0808850a, 90e4907bdbb8846c93d9039884d03e97d5574a43, fbb0279af8f8a505ec4e4cf63870c3c342f56061, faf69ee2e2ebadaa40783da4756dd4ed6ff657fd, c454463e2040b513de188a1d66116c7e19e414c2, 81badd5e5ba00405b43e212e12fa0e6820b5b33d, 2fb90ebff7914ecb652f895294457745943e09a2, a853f8b28cb9bb0716305711e1c3c940dde957ee, 1c063cc16c83c62176991f35bac332b77c8979d5, d6f20e458bb9f2c94f7935a97f0c1b93c1ee4c00, da50264a0d6515346c6224fed966015f38fc93a1, a8ce35c5e0e10899df34d41381c9061d2b4c61a3, acda04aedc89e43569ebc62f84a62f4dca380014, ee3b47d21a67307dccdf0cb03a266653d481b89f, f8a467c7d513dbb510db316f3cf47d3b884e9c3b, 9a94cfdf35442e6220699b24702b2326b2e4a3b0, 33161d25ab9fd449e3706a4e62aa319d4fa2d44b, 7a30a3b6e5d24db3db0b1a4bb89f273c808114c3, 5350895d3f50550da81d60fc3fa89da265f297e9, 91ddd1bb5ea9be00f59e0d61e50312c946a7d6dd, ac0a78b519da01b75d0a3361c7bc27da0bf92772, 295a452aeee601034df73712b952e4ad65c6ae59, f11ef9186a308624bfde672dd6a6163cbd23fdd6, ed521c354917de722557bad9611a16b518363701, 6c1d9e1a3637366d158851df7c455cab4767e9d3, bbf5f91c7e46b61a8f936ec0d4debdbebf4c1ddb, faa7cdb27ca884d93226a01919907566eb44566d, 58f635b24c8796ce686a8543e4a31039d0feb471, 891b0c2b1aced3792626e087e500182e34fa57d0, 73d3757ca1b5390999f9955f41979fac41240986, 65098d4829234e4251201d545fdb9da20e7c6c3b, c232fb84d86a9a70b81aba18d6cd5c2aaaa72348, 635618997bbdf6811b51afc206dfb55045cbba70, a8618dbf703347891dbde688450500233b622a68, bfa9fbde64d6694b39a29b934ed75ac2a4bdf7ec, 4fd187649b1d9650cd8015573ad3d85b773f4d70, 94b71380f3bb1f571dd977374cfe46f701c09ae4, 500ecd6f077ace9d61545cae4fa6957b9e422ea1, 9badff27465045d2c09b855f9a5b0ede7066b814, 6c31a9fa4fc7c2293310d7dc39783599ec016981, d134e13b23a9af242ade33ecfb9d742acd301006, 0fd7bb67eda25404f80a60840f1d8bddba805433, e2b74da8ef531613d877e0b830627d5e15becbad, 00117e01cd5424fe61330b8d34efea0b2c800d2b, 2b20757653a6a28005152fca012a6fab45eed3cf, 5588711859a5443c30eaef97d3c60b96571f67be, 1f9e47569aa1483dc394dc3a64d2c3060453813a, b55171ce15035af52615f7b7b30748fb96754abb, 54c66961de039870b7fc9eee5518f693c82f1b02, cb4ab95d95f5f8bdc048e376e1a9023b74341752, 9debdd1e298bfb8c57c9c194797719aa5dee6b62, ff69f44795cdff8e3065438f24d4500e51e2d088, 3c5ddf81a3b02adea5ffc80e9394880ff0c4c3d1, 69305841adce6820d47803d9e8d5d4b381f659f6, af03207f65b2e9582c612e52e084ca4155162847, 509d44c70298533c260be0288f5846d780ad21b8, 3da275ef927ae385dfed1910738aba9b95be506b, ecd5586394e56ce16227428d4675f4b2b1e39b8c, d63f2f4cf79e68df9b08f1e7123819746375660d, 6c91bb5bbc364a88b48f02b7923866bd4781db58, 6c856fdc48394314cbf7d2daf19c9a9916909a54, c06f4e9d151baa12ce221cb2230d1408132c58c6, ff3837c5783bc0bcd1af77b7a0227fff99fba48a, d822c097ab39e28ff3dc7b6af32f32e0918c387e, cad53ab196b3a7f9d70128f8aef940ec91b366b0, bc4aadab625e20910b794526ba18ff7081eca2ad, 916c76ca354d7c3617b43ae6956420f218637d29, e9aaf8259dc862a6f6ed37f5320af45cef3393c2, 707026ce570de8a0f8971fc32935eabc30147500, b6995e02134325f1fe139032ec2d783ff6e7f97d, c125784e9a482e1d3973c1bb170277599afc8d4a, 81eed11ba35f2c90cb854033edfe0ba936c96e26, 6749c93756e753bb3923a330bdd53ea182d791c3, a968551a7a96a843ec5dbc1831bfa9bd584976de, 478a3d632655233ca4cccfe588968abd495f4a35, ce891b301a00112c05fe2b5d0f112ca23138550c, 8c0dd71159d19d830502b16c7daeab371804c7e6, 9a84bf0b433296378b56ef5ada93aadec84e27e4, ecb3bc67f64c6be8fdf0abaad92abb658ca96b4c, d79d13ba419e6d5a0e4521b7ace3eb19625975be, f96ff8749fc889b2465bf5121ba3dc0c922de621, 093edc07214b160ad9ff1a31762eda5b449ac5cb, 9b8cb195f756c6a9425dde79d25a23392fcd64a3, ef7b08e77be8201e4c10b42c3126522551a50227, 273425cf3fd0b55b55a40a866b60a2b1c353a9fa, a80b6710422e98309d07ed2fd5aada9784a01ca4, d684663b18c9300abcccab534c6660d0d11a52ae, b643f345c9ce768f8e8f76ea251e375e3bf200cd, b4ee32ebac8ea36614b453bcf222e4072840aa37, 5cf3af652be0f0f8706f39f5c36b011c6d879906, dd503f226124acae865646d3b2a94ede76f78d23, ffd745b87358b64695c8c9367c217114ea7fd7a8, 85ffed4d3eae87ff15051c03f8a69993f53d5333, dee02e2ae36c8f3f7f8b46fcd8dc6ef381b8b17c, 924bb678690a76101f0faa683e030b03191ac459, c18d6bdcb103acd6e4677bdec4927a85bb73b811, 4aed3114e5d7af57d4aaccb36a267e9b93ce5d5c, 31439c5ab444c8e9d42b43dce79b6e9ad56c4a39, e360fc5b199a73af89d068419fa61a5184702991, dd4e45394e36e8ebdff0ca7ca6c939e8b008ade0, fb5b026696bef0effb401b99eda7b82e9da7cd16, 9ec0f625ddda784e72857ea084f83841387e58a2, b253d088c8f343682df43fa51ee6ca4f888fbf9b, 4d7e097baa50ff2a52a1e61f2b84b5b2b5220765, 7cb389ec7c831f3ab55a7e97d0bb1fb172facd7e, 9340f55c4b86da5cc2ca1c3f22bb5126af59c784, b0fd1368efba0ab58896b53bf196dcd5df9d4d21, 9493d8cec58e2143664f9210ccf744eba25c14e8, 38c6e3f56d6246ace11141a7fab565f0b9842e41, aba148f7e17ee5e1cd40767440b630fed5b9f7b3, ff1e3327a9b6d4e1052d8de0a86ca7a88ba0badc, c7f0e9947199c8d5e6c41064db6952b24a20e742, 4386eabee9f9afb73e482738b00a819aa37a5bd2, 346fa1c75bfb207d1cee5559d85763b67a64d0da, 8094a74f318bfbcb4902e566a90400150c3e45a7, 6aa28553894b938ef2aa4ba2380189f730a929bb, 18028e650dd8b6c953db55d4f0f5e572dc7c899c, e3f205d9a9f46de243146b8395d60b368928006a, 404b79709a079a0c3b6367e86c5b6cfa6fc02d63, 197def12448b884ce136bd83a95fef835221940d, a5af71e74ffa38283ccc10c3e80fc5c8d4da7a21, c2239e92a421edb8ef9b7ab93419226709255fd1, 8d4ec69839fcd1d93e11eb857aeef42ad2f3d69d, b8f87b16c385764b34bcd5786f0cde4e8295eb61, 0f48eea37043935bc7621ca11126adb96d50bb0f, 921c20c7c8b67157c58ed7d0c2307282e804b706, 0b1dc1213054454937319ca98ffabb6550c6d536, 7bdb52a1760faaee243e9eaa255ba5187145c325, 71c54752da66c9e3235497b5a9dec493662a8d02, 8dd8f2a16fd7c9b01e89e5839ffafd11e2d9ec1e, d9c1eae1487f5042693b26ec255a120e10be8455, 896095fbf26a63dd375b31147fa676abdf61c644, d30599fb7a56367586346d743fcf4f5d1954c904, dd1bac7a0e7a26ec823842f2fc38a4b18cb364a1, 3838b8b89ba963d0425ca2128dfb2d86286d84f8, a3f90674e3665f5d888e962fe3ad6758fc1df1dd, d47523621fe3f882a6b9f6d5af656e25abe5ce5f, 5a8ab2b7cc092efd49518997f3ebd5a0b6e86a68, 6ae9bd0fd9acec0fec9c75bd6c1d3635561395cf, 1c22e4a705747604425a23c9337f8558e7fe22fb, 65852eb767952fa3c04ae30f8023de8cf6c02a44, c89d54a5ec05531055822778d05cc1acd061bec9, 772a228c88856c7535be0f224364ff7c91bcf3ff, 5962de4bfff45a4ac17f724c593aef9e33736521, c2fee9c4662e20d176ac72c943463b0d5a552fd9, 5218ddb265969b97a7a2f5c1e617073fe3c7461b, c0e14736fc9c62fa479769e5f5a3fc43992d784d, 78ec889eadf2639515c39c94187a58274cd2e006, cfb0d89b92dc3b73e35f140d4fa62b6825220a0a, a55c0f1a1e0dd4d6aa9df15590ffa0431a7c8222, aae36699aed0de2f04f65115c5303f890848ed04, 92faef39fef41d6422a598530a99ee5b06430c57, 12f270c86ec921983cf3bcc4d656e34669e32014, e84fa3e96d431136b7b178ee5d0479d5ad47a066, 5c3c64e82e3188c7458867c490c3adc3a59b326c, a8d324c0603a9f7e0e58f0cbdc5fdfa5df94a497, 7601d10ec1c6addca9c7bdedad12d4d2e3e937ee, 71b7f13215fba0cae674598935f89636f817f269, 292ff5279ac363dc9d8e70a1180cfc337cc6cc2f, 1e3b901d5ca5b14af469824153d3717fd601bb96, a7266b4d007d96759a59947354371c6e061c2148, aa47fdb069d1ff1fee6a62078554df7bfce77dca, 06742c79f11ef838db1e141028a524aac0b91ab3, 06203ad46ec5f4d7fbd20a521d89096851ce4e37, 24fc38ad1902daa427f0690fe170b520be8183e1, 4938310734aff031c199dce4aa89e92aebd4b3de, 1f7360b537aee313b3e8381241cae55de0734321, f6443982eb93d2f0d94ff7b67fdc68babced9ce3, abe8c36cafc13e9a822cfe178fd2cee3544ddb6a, 452116fab192e1002f57c6c3640528f3e23bd37d, 3959b37eb0b8834b23b3c81962ef7ccbb6f5f925, fb130e4c8389964f29199c9b1a3a57d223aec8e5, b4519bb2466bbb8a93c49abb3f195691e3be930d, 340aa59b085d9167e86c3e2a75a775649861487e, 217e8666ae55f854cb9d6f2932d111ec9d3ab7ea, 513b1cf8ddf74a92fb6f310587dd897a38444d38, ce51a7411716dd1cb2749192ec91debe72ff6f41, f2e7ad6ef2511e02ef5979cf15fa905eff3dd712
2. [ONNX] Bump torchlib opset to 22: This pull request aims to update the ONNX (Open Neural Network Exchange) torchlib opset to version 22, involving multiple commits that address various aspects such as migrating torchlib into PyTorch, updating tests, fixing issues, and making several code adjustments, although it was ultimately not merged.
- URL: pull/146510
- Merged: No
- Associated Commits: d3904c97c91a8a951e10ebc55b2d2c05716b3e54, 837c03511e05c97a5e68a8f7ff90a3cb414fa3b2, caea1897fe02a510eaf0cd56a16b73d9ee32cc5e, 31fc59a9a857adbd07725f3d98ccd8fbde47cb0a, 8c2c5f7235ed77a8d83a338ac0f7d290fdbe2648, 77ed810ccfb4f0f72b48b7ee63c242c1a8858cb8, 00179f73ea9d4530e10ffa0d8bb125fa0fa5cebb, 3cf2eccd62252c890d81e545fed493be3885a422, 4504992f3911729200ef5a5296408e48986bd867, f61eb8bf9dde8b56b9b221b8834945d4ded78b25, 07c7cd3e3ffa7dee2be8f8c86933754d40057558, c263b4d243b245f46ecbbeb31db181318cf25de6, 4927f2fb50f7ff5e1ae964bda0ece4e2c9598c8f, c673722865468eb312a50dd600e8551ffd21a616, 40153a1c991292bf33ca51f3276d0623045d157c, 98e32d21399d9fdaf5f02076b9abdae73e492d39, 702cc78771730a319133503a84d96d7de9b9d3a1, f8f12752493e7b87a7dd03386c23e10bc7cc9a4b, fcb0fce6ac09a10b352f22038f16a84f6fa12d56, 1be91d0f6c696699de76fa31290b613885d363cf, 6009477a78d006e1b41cc6154f68031c52e1f5a5, b8a630a952b58eed83e942cd638055f950bb415b, f590b1a120c6df7a863bd5f2c0862404b5869988, 24b7d88cd7774031f6a1c38c857407d94a35421f, ad46eabe5e7e346da5f42af9dc9c10a44960f133, 4289ec9e1f359535898b1c7febbeca66b7a105ad, 1d488d660c09f6622ed9658d913ed99317189d8f, 985795c20442be5c4b5193c54d91b0290b7066c5, c9ce73eb47a1551eb15983e6830c36f6ed53cbc7, 8cfef16a50f7929c1b1607ba58d0788561669388, e592ade95b744096035856f7b94d330467bc2163, ed814b548760729664aaf781fa8960445a1a2b1b, 95b337a9a969e0f0cdfc02be7f83558935005cb0, 0f7ac12c5be926045dad0f9786ce7c1777f59d21, 8e5743e58e90b13738c1cd07b5bbf65277f76277, 1575d2a986d6d0bd9886453324e2ad1fe45fb17f, 50f0cd2421ebdb3376c8ecb0b28eba0355be4cc2, cdeac69916b21b593e45d8862077165b98474f1e, 90973912b1a91a464254e2085dc2e49f62de2032, 13bcf716d28284f6cf12bb8f528d9e42c9e3964b, b3aab2a83fa6882cf66fabdb864247984f0b34fe, fc259e12823eb3fbbfef9e6c0c2eb28fb4dfb6d7, 805066c9db9a0c67e39515baf1242fdf53f0fc8a, 125c9fb5553d190a34fb25a4363855a4163233fe, da3d02b3603c9a7b4ec1462c35964fc3ce49867f, bba250468a5b6029516d53423f59739a776b6e5b, 2b5bb0ffc66271dfa2e07598288b0a6d4302acf3, 61c9814fb743bc1c0be42cc1e1ae50e561c8bce4, de4210e0e03363dbb855f41a73211d5b42bc659b, d44c0a09afbd90c3386ea4045db436d896cc925b, 713e5f31e730c8e070d8d85e140cc3366c2abbe4, 67a32e35d5f55260d54c273bc38da80038912d4a, 0637eaec23d4de217b6cd0cfcaf52f098858a95b, 4c0ad07020534ef82459306838c1419ab34f98c6, 798383bcf810dd5861eb2a3ce77bc2dd853cca53
3. [Intel GPU] qconv_pointwise.binary XPU support: This pull request aims to enable support for quantized fusion operations, specifically qconv+add
and qconv+add+relu
, on the Intel GPU backend by registering the operation via the schema TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")
and making modifications to allow signed int8 data type during operation lowering, while also reusing existing code for pattern matching from the x86InductorQuantizer.
- URL: pull/135189
- Merged: No
- Associated Commits: 5f711c706717606019478c361bf5543bf23c32fa, 08cde3a39e2ced53a3a4930dcdb629f1edf6dc29, e80985ec2d3e6f55e82e0082ba44c2e909671109, eb1e56c51fafd24452cab1303eb2f72b52d44921, 2fc0e194c2e78f719f9c06acf3354d152749c2fd, e9e3d286d0fcb02a567cae218037c26ce1e0e813, 9e9063118cf74e4ca6b60ac9c695a1bcea7a0987, d7500eb4910fc969f24cdd7fa3b3c325e6303498, 1dfc42d3abd60a221a08e13f57d38d1778e68bb8, 0a39e2023e870627651447a0e6b358334bfacc8f, 2631240afd19e8c2a08dd83d3b02b6832c962d54, 2e6d70d4a339f3d869835b7fa2809c2fe8a2ecb7, e8221e2ff90530071a3312ddc4741386e36e4585, c90b23d03b614e4ad9b1c51d48fe52e32521f028, 97b781e02045b8f5a7f8dba3cc92b8e89006dafd, fef7f32f30073c7b570fd9b891cd52262925c94e, 0b2271be05a5ba015f10ab5dbae427e91449b84e, eefe995d96b7255a9883bbc00349596f3bf58978, 281d9f16c16cf612e529b0cc57ac89b49b71ee2f, a33910bc17ae923ad0d9ec7ba2c09a86bac9de49, d468eccf1d873da541b7b1232e6477edcb7e7a66, b128dd372c3e291d349b92bc84d36d3c8fd28dad, ec34e31a03721f7e535d8605933a9bf90f39c5be, cb30e5a98ac28747f82bdaa01bcc28ad12facbc1, 7c4de3a718d4de03b3975313435fbf55678976a3, 83fb67b26baddd80be914ac4aa80bd67a962e2a9, ac9fd32b3d4d59c80bbcb556d2daad5ed0cbcaef, e339f5335f3e8d0b6429fc632bcec75079b04bc6, 89086c91ea4444cde0512df9dd9cd01d45b86ee1, 2574bc9cb21ed4ed002eaa3243d43490206ad3af, 3a1f769277a75100420d4529d9a20fc68b3a1953, 4f05b786112b76f51ce899ca9f2597b16fb6a941, e6115120ce67dfd3cd98c5d1fceb6b370483cd9f, 5a84844df59934f640124333579a862fb875de75, 768520aa155d189c890acee07ae5a98b67add902, 346d646515bb81918899af645ffbbd2313ecb0ca, 439788af5a44e8e9a7cf5fe5e42d8f47a16f7270, b27a8cc2371514c0e87d68115d8e11f60d384630, ea400594d6e3aed79952178d5f29d6a0970dcca3, 65c14f00ec4f1bd9a0dcbd9e29b3747d0cf4146e, e6322e81a63e163fd225baf4100519ca812b46ec, 122340faa57dd47af51d6618f74a64fb89aea83a, 4a36fbcfab5f6986b28ff8212dd0aed9ccdca328, c5edf493b2c7819b4ed1a44a57dd037f9455cb42, b44bf1756c469408e67e80d2da2fb42c106ed0db, 3b1ee856bc29ef51d333ac12dd99f0e9105bd646, ed8ad3f73866e6b8199644c31521304f57d1144b, 2fa8347b882aa0202a24a6ab666ecd9a5420be40, 5011d90c92af9074e8f32df16d56fd7692bada65, c34e6f18b6e2d716f1a76ed52a74256e56857853, 161c192bc6a3a98750363076011e1aa99783da4b, 34d44776782573e4bf59960adea006c12f5be7f5, a1a326c54a9f016c4157a0628afe567ac9099947, 6a0e37a5e88aa532b0f45e1005b9234fe6565331
Other Closed Pull Requests
- Intel GPU Backend Enhancements: This topic covers the enablement of various operations and optimizations for Intel GPUs in the PyTorch project. The pull requests focus on enabling
onednn.qlinear
operations, quantized fusion ofqlinear+add
, and implementing the SDPA operator using OneDNN, with improvements in data type support and performance optimizations for Intel GPUs.
- NCCL and CUDA Updates: This topic involves updates to the NVIDIA Collective Communications Library (NCCL) and CUDA support in the PyTorch project. The pull requests include updating NCCL to version 2.25.1 for newer CUDA versions and addressing related issues, as well as adding support for CUDA 12.8 in the libtorch nightly build.
- Torch Export and Dynamo Enhancements: This topic covers improvements to the Torch export functionality and the PyTorch Dynamo component. The pull requests address issues with the export backend, enhance graph break messages, and improve the handling of dataclass instances in Dynamo and FX.
- MX-FP8 Data Type Support: This topic focuses on the introduction and support of MX-FP8 data types in the PyTorch project. The pull requests aim to add support for
Float8_e8m0fnu
andFloat4_e2m1fn_x2
data types across various components, including CUDA and CPU kernels, and enhance device property handling.
- Cutlass Backend Improvements: This topic involves enhancements and fixes to the Cutlass backend in the PyTorch project. The pull requests address issues with GEMM template data types, forward fixes for mixed matrix multiplication, and the introduction of subprocess tests for autotuning.
- Inductor and XPU Backend Support: This topic covers the enablement and optimization of the Inductor component for the XPU backend on Windows. The pull requests focus on resolving unit test failures, enabling the XPU backend, and addressing issues with the
fft_c2c
test case.
- Documentation and Typing Improvements: This topic involves enhancements to the documentation and type annotations in the PyTorch project. The pull requests aim to improve the clarity of method parameter descriptions, add type hints, and address issues with type stubs and annotations.
- ONNX and Export Enhancements: This topic covers improvements to the ONNX module and export functionality in the PyTorch project. The pull requests introduce a framework for ONNX operator test data, enhance the export API, and address issues with dynamic shapes and serialization.
- ROCm and AMD GPU Support: This topic focuses on updates and optimizations for ROCm and AMD GPUs in the PyTorch project. The pull requests address issues with efficient attention mechanisms, optimize the TopK operation, and update CK kernel codegen templates for ROCm.
- Cache and Performance Improvements: This topic involves enhancements to caching mechanisms and performance optimizations in the PyTorch project. The pull requests introduce a new benchmark for PT2 caching, address cache-related issues, and optimize the handling of non-constant weights in BMM operations.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
-
- Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Mediation attempts, Escalating tension.)
- This GitHub conversation involves several users discussing the implementation of a feature, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.
-
[dynamo] Save/restore system random state more carefully [attempt 3]
- Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Increasing tension.)
- This GitHub conversation involves multiple users discussing a technical issue, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as more users join, with some attempting to mediate while others exacerbate the situation.
-
- Toxicity Score: 0.55 (Defensive responses, critique of solution, tense exchange.)
- This GitHub conversation involves multiple users discussing a series of commits related to an "export method." User1 initially provides a solution, which User2 critiques, expressing dissatisfaction with its effectiveness. User3 attempts to mediate by suggesting improvements, but User1 responds defensively, leading to a tense exchange. The tone shifts from collaborative to confrontational, with User2 and User1 exchanging terse comments.
-
demo myst_nb with compile tutorial
- Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Mediation attempts, Escalating tension.)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.
-
[Docs] Add
OpDTypes.any_common_cpu_cuda_one
- Toxicity Score: 0.55 (Defensive responses,critical feedback,escalating tension)
- This GitHub conversation involves a discussion between several users, where username1 initially proposes a change, and username2 provides feedback that is perceived as critical. Username1 responds defensively, leading to a back-and-forth exchange that escalates in tension. Other users, such as username3 and username4, attempt to mediate and offer constructive suggestions, but the tone remains strained. The conversation is marked by frustration and a lack of consensus, with users expressing dissatisfaction with the progress and communication style.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 199 | 62 | 2 | 220 |
anijain2305 | 273 | 59 | 3 | 77 |
guilhermeleobas | 337 | 16 | 2 | 34 |
jansel | 223 | 32 | 2 | 119 |
zou3519 | 60 | 19 | 20 | 246 |
justinchuby | 141 | 23 | 8 | 142 |
benjaminglass1 | 241 | 14 | 0 | 41 |
Skylion007 | 46 | 20 | 3 | 205 |
eellison | 96 | 9 | 7 | 160 |
cyyever | 138 | 49 | 0 | 49 |