Weekly GitHub Report for Pytorch: February 17, 2025 - February 24, 2025

            Weekly GitHub Report for Pytorch: February 17, 2025 - February 24, 2025

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
The PyTorch 2.6 release, created on January 29, 2025, introduces significant updates including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and enhancements to AOTInductor. Notable changes include the deprecation of publishing on Conda, the introduction of FP16 support on X86 CPUs, and a backward compatibility-breaking change in the default behavior of torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Triton pin update for PyTorch 2.7 / Triton 3.3: Upgrading PyTorch-Triton to a version that Supports Blackwell: This issue involves updating the PyTorch-Triton integration to support the Blackwell architecture by upgrading to a version of Triton that includes necessary optimizations and features. The update aims to address various compatibility and performance issues, particularly focusing on the integration of new functionalities and resolving existing bugs before the release of PyTorch 2.7.

The comments discuss the urgency of updating the Triton pin to support Blackwell, with concerns about unresolved issues and the timing of the update relative to the PyTorch 2.7 release. Contributors highlight specific test failures and compatibility issues, propose potential solutions, and track progress on related tasks, emphasizing the need for coordination and timely resolution of these issues.
Number of comments this week: 12

[compile] Modularize very long compilation: This issue addresses the prolonged compilation time experienced during the model export/compile process, where a single generated C++ file exceeding 78,000 lines takes over an hour to compile using only one CPU core. The user suggests modularizing and parallelizing the compilation process to improve efficiency and reduce the time required for this stage.

The comments discuss the potential causes of the issue, including the generation of a large Triton kernel and the need for modularization. Suggestions include splitting the C++ file into smaller parts for parallel compilation, although this may not be straightforward due to the current architecture. There is also a mention of testing with lower optimization levels and the possibility of using subgraph handling to manage repeated submodules.
Number of comments this week: 11

[Export AOTI] dynamic_shapes  export and compile degraded output: This issue involves a bug in the export and compile process of a model using dynamic shapes for width (W) and height (H), which results in degraded output compared to using fixed resolutions. The problem seems to be related to the use of torch.export.Dim for dynamic shapes, which causes runtime errors during inference unless the dimensions are aligned with the inference resolution.

The comments discuss the difficulty in debugging the issue without a reproducible example, suggest testing subparts of the model, and mention a potential problem with the exported program's graph. A minimal reproduction script is provided, and it is noted that the issue might stem from an invalid graph produced during export, with errors in AOTI and compile processes being secondary.
Number of comments this week: 10

[RFC] Test Cases Enabling for Accelerators: This issue addresses the challenge of enabling existing PyTorch test cases for new device backends, such as accelerators, by proposing a flexible mechanism to determine at runtime which tests to run, skip, or adapt based on a device's specific capabilities. The proposed approach involves introducing device abstractions that report capabilities, allowing for dynamic configuration of test inclusion or parameterization, thereby minimizing intrusive modifications and providing robust coverage across diverse hardware capabilities.

The comments discuss extending OpInfo for better device capability querying, the potential benefits for both in-tree and out-of-tree backends, and the need for a registration mechanism for device interfaces. There is interest in how this proposal will scale PyTorch tests for third-party hardware, the primary use case, and the integration with existing test infrastructure. The discussion also touches on the compatibility of capabilities across different hardware and the impact on test writing for PyTorch developers.
Number of comments this week: 9

PyTorch VS2022 official build Windows binary illegal instruction on AVX2(max ISA level) CPU: This issue addresses a bug in the PyTorch official build for Windows using Visual Studio 2022, where an illegal instruction error occurs on CPUs with a maximum ISA level of AVX2 due to the generation of AVX512 instructions. The problem does not affect current PyTorch official binaries built with VS2019, and it is challenging to reproduce locally, suggesting it might be specific to the official build environment.

The comments discuss potential solutions, including involving Microsoft, understanding the issue's scope across platforms, and maintaining AVX2 support due to its prevalence in client CPUs. There is a suggestion to revert to VS2019 or fix the issue by identifying differences between local and official environments. The discussion also touches on the possibility of making AVX2 the new base architecture instead of SSE4, as the problem might be related to non-deterministic linker behavior with AVX512 implementations.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

DISABLED test_transformer_training_is_seq_parallel_False (main.DistTensorParallelExampleTest): This issue pertains to a disabled test, test_transformer_training_is_seq_parallel_False, within the DistTensorParallelExampleTest suite, which is failing on the main branch of a GitHub project. The failure is suspected to be caused by changes introduced in one of the pull requests #122995, #122996, or #122997, and it affects the ROCm platform, prompting a wide range of contributors and maintainers to be notified for further investigation and resolution.
[NestedTensor] multiply batch and ragged dimension to get shape of values tensor: This issue discusses a feature request for the PyTorch library, specifically the ability to manipulate the dimensions of a NestedTensor by collapsing the first two dimensions into one, which would allow for more flexible tensor operations. The proposal includes a code snippet demonstrating how this could be achieved, highlighting the potential for enhanced functionality in handling nested tensor sizes.
Error: command buffer exited with error status.: This issue describes a problem encountered while training a model using llama2.c on an iMac with an AMD Radeon Pro 5700 XT GPU, where the user experienced a command buffer error at epoch 11,580, causing significant delays in epoch processing times. The error, which appears to be related to GPU timeout issues, occurred after the user built PyTorch from source due to the lack of recent nightly builds for MacOS + x86_64, and the user is seeking insights into whether these GPU timeout errors could be related to garbage collection or other factors.
scalar_tensor call with symbolic bool input does not work in inductor: This issue involves a bug in the PyTorch library where the scalar_tensor function fails when called with a symbolic boolean input while using the Inductor backend. The error occurs during the execution of a compiled function, resulting in a TypeError due to an Equality object lacking a length, which disrupts the expected behavior of the function.
Support AOT Autograd level Caching: This issue addresses the need for caching in the torch.compile process when using an aot-autograd enabled backend, as the current compilation time for models like Llama2 7B is significantly long, impacting development speed. The problem is particularly pronounced in scenarios where PyTorch/XLA is integrated with VLLM, requiring pre-compilation of multiple input shape combinations, which currently results in a lengthy warm-up phase due to the lack of support for dynamic shapes.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 107
Summarized Issues:

AttributeError in Distributed Training with LLaMA 3.2: An AttributeError occurs during the implementation of distributed training using pipelining for the LLaMA 3.2 model. The error message indicates that an 'InterpreterModule' object lacks the 'cache' attribute, potentially due to a missing or incorrectly referenced attribute in the model's pipeline configuration.
issues/147348

Floating Point Exception in torch.nn.functional.conv3d: A floating point exception and subsequent crash occur in the torch.nn.functional.conv3d function of PyTorch when using specific input parameters, particularly a very large stride value. This issue is related to the mkldnn backend and can be temporarily bypassed by disabling mkldnn.
issues/147362

Build Error with oneAPI on Windows: A build error is encountered when upgrading to a new internal build of oneAPI on Windows, where the compiler does not conform to the C++ standard. This results in multiple syntax and conversion errors, which can be resolved by adding the /permissive- flag and making specific code changes to address const qualifier issues.
issues/147366

AttributeError in Triton Upstream Inductor Project: Widespread failures in unit tests for the Triton upstream Inductor project are caused by an AttributeError due to a deprecated API. A 'dict' object is incorrectly expected to have an 'equal_to_1' attribute, affecting platforms NV and ROCm.
issues/147375

Timeout in Nightly Windows Builds: The timeout of nightly Windows builds for the PyTorch project began around January 31, 2025, due to an increase in build time. This is potentially linked to recent CUDA-related pull requests, specifically the addition of numerous element-wise CUDA kernels.
issues/147376

Memory Access Violation on ROCm Platform: A memory access violation error, specifically HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION, occurs on some unit tests when using the ROCm platform with the latest Triton commit. This issue is encountered during an attempt to update Triton in preparation for version 3.3.
issues/147378

Bug in _export_forward_backward API: A bug is encountered when using the experimental _export_forward_backward API in PyTorch to export the joint graph of a model with tied weights. This results in an assertion error due to a parameter not receiving a gradient.
issues/147380

Missing comm.reduce_add_coalesced Function: The absence of the comm.reduce_add_coalesced function in the CUDA communication collectives of the PyTorch project is highlighted. The provided documentation link and image indicate this issue, but no potential alternatives or fixes are suggested.
issues/147381

Discussion on CUDA 11.8 Support Removal: The potential removal of support for CUDA 11.8 in the PyTorch project is discussed, proposing to announce its removal in release 2.7 and officially drop it by release 2.8. The support is to be phased out in nightly builds between March 2025 and June 2025.
issues/147383

Monitoring Test Count Regressions: Monitoring test count regressions is necessary to identify potential bugs in sharding or unintended changes, such as modifications to bash scripts that might inadvertently stop testing on certain platforms like macOS. This ensures the integrity and completeness of the testing process.
issues/147387

Bug in Lazy* Modules Representation: A bug in the PyTorch library is highlighted where the representation of Lazy* modules, such as LazyLinear, does not update correctly after loading parameters from another module's state dictionary. This results in an inaccurate display of input features even after a forward pass is performed.
issues/147389

Need for Robust API in torch.export: There is a need for a robust API in torch.export to specify whether certain inputs, like MyStaticInput, should be treated as constants or not. The current implementation incorrectly treats registered constants as empty containers, leading to a ValueError during the export process.
issues/147397

Failure in Real-Tensor Tracing: A failure in real-tensor tracing during the export of dynamic shapes in PyTorch is caused by a division by zero error in the _split_dim_meta function. This is due to a mismatch between fake and real tensor shapes, suggesting potential solutions such as rewriting metas for real-tensor tracing.
issues/147402

Bug in PyTorch's Dynamo with Constant Tensors: A bug in PyTorch's Dynamo is highlighted where constant tensors created with torch.tensor do not correctly recompile with the appropriate device guards when the ambient device index changes. This leads to failures in ensuring that the output tensor is always on the current CUDA device.
issues/147405

Configuration Option for Recompile Reasons: A configuration option is added to the PyTorch project that enables the printing and generation of all recompile reasons without halting at the first encountered reason. This is indicated in the GitHub issue titled "Add a config to allow print and generate all recompile reasons and not stop at first."
issues/147428

Dynamic Epsilon Value in Optimizers: PyTorch needs to dynamically adjust the default epsilon (eps) value in optimizers like Adam based on the precision used, such as float16. This prevents issues like NaN values in model parameters due to ineffective default eps in low precision training.
issues/147437

Bug in torch.compile with dict_items: A problem with the PyTorch library is highlighted where the torch.compile function using the "eager" backend and fullgraph=True does not support iterating over dict_items from a user-defined object. This is demonstrated by a provided code snippet that fails to execute correctly.
issues/147440

Crash in scaled_dot_product_attention on MPS: A bug in PyTorch is highlighted where using the scaled_dot_product_attention function on Apple's Metal Performance Shaders (MPS) backend with non-contiguous 5D tensors causes a crash. This is due to inadmissible tensor sizes being passed to MetalPerformanceShadersGraph.
issues/147443

NaN Values in Backward Pass: The occurrence of NaN values during the backward pass in a PyTorch project is highlighted, specifically originating from the normalize_weight_jit function. This is used for weight normalization in a neural network, and the problem is highlighted by a runtime error indicating that the function returned NaN values in its output.
issues/147444

Gradient Checkpointing Inefficiency: A problem with gradient checkpointing in PyTorch is highlighted, where setting use_reentrant=False does not reduce peak memory usage compared to not using gradient checkpointing at all. Setting use_reentrant=True significantly reduces memory usage, indicating a potential inefficiency or bug when use_reentrant=False.
issues/147449

Precision Discrepancy in polygamma Function: A precision discrepancy in the PyTorch library is highlighted where the polygamma function, when n == 1, executes the trigamma_kernel under eager mode but switches to cal_polygamma under torch.compile. This results in less precise outputs and suggests a change to consistently use trigamma_kernel for improved accuracy.
issues/147450

INTERNAL ASSERT FAILED in torch.svd: An INTERNAL ASSERT FAILED error occurs in PyTorch when using the torch.svd function with extremely large input matrices. This is likely due to the use of a 32-bit LAPACK API that cannot handle arguments with more than 2^32-1 elements, and there is currently no plan to support a 64-bit version.
issues/147457

Floating Point Exception in torch.nn.functional.conv1d: A bug in the PyTorch library is reported where using the torch.nn.functional.conv1d function with specific input parameters results in a "Floating point exception (core dumped)" error. This is reproducible with the nightly-build version 2.7.0.dev20250208+cpu and may be related to a known issue with oneDNN.
issues/147458

Incorrect Outputs in scaled_dot_product_attention: A bug in PyTorch 2.6.0+rocm6.2.4 is reported where the scaled_dot_product_attention function using the memory-efficient backend (aotriton 0.8.0) produces incorrect outputs when a custom attention mask is applied. Upgrading to aotriton 0.8.2 resolves the problem.
issues/147460

Performance Regression in "modded-nanogpt": A performance regression in the "modded-nanogpt" project is reported, where the PyTorch version 2.7.0.dev20250209 is observed to be 2 seconds slower than the previous version 2.7.0.dev20250208. Detailed logs are provided for both versions to illustrate the discrepancy.
issues/147463

Test Failures in FlexDecoding Component: Test failures in the FlexDecoding component of the Triton project are reported when integrated with PyTorch, specifically related to an assertion error indicating an invalid stage for an operation. A detailed script for reproducing the problem is included along with discussions on potential fixes and workarounds.
issues/147468

Bug in Export and Compile Process with Dynamic Shapes: A bug in the PyTorch export and compile process is reported where using dynamic shapes for width (W) and height (H) results in degraded output. Using fixed resolutions produces correct results, and attempts to debug this are complicated by runtime errors and discrepancies between exported and compiled outputs.
issues/147475

Test Failures Related to AOTI and Triton on Blackwell Platform: Multiple test failures related to AOTI and Triton on the Blackwell platform are tracked, including internal assertion failures, kernel library production issues, and missing attributes in code generation tests.
issues/147478

Deprecation of Silent Fallback to ATen Logic in Inductor: The deprecation of the silent fallback to ATen logic in Inductor when generating GEMM kernels is proposed. This aims to respect the user's specified backends in max_autotune_gemm_backends by raising an error if these backends fail and ATen is not included.
issues/147479

Enhancement Request for Dim.AUTO and Dim.DYNAMIC: A feature request is made to enhance the Dim.AUTO and Dim.DYNAMIC functionalities in a GitHub project by allowing users to specify optional minimum and maximum values. This could potentially optimize dynamic dimension handling and improve performance through Inductor optimizations.
issues/147483

Logging Error in remote_cache.py: A logging error in the remote_cache.py file of a PyTorch project is reported, where log messages are being emitted after pytest has exited. This is due to the use of the atexit.register decorator, resulting in a "ValueError: I/O operation on closed file" when attempting to write to a closed stream.
issues/147486

Feature Request for Node and Weight Deactivation: A proposal is made to add native functionality to the PyTorch library that allows for the temporary deactivation ("sleep") and reactivation ("wake") of specific nodes and weights within neural network layers. This aims to enhance research capabilities and training optimizations by providing a more granular and reversible approach compared to current methods like pruning or freezing entire layers.
issues/147495

Slow torch.distributed all_reduce Operation: A problem is reported where the first execution of the torch.distributed all_reduce operation takes significantly longer (over 30 seconds) when using Ray with specific CUDA_VISIBLE_DEVICES settings. Subsequent executions are much faster, and this behavior does not occur when using other configurations or Docker.
issues/147502

Unintuitive Behavior of F.pad Function: The unintuitive behavior and errors encountered when using the F.pad function in PyTorch are highlighted. The function unexpectedly requires batch and channel dimensions for padding operations on 2D tensors, despite this requirement not being documented, leading to user confusion and the need for cumbersome workarounds.
issues/147506

Precision Drop in ONNX Export with Sigmoid Function: A significant precision drop is observed when exporting a PyTorch model using the sigmoid function to the ONNX format. The model performs accurately in PyTorch but shows discrepancies in output accuracy after conversion and inference in ONNX, particularly with small input values.
issues/147508

Inconsistent clamp_ Operation on MPS Device: A bug in the PyTorch library is highlighted where the clamp_ operation on tensors behaves inconsistently on the MPS device, particularly when applied to sliced tensors. This results in incorrect in-place modifications, unlike the expected consistent behavior observed on the CPU device.
issues/147510

Feedback Request for RISC-V Support Enhancements: A request for feedback and review on two pull requests aimed at enhancing PyTorch's support for the RISC-V architecture and the RISC-V Vector Extension (RVV) is made. The focus is on kernel optimization, vector library support, and CI support for cross-compiling.
issues/147513

Request for L-BFGS-B Algorithm Implementation: A request is made for implementing the L-BFGS-B algorithm in PyTorch to support box constraints. The current L-BFGS implementation lacks this feature, which limits the ability to efficiently perform tasks like maximum likelihood estimation for point processes and Hawkes processes using PyTorch's tensor computation and GPU acceleration.
issues/147520

Bug in nn.GaussianNLLLoss Function: A bug in the PyTorch nn.GaussianNLLLoss function is reported, where the expected behavior of allowing the var parameter to have one dimension of size 1 is not met. The function only permits the final dimension to be of size 1, resulting in an unexpected "var is of incorrect size" error.
issues/147521

Inclusion of Dataclass Instances in Computational Graph: A bug in the PyTorch project is highlighted where the Dynamo and fx tracing systems currently permit dataclass instances to be included in the computational graph. This poses a problem because the dataclass constructor can contain arbitrary user code, potentially leading to unintended behavior or security vulnerabilities.
issues/147530

Index Error in FSDP Wrapped Module: A bug in the PyTorch library is described where calling a Fully Sharded Data Parallel (FSDP) wrapped module with zero arguments results in an index error. This is due to an assumption in the code that at least one argument is always provided, suggesting that the code could be modified to support arbitrary numbers of arguments and keyword arguments.
issues/147531

Lack of Gradient Support in torch.linalg.lstsq: The lack of gradient support for the residuals component in the return value of the torch.linalg.lstsq function is highlighted when using the gels driver. The solution component does have gradient support, prompting a query on whether this behavior is expected.
issues/147543

Numerical Stability in linalg_eig_backward Function: A proposal is made to enhance numerical stability in the linalg_eig_backward function on GPUs within PyTorch by adding a small epsilon to the denominator in the backward formula. This addresses problems with unstable gradients or NaNs during backpropagation in physics-inspired machine learning models using torch.linalg.eigh.
issues/147544

Illegal Memory Access in FlexAttention Module: A bug in the PyTorch library is reported where the FlexAttention module, when compiled and run on a CUDA device, encounters illegal memory access or device-side assertions. This occurs despite all tensors being contiguous, with the problem persisting even after attempting a workaround involving padding adjustments to the rel_bias tensor.
issues/147551

Documentation Error in torch.distributed.elastic.multiprocessing.start_process(): The documentation for torch.distributed.elastic.multiprocessing.start_process() in PyTorch 2.3.0 and later versions incorrectly includes the removed tee parameter. It is suggested that it should be updated to reflect the current API by replacing references to tee with the logs_specs parameter.
issues/147563

Incorrect Behavior with Non-Reentrant Checkpoints: A bug in the PyTorch project is described where using non-reentrant checkpoints in combination with ambient saved tensor hooks results in incorrect behavior. This is demonstrated by a test case involving tensor operations and gradient calculations that produce unexpected results when logging the pack/unpack hooks.
issues/147565

Segmentation Fault in copy_() Function: A bug in PyTorch version 2.6.0 is described where the copy_() function fails with a segmentation fault when using Hierarchical Sharded Data Parallel (HSDP) in Fully Sharded Data Parallel (FSDP) version 2 on a 2-GPU machine. The same setup works in version 2.5.1.
issues/147568

Silent Failure in view() with In-Place Modification: A bug in the PyTorch library is described where using the view() function combined with in-place modification fails silently when applied to a DTensor. This results in no changes to the tensor's values, as demonstrated by the unchanged all-ones tensor output when the works flag is set to False.
issues/147570

Errors with PyTorch and NCCL on CUDA 12.8: A user experiences errors with PyTorch and NCCL installations on CUDA 12.8, where the NCCL version is incorrectly reported as 2.25.1+cuda12.2 instead of 2.25.1+cuda12.8. A "Cuda failure 1 'invalid argument'" error occurs during code execution.
issues/147575

Need for Sharding Strategy for aten.amax.default: The need to register a sharding strategy for the aten.amax.default operator in Dtensor is highlighted to address errors encountered with float8 rowwise scaling in both eager mode and vanilla TP. This is identified during the debugging of a related problem in the torchtitan project.
issues/147578

Discrepancy in TransformerImpl Class Parameters: A discrepancy between the C++ libtorch and Python PyTorch implementations is highlighted, specifically noting that the C++ version's TransformerImpl class lacks certain parameters such as layer_norm_eps, batch_first, and norm_first. Guidance is sought on how to pass these parameters in C++.
issues/147581

Bug in OffsetBasedRNGTracker Instantiation: A bug in the PyTorch project is described where the OffsetBasedRNGTracker is always instantiated with a CUDA backend. This causes problems when attempting to use other backends, such as HPU, due to the lack of support for non-CUDA devices.
issues/147584

Illegal Memory Access in ScaledDotProductEfficientAttentionBackward0: A bug in the PyTorch library is described where an error occurs in the ScaledDotProductEfficientAttentionBackward0 function when the input sequence length exceeds 46344 and an attention mask is applied. This results in a CUDA illegal memory access error.
issues/147585

RuntimeError in Faster R-CNN with Deterministic Algorithms: A RuntimeError is encountered when running Faster R-CNN with PyTorch's deterministic algorithms enabled. This is due to the lack of a deterministic implementation for the roi_align_backward_kernel, despite setting all known deterministic flags and environment variables.
issues/147595

Performance Regression with Tensor Parallelism: A performance regression is highlighted where the model "meta-llama/Llama-3.1-8B-Instruct" exhibits worse latency when using tensor parallelism (TP) on a CPU setup with Intel's 4th Gen Xeon processors. This is compared to running without TP, and specific pull requests for the CCL and transformers libraries might be needed to address the problem.
issues/147596

Bug in WeakRefVariable with call_function: A bug in the PyTorch project is highlighted where the WeakRefVariable does not utilize the most updated Python referent when call_function is executed. This leads to discrepancies between compiled and eager execution outputs, and it is suggested that the original Python referent should be checked each time WeakRefVariable.call_function is called to ensure correct behavior.
issues/147597

Error in ONNX Export with aten::_make_per_tensor_quantized_tensor: An error is encountered when attempting to export a PyTorch model to ONNX opset version 11 using torch.onnx.export. This is due to the unsupported operator aten::_make_per_tensor_quantized_tensor, and despite attempts to resolve it by using different opset versions and custom operations, the user continues to face a "RuntimeError: ArrayRef: invalid index Index = 11; Length = 11" error.
issues/147602

RuntimeError in PyTorch Profiler: A RuntimeError is encountered when using the PyTorch profiler in a loop to export chrome traces. This intermittently fails with an internal assertion error related to an empty Python replay stack, suggesting a potential bug in PyTorch's profiler implementation.
issues/147604

Docstring Mistake in replace_pattern Function: A minor mistake in the docstring of the replace_pattern function in torch/fx/subgraph_rewriter.py is highlighted, where an unnecessary sum() operation is included in the pattern definition. This does not align with the intended functionality as demonstrated by the generated code.
issues/147610

RuntimeError in torch.nn.AvgPool2d on CUDA: A bug in the PyTorch library is described where the torch.nn.AvgPool2d function fails with a "RuntimeError: integer out of range" when executed on a CUDA device with a stride of 2^31 or larger. It works correctly on a CPU.
issues/147613

Discrepancy in GAT Model Output in ONNX: A significant discrepancy in the output of a Graph Attention Network (GAT) model is reported when converted from PyTorch to ONNX format. The differences in results are unexpectedly large depending on the input data, despite the expectation of only minor variations.
issues/147617

Bug in torch.export.export with Batch Normalization: A bug in the torch.export.export function is reported, where attempting to export a convolutional neural network with a batch normalization layer on a GPU results in guard conditions that prevent successful exporting. This is due to constraints on the batch size that are not satisfied.
issues/147623

Absence of C-Shim for aten.grid_sampler_3d.default: The absence of a c-shim implementation for aten.grid_sampler_3d.default in the PyTorch project is highlighted, resulting in the use of a proxy executor as a fallback. This may introduce some overhead, and it suggests following a specific pull request to address this by adding the necessary c-shim.
issues/147625

Runtime Error with ROCm/HIP Backend on AMD Radeon RX 7600 XT: A runtime error is encountered when attempting to perform GPU compute tasks using PyTorch with the ROCm/HIP backend on an AMD Radeon RX 7600 XT. The error "HIP error: invalid device function" occurs during the first attempt to allocate a tensor on the GPU, despite the GPU being detected.
issues/147626

FIPS Compliance in Python 3.9+: Enforcing full FIPS compliance in Python 3.9+ is proposed by using ruff rule S324 to ensure that the hashlib function is not used for cryptographic applications. This requires the addition of usedforsecurity=False to all hashlib calls in the codebase and updating the documentation accordingly.
issues/147627

Optimization of torch.sort Function: Optimizing the torch.sort function in PyTorch is proposed to significantly reduce GPU memory usage by allowing the indices to have a dynamic data type instead of the fixed 64-bit torch.long. This can be particularly beneficial for large datasets, as demonstrated by a reduction in peak and final GPU memory usage when using a boolean matrix example.
issues/147628

Documentation Discrepancy in torch.distributed.init_process_group: A discrepancy in the PyTorch documentation regarding the default behavior of the torch.distributed.init_process_group function is highlighted. The documentation inaccurately states that both gloo and nccl backends are created when no backend is specified, whereas in versions 2.6 and 2.7/main, only the nccl backend is created.
issues/147631

Enhancements to torch.compile Programming Model Documentation: Enhancements to the torch.compile programming model documentation are requested, specifically including a debug_trace API for torch._dynamo, a more readable string output for gm in Jupyter notebooks, and a descriptive make_fx function with a tracing_mode set to Fake.
issues/147632

Marking Build Job as Unstable: A specific build job is marked as unstable due to potential flakiness, as part of an experiment related to a pull request in the PyTorch project. Further context is provided in the linked pull request.
issues/147642

Invalid Representation String of Meta Tensor: The representation string of a meta tensor in PyTorch is not a valid tensor call due to the use of an unexpected keyword argument 'size'. It is suggested that the representation should be modified to be an executable code snippet, similar to how concrete tensors are represented.
issues/147643

Shape Function for Einsum Operation in PyTorch XLA: Adding a shape function for the einsum operation to the PyTorch XLA project is proposed to facilitate full code generation. This is currently hindered by the absence of this function in the shape inference header file.
issues/147661

Enhancing GPUDirect Storage User Experience: Enhancing the user experience of GPUDirect Storage is proposed by integrating support for commonly used APIs like torch.save, torch.load, dcp.save, and dcp.load. This enables faster model checkpoint saving/loading and efficient use of GDS-compatible storage solutions, thereby avoiding CPU bottlenecks.
issues/147662

Unexpected Error in torch.utils.collect_env on elementaryOS: A user experiences an unexpected error when running the command python3 -m torch.utils.collect_env to verify their PyTorch installation on elementaryOS 7.1/Ubuntu 22.04.5. This results in an AttributeError due to a NoneType object, despite following the installation guide and using sudo for package installation.
issues/147669

Failure in Building torch_cuda.dll on Windows: A failure in building torch_cuda.dll is reported due to an unresolved external symbol error when linking _cudnn_attention_forward. This specifically affects Windows builds for the wheel and libtorch cases and requires attention from NVDA developers as it is potentially related to a previous pull request.
issues/147671

UserError in Model Export and Compile: A user encounters a torch._dynamo.exc.UserError while attempting to export and compile a model from a GitHub repository. This is due to a data-dependent expression Eq(256*u0, 256) that could potentially be resolved by using guard_size_oblivious, as suggested by the error message.
issues/147672

Adding _capture_strategy Field to ONNX Program: Adding a _capture_strategy field to the ONNX program is proposed to document the strategy used during its creation. This will help in identifying regressions when fallback strategies are activated.
issues/147674

CUDA Out-of-Memory Error in distributed_c10d.broadcast: A CUDA out-of-memory error occurs when broadcasting a torch.tensor(True) using the distributed_c10d.broadcast function with two GPUs. This is potentially linked to the recent addition of CUDA 12.8 support in PyTorch's nightly build.
issues/147677

IndexError in _extract_arch_version Function: A bug in the PyTorch library is described where the _extract_arch_version function in torch/cuda/__init__.py fails to correctly parse architecture strings for AMD GPUs, such as the Radeon RX 7700S. This is because these strings do not contain an underscore ('_'), leading to an IndexError.
issues/147682

Potential Bug in fx_passes/binary_folding.py: A potential bug in the fx_passes/binary_folding.py file of the PyTorch project is highlighted, where the indexing for checking a convolution's bias appears to be incorrect. It uses conv_node.args[1] instead of the correct conv_node.args[2], as indicated by a comparison with similar code in efficient_conv_bn_eval.py.
issues/147686

Request for PyTorch Version Compatible with Blackwell RTX 5080: A request is made for the development and release of a new version of PyTorch that is compatible with the Blackwell RTX 5080 graphics card and CUDA 12.8.
issues/147692

High Resource Usage by CPUExec in PyTorch Profiling: A user questions why the CPUExec component accounts for a high percentage of resource usage in their PyTorch profiling, despite having few .device("cuda") operations in their code. Insights are sought from specific contributors.
issues/147695

Documentation Error in register_forward_hook Method: A documentation error in PyTorch's register_forward_hook method is highlighted, where the text incorrectly references a non-existent torch.nn.modules.Module instead of the correct torch.nn.Module.
issues/147696

Inference Failure in Custom DETR Model: A problem with a custom implementation of the DETR model using a ResNet50 backbone is described, where the model fails to produce any detections during inference when the batch size is set to 1. This occurs despite working correctly with larger batch sizes, potentially due to issues related to batch normalization or the small size of the dataset used for fine-tuning.
issues/147700

Runtime Error in flex_attention Function: A bug in the flex_attention function within PyTorch is described, where the compiled code incorrectly assumes an output tensor shape. This leads to a runtime error due to a mismatch between the expected and actual tensor sizes, particularly when the dimensions of the query/key and value tensors are confused.
issues/147701

Error in ONNX Conversion with nn.AdaptiveAvgPool2d: An error is encountered during the conversion of a trained model using nn.AdaptiveAvgPool2d to ONNX format, specifically when the input size to nn.AdaptiveAvgPool2d is variable. Guidance is sought on resolving this problem.
issues/147720

Segmentation Fault in torch.ops.profiler._call_end_callbacks_on_jit_fut: A segmentation fault occurs in the PyTorch function torch.ops.profiler._call_end_callbacks_on_jit_fut when a tuple containing a None value is passed as an argument. This specifically highlights a bug in version 2.6.0+cu124.
issues/147722

Lack of Sharding Strategy for aten.select.int: A problem in the PyTorch project is highlighted where the operator aten.select.int lacks a registered sharding strategy. This causes a NotImplementedError during distributed tensor operations, and it is suggested that the DTensor module needs to address this by adding the necessary operation support incrementally.
issues/147724

Feature Request for Stream Management API in NCCL: A feature request is made for a stream management API in PyTorch's NCCL process groups to address asynchronous communication challenges. This specifically addresses the "read-before-write" issue that arises when collective operations are executed out of order due to each NCCL process group operating on its own dedicated stream.
issues/147729

GradScaler Issue on Intel Arc GPUs: A problem is reported where the PyTorch GradScaler does not function correctly on Intel Arc GPUs when attempting to train with mixed precision. It either produces a warning about CUDA not being available or throws a runtime error related to unsupported fp64 aspect when the "xpu" device type is specified.
issues/147731

AssertionError in register_sharding with Keyword Arguments: An AssertionError is encountered when using register_sharding for a custom operation with keyword arguments in PyTorch. This is due to a mismatch between the number of input specifications and input argument strategies, as the function unwrap_to_op_info handles arguments and keyword arguments separately.
issues/147732

Segmentation Fault in Triton Upstream on ROCm: A segmentation fault occurs in the cpp_wrapper component of the Triton upstream within the Inductor project on ROCm. This specifically happens when running a unit test related to dtype view conversion from float32 to bfloat16 on CUDA.
issues/147734

Accuracy Problems with Cooperative Reduction Functions on MI200: Accuracy problems with cooperative reduction functions on the MI200 platform are reported when using ROCm. This is evidenced by multiple test failures in the PyTorch project, where tensor-like objects are not sufficiently close in value, exceeding the allowed differences in both absolute and relative terms.
issues/147735

Accuracy Problems in quantile Operation on ROCm: Accuracy problems in the unit tests for the quantile operation on ROCm are reported when attempting to update Triton in preparation for version 3.3. This is evidenced by multiple test failures in the TestInductorOpInfoCUDA suite, where tensor-like objects are not sufficiently close in their values.
issues/147736

Unit Test Failures in Triton Update for Version 3.3: Unit test failures are encountered in the PyTorch project when attempting to update Triton for version 3.3. This is specifically related to a "Cannot bitcast data-type of size" error occurring during the execution of a CUDA boolean sort test.
issues/147737

AttributeError in retinanet_resnet50_fpn() Model Export: An error is encountered when using the torchvision.models.detection.retinanet_resnet50_fpn() model, where the user experiences an AttributeError due to a Tensor object not having an items attribute during the model export process with torch.jit.trace.
issues/147739

Identical Values in torch.randn_like() on MPS: A bug in the PyTorch library is described where the torch.randn_like() function, when used with the MPS (Metal Performance Shaders) device, produces tensors with identical values along a given dimension once the tensor's dimensionality exceeds a certain size. This behavior is not observed on the CPU.
issues/147740

ResourceWarning in torch.distributed.nn.jit.instantiator: A warning is generated by the tempfile module due to an uncleaned temporary directory created in the torch.distributed.nn.jit.instantiator module. This occurs when torch_tensorrt is imported, leading to a ResourceWarning about implicitly cleaning up a temporary directory upon program exit.
issues/147744

Transition to Public ECR Images for Docker Builds: Transitioning the project's Docker builds to utilize public Amazon Elastic Container Registry (ECR) images instead of Docker Hub is proposed. This is motivated by Docker Hub's impending rate limit changes and the potential for more reliable and faster image pulls within AWS.
issues/147748

Exposure of NCCL API ncclGroupSimulateEnd: Proposing the exposure of the NCCL API ncclGroupSimulateEnd at the Python level in PyTorch is suggested to enable users to perform runtime estimation of communication operations.
issues/147753

LoweringException in flex_attention with torch.compile: A failure in the PyTorch library is reported where attempting to compile the flex_attention function with dynamic settings using torch.compile results in a LoweringException. This is due to a TypeError that prevents determining the truth value of a relational expression.
issues/147756

Compilation Error with Dropout in SequenceParallel: A compilation error occurs when attempting to compile a PyTorch model using Dropout parallelized with SequenceParallel. This results in a runtime error related to tensor conversion, despite documentation suggesting support for Dropout in SequenceParallel.
issues/147757

Memory Allocator Lock Contention in Inductor-CPU: The problem of memory allocator lock contention in templated GEMMs within the Inductor-CPU project is addressed. Threads compete for memory allocator locks during the creation of per-thread local accumulation buffers in an OpenMP parallel region, leading to significant performance impacts.
issues/147766

Disabled Test on ROCm Platform: A disabled test, "test_custom_hook_custom_stream" from the PyTorch project, is failing on the main branch specifically on ROCm platforms due to a "HIP error: invalid device ordinal." It requires attention from several developers and contributors to address the device ordinal issue.
issues/147767

Disabled Test in TestHSDPWithCustomHook on ROCm: A disabled test named 'test_custom_hsdp_all_reduce_hook' within the TestHSDPWithCustomHook suite on the ROCm platform is failing on the main branch of the PyTorch project. This involves several contributors and stakeholders for resolution.
issues/147769

BackendCompilerFailed Error in torch._check Function: A bug in PyTorch is described where the torch._check function fails when used with .item() followed by a select operation. This results in a BackendCompilerFailed error due to a data-dependent expression that cannot be guarded.
issues/147772

Bug in SETUP_WITH Implementation in Dynamo: A bug in the SETUP_WITH implementation within the Dynamo component of the PyTorch project is highlighted. The current order of operations deviates from the CPython documentation by pushing __exit__() onto the stack after creating the block stack, leading to a crash when a graph break occurs.
issues/147776

Bug in torch.compiler.allow_in_graph Decorator: A bug in the PyTorch project is highlighted where decorators like torch.compiler.allow_in_graph do not properly handle the reuse of function identifiers. This leads to unexpected behavior when a function is deleted and another function is defined with the same name.
issues/147777

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 66
Summarized Issues:

Test Failures on ROCm Platform: The PyTorch project has encountered multiple test failures on the ROCm platform, leading to the disabling of tests such as test_attention_vs_linear and test_tracker_multi_group_eager. These failures were linked to changes in the main branch and specific pull requests, prompting discussions on restoring stability and considering reversion if necessary.  
issues/128217, issues/129390, issues/134824

Performance and Compilation Issues: Several issues in PyTorch relate to performance discrepancies and compilation problems, such as slower execution with torch.compile() compared to decorators, and excessive compilation times with reduce-overhead mode. These issues highlight the need for optimization and better handling of specific data types and operations.  
issues/128354, issues/128424, issues/129457, issues/129468, issues/129637, issues/129684, issues/136192, issues/147327

Bugs in PyTorch Functions: Various bugs have been reported in PyTorch functions, including issues with torch.func.vmap, torch.onnx.dynamo_export(), and torch.cholesky_solve. These bugs often result in errors or unexpected behavior, necessitating fixes and updates to ensure correct functionality.  
issues/128711, issues/134757, issues/147456

Serialization and Export Challenges: The PyTorch project faces challenges with serialization and export processes, such as errors with nested classes and difficulties exporting models with dynamic shapes. These issues require workarounds and improvements to support various use cases and configurations.  
issues/146814, issues/147606

Test Disabling Due to Failures: Multiple tests in the PyTorch project have been disabled due to consistent failures on the main branch, such as test_real_imag_view_lazy_complex128 and test_flatten_nonview_xla. These failures are documented with examples and linked resources for further investigation.  
issues/147711, issues/147712, issues/147713, issues/147714, issues/147715, issues/147716, issues/147717, issues/147718, issues/147719

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. All other pull requests are grouped based on similar characteristics for easier analysis.
Pull Requests Opened This Week: 181
Key Open Pull Requests
1. [test] 2: This pull request, titled "[test] 2," aims to address and fix an unspecified issue in the PyTorch project, as indicated by the placeholder "#ISSUE_NUMBER," and includes a series of 16 commits, each labeled with the commit message "tc," which suggests a focus on testing or test-related changes, although it has not yet been merged.

URL: pull/147470

Merged: No

Associated Commits: 03e29ef5bda553434e588ee1f041ba8a71031e5e, 176793c7127a2d6502c9834abf2bb21ffe3f638e, 7b5b15cf410a1a364379bd4b8886d22ecec39a77, bd6cdf29c6c6b19e3bdaaf8cf9c5e87832db83be, 1e50cb13d90a2242db56663a8cebaaaeb2859c40, a324ca8d10153dec91e8419a4ecca844bd2e7c39, 2de748a5fcecfeda346d9e7c6269f732e92218d3, 0a2e06b882a35ac6aa3c23f59ce6a7edf560a93f, ed111e570352372ac450c4e4009ca89e08ce981a, bcad8d972d94be48e570da0202be1c2ef654cb37, 29c02d0016c792f6699a88e6e15cf1df4a0dda2f, 32dc214adb0ad68052f7447402211a3f0d6f350a, 4e88f9938d14a2262f3084c70ded2153e50f4033, 15c409c5dd9a73303bc2ed6e6fccd4cbf2814027, 573b2b535cc3f230e0a2fb7aa09cb950df28f598, b8b9756fa1e4e58af566e2ad69bf69650e660616

2. cpp_wrapper: reduce memory usage by removing unneeded temporaries: This pull request aims to reduce memory usage in the cpp_wrapper of a GitHub project by refactoring reinterpret_view calls to return temporary RAII tensor objects, thereby making the function's callers responsible for saving the handle when necessary, and eliminating unnecessary temporary tensor handles to align memory usage with the default inductor mode.

URL: pull/147403

Merged: No

Associated Commits: 01424669b96688c50c271cdca6e8f52a8185bf1d, 67582f8e3028a688d489ab644155862868468250, eb4f8d8ffe626fde86a511f865f60267ee6d33d8, 20c1ad66b21f5ccac2d16853d9e0b953af060d3b, a6f57b847cc9624cac7ffa3e81322aafb31e7eaa, 1c1b4c9270563664983046631b1165d8be8c5b7c, aae0fd8856f2c67dd3ebad7824189c74aae7f15d, 3806a189aa7f4e2124616a7126fa3fa7cf973388, ebf67aeef582fb9e259a9f022f399e9764b71c80, 91ceb7cf1363b3c4590117f5aacb7467ed1b504c, 4d5edaf67e80ca9ca36d301af1ded13967a04790, 35f9d714fc57cfda513c69b1c79c7ee47c5fda4e, b6bf52e56404744de5abd1a2dcfc95271e655835, 7deb4da44f7a2f8b16de60bb680bb65d7062806e, 4ebac8719de768e93930aae722ee10d0ab99bb9e

3. Make Tensor.set_ validate storage_offset when sizes/strides are unchanged: This pull request aims to enhance the Tensor.set_ function in the PyTorch library by adding validation for the storage_offset parameter when the sizes and strides of the tensor remain unchanged, as part of a series of related updates tracked through the ghstack tool.

URL: pull/147354

Merged: No

Associated Commits: d55833c1f1f9078c7033b7bf966a7a8e04c1a4c0, 9677c958516940094f547641847194926baaeb64, 56c4c38b90973f59f428e67e42b5442ca602cf53, 2db7ebf942472df2e40757fd08368aa4db594be4, 649ddb82ef37da61f9ed0ac8323606f03e549915, 439a693f1a5e89b2a37bcf015d49602f1d24f430, 5eb7fb4b69c1453028421355083fcfc6fa6f0c96, 4d08c805b890c3c3125ab2b07b7eb75cbf881328, 6be5a4b237feba23c6efd8be06d43028b7401beb

Other Open Pull Requests

C++ Standards Compliance on Windows: This topic involves enforcing C++ standards compliance in the build process on Windows by adding the /permissive- flag. This change resolves issues like assigning string literals to non-const pointers and aligns the project with Visual Studio's default settings, improving code quality and fixing related compiler errors.
pull/147367

Input Validation in PyTorch: This topic focuses on enhancing the robustness of the PyTorch codebase by validating inputs to prevent potential buffer overflow issues. The pull requests address validation in functions like _nested_view_from_buffer, ensuring safer code execution.
pull/147356

Error Message Improvements: This topic covers efforts to improve the readability of error messages in PyTorch by auditing and updating "unimplemented" sites. The changes ensure that messages are clear and understandable, enhancing user experience.
pull/147385

ONNX Module Enhancements: This topic introduces new strategies and features to the ONNX module in PyTorch, such as the "draft_export" strategy and direct utilization of ONNX operations. These changes aim to improve tensor specialization and integration with the existing ecosystem.
pull/147529, pull/147576

CUDA and ROCm Support Enhancements: This topic includes introducing blockwise MXFP8 support for CUDA devices and enhancements for ROCm MX-FP8 matrix multiplications. These changes improve matrix multiplication efficiency and validation for specific data types.
pull/147548, pull/147553

Runtime and SACEstimator Modifications: This topic involves modifications to the RuntimeEstimator and SACEstimator in PyTorch, addressing issues and including tests for fake utilities. The changes also fix default arguments, bindings, and resolve linting issues.
pull/147750

Masked Fill Implementation: This topic covers the implementation of the masked_fill_scalar function as a shader, moving existing functions into a new header, and introducing StridedTensor and ConstStridedTensor. These changes facilitate the implementation of masked_fill, addressing a specific issue.
pull/147369

Dynamo Component Enhancements: This topic introduces generic graph break hints and error message improvements to the Dynamo component. The changes include multiple updates and contributions from various collaborators.
pull/147429, pull/147494

ROCm and XPU Enhancements: This topic includes updates to the ck_conv_template code generation for ROCm CK kernels and improvements to the XPU oneDNN context manager API. These changes enhance flexibility, maintainability, and usability.
pull/147504, pull/147349

Inductor Component Enhancements: This topic covers improvements to the Inductor component, including handling mismatched outputs and optimizing heuristics for outer loop fusion. These changes enhance performance and compatibility with various operations.
pull/147567, pull/147523

Overflow and Buffer Issues: This topic addresses overflow issues in various functions, such as checkInBoundsForStorage and tensor slice calculations. The changes include implementing fixes to prevent crashes and incorrect tensor returns.
pull/147352, pull/147433

Experimental Features and Tests: This topic introduces experimental features like delayed compilation and new tests for components like CacheBench. These changes involve multiple updates and collaboration among contributors.
pull/147591, pull/147688

Tensor and Data Type Handling: This topic addresses issues with tensor and data type handling, such as converting non-standard boolean values and handling mismatched outputs. The changes ensure correct operations and improve compatibility.
pull/147459, pull/147567

Export and Serialization Enhancements: This topic focuses on improving the export process by eliminating unbacked renamings and introducing new passes for recomputing bindings. These changes enhance compatibility with de/serialization.
pull/147574

CUDA Graph Partitioning: This topic involves implementing a CUDA graph partition feature, building upon previous work related to inductor graph partitioning. The changes include recording mappings and handling metadata and input index mutations.
pull/147648

MKLDNN and oneDNN Enhancements: This topic includes migrating from oneDNN Inner Product to MatMul and introducing an is_available API for torch.backends.mkldnn. These changes improve functionality and allow users to check backend availability.
pull/147360, pull/147432

Sparse Tensor Validation: This topic addresses the validation of sparse tensors constructed via a legacy constructor, highlighting issues like size inconsistency and storage size calculation overflow. The changes refine the solution for these issues.
pull/147408

FSDP and FlexAttention Enhancements: This topic includes enabling FSDP tests on XPU devices and addressing error messaging in the FlexAttention module. The changes improve testing and guide users experimenting with small tensors.
pull/147518, pull/147765

NCCL and TCPStore Enhancements: This topic aims to enhance the NCCL communication library to support uint64 tensor types and improve error handling in TCPStore components. The changes address gaps and improve error message specificity.
pull/147424, pull/147647

Build Process and Compiler Updates: This topic involves updating the build process for XPU and enabling AddressSanitizer support for CUDA. The changes improve compatibility and collaboration among contributors.
pull/147448, pull/147512

Documentation and Code Refactoring: This topic covers documentation updates and code refactoring efforts, such as correcting docstrings and renaming options for clarity. The changes enhance readability and maintainability.
pull/147611, pull/147679

Testing and Continuous Integration: This topic focuses on testing the continuous integration process and addressing issues with test scripts. The changes ensure compatibility with new versions and improve the debugging process.
pull/147664, pull/147746

Attention Mechanism and Quantization: This topic addresses issues with the attention mechanism for tensors with more than four dimensions and introduces a total quantization target for the P1 INT16 model. The changes ensure proper functionality and include a test plan.
pull/147545, pull/147747

Error Handling and Logging Enhancements: This topic involves enhancing error handling in various components and introducing context managers for logging. The changes improve error message clarity and logging capabilities.
pull/147647, pull/147760

Memory and Performance Optimizations: This topic includes optimizations for memory-efficient attention in ROCm and performance improvements for integer matrix multiplication on macOS. The changes enhance performance and provide a performance comparison.
pull/147778, pull/147526

Type Annotations and Code Introspection: This topic focuses on enhancing type annotations for dynamo methods and refactoring function signatures to improve type safety and code introspection. The changes involve collaboration among contributors.
pull/147499, pull/147582

Cache and Compilation Enhancements: This topic introduces a caching mechanism for save plans and optimizes the integer matrix multiplication kernel for Metal Performance Shaders. The changes reduce computational costs and improve performance.
pull/147343, pull/147526

Bug Fixes and Issue Resolutions: This topic addresses various bug fixes and issue resolutions, such as fixing the torch.polygamma() function and correcting the RNN example code. The changes ensure consistency and correct functionality.
pull/147453, pull/147490

Experimental and Test Submissions: This topic includes test submissions and experimental attempts, such as implementing end-to-end control plane flex_attention. The changes involve collaboration and are open for review.
pull/147689, pull/147603

Backend and Device Support Enhancements: This topic covers enhancements to backend and device support, such as enabling SDPA on the XPU backend and updating merge rules for oneDNN. The changes improve compatibility and functionality.
pull/147614, pull/147615

Kernel and Operation Enhancements: This topic involves implementing a metal kernel for MPS binary operations and enhancing the torch.compile function. The changes improve performance and ensure correct operation handling.
pull/147644, pull/147528

Error Message and Logging Improvements: This topic focuses on improving error messages and logging capabilities, such as updating error messages related to missing build systems and enhancing logging for AOTI. The changes improve user guidance and debugging.
pull/147698, pull/147760

Optimization and Performance Improvements: This topic includes optimizations for block radix sort and the matmul_small_brute_force_tunableop unit test. The changes enhance performance and reduce execution time.
pull/147657, pull/147659

Type and Data Handling Enhancements: This topic addresses enhancements in type and data handling, such as introducing input vectorization in elementwise kernels and supporting unique user kernel names. The changes improve control over naming and data processing.
pull/147527, pull/147587

Testing and Validation Enhancements: This topic focuses on enhancing testing and validation, such as introducing a new test for the CacheBench component and ensuring the accuracy of the layernorm CUDA backwards pass. The changes improve test coverage and accuracy.
pull/147688, pull/147763

Build and Compilation Process Enhancements: This topic involves enhancing the build and compilation process, such as enabling the Triton XPU build process on Windows and updating the pybind11 submodule. The changes improve compatibility and collaboration.
pull/147637, pull/147524

Error Handling and Bug Fixes: This topic addresses error handling and bug fixes, such as fixing the inductor/test_kernel_benchmark.py script and correcting the detection logic for clang++. The changes ensure correct functionality and prevent unexpected failures.
pull/147746, pull/147775