Weekly GitHub Report for Pytorch: May 05, 2025 - May 12, 2025 (12:02:32)

            Weekly GitHub Report for Pytorch: May 05, 2025 - May 12, 2025 (12:02:32)

            Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for torch.compile with Python 3.13, a new performance-related feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release marks a shift away from publishing on Conda, with a focus on using official wheel packages or conda-forge, and introduces a backward-incompatible change by setting weights_only=True as the default for torch.load.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[CXX11ABI] torch 2.6.0-cu126 and cu124 have different exported symbols: This issue highlights a discrepancy in the exported symbols between the cu124 and cu126 versions of torch 2.6.0, specifically noting the absence of the symbol _ZN3c105ErrorC2ENS_14SourceLocationESs in cu126, which has caused compatibility issues with the flash_attention library. The discussion also raises questions about the dependency of flash_attention on this symbol and the implications of the CXX11 ABI migration on the stability and compatibility of PyTorch extensions.

The comments discuss the unexpected symbol export differences due to the CXX11 ABI migration, suggesting smoke tests for exported symbols and questioning flash_attention's dependency on the missing symbol. There is a consensus that extensions need to update their binaries, and suggestions are made for improving PyTorch's testing and release notes to address such issues. Some users share their experiences and potential solutions, while others seek clarification on installation and compatibility problems.
Number of comments this week: 17

Performance Regression nightly 03/11→03/12, on nanogpt speedrun: This issue reports a performance regression in the nanogpt speedrun observed between nightly builds from March 11 to March 12, with a noted increase in runtime from approximately 1469 seconds to 1487 seconds. The user suspects that the regression may be related to a Triton upgrade and provides detailed runtime comparisons and diffs to support the investigation.

The comments discuss potential causes of the regression, with some attributing it to a Triton update. There is a debate on whether the observed 1.2% performance difference is significant enough to be actionable, given the stable setup and reproducibility of the results. Further profiling indicates a 2% regression per iteration, possibly linked to changes in flex attention, and additional investigation is ongoing.
Number of comments this week: 11

Loading sparse tensors in a DataLoader raises CUDA initialization error since 2.5.0 if you have already initialized CUDA: This issue describes a problem with loading sparse tensors in a DataLoader, which raises a CUDA initialization error when CUDA is initialized before the data loader loop, starting from version 2.5.0. The error seems to be related to a validation check for pinned memory in sparse tensors, which was introduced in a recent pull request and causes issues when the CUDA context is improperly copied during a process fork.

The comments discuss whether the issue is a regression, with some clarifying that the problem is related to sparse tensor validation and not pinned memory. Suggestions include exposing internal checks to handle bad forks, and there is a consensus that the validation check is necessary for consistency, despite the challenges it introduces with CUDA initialization in forked processes.
Number of comments this week: 10

Pytorch 2.7 crashes when using flex attention with torch.amp: This issue reports a bug in PyTorch 2.7 where using flex attention with torch.amp.autocast causes a runtime error, specifically when the enabled parameter is set to True. The problem persists across multiple versions of PyTorch, from 2.5 to 2.7, and is reproducible with a provided code snippet, but it does not occur when the enabled parameter is set to False.

The comments discuss attempts to reproduce the issue, with some users unable to replicate the crash on similar hardware and software configurations. There is a suggestion to check the version of Triton being used, as an outdated version might be causing the issue, and it is recommended to use PyTorch-triton 3.3 for compatibility with PyTorch 2.7.
Number of comments this week: 7

Process never ends when sending tensors through multiprocessing queues in Python 3.12+ on macOS: This issue describes a bug encountered when sending tensors through multiprocessing queues in Python 3.12+ on macOS, where the process does not terminate as expected and requires manual interruption. The problem appears to be linked to the resource tracker process initiated by Python, as the issue does not occur in Python 3.11 or on other operating systems like Ubuntu, and is resolved by changing the multiprocessing start method to "fork," although this is not recommended.

The comments discuss attempts to reproduce the issue, with some users unable to replicate it while others confirm the problem on different macOS machines. A stack trace is shared, and the issue is identified as specific to Python 3.12.10, not occurring in 3.12.9. Potential links to recent changes in Python's handling of the resource tracker and subprocess modules are noted, suggesting a possible cause for the hang.
Number of comments this week: 7

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs during the execution of a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically affecting the 'inductor' backend.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that using multiple smaller MaxPool2D operations can reduce computational costs on a CPU. The approach involves representing a larger kernel size with multiple smaller ones, which has been shown to yield a significant speedup in processing time, as demonstrated by the provided testing code and performance benchmarks.
cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached cuda_utils.so file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with a tmpfs permission set to 1777, and the problem is highlighted by an ImportError indicating a failure to map a segment from the shared object, which is crucial for the model's execution.
Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, which currently has approximately 1,500 files that are not formatted according to the UFMT standards. The process requires removing file names from the exclude_patterns in the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
[JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature request is to reduce the size of model files, particularly for deployment on mobile devices, where storage space is limited, as demonstrated by the user's experience of reducing a model's file size from 6.7MB to 5.6MB by manually removing these debug files.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 115
Summarized Issues:

ImportError with Torch and CUDA on Linux: This issue involves an ImportError encountered when using Torch 2.7 with CUDA 12.8 on Linux, related to an undefined symbol in libnvJitLink.so.12. The problem does not occur on Windows, and a workaround involving CUDA 12.8 is suggested to resolve it.
pytorch/pytorch/issues/152783

Symbol Discrepancy in PyTorch Versions: A discrepancy in exported symbols between cu124 and cu126 versions of PyTorch 2.6.0 is highlighted, causing compatibility issues with the flash_attention library. The absence of a specific symbol in cu126 raises questions about dependencies and CXX11 ABI migration implications.
pytorch/pytorch/issues/152790

Inconsistent Model Export Behavior: A bug in PyTorch causes inconsistent behavior when exporting a model with a nonzero call followed by grid_sample, with the CUDA backend throwing a data-dependent error. This requires additional manual checks, unlike the CPU/MPS backends where no such checks are needed.
pytorch/pytorch/issues/152791

Performance Discrepancy in Quantization Methods: A significant performance discrepancy is noted between torchao and torch.quantization dynamic quantization methods. The former results in a 1% metric drop on GPU, while the latter leads to a 35% drop on CPU, prompting an inquiry into the cause.
pytorch/pytorch/issues/152813

Feature Request for TorchRun GPU Specification: A feature request is made to allow users to specify GPUs in TorchRun, as the current method using CUDA_VISIBLE_DEVICES is not intuitive. An option like --bind-devices is suggested to improve usability and reduce errors.
pytorch/pytorch/issues/152822

Performance Regression in Nanogpt Speedrun: A performance regression is reported in the "nanogpt speedrun" between March 11 and March 12 nightly builds, with runtime increasing from 1469-1470 seconds to 1486-1487 seconds. The issue is potentially linked to a Triton upgrade.
pytorch/pytorch/issues/152823

Intermittent Crashes During Mypy Stage: Intermittent crashes occur during the mypy stage of the lintrunner -a command, generating a Python traceback 5-10% of the time. The error does not typically reoccur upon rerunning, and a suspicion of ruff modifying files is noted.
pytorch/pytorch/issues/152824

Pipeline Parallelism Bug in PyTorch: A bug in pipeline parallelism causes process failure if input to a stage does not produce gradients for all stages. This affects conditional or mixture models and suggests a need for different asynchronous communication operations.
pytorch/pytorch/issues/152827

DTensor Operation Support in Dynamo: The need to remove hardcoded support for DTensor operations in PyTorch's Dynamo is addressed. The current implementation is brittle and hinders caching, suggesting a more generic approach using flat_apply HOP.
pytorch/pytorch/issues/152829

Clarification on Build Stage Function Usage: Clarification is sought on the correct usage of the build_stage function in torch.distributed.piplining with DistributedDataParallel. The documentation example may be misleading regarding wrapping a stage module.
pytorch/pytorch/issues/152849

Thread Safety Issue with torch.cuda.use_mem_pool: A thread safety problem is identified with the torch.cuda.use_mem_pool API due to the non-thread-local nature of captures_underway. This leads to potential memory allocation conflicts across threads.
pytorch/pytorch/issues/152861

Failure of Inductor-Periodic ROCm Tests: Inductor-periodic ROCm tests in PyTorch have been failing since April 10th, with ongoing efforts to resolve the problem. A related pull request aims to address some of these failures.
pytorch/pytorch/issues/152866

Improving Clarity of Traceback Frames: The need to improve the clarity of the final traceback frame in error messages generated by torch.compile is highlighted. The current frame lacks context about the function's purpose.
pytorch/pytorch/issues/152867

Runtime Error with wrap_triton Function: A runtime error occurs when using the wrap_triton function with a Triton kernel in eager mode. The function only supports kernels annotated with triton.jit or triton.autotune.
pytorch/pytorch/issues/152868

Documentation Update for wrap_triton Function: An update is needed for the wrap_triton function documentation in PyTorch. The current explanation is confusing regarding the necessity of registering as a custom operation.
pytorch/pytorch/issues/152870

CUDA Memory Trimming in DeviceCachingAllocator: Incorporating CUDA memory trimming into the DeviceCachingAllocator is proposed to address GPU memory oversubscription on Windows. This involves registering a callback using the cudaDeviceRegisterAsyncNotification API.
pytorch/pytorch/issues/152875

Forward Mode AD Documentation Enhancement: PyTorch's forward mode automatic differentiation (AD) documentation needs training examples, particularly for RNNs. This would illustrate the practical application and benefits of forward mode AD.
pytorch/pytorch/issues/152877

Flakiness of Lint URL Checks: The 'UNSTABLE Lint / Link checks / Lint URLs / linux-job' in PyTorch is flaky. Improvements such as implementing retries or checking only links from git diffs are suggested to enhance stability.
pytorch/pytorch/issues/152884

Runtime Error with Cudagraphs and Transformer Blocks: A runtime error occurs when using cudagraphs with individually compiled transformer blocks. The error is triggered by accessing a tensor output overwritten by a subsequent run.
pytorch/pytorch/issues/152887

Segmentation Fault in max_unpool2d Function: A segmentation fault occurs in the torch.nn.functional.max_unpool2d function when executed with a specific script. This is similar to a previously reported issue.
pytorch/pytorch/issues/152913

Failing Unit Test in PyTorch: A failing unit test titled "inductor_cpp_wrapper" is reported under the configuration cuda12.6-py3.10-gcc9-sm86. It is marked as unstable while further investigation is conducted.
pytorch/pytorch/issues/152916

Performance Regression in CycleGAN and pix2pix Models: A performance regression is observed in CycleGAN and pix2pix models using AMP with multiple threads. The regression is attributed to a specific commit.
pytorch/pytorch/issues/152921

Tracking Upgrade Process to Version 12.8.1: The upgrade process to version 12.8.1 is tracked, focusing on updating Docker images/builds and Windows AMI. Related pull requests and key contributors are mentioned.
pytorch/pytorch/issues/152922

HIP Environment Detection Bug: A bug is reported where PyTorch fails to detect the HIP environment on ROCm when initialized by CuPy. This suggests a conflict or initialization order problem between the libraries.
pytorch/pytorch/issues/152941

Behavior of aten._scaled_dot_product_efficient_attention: The aten._scaled_dot_product_efficient_attention function returns a Log-Sum-Exp tensor with padded sequence length. Clarification is sought on the necessity of this padding.
pytorch/pytorch/issues/152942

Removing Redundant Type Aliases: The removal of redundant type aliases for _device_t in favor of torch.types.Device is proposed to enhance consistency in PyTorch's typing system.
pytorch/pytorch/issues/152952

DTensor Placement Propagation Bug: A bug in DTensor placement propagation for the slice operation is reported. The sharding propagation rule fails due to symbolic integers, causing an assertion error.
pytorch/pytorch/issues/152954

Test Failure in PyTorch: A test failure is reported in TestNestedTensorOpInfoCUDA.test_compile_backward_matmul_cuda_float32, where a backend compiler error occurs during the backward pass on CUDA.
pytorch/pytorch/issues/152962

Improving DTensor Support for Dynamic Shapes: Challenges and potential solutions for improving DTensor's support for dynamic shapes are discussed. The focus is on handling symbolic integers in metadata and caching logic.
pytorch/pytorch/issues/152963

Dummy Forward and Backward Passes in FSDP2: The need for dummy forward and backward passes in FSDP2 is addressed to maintain SPMD execution across all ranks, preventing job hangs due to masked activations or empty inputs.
pytorch/pytorch/issues/152964

Avoiding as_strided for Non-Contiguous Reshaping: The use of as_strided for non-contiguous in-place reshaping of tensors with unbacked symbols is discouraged due to data-dependent errors. A revisit is suggested once the codebase is more accommodating.
pytorch/pytorch/issues/152972

Refactoring MegaCache Component: The MegaCache component is being refactored to make it generic, enabling the registration of external plugins' caches and eliminating specific cache logic.
pytorch/pytorch/issues/152976

Program Crash with torch.lcm_ Operation: A bug in PyTorch causes a program crash when using torch.lcm_ between a large int32 tensor and an int16 scalar. The function should either complete successfully or raise a proper exception.
pytorch/pytorch/issues/152979

torch.load Function UnpicklingError: A bug in torch.load prevents deserialization of datetime objects, resulting in an UnpicklingError. This is due to a change in the default weights_only argument behavior in PyTorch 2.6.
pytorch/pytorch/issues/152985

2D Convolution Operation Failure on CUDA: A bug causes a 2D convolution operation using int8 data type on CUDA to fail due to the inability to find a suitable engine. The operation works correctly on CPU or with float32.
pytorch/pytorch/issues/152992

Enhancing PyTorch Dynamo for functools.lru_cache: Enhancements are proposed for PyTorch Dynamo to fully support functools.lru_cache, addressing limitations where only the underlying function is traced.
pytorch/pytorch/issues/152994

lintrunner init Command Failure: The lintrunner init command fails with the --take FLAKE8 option due to an unsupported use of pip's --user flag. A virtual environment is suggested as a workaround.
pytorch/pytorch/issues/152999

Gradient Discrepancy in All-Gather Implementations: A gradient discrepancy is noted between two PyTorch all-gather implementations, raising concerns about which produces the correct autograd-compatible gradients.
pytorch/pytorch/issues/153016

A16W4 Quantization on XPU Devices: Enabling A16W4 quantization on XPU devices within the torchAO framework is proposed to optimize memory consumption and inference speed for large language models.
pytorch/pytorch/issues/153019

FlexAttention on Intel GPUs Using XPU: A proposal to enable FlexAttention on Intel GPUs using XPU is discussed, aiming to address performance bottlenecks in large language models.
pytorch/pytorch/issues/153024

Illegal Memory Access with CUDA Graphs: Capturing multiple CUDA graphs using multiple GPUs results in illegal memory access during replay. Only buffers related to the last captured graph are retained.
pytorch/pytorch/issues/153025

Inconsistent Output in F.linear Function: A bug in the F.linear function causes inconsistent output when performing matrix multiplications with varying dimensions using zero-padding under bf16 precision.
pytorch/pytorch/issues/153033

Request for Split Softmax Function: A request is made to implement the Split Softmax function in PyTorch to address transformer models forgetting system prompts during long text processing.
pytorch/pytorch/issues/153035

Incorrect Alias Assumptions in gen_alias_from_base: The gen_alias_from_base function incorrectly assumes inductor-generated output is an alias, leading to erroneous results when regenerating the alias.
pytorch/pytorch/issues/153041

Runtime Error with Flex Attention and Autocast: A runtime error occurs when using flex attention with torch.amp.autocast, causing a program crash. Disabling autocast avoids the issue.
pytorch/pytorch/issues/153042

Precision Problem with AMP and torch.compile: A significant precision problem and unexpected float32 overflow occur when using AMP with torch.compile, leading to low precision and overflow issues.
pytorch/pytorch/issues/153044

Process Termination Issue with Multiprocessing Queues: Processes do not terminate as expected when sending tensors through multiprocessing queues in Python 3.12+ on macOS, potentially due to a resource tracker process.
pytorch/pytorch/issues/153050

Device Mismatch Error in Export Function: A bug in the export function causes a runtime error due to device mismatch between CPU and CUDA during tensor operations, as the embedding layer is not moved to the appropriate device.
pytorch/pytorch/issues/153056

Incorrect Handling of cuda.Event in Dynamo: A bug in PyTorch's Dynamo treats cuda.Event as compile-time constants, leading to a runtime error when calculating elapsed time between events.
pytorch/pytorch/issues/153058

register_constant Function TypeError: The register_constant function does not work with simple types like enums unless they have a non-default __eq__ implementation, raising a TypeError during export.
pytorch/pytorch/issues/153061

Fake Tensor Leakage in Non-Strict Export: A bug in PyTorch's non-strict export process fails to detect and prevent fake tensor leakage, as demonstrated by a model pipeline logging a fake tensor in its buffer.
pytorch/pytorch/issues/153062

Assertion Error with Fake Tensors in Export: An assertion error occurs when exporting a model using vmap for vectorization, indicating a problem with handling fake tensors at the pre-dispatch level.
pytorch/pytorch/issues/153063

CI Infrastructure Robustness Testing: Two scenarios are executed to test the robustness of PyTorch's CI infrastructure, simulating errors in HUD processing pipelines and AWS EC2 instance creation API failures.
pytorch/pytorch/issues/153068

Incorrect Zero Output in torch.ldexp Function: The torch.ldexp function produces an incorrect zero output when multiplying a float16 tensor by a power of two within the representable range, suggesting an implementation flaw.
pytorch/pytorch/issues/153069

Link Check Failure Due to Stack Overflow 403 Error: A link check failure occurs due to a 403 error from a Stack Overflow page, potentially due to rate-limiting or blocking of automated access.
pytorch/pytorch/issues/153071

Mixed Precision Casting Issue in FSDP2: A bug in FSDP2 prevents proper handling of mixed precision casting for tensors within dataclasses, unlike FSDP (version 1), leading to a runtime error.
pytorch/pytorch/issues/153077

AttributeError with functools.partial in Export: An AttributeError occurs when exporting a model with a patched forward method using functools.partial, as the partial object lacks a __code__ attribute.
pytorch/pytorch/issues/153086

Atomic Operations for Global Amax in Triton: Investigating the use of atomic operations to compute global amax values in Triton code generation for float8 quantization, as atomics outperform the current reduction-based approach.
pytorch/pytorch/issues/153103

Introducing is_known_contiguous Function: A new function, is_known_contiguous, is proposed to replace is_contiguous in relevant areas of PyTorch, handling cases not known to be contiguous.
pytorch/pytorch/issues/153108

Inconsistent Sizes in torch::unique_consecutive: A bug in torch::unique_consecutive with a custom CUDA allocator passes inconsistent sizes to malloc and free, potentially leading to memory corruption.
pytorch/pytorch/issues/153109

Heuristic Increase for CUDA 12.6 Test: A failure in the test_c10d_nccl test necessitates a temporary heuristic increase from 1.5x to 1.7x when replacing CUDA 11.8 distributed jobs with CUDA 12.6.
pytorch/pytorch/issues/153122

torch.nanmean() Function Error on CPU: An inconsistency in torch.nanmean() function behavior is noted, where it operates correctly on GPU but fails on CPU with a misleading error message.
pytorch/pytorch/issues/153132

torch.arange() Function Discrepancy: A discrepancy in torch.arange() function behavior is highlighted, where GPU returns an empty tensor for an impossible range, while CPU raises an exception.
pytorch/pytorch/issues/153133

Error Discrepancy in torch.batch_norm: An error discrepancy in torch.batch_norm function is noted, where an error is triggered on CUDA-enabled GPUs but not on CPU when running_mean or running_var is set to None.
pytorch/pytorch/issues/153137

Missing Documentation for torch.segment_reduce: The absence of documentation for torch.segment_reduce is highlighted, questioning its performance compared to torch.scatter_reduce.
pytorch/pytorch/issues/153138

Floating-Point Exception with NNPACK: A floating-point exception occurs with NNPACK for convolution on systems with disabled SMT, linked to a calculation error in NNPACK.
pytorch/pytorch/issues/153139

CUDA Initialization Error with Sparse Tensors: Loading sparse tensors in a DataLoader raises a CUDA initialization error in PyTorch 2.5.0 and later if CUDA is already initialized, due to a validation check.
pytorch/pytorch/issues/153143

Segmentation Fault in test_flex_attention.py: Running test/inductor/test_flex_attention.py results in segmentation faults and memory corruption errors, particularly on Intel Sapphire Rapids CPUs.
pytorch/pytorch/issues/153145

torch.nn.functional.avg_pool2d Documentation Discrepancy: A discrepancy between documentation and actual behavior of kernel_size parameter in torch.nn.functional.avg_pool2d is noted, prompting a request for updates.
pytorch/pytorch/issues/153149

Runtime Error in Qwen2 Model: A runtime error is triggered by an internal assertion failure in the matmul operation within the eager_attention_forward function of the Qwen2 model.
pytorch/pytorch/issues/153172

Failure of tensor.view() with Dynamic Shapes: The tensor.view() function fails with dynamic shapes, despite the size of the first dimension being inferable and an assertion check confirming the condition for reshaping.
pytorch/pytorch/issues/153174

Deferred Runtime Asserts Feature Ineffectiveness: The prefer_deferred_runtime_asserts_over_guards feature does not function correctly during end-to-end compilation, as necessary assertions are not added to the graph.
pytorch/pytorch/issues/153175

Export Function Assertion Error: A bug in the export function causes a program to be incorrectly exported without raising an error, despite a runtime assertion that should fail given the input tensor.
pytorch/pytorch/issues/153179

Compilation Error in Windows Inductor Tests: A bug in PyTorch's generated code for Windows inductor tests creates a zero-size array, resulting in a compilation error with MSVC.
pytorch/pytorch/issues/153180

Overflow Error in softshrink Function: The torch.nn.functional.softshrink function throws an overflow error on CUDA with a float16 tensor and a lambda value exceeding float16 limits, unlike on CPU.
pytorch/pytorch/issues/153186

Inconsistency in torch.clamp_min Function: An inconsistency in torch.clamp_min function behavior is noted, where CPU raises an overflow exception for float16 values, while CUDA converts them to infinity.
pytorch/pytorch/issues/153187

Inductor Expected to Produce Triton Kernel: A discrepancy is noted where PyTorch inductor is expected to produce a Triton kernel, but eager mode is called instead, requiring a codebase update.
pytorch/pytorch/issues/153194

UnboundLocalError with torch.amp.autocast in Export: A bug in PyTorch causes an UnboundLocalError during model export with torch.amp.autocast, obscuring a RuntimeError related to device propagation.
pytorch/pytorch/issues/153202

Floating Point Exception with torch.fmod: The torch.fmod function causes a "Floating point exception (core dumped)" error with a very large negative integer divisor, suggesting a proper exception should be raised.
pytorch/pytorch/issues/153203

Avoiding Unnecessary Recompilations with Strides: The challenge of avoiding unnecessary recompilations in PyTorch when handling tensors with strides equal to 1 is addressed, suggesting marking strides as unbacked.
pytorch/pytorch/issues/153204

Inconsistent Error Handling in cosine_similarity: An inconsistency in cosine_similarity function error handling is noted, where CPU raises a RuntimeError for type conversion overflow, while CUDA returns NaN values.
pytorch/pytorch/issues/153209

Error Handling Discrepancy in torch.addcmul(): A discrepancy in error handling between CPU and CUDA implementations of torch.addcmul() is noted, where CPU raises a RuntimeError for integer overflow, while CUDA does not.
pytorch/pytorch/issues/153210

ONNX Export Failure with AutoencoderKLCosmos Model: A failure in exporting the AutoencoderKLCosmos model to ONNX is reported due to a shape inference problem with ConvTranspose operations.
pytorch/pytorch/issues/153214

Post-Mod Feature in FlexAttention Module: A feature request is made to add "post_mod" support in the FlexAttention module to enable post-softmax or head-mix variants of MultiTokenAttention.
pytorch/pytorch/issues/153221

Inconsistent Results with F.instance_norm in JIT: A bug in the PyTorch JIT compiler causes inconsistent results with F.instance_norm function when using custom running statistics, compared to eager mode.
pytorch/pytorch/issues/153224

Symbolic Condition Handling in eval Function: A problem in the eval function is reported, where a symbolic condition involving sym_or is not handled correctly, despite similar logic working with a direct logical or.
pytorch/pytorch/issues/153227

Floating Point Exception in native_channel_shuffle: The torch.native_channel_shuffle() function crashes with a "Floating point exception (core dumped)" error when called with a large integer parameter on an int32 tensor.
pytorch/pytorch/issues/153231

Segmentation Fault in torch.lu_unpack(): A segmentation fault occurs in torch.lu_unpack() when called with a bfloat16 tensor containing extreme values and an empty pivots tensor.
pytorch/pytorch/issues/153232

Inconsistent Behavior in torch.quantile: A discrepancy in torch.quantile function behavior is noted when handling infinite and NaN values, producing different outputs on CPU and CUDA.
pytorch/pytorch/issues/153234

Incorrect Contiguity in NestedTensor Representation: A bug in PyTorch causes a NestedTensor to incorrectly show contiguous=True after a transpose operation, highlighting a discrepancy in memory layout reporting.
pytorch/pytorch/issues/153237

torch.chunk Function Failure with NestedTensor: The torch.chunk function fails to operate on a NestedTensor with a jagged layout when the second dimension is not ragged, resulting in an unhelpful exception.
pytorch/pytorch/issues/153238

Decorator Bug in test_decompose_mem_bound_mm.py: A bug in a specific decorator causes the test environment to return an empty list and run zero tests, suggesting a defense mechanism against such issues.
pytorch/pytorch/issues/153239

Online Softmax Disabled in Transformer Inference: Online softmax functionality is disabled during a transformer forward inference pass using torch.compile(), due to Inductor's decision to split the reduction.
pytorch/pytorch/issues/153241

Proposal for SlimTensor Representation: A proposal is made to introduce a new lightweight tensor representation called SlimTensor to enable AOTInductor to generate minimal, self-contained binaries for PyTorch models.
pytorch/pytorch/issues/153242

Failure of torch.arange in torch.ops.higher_order.scan: The use of torch.arange within torch.ops.higher_order.scan fails due to the lack of a guard for unbacked symfloat, causing errors when rewriting a loop.
pytorch/pytorch/issues/153247

Support for Backward Pass in torch.export: A request is made for torch.export to handle models requiring backward pass computation during inference, as current limitations prevent its use for latency-sensitive models.
pytorch/pytorch/issues/153251

TorchScript Crash with Fake Tensors in Export: A bug in PyTorch's export function causes a crash when a fake tensor is passed to the TorchScript interpreter, suggesting a warning and disabling TS in such cases.
pytorch/pytorch/issues/153260

Extending CPP Extension API for SYCL Kernels: A proposal is made to extend the PyTorch CPP Extension API to support SYCL kernels, enabling new operator development for Intel GPU platforms.
pytorch/pytorch/issues/153265

Disabled Test on ROCm MI300 Runners: A disabled test, "test_ddp_apply_optim_in_backward," is failing on ROCm MI300 runners in continuous integration, suspected to be a flaky failure.
pytorch/pytorch/issues/153266

Auto Format Lint Not Making Suggestions: A problem in the PyTorch GitHub project where the auto format lint is not making suggestions, possibly due to incompatibility with CI workflows.
pytorch/pytorch/issues/153273

Output Discrepancy in conv_transpose2d Function: A bug in torch.nn.functional.conv_transpose2d causes significantly different outputs on CPU and CUDA when using float16 tensors with specific settings.
pytorch/pytorch/issues/153276

rrelu Function Crash on CPU: The torch.nn.functional.rrelu function crashes on a CPU when training=True and either lower or upper is set to infinity, while it works without error on CUDA.
pytorch/pytorch/issues/153281

Change in ProcessGroupNCCL Component: A change in the PyTorch nightly build affects the ProcessGroupNCCL component, which now uses merged CUDA streams, impacting tensor lifetime management and profiling.
pytorch/pytorch/issues/153296

Cleanup of autotune_fallback_to_aten Feature: The deprecated autotune_fallback_to_aten feature is being cleaned up in PyTorch by removing references in tests and benchmarks and updating configuration settings.
pytorch/pytorch/issues/153298

Accuracy Discrepancy with torch compile: A discrepancy in accuracy is noted when training with torch compile, where numerical results diverge when using the inductor backend compared to eager and aot_eager.
pytorch/pytorch/issues/153299

OOM Errors with FSDP reshard_after_forward: Using reshard_after_forward=int with shared post-forward device meshes in FSDP leads to GPU OOM errors, with discussions on potential long-term solutions.
pytorch/pytorch/issues/153302

Complexity of reshape_view_helper Function: The complexity of the reshape_view_helper function in PyTorch's autograd is discussed, questioning whether a simpler version should be used to avoid data-dependent errors.
pytorch/pytorch/issues/153303

Error Handling Discrepancy in fused_moving_avg_obs_fake_quant: A discrepancy in error handling between CPU and GPU implementations of fused_moving_avg_obs_fake_quant is noted, suggesting a validity checker for CPU.
pytorch/pytorch/issues/153310

Behavior Inconsistency in lp_pool1d Function: An inconsistency in lp_pool1d function behavior is noted, where CPU returns a tensor of zeros without error, while GPU throws an "integer out of range" error.
pytorch/pytorch/issues/153312

Inconsistent Results with F.instance_norm in JIT: A bug in the PyTorch JIT compiler causes inconsistent results with F.instance_norm function when using custom running statistics, compared to eager mode.
pytorch/pytorch/issues/153315

Segmentation Fault with choose_qparams_optimized: A segmentation fault occurs when choose_qparams_optimized() is called with empty tensors and an extremely large num_bins, causing a program crash.
pytorch/pytorch/issues/153326

Floating Point Exception with pixel_shuffle: The torch.pixel_shuffle() function causes a floating point exception when called with empty tensors and an extremely large upscale_factor, resulting in a core dump.
pytorch/pytorch/issues/153327

Segmentation Fault with Sparse COO Tensor Conversion: A segmentation fault occurs when converting a sparse COO tensor with complex128 values to a dense tensor, traced to the complex number addition operator.
pytorch/pytorch/issues/153329

Import Failure with BlockMask and flex_attention: Importing BlockMask and flex_attention from torch.nn.attention.flex_attention fails with torch.device("meta"), resulting in a RuntimeError.
pytorch/pytorch/issues/153330

Selective Activation Checkpointing on Custom autograd.Function: Implementing Selective Activation Checkpointing on a custom autograd.Function is discussed, as the current implementation does not recognize it as an operation.
pytorch/pytorch/issues/153334

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 69
Summarized Issues:

Bugs in PyTorch's Triton and CPP backends: This topic covers issues where PyTorch functions produce incorrect results or errors when executed on Triton and CPP backends. The torch.nn.PairwiseDistance(p=2) function shows discrepancies in eager mode, and the torch.cummin function throws a NameError due to an undefined variable in Triton kernel code.
issues/151198, issues/151738

Documentation and Linter Issues in PyTorch: Several issues highlight problems with PyTorch's documentation and linter. The docstring linter incorrectly requires documentation for overridden methods, and there are inconsistencies in the documentation of functions like torch.min() and torch.max().
issues/151692, issues/152176

Graph Breaks and Compilation Errors in PyTorch: PyTorch faces issues with graph breaks and compilation errors. A graph break occurs with the .t() method on tensor subclasses, and a missing function declaration causes compilation failures on MSVC.
issues/151771, issues/152251

Performance Optimization and Memory Usage in PyTorch: PyTorch projects encounter performance and memory usage issues. Optimizing SymPy expression printing can reduce a 10% performance cost, while a model unexpectedly consumes more GPU memory when loaded.
issues/151823, issues/152356

Intel GPU Driver and Triton Errors in PyTorch: The latest Intel GPU Driver introduces breaking changes causing torch.compile to fail on Windows, and Triton errors occur due to renamed functions and interactions with operations like select_scatter.
issues/151898

Non-picklable Operations and Error Messages in PyTorch: PyTorch faces issues with non-picklable operations and error messages. The graph pickler raises exceptions for non-picklable torch.ops.aten operations, and Inductor's error messages need enhancement for debugging.
issues/151904, issues/151930

ONNX Export and Unit Test Failures in PyTorch: PyTorch needs improvements in ONNX export and faces unit test failures. Users are encouraged to enable dynamo=True when exporting in ONNX, and specific tests fail due to mismatched error messages.
issues/152025, issues/152090

Padding and Pooling Function Documentation in PyTorch: PyTorch's documentation for pooling functions needs updates. The padding parameter in avg_pool2d should include a note about limitations, and basic usage descriptions for several functions need clarification.
issues/152156, issues/152176

C++ Code Generation and Attention Mask Errors in PyTorch: PyTorch faces issues with C++ code generation and attention masks. A missing function declaration causes compilation failures, and a custom attention mask causes errors during training with flex attention.
issues/152251, issues/152297

XPU Support and Memory Errors in PyTorch: PyTorch encounters issues with XPU support and memory errors. The function torch.xpu.is_bf16_supported() returns incorrect values, and memory errors occur with the flex_attention function for large time series.
issues/152301, issues/152528

AsyncCollectiveTensor and Kernel Operations in PyTorch: PyTorch faces issues with AsyncCollectiveTensor and kernel operations. An AsyncCollectiveTensor does not trigger a wait_tensor upon a dtype cast, and binary kernel operations produce incorrect results with wrapped scalars.
issues/152534, issues/152582

Docker Caching and Distribution Gradients in PyTorch: PyTorch projects face issues with Docker caching and distribution gradients. Docker caching does not function correctly on MI300 runners, and cross-mesh DTensor gradient allreduce lacks support.
issues/152655, issues/152712

Tensor Method Discrepancies and Performance Regressions in PyTorch: PyTorch faces discrepancies in tensor methods and performance regressions. The torch.Tensor.put_ method behaves differently on CPU and GPU, and a minor performance regression occurs in the modded-nanogpt project.
issues/152755, issues/152761

Build Errors and CUDA Issues in PyTorch: PyTorch projects encounter build errors and CUDA issues. Building on WSL with specific CUDA architectures fails, and a CUDA error occurs during DDP training with RTX 5090 GPUs.
issues/152763, issues/152780

Data Type Performance and Segmentation Faults in PyTorch: PyTorch faces performance issues with data types and segmentation faults. The torch.dot function is slower with float16 and bfloat16, and a segmentation fault occurs in the max_unpool2d function.
issues/152798, issues/152804

Quantization and Large Tensor Limitations in PyTorch: PyTorch faces issues with quantization and large tensor limitations. A "False INTERNAL ASSERT FAILED" error occurs when saving a quantized model, and a 2D convolutional layer fails with large tensors.
issues/152805, issues/152816

Documentation Discrepancies and Negative Entropy Values in PyTorch: PyTorch faces documentation discrepancies and negative entropy values. The documentation incorrectly states the use of a nondeterministic index_add, and the Beta.entropy function returns negative values.
issues/152817, issues/152845

Code Cleanup and Sharding Visualization Errors in PyTorch: PyTorch projects involve code cleanup and sharding visualization errors. An unnecessary header file inclusion is removed, and the visualize_sharding function fails with an "IndexError".
issues/152847, issues/152848

Non-deterministic Sampling and Test Disabling in PyTorch: PyTorch faces issues with non-deterministic sampling and test disabling. The torch.multinomial function produces non-deterministic results, and several tests in the TestInductorOpInfoXPU suite are disabled due to failures.
issues/152854, issues/152898, issues/152910, issues/152911, issues/152912, issues/152925, issues/152929, issues/152930, issues/152931, issues/152970, issues/152971, issues/153008, issues/153009, issues/153017, issues/153018

Overflow and Inference Errors in PyTorch: PyTorch faces overflow and inference errors. The torch.addcdiv operation overflows to infinity with a scalar input, and inference outputs produce random 'nan' values on device 'xpu:1'.
issues/153014, issues/153022

Randomness and Symbolic Operations in PyTorch: PyTorch faces issues with randomness and symbolic operations. The torch.Tensor.bernoulli_() function generates different outcomes on CPU and GPU, and the necessity of a symbolic floating-point operation in the ONNX exporter is questioned.
issues/153102, issues/153115

Test Disabling and Dtype Promotion in PyTorch: PyTorch projects involve test disabling and dtype promotion issues. A test in the AOTInductorTestABICompatibleGpu suite is disabled, and dtype promotion behaves unexpectedly when performing operations between a tensor and a scalar.
issues/153123, issues/153127

Complex Tensor Function and Multihead Attention Issues in PyTorch: PyTorch faces issues with complex tensor functions and multihead attention. The asin() function for complex tensors returns inconsistent results, and in_proj_bias values for Q and K biases are unexpectedly zeros.
issues/153128, issues/153136

CI Infrastructure and Build Process in PyTorch: PyTorch projects involve testing CI infrastructure resilience and build process issues. A scenario simulates ignored requests causing runner scarcity, and an unstable build process is resolved by reducing build time.
issues/153155, issues/153183

CI Job Instability and TensorFloat32 Concerns in PyTorch: PyTorch projects face CI job instability and TensorFloat32 concerns. A CI job is marked unstable due to a parsing failure, and TensorFloat32 is disabled by default on Ampere-class GPUs without warning.
issues/153190, issues/153195

Pattern Matching and Performance Anomalies in PyTorch: PyTorch faces issues with pattern matching and performance anomalies. A mismatch in the PatternMatcher arises from a user count discrepancy, and a performance anomaly occurs with the torch.from_numpy().cuda() function.
issues/153208, issues/153235

Function Behavior and Pretrained Model Issues in PyTorch: PyTorch faces issues with function behavior and pretrained models. The torch.arange() function behaves inconsistently on CPU and GPU, and AutoModel.from_pretrained(...) fails within a torch.device("meta") context.
issues/153311, issues/153332

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 193
Key Open Pull Requests
1. [associative_scan] Autograd for additional inputs: This pull request implements the autograd feature for additional inputs in the PyTorch project, building upon a previous pull request, and includes various commits that address issues, improve documentation, and refine the implementation to ensure functionality and compatibility.

URL: pull/153317

Merged: No

Associated Commits: 9c49a, 100c5, 0e7c8, 67d62, 9e5e1, 6b415, d68b3, 9de0c, 653ea, ac1a1, d51b1, dbfd5, 64a8b, 08b72, f3788, b1277, 63350, 89ea8, c8bbd, 1c9fa, 6d835, 3f816, 62770, 67617, b73c5, ea956, 0786e, c628d, e7d63, 95d0e, 3707c, dc03c, 932f3, 2edd6

2. try relanding cublaslt autotuning support for TunableOp #: This pull request aims to reintroduce cublasLt autotuning support for the TunableOp in the PyTorch project, with additional tasks including the addition and testing of ScaledGemmTunableOp and obtaining benchmarking numbers, as evidenced by multiple commits addressing autotuning, support for various data types, and compatibility fixes for CUDA and ROCm platforms.

URL: pull/153316

Merged: No

Associated Commits: 3a6e6, 6ec4b, 260c2, 15e4b, 08bcd, 4d060, 278a8, 62ea6, 0316c, 676b8, 85391, 02a11, 99cc7, 6e555, ce341, cccd3, 14fa0, 2a5d7, 68a55, ba332, c0fe6, 945b2

3. [Set] Add set.issubset and set.issuperset: This pull request proposes the addition of set.issubset and set.issuperset methods to the PyTorch project, as part of a series of related changes tracked through a stack of pull requests.

URL: pull/152902

Merged: No

Associated Commits: 86a00, 45494, 38079, 8e815, 7a75b, 5a90c, 3bfc3, b583a, 3511f, 99fa6, 0ee20, 8d825, ccdb9

Other Open Pull Requests

Error Handling Enhancements: Several pull requests focus on improving error handling in the PyTorch project. These include raising KeyError and TypeError for specific conditions, such as when an element is not in a set or when there is a mismatch in the number of arguments provided.  
pull/152903, pull/152907, pull/152904, pull/152990, pull/152988

Set Method Enhancements: Multiple pull requests propose the addition and update of set methods in the PyTorch project. These include methods like set.difference_update, set.intersection_update, and set.symmetric_difference_update, which are part of a series of related changes managed through the ghstack tool.  
pull/152905, pull/152906, pull/152901, pull/152989

Set and Frozenset Initialization: A pull request aims to implement the correct initialization behavior for the set and frozenset data structures. This involves a series of related commits and pull requests to ensure proper functionality within the PyTorch project.  
pull/152908

Debugging and CI Modifications: A pull request titled "[Don't merge] Debug" involves modifications to the CI models list and torchbench model list. It includes adding and removing specific models for debugging purposes without merging into the main branch.  
pull/152940

FrozenSet Fixes: A pull request addresses fixes for the FrozenSet in the PyTorch project. It is part of a larger stack of changes and includes multiple updates marked as "[ghstack-poisoned]" across several commits.  
pull/152991

Exception Handling in ConstantVariable: A pull request addresses the handling of exceptions in the ConstantVariable operation within the PyTorch project. It is part of a series of related changes managed through the ghstack tool, involving multiple commits.  
pull/152987

Testing Enhancements: A pull request aims to add testing to ensure that the stride order in the code matches the expected configuration. This involves multiple updates and contributions from various collaborators within the PyTorch project.  
pull/152894

Higher Order Gradients Support: A pull request aims to enhance the PyTorch library by supporting higher order gradients with the create_graph=True option. It addresses potential issues with double backward computations when the forward trace includes non-composite implicit operations.  
pull/153222

Build Tools Update: A pull request updates the build tools version for PyTorch by using the latest Microsoft Visual Studio Compiler (MSVC). It ensures that AVX-512 instructions are correctly configured, with modifications to the findavx module and additional checks for Visual Studio path searches.  
pull/152820

New Operation Introduction: A pull request introduces a new operation, onednn.qbatch_norm2d, designed to compute uint8 batch normalization on the CPU using AVX512 instructions. It offers performance comparable to the existing quantized.batch_norm2d for the QuantizedCPU device.  
pull/152811

Linter Introduction: A pull request introduces a linter to automatically detect and flag instances where contributors have hardcoded the "cuda" device in test cases. This helps prevent test failures on XPU and disrupts the XPU CI process.  
pull/152948

Codebase Improvements: A pull request involves replacing the unimplemented function with unimplemented_v2 in the torch/_dynamo/variables/misc.py file. It is part of a series of updates to improve the PyTorch codebase, addressing issue #147913 and following up on #152274.  
pull/153039

VariableBuilder Support for Sets: A pull request aims to introduce support for sets in the VariableBuilder within the PyTorch project. It is part of a series of stacked changes managed through the ghstack tool and is currently a work in progress.  
pull/153150

Ruff Linter Update: A pull request updates the Ruff linter to version 0.11.8 to address numerous false negatives across the codebase. It improves the validation of NOQA comments and includes various changes such as fixing typos and adjusting external rules in the pyproject.toml file.  
pull/153249

Negative Dimension Issue Fix: A pull request addresses a negative dimension issue within the parallel loss context manager. It implements a solution similar to the one proposed in issue #152016, with contributions including normalization additions and review commits.  
pull/152785

Header-Only File Implementation: A pull request proposes moving the c10/core/DeviceType.h file to a new torch/csrc/header_only directory to implement it as a header-only file. It retains a copy in its original location for backward compatibility and plans to move more header files in the future.  
pull/152787

Output Stride Consistency: A pull request aims to ensure that the output stride in the invoke_subgraph function is consistent with the eager execution mode. This is indicated by the title and multiple updates in the commit messages.  
pull/152806

EVT Component Testing: A pull request introduces end-to-end tests for the EVT component within the Cutlass project. It includes multiple updates and revisions to ensure comprehensive testing, while engaging several contributors for review and collaboration.  
pull/152815

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 273
Key Closed Pull Requests
1. [Release/2.6] Remove --no-deps and --no-index flag when building pytorch in docker build: This pull request addresses the removal of the --no-deps and --no-index flags when building PyTorch in a Docker environment, as the old setuptools version (<80.0.1) was installing the requirements of the torch and torchvision wheels, but the new setuptools version (>=80.0.1) does not install these requirements, necessitating the change.

URL: pull/153163

Merged: No

Associated Commits: c69ea, af92b, aad1c, f3c08, 5363f, 5fbc4, 2b84d, c92f6, 1d3ff, f9e99, 46f55, 0cdf8, 6628b, c953e, 22775, 9b688, 4b9b7, f61bf, 31b52, b1a10, 5eb54, d9eed, 41811, 23e39, f01a6, 929ef, 4e418, 478a9, 7d329, 3a3de, f35ab, 7092d, 8c034, be126, e1858, d155d, 4d9de, a99cc, 51829, 983ea, 47f4e, eb304, 6e304, e2067, 57421, a61b5, 4658a, e19c1, 232eb, 1d2c2, a2639, cd15d, 9c34a, 8d4b8, dcb8a, 7be6b, ca3c3, 32070, 2236d, 1eba9, 93864, 88b97, bbd00, 1f32b, d33dd, f1481, ea546, ac7d6, ed487, 66dfe, 3783d, 8adc1, 5c4fa, 639ee, b445b, 8d72c, 374e5, 6a3b5, e607b, aafc7, d5947, 1b753, ba1ba, 70f30, 3398f, 8354d, 737cf, 4202f, 7c27e, 2e2c7, 3a818, 53ad2, 8eb5d, dbe8c, fcdff, 92b55, f6789, 2e1ed, 13339, 82ac2, 3608e, bfb23, 86b0a, 03714, 34caa, ac032, 5dd61, 7d528, d9a03, 7c072, 73dd0, b08d9, d70a9, 7ad5a, 2fd46, ed8c6, bf084, 20ad8, 2fb0a, 8cfa9, 50a04, 45896, 9d0a4, 1a808, 6fe84, a3632, 68180, fb24f, e53a9, 2cda1, 9d566, a7044, c7ba8, cbd7b, c3733, faf90, a87c9, ce6b7, dc41a, 469ce, 8ccfc, 1290e, 93693, 75628, 011d3

2. [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/misc.py [1/2]: This pull request involves replacing the unimplemented function with unimplemented_v2 in the torch/_dynamo/variables/misc.py file as part of a larger task referenced by issue #147913, and includes multiple commits, many of which are co-authored by William Wen, although it was ultimately not merged.

URL: pull/152274

Merged: No

Associated Commits: bf165, 2219d, fc60c, fa228, 03059, 3f64d, 49785, b0c01, d8852, 7b1e2, 2fb11, 4c828, 27675, fd03c, e7319, 4b862

3. Broken Links GHA: This pull request introduces a GitHub Action designed to run monthly, which checks for broken links within the repository and automatically creates an issue listing any broken links found, although it was not merged into the main branch.

URL: pull/151454

Merged: No

Associated Commits: 12318, 22c4f, 6cb4b, 8887f, 73a25, 49982, 261ab, 41ccd, f1f70, c6d60, 7618b, 5329e, d59d2, 00b3f

Other Closed Pull Requests

Manylinux 2.28 and GCC 13 Update: This pull request aimed to update the Manylinux 2.28 images to use GCC version 13, as part of addressing an issue related to the PyTorch project. However, it was ultimately not merged.
pull/152825

CI Setup Changes: This pull request proposes changing the continuous integration (CI) setup for the PyTorch project by using CMake installed via pip instead of conda within the CI Docker images. Despite the proposed changes, it was ultimately not merged.
pull/152537

Subgraph Name Refactoring: This pull request involves refactoring the process of fetching subgraph names within the PyTorch project. It was part of a series of changes managed through the ghstack tool, although it was ultimately not merged.
pull/152770

XPU Backend Convolution Fix: This pull request addresses an issue in the PyTorch project where the XPU backend's convolution operation bypasses the device guard logic. It proposes a solution to ensure proper device management across multiple devices.
pull/153067

Recursive DCE on Subgraphs: This pull request involves implementing a recursive dead code elimination (DCE) process on subgraphs within the PyTorch project. Despite multiple updates and discussions among contributors, it was ultimately not merged.
pull/152772

MPS Kernel Improvements: This pull request introduces specialized kernels for the SDPA on MPS to improve performance, achieving significant speed improvements. It also addresses failing tests, adds new tests for specialized paths, and performs code cleanup.
pull/152781

MPS Unary Operations Optimization: This pull request involves moving additional MPS unary operations to a TensorIterator-based Metal kernel to significantly improve runtime performance. It achieves a 20x speedup for certain operations, with plans for further migration in a subsequent pull request.
pull/152876

Dtype Promotion Error Fix: This pull request addresses a dtype promotion error in the PyTorch project related to the concatenation decomposition process. It ensures that the cloning of a single tensor adheres to the correct dtype promotion rules.
pull/152995

Cutlass Key Addition: This pull request aims to add a "cutlass key" for both Facebook's internal codebase and open-source software. It involves multiple updates and revisions as indicated by the series of commits and the associated differential revision.
pull/153081

Flex Attention Symints Fix: This pull request addresses an issue in the Flex Attention component where symbolic integers from subgraph inputs and outputs were not explicitly stored. It proposes a fix to ensure these symints are properly captured and stored.
pull/152878

Cutlass Library Enhancement: This pull request aims to enhance the Cutlass library by adding epilogue inputs and outputs to the def_kernel function. It is part of a series of updates tracked through the ghstack tool, although it was ultimately not merged.
pull/151406

Docstring Linter Exemption: This pull request aims to address issue #151692 by proposing changes to exempt overriding methods from the docstring linter in the PyTorch project. However, it was ultimately not merged.
pull/151906

ROCm CI Environment Upgrade: This pull request aims to upgrade the ROCm CI environment to ROCm version 6.4 and update all related GitHub workflows to use the Jammy distribution. It also modifies the install script for ROCm and addresses specific CI issues.
pull/151368

Spack Includes Update for ROCm: This pull request focuses on updating the Spack includes for ROCm in the PyTorch project by cleaning up the caffe2/CMakeLists.txt. It ensures compliance with Spack by using the rocm-core package and adjusts the order of find_package calls.
pull/152569

Functorch Interpreters Serialization: This pull request aims to make Functorch interpreters serializable most of the time to enable saving guards on Functorch states. It includes various test cases to ensure functionality across different nested scenarios.
pull/152616

Unsupported Float8 Multiplications: This pull request addresses the issue of unsupported float8 row-wise scaled matrix multiplications on Blackwell architecture. It adds skips to the relevant tests to reduce noise on machines with compute capability sm_120+.
pull/152814

Recursive Realization of stack_values: This pull request aims to implement a recursive realization of the stack_values in the Dynamo component of the PyTorch project. It potentially addresses issue #135696, as indicated by the series of commits and the involvement of multiple contributors.
pull/152853

Tuple Iterator Serialization Testing: This pull request addresses the issue of testing serialization for TUPLE_ITERATOR_LEN in the PyTorch project. It implements a workaround to prevent tuple iterators from being exhausted during testing and invites insights for a more robust solution.
pull/152865

hasattr(tensor, "size") Bug Fix: This pull request addresses a bug fix in the PyTorch project related to the hasattr(tensor, "size") function. It is part of a stack of changes and aims to resolve issue #135696, although it was ultimately not merged.
pull/152883

Hierarchical Compilation Enhancement: This pull request aims to enhance the hierarchical compilation process by incorporating mutation dependencies into the topological sorting algorithm. It is part of a series of related updates in the PyTorch project.
pull/152410

Vendor-Neutral Fault Recovery: This pull request aims to make the FR code in the PyTorch project vendor-neutral by removing the dependency on USE_C10D_NCCL. It introduces a generic version with C10::Event that can be utilized by other backends like Gloo.
pull/152563

Code Generation Symbol ID Fix: This pull request addresses an issue in the PyTorch project where the code generation process fails due to out-of-order symbol IDs in forward input expressions. It proposes to skip code generation for SymPy expressions with multiple symbols.
pull/152579

Proxy Removal for autograd.Function.ctx: This pull request addresses the removal of proxying the autograd.Function.ctx into the computational graph due to the adoption of newer designs. It ensures the creation of an fx.Proxy for the autograd.Function.ctx object when necessary.
pull/152621

ShapeEnv Cache Key Modification: This pull request addresses an issue in the ShapeEnv.evaluate_expr() function by modifying its cache key to include the global "suppress_guards" value. This fix was prompted by a test failure in test/dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error.
pull/152661

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 

MXFP8 Fix broken bias support for mxfp8
Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Mediation attempts, Escalating dissatisfaction.)
This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

malfet
227
19
9
120

anijain2305
213
17
2
11

guilhermeleobas
180
17
1
7

Skylion007
42
13
2
98

swolchok
112
5
0
13

mlazos
108
13
0
6

guangyey
96
10
0
16

laithsakka
66
15
15
20

FFFrog
99
7
0
1

henrylhtsang
79
11
7
10

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
malfet	227	19	9	120
anijain2305	213	17	2	11
guilhermeleobas	180	17	1	7
Skylion007	42	13	2	98
swolchok	112	5	0	13
mlazos	108	13	0	6
guangyey	96	10	0	16
laithsakka	66	15	15	20
FFFrog	99	7	0	1
henrylhtsang	79	11	7	10