Weekly GitHub Report for Pytorch: May 05, 2025 - May 12, 2025 (12:02:32)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and FP16 support on X86 CPUs. Notably, the release marks a shift away from publishing on Conda, with a focus on using official wheel packages or conda-forge, and introduces a backward-incompatible change by setting weights_only=True
as the default for torch.load
.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[CXX11ABI] torch 2.6.0-cu126 and cu124 have different exported symbols: This issue highlights a discrepancy in the exported symbols between the cu124 and cu126 versions of torch 2.6.0, specifically noting the absence of the symbol
_ZN3c105ErrorC2ENS_14SourceLocationESs
in cu126, which has caused compatibility issues with the flash_attention library. The discussion also raises questions about the dependency of flash_attention on this symbol and the implications of the CXX11 ABI migration on the stability and compatibility of PyTorch extensions.- The comments discuss the unexpected symbol export differences due to the CXX11 ABI migration, suggesting smoke tests for exported symbols and questioning flash_attention's dependency on the missing symbol. There is a consensus that extensions need to update their binaries, and suggestions are made for improving PyTorch's testing and release notes to address such issues. Some users share their experiences and potential solutions, while others seek clarification on installation and compatibility problems.
- Number of comments this week: 17
-
Performance Regression nightly 03/11→03/12, on nanogpt speedrun: This issue reports a performance regression in the nanogpt speedrun observed between nightly builds from March 11 to March 12, with a noted increase in runtime from approximately 1469 seconds to 1487 seconds. The user suspects that the regression may be related to a Triton upgrade and provides detailed runtime comparisons and diffs to support the investigation.
- The comments discuss potential causes of the regression, with some attributing it to a Triton update. There is a debate on whether the observed 1.2% performance difference is significant enough to be actionable, given the stable setup and reproducibility of the results. Further profiling indicates a 2% regression per iteration, possibly linked to changes in flex attention, and additional investigation is ongoing.
- Number of comments this week: 11
-
Loading sparse tensors in a DataLoader raises CUDA initialization error since 2.5.0 if you have already initialized CUDA: This issue describes a problem with loading sparse tensors in a DataLoader, which raises a CUDA initialization error when CUDA is initialized before the data loader loop, starting from version 2.5.0. The error seems to be related to a validation check for pinned memory in sparse tensors, which was introduced in a recent pull request and causes issues when the CUDA context is improperly copied during a process fork.
- The comments discuss whether the issue is a regression, with some clarifying that the problem is related to sparse tensor validation and not pinned memory. Suggestions include exposing internal checks to handle bad forks, and there is a consensus that the validation check is necessary for consistency, despite the challenges it introduces with CUDA initialization in forked processes.
- Number of comments this week: 10
-
Pytorch 2.7 crashes when using flex attention with torch.amp: This issue reports a bug in PyTorch 2.7 where using flex attention with torch.amp.autocast causes a runtime error, specifically when the
enabled
parameter is set toTrue
. The problem persists across multiple versions of PyTorch, from 2.5 to 2.7, and is reproducible with a provided code snippet, but it does not occur when theenabled
parameter is set toFalse
.- The comments discuss attempts to reproduce the issue, with some users unable to replicate the crash on similar hardware and software configurations. There is a suggestion to check the version of Triton being used, as an outdated version might be causing the issue, and it is recommended to use PyTorch-triton 3.3 for compatibility with PyTorch 2.7.
- Number of comments this week: 7
-
Process never ends when sending tensors through multiprocessing queues in Python 3.12+ on macOS: This issue describes a bug encountered when sending tensors through multiprocessing queues in Python 3.12+ on macOS, where the process does not terminate as expected and requires manual interruption. The problem appears to be linked to the resource tracker process initiated by Python, as the issue does not occur in Python 3.11 or on other operating systems like Ubuntu, and is resolved by changing the multiprocessing start method to "fork," although this is not recommended.
- The comments discuss attempts to reproduce the issue, with some users unable to replicate it while others confirm the problem on different macOS machines. A stack trace is shared, and the issue is identified as specific to Python 3.12.10, not occurring in 3.12.9. Potential links to recent changes in Python's handling of the resource tracker and subprocess modules are noted, suggesting a possible cause for the hang.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs during the execution of a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically affecting the 'inductor' backend.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that using multiple smaller MaxPool2D operations can reduce computational costs on a CPU. The approach involves representing a larger kernel size with multiple smaller ones, which has been shown to yield a significant speedup in processing time, as demonstrated by the provided testing code and performance benchmarks.
- cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when running a PyTorch model within a Docker container, where the execution of a cached
cuda_utils.so
file fails due to a missing execution permission, despite the directories having the correct permissions. The error occurs specifically in a Docker environment with atmpfs
permission set to1777
, and the problem is highlighted by anImportError
indicating a failure to map a segment from the shared object, which is crucial for the model's execution. - Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, which currently has approximately 1,500 files that are not formatted according to the UFMT standards. The process requires removing file names from the
exclude_patterns
in the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of JIT archives. The motivation behind this feature request is to reduce the size of model files, particularly for deployment on mobile devices, where storage space is limited, as demonstrated by the user's experience of reducing a model's file size from 6.7MB to 5.6MB by manually removing these debug files.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 115
Summarized Issues:
- ImportError with Torch and CUDA on Linux: This issue involves an ImportError encountered when using Torch 2.7 with CUDA 12.8 on Linux, related to an undefined symbol in
libnvJitLink.so.12
. The problem does not occur on Windows, and a workaround involving CUDA 12.8 is suggested to resolve it.
- Symbol Discrepancy in PyTorch Versions: A discrepancy in exported symbols between cu124 and cu126 versions of PyTorch 2.6.0 is highlighted, causing compatibility issues with the flash_attention library. The absence of a specific symbol in cu126 raises questions about dependencies and CXX11 ABI migration implications.
- Inconsistent Model Export Behavior: A bug in PyTorch causes inconsistent behavior when exporting a model with a
nonzero
call followed bygrid_sample
, with the CUDA backend throwing a data-dependent error. This requires additional manual checks, unlike the CPU/MPS backends where no such checks are needed.
- Performance Discrepancy in Quantization Methods: A significant performance discrepancy is noted between
torchao
andtorch.quantization
dynamic quantization methods. The former results in a 1% metric drop on GPU, while the latter leads to a 35% drop on CPU, prompting an inquiry into the cause.
- Feature Request for TorchRun GPU Specification: A feature request is made to allow users to specify GPUs in TorchRun, as the current method using
CUDA_VISIBLE_DEVICES
is not intuitive. An option like--bind-devices
is suggested to improve usability and reduce errors.
- Performance Regression in Nanogpt Speedrun: A performance regression is reported in the "nanogpt speedrun" between March 11 and March 12 nightly builds, with runtime increasing from 1469-1470 seconds to 1486-1487 seconds. The issue is potentially linked to a Triton upgrade.
- Intermittent Crashes During Mypy Stage: Intermittent crashes occur during the
mypy
stage of thelintrunner -a
command, generating a Python traceback 5-10% of the time. The error does not typically reoccur upon rerunning, and a suspicion ofruff
modifying files is noted.
- Pipeline Parallelism Bug in PyTorch: A bug in pipeline parallelism causes process failure if input to a stage does not produce gradients for all stages. This affects conditional or mixture models and suggests a need for different asynchronous communication operations.
- DTensor Operation Support in Dynamo: The need to remove hardcoded support for DTensor operations in PyTorch's Dynamo is addressed. The current implementation is brittle and hinders caching, suggesting a more generic approach using
flat_apply
HOP.
- Clarification on Build Stage Function Usage: Clarification is sought on the correct usage of the
build_stage
function intorch.distributed.piplining
withDistributedDataParallel
. The documentation example may be misleading regarding wrapping a stage module.
- Thread Safety Issue with torch.cuda.use_mem_pool: A thread safety problem is identified with the
torch.cuda.use_mem_pool
API due to the non-thread-local nature ofcaptures_underway
. This leads to potential memory allocation conflicts across threads.
- Failure of Inductor-Periodic ROCm Tests: Inductor-periodic ROCm tests in PyTorch have been failing since April 10th, with ongoing efforts to resolve the problem. A related pull request aims to address some of these failures.
- Improving Clarity of Traceback Frames: The need to improve the clarity of the final traceback frame in error messages generated by
torch.compile
is highlighted. The current frame lacks context about the function's purpose.
- Runtime Error with wrap_triton Function: A runtime error occurs when using the
wrap_triton
function with a Triton kernel in eager mode. The function only supports kernels annotated withtriton.jit
ortriton.autotune
.
- Documentation Update for wrap_triton Function: An update is needed for the
wrap_triton
function documentation in PyTorch. The current explanation is confusing regarding the necessity of registering as a custom operation.
- CUDA Memory Trimming in DeviceCachingAllocator: Incorporating CUDA memory trimming into the DeviceCachingAllocator is proposed to address GPU memory oversubscription on Windows. This involves registering a callback using the
cudaDeviceRegisterAsyncNotification
API.
- Forward Mode AD Documentation Enhancement: PyTorch's forward mode automatic differentiation (AD) documentation needs training examples, particularly for RNNs. This would illustrate the practical application and benefits of forward mode AD.
- Flakiness of Lint URL Checks: The 'UNSTABLE Lint / Link checks / Lint URLs / linux-job' in PyTorch is flaky. Improvements such as implementing retries or checking only links from git diffs are suggested to enhance stability.
- Runtime Error with Cudagraphs and Transformer Blocks: A runtime error occurs when using cudagraphs with individually compiled transformer blocks. The error is triggered by accessing a tensor output overwritten by a subsequent run.
- Segmentation Fault in max_unpool2d Function: A segmentation fault occurs in the
torch.nn.functional.max_unpool2d
function when executed with a specific script. This is similar to a previously reported issue.
- Failing Unit Test in PyTorch: A failing unit test titled "inductor_cpp_wrapper" is reported under the configuration cuda12.6-py3.10-gcc9-sm86. It is marked as unstable while further investigation is conducted.
- Performance Regression in CycleGAN and pix2pix Models: A performance regression is observed in CycleGAN and pix2pix models using AMP with multiple threads. The regression is attributed to a specific commit.
- Tracking Upgrade Process to Version 12.8.1: The upgrade process to version 12.8.1 is tracked, focusing on updating Docker images/builds and Windows AMI. Related pull requests and key contributors are mentioned.
- HIP Environment Detection Bug: A bug is reported where PyTorch fails to detect the HIP environment on ROCm when initialized by CuPy. This suggests a conflict or initialization order problem between the libraries.
- Behavior of aten._scaled_dot_product_efficient_attention: The
aten._scaled_dot_product_efficient_attention
function returns a Log-Sum-Exp tensor with padded sequence length. Clarification is sought on the necessity of this padding.
- Removing Redundant Type Aliases: The removal of redundant type aliases for
_device_t
in favor oftorch.types.Device
is proposed to enhance consistency in PyTorch's typing system.
- DTensor Placement Propagation Bug: A bug in DTensor placement propagation for the
slice
operation is reported. The sharding propagation rule fails due to symbolic integers, causing an assertion error.
- Test Failure in PyTorch: A test failure is reported in
TestNestedTensorOpInfoCUDA.test_compile_backward_matmul_cuda_float32
, where a backend compiler error occurs during the backward pass on CUDA.
- Improving DTensor Support for Dynamic Shapes: Challenges and potential solutions for improving DTensor's support for dynamic shapes are discussed. The focus is on handling symbolic integers in metadata and caching logic.
- Dummy Forward and Backward Passes in FSDP2: The need for dummy forward and backward passes in FSDP2 is addressed to maintain SPMD execution across all ranks, preventing job hangs due to masked activations or empty inputs.
- Avoiding as_strided for Non-Contiguous Reshaping: The use of
as_strided
for non-contiguous in-place reshaping of tensors with unbacked symbols is discouraged due to data-dependent errors. A revisit is suggested once the codebase is more accommodating.
- Refactoring MegaCache Component: The MegaCache component is being refactored to make it generic, enabling the registration of external plugins' caches and eliminating specific cache logic.
- Program Crash with torch.lcm_ Operation: A bug in PyTorch causes a program crash when using
torch.lcm_
between a large int32 tensor and an int16 scalar. The function should either complete successfully or raise a proper exception.
- torch.load Function UnpicklingError: A bug in
torch.load
prevents deserialization ofdatetime
objects, resulting in anUnpicklingError
. This is due to a change in the defaultweights_only
argument behavior in PyTorch 2.6.
- 2D Convolution Operation Failure on CUDA: A bug causes a 2D convolution operation using
int8
data type on CUDA to fail due to the inability to find a suitable engine. The operation works correctly on CPU or withfloat32
.
- Enhancing PyTorch Dynamo for functools.lru_cache: Enhancements are proposed for PyTorch Dynamo to fully support
functools.lru_cache
, addressing limitations where only the underlying function is traced.
- lintrunner init Command Failure: The
lintrunner init
command fails with the--take FLAKE8
option due to an unsupported use of pip's--user
flag. A virtual environment is suggested as a workaround.
- Gradient Discrepancy in All-Gather Implementations: A gradient discrepancy is noted between two PyTorch all-gather implementations, raising concerns about which produces the correct autograd-compatible gradients.
- A16W4 Quantization on XPU Devices: Enabling A16W4 quantization on XPU devices within the torchAO framework is proposed to optimize memory consumption and inference speed for large language models.
- FlexAttention on Intel GPUs Using XPU: A proposal to enable FlexAttention on Intel GPUs using XPU is discussed, aiming to address performance bottlenecks in large language models.
- Illegal Memory Access with CUDA Graphs: Capturing multiple CUDA graphs using multiple GPUs results in illegal memory access during replay. Only buffers related to the last captured graph are retained.
- Inconsistent Output in F.linear Function: A bug in the
F.linear
function causes inconsistent output when performing matrix multiplications with varying dimensions using zero-padding under bf16 precision.
- Request for Split Softmax Function: A request is made to implement the Split Softmax function in PyTorch to address transformer models forgetting system prompts during long text processing.
- Incorrect Alias Assumptions in gen_alias_from_base: The
gen_alias_from_base
function incorrectly assumes inductor-generated output is an alias, leading to erroneous results when regenerating the alias.
- Runtime Error with Flex Attention and Autocast: A runtime error occurs when using flex attention with
torch.amp.autocast
, causing a program crash. Disabling autocast avoids the issue.
- Precision Problem with AMP and torch.compile: A significant precision problem and unexpected float32 overflow occur when using AMP with
torch.compile
, leading to low precision and overflow issues.
- Process Termination Issue with Multiprocessing Queues: Processes do not terminate as expected when sending tensors through multiprocessing queues in Python 3.12+ on macOS, potentially due to a resource tracker process.
- Device Mismatch Error in Export Function: A bug in the export function causes a runtime error due to device mismatch between CPU and CUDA during tensor operations, as the embedding layer is not moved to the appropriate device.
- Incorrect Handling of cuda.Event in Dynamo: A bug in PyTorch's Dynamo treats
cuda.Event
as compile-time constants, leading to a runtime error when calculating elapsed time between events.
- register_constant Function TypeError: The
register_constant
function does not work with simple types like enums unless they have a non-default__eq__
implementation, raising aTypeError
during export.
- Fake Tensor Leakage in Non-Strict Export: A bug in PyTorch's non-strict export process fails to detect and prevent fake tensor leakage, as demonstrated by a model pipeline logging a fake tensor in its buffer.
- Assertion Error with Fake Tensors in Export: An assertion error occurs when exporting a model using
vmap
for vectorization, indicating a problem with handling fake tensors at the pre-dispatch level.
- CI Infrastructure Robustness Testing: Two scenarios are executed to test the robustness of PyTorch's CI infrastructure, simulating errors in HUD processing pipelines and AWS EC2 instance creation API failures.
- Incorrect Zero Output in torch.ldexp Function: The
torch.ldexp
function produces an incorrect zero output when multiplying a float16 tensor by a power of two within the representable range, suggesting an implementation flaw.
- Link Check Failure Due to Stack Overflow 403 Error: A link check failure occurs due to a 403 error from a Stack Overflow page, potentially due to rate-limiting or blocking of automated access.
- Mixed Precision Casting Issue in FSDP2: A bug in FSDP2 prevents proper handling of mixed precision casting for tensors within dataclasses, unlike FSDP (version 1), leading to a runtime error.
- AttributeError with functools.partial in Export: An
AttributeError
occurs when exporting a model with a patched forward method usingfunctools.partial
, as thepartial
object lacks a__code__
attribute.
- Atomic Operations for Global Amax in Triton: Investigating the use of atomic operations to compute global amax values in Triton code generation for float8 quantization, as atomics outperform the current reduction-based approach.
- Introducing is_known_contiguous Function: A new function,
is_known_contiguous
, is proposed to replaceis_contiguous
in relevant areas of PyTorch, handling cases not known to be contiguous.
- Inconsistent Sizes in torch::unique_consecutive: A bug in
torch::unique_consecutive
with a custom CUDA allocator passes inconsistent sizes tomalloc
andfree
, potentially leading to memory corruption.
- Heuristic Increase for CUDA 12.6 Test: A failure in the
test_c10d_nccl
test necessitates a temporary heuristic increase from 1.5x to 1.7x when replacing CUDA 11.8 distributed jobs with CUDA 12.6.
- torch.nanmean() Function Error on CPU: An inconsistency in
torch.nanmean()
function behavior is noted, where it operates correctly on GPU but fails on CPU with a misleading error message.
- torch.arange() Function Discrepancy: A discrepancy in
torch.arange()
function behavior is highlighted, where GPU returns an empty tensor for an impossible range, while CPU raises an exception.
- Error Discrepancy in torch.batch_norm: An error discrepancy in
torch.batch_norm
function is noted, where an error is triggered on CUDA-enabled GPUs but not on CPU whenrunning_mean
orrunning_var
is set toNone
.
- Missing Documentation for torch.segment_reduce: The absence of documentation for
torch.segment_reduce
is highlighted, questioning its performance compared totorch.scatter_reduce
.
- Floating-Point Exception with NNPACK: A floating-point exception occurs with NNPACK for convolution on systems with disabled SMT, linked to a calculation error in NNPACK.
- CUDA Initialization Error with Sparse Tensors: Loading sparse tensors in a DataLoader raises a CUDA initialization error in PyTorch 2.5.0 and later if CUDA is already initialized, due to a validation check.
- Segmentation Fault in test_flex_attention.py: Running
test/inductor/test_flex_attention.py
results in segmentation faults and memory corruption errors, particularly on Intel Sapphire Rapids CPUs.
- torch.nn.functional.avg_pool2d Documentation Discrepancy: A discrepancy between documentation and actual behavior of
kernel_size
parameter intorch.nn.functional.avg_pool2d
is noted, prompting a request for updates.
- Runtime Error in Qwen2 Model: A runtime error is triggered by an internal assertion failure in the
matmul
operation within theeager_attention_forward
function of the Qwen2 model.
- Failure of tensor.view() with Dynamic Shapes: The
tensor.view()
function fails with dynamic shapes, despite the size of the first dimension being inferable and an assertion check confirming the condition for reshaping.
- Deferred Runtime Asserts Feature Ineffectiveness: The
prefer_deferred_runtime_asserts_over_guards
feature does not function correctly during end-to-end compilation, as necessary assertions are not added to the graph.
- Export Function Assertion Error: A bug in the export function causes a program to be incorrectly exported without raising an error, despite a runtime assertion that should fail given the input tensor.
- Compilation Error in Windows Inductor Tests: A bug in PyTorch's generated code for Windows inductor tests creates a zero-size array, resulting in a compilation error with MSVC.
- Overflow Error in softshrink Function: The
torch.nn.functional.softshrink
function throws an overflow error on CUDA with afloat16
tensor and alambda
value exceedingfloat16
limits, unlike on CPU.
- Inconsistency in torch.clamp_min Function: An inconsistency in
torch.clamp_min
function behavior is noted, where CPU raises an overflow exception for float16 values, while CUDA converts them to infinity.
- Inductor Expected to Produce Triton Kernel: A discrepancy is noted where PyTorch inductor is expected to produce a Triton kernel, but eager mode is called instead, requiring a codebase update.
- UnboundLocalError with torch.amp.autocast in Export: A bug in PyTorch causes an
UnboundLocalError
during model export withtorch.amp.autocast
, obscuring aRuntimeError
related to device propagation.
- Floating Point Exception with torch.fmod: The
torch.fmod
function causes a "Floating point exception (core dumped)" error with a very large negative integer divisor, suggesting a proper exception should be raised.
- Avoiding Unnecessary Recompilations with Strides: The challenge of avoiding unnecessary recompilations in PyTorch when handling tensors with strides equal to 1 is addressed, suggesting marking strides as unbacked.
- Inconsistent Error Handling in cosine_similarity: An inconsistency in
cosine_similarity
function error handling is noted, where CPU raises a RuntimeError for type conversion overflow, while CUDA returns NaN values.
- Error Handling Discrepancy in torch.addcmul(): A discrepancy in error handling between CPU and CUDA implementations of
torch.addcmul()
is noted, where CPU raises a RuntimeError for integer overflow, while CUDA does not.
- ONNX Export Failure with AutoencoderKLCosmos Model: A failure in exporting the AutoencoderKLCosmos model to ONNX is reported due to a shape inference problem with ConvTranspose operations.
- Post-Mod Feature in FlexAttention Module: A feature request is made to add "post_mod" support in the FlexAttention module to enable post-softmax or head-mix variants of MultiTokenAttention.
- Inconsistent Results with F.instance_norm in JIT: A bug in the PyTorch JIT compiler causes inconsistent results with
F.instance_norm
function when using custom running statistics, compared to eager mode.
- Symbolic Condition Handling in eval Function: A problem in the
eval
function is reported, where a symbolic condition involvingsym_or
is not handled correctly, despite similar logic working with a direct logicalor
.
- Floating Point Exception in native_channel_shuffle: The
torch.native_channel_shuffle()
function crashes with a "Floating point exception (core dumped)" error when called with a large integer parameter on an int32 tensor.
- Segmentation Fault in torch.lu_unpack(): A segmentation fault occurs in
torch.lu_unpack()
when called with a bfloat16 tensor containing extreme values and an empty pivots tensor.
- Inconsistent Behavior in torch.quantile: A discrepancy in
torch.quantile
function behavior is noted when handling infinite and NaN values, producing different outputs on CPU and CUDA.
- Incorrect Contiguity in NestedTensor Representation: A bug in PyTorch causes a NestedTensor to incorrectly show
contiguous=True
after a transpose operation, highlighting a discrepancy in memory layout reporting.
- torch.chunk Function Failure with NestedTensor: The
torch.chunk
function fails to operate on aNestedTensor
with a jagged layout when the second dimension is not ragged, resulting in an unhelpful exception.
- Decorator Bug in test_decompose_mem_bound_mm.py: A bug in a specific decorator causes the test environment to return an empty list and run zero tests, suggesting a defense mechanism against such issues.
- Online Softmax Disabled in Transformer Inference: Online softmax functionality is disabled during a transformer forward inference pass using
torch.compile()
, due to Inductor's decision to split the reduction.
- Proposal for SlimTensor Representation: A proposal is made to introduce a new lightweight tensor representation called SlimTensor to enable AOTInductor to generate minimal, self-contained binaries for PyTorch models.
- Failure of torch.arange in torch.ops.higher_order.scan: The use of
torch.arange
withintorch.ops.higher_order.scan
fails due to the lack of a guard for unbacked symfloat, causing errors when rewriting a loop.
- Support for Backward Pass in torch.export: A request is made for
torch.export
to handle models requiring backward pass computation during inference, as current limitations prevent its use for latency-sensitive models.
- TorchScript Crash with Fake Tensors in Export: A bug in PyTorch's export function causes a crash when a fake tensor is passed to the TorchScript interpreter, suggesting a warning and disabling TS in such cases.
- Extending CPP Extension API for SYCL Kernels: A proposal is made to extend the PyTorch CPP Extension API to support SYCL kernels, enabling new operator development for Intel GPU platforms.
- Disabled Test on ROCm MI300 Runners: A disabled test, "test_ddp_apply_optim_in_backward," is failing on ROCm MI300 runners in continuous integration, suspected to be a flaky failure.
- Auto Format Lint Not Making Suggestions: A problem in the PyTorch GitHub project where the auto format lint is not making suggestions, possibly due to incompatibility with CI workflows.
- Output Discrepancy in conv_transpose2d Function: A bug in
torch.nn.functional.conv_transpose2d
causes significantly different outputs on CPU and CUDA when usingfloat16
tensors with specific settings.
- rrelu Function Crash on CPU: The
torch.nn.functional.rrelu
function crashes on a CPU whentraining=True
and eitherlower
orupper
is set to infinity, while it works without error on CUDA.
- Change in ProcessGroupNCCL Component: A change in the PyTorch nightly build affects the
ProcessGroupNCCL
component, which now uses merged CUDA streams, impacting tensor lifetime management and profiling.
- Cleanup of autotune_fallback_to_aten Feature: The deprecated
autotune_fallback_to_aten
feature is being cleaned up in PyTorch by removing references in tests and benchmarks and updating configuration settings.
- Accuracy Discrepancy with torch compile: A discrepancy in accuracy is noted when training with
torch compile
, where numerical results diverge when using theinductor
backend compared toeager
andaot_eager
.
- OOM Errors with FSDP reshard_after_forward: Using
reshard_after_forward=int
with shared post-forward device meshes in FSDP leads to GPU OOM errors, with discussions on potential long-term solutions.
- Complexity of reshape_view_helper Function: The complexity of the
reshape_view_helper
function in PyTorch's autograd is discussed, questioning whether a simpler version should be used to avoid data-dependent errors.
- Error Handling Discrepancy in fused_moving_avg_obs_fake_quant: A discrepancy in error handling between CPU and GPU implementations of
fused_moving_avg_obs_fake_quant
is noted, suggesting a validity checker for CPU.
- Behavior Inconsistency in lp_pool1d Function: An inconsistency in
lp_pool1d
function behavior is noted, where CPU returns a tensor of zeros without error, while GPU throws an "integer out of range" error.
- Inconsistent Results with F.instance_norm in JIT: A bug in the PyTorch JIT compiler causes inconsistent results with
F.instance_norm
function when using custom running statistics, compared to eager mode.
- Segmentation Fault with choose_qparams_optimized: A segmentation fault occurs when
choose_qparams_optimized()
is called with empty tensors and an extremely largenum_bins
, causing a program crash.
- Floating Point Exception with pixel_shuffle: The
torch.pixel_shuffle()
function causes a floating point exception when called with empty tensors and an extremely largeupscale_factor
, resulting in a core dump.
- Segmentation Fault with Sparse COO Tensor Conversion: A segmentation fault occurs when converting a sparse COO tensor with complex128 values to a dense tensor, traced to the complex number addition operator.
- Import Failure with BlockMask and flex_attention: Importing
BlockMask
andflex_attention
fromtorch.nn.attention.flex_attention
fails withtorch.device("meta")
, resulting in aRuntimeError
.
- Selective Activation Checkpointing on Custom autograd.Function: Implementing Selective Activation Checkpointing on a custom
autograd.Function
is discussed, as the current implementation does not recognize it as an operation.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 69
Summarized Issues:
- Bugs in PyTorch's Triton and CPP backends: This topic covers issues where PyTorch functions produce incorrect results or errors when executed on Triton and CPP backends. The
torch.nn.PairwiseDistance(p=2)
function shows discrepancies in eager mode, and thetorch.cummin
function throws aNameError
due to an undefined variable in Triton kernel code.
- Documentation and Linter Issues in PyTorch: Several issues highlight problems with PyTorch's documentation and linter. The docstring linter incorrectly requires documentation for overridden methods, and there are inconsistencies in the documentation of functions like
torch.min()
andtorch.max()
.
- Graph Breaks and Compilation Errors in PyTorch: PyTorch faces issues with graph breaks and compilation errors. A graph break occurs with the
.t()
method on tensor subclasses, and a missing function declaration causes compilation failures on MSVC.
- Performance Optimization and Memory Usage in PyTorch: PyTorch projects encounter performance and memory usage issues. Optimizing SymPy expression printing can reduce a 10% performance cost, while a model unexpectedly consumes more GPU memory when loaded.
- Intel GPU Driver and Triton Errors in PyTorch: The latest Intel GPU Driver introduces breaking changes causing
torch.compile
to fail on Windows, and Triton errors occur due to renamed functions and interactions with operations likeselect_scatter
.
- Non-picklable Operations and Error Messages in PyTorch: PyTorch faces issues with non-picklable operations and error messages. The graph pickler raises exceptions for non-picklable
torch.ops.aten
operations, and Inductor's error messages need enhancement for debugging.
- ONNX Export and Unit Test Failures in PyTorch: PyTorch needs improvements in ONNX export and faces unit test failures. Users are encouraged to enable
dynamo=True
when exporting in ONNX, and specific tests fail due to mismatched error messages.
- Padding and Pooling Function Documentation in PyTorch: PyTorch's documentation for pooling functions needs updates. The
padding
parameter inavg_pool2d
should include a note about limitations, and basic usage descriptions for several functions need clarification.
- C++ Code Generation and Attention Mask Errors in PyTorch: PyTorch faces issues with C++ code generation and attention masks. A missing function declaration causes compilation failures, and a custom attention mask causes errors during training with flex attention.
- XPU Support and Memory Errors in PyTorch: PyTorch encounters issues with XPU support and memory errors. The function
torch.xpu.is_bf16_supported()
returns incorrect values, and memory errors occur with theflex_attention
function for large time series.
- AsyncCollectiveTensor and Kernel Operations in PyTorch: PyTorch faces issues with
AsyncCollectiveTensor
and kernel operations. AnAsyncCollectiveTensor
does not trigger await_tensor
upon a dtype cast, and binary kernel operations produce incorrect results with wrapped scalars.
- Docker Caching and Distribution Gradients in PyTorch: PyTorch projects face issues with Docker caching and distribution gradients. Docker caching does not function correctly on MI300 runners, and cross-mesh DTensor gradient allreduce lacks support.
- Tensor Method Discrepancies and Performance Regressions in PyTorch: PyTorch faces discrepancies in tensor methods and performance regressions. The
torch.Tensor.put_
method behaves differently on CPU and GPU, and a minor performance regression occurs in the modded-nanogpt project.
- Build Errors and CUDA Issues in PyTorch: PyTorch projects encounter build errors and CUDA issues. Building on WSL with specific CUDA architectures fails, and a CUDA error occurs during DDP training with RTX 5090 GPUs.
- Data Type Performance and Segmentation Faults in PyTorch: PyTorch faces performance issues with data types and segmentation faults. The
torch.dot
function is slower withfloat16
andbfloat16
, and a segmentation fault occurs in themax_unpool2d
function.
- Quantization and Large Tensor Limitations in PyTorch: PyTorch faces issues with quantization and large tensor limitations. A "False INTERNAL ASSERT FAILED" error occurs when saving a quantized model, and a 2D convolutional layer fails with large tensors.
- Documentation Discrepancies and Negative Entropy Values in PyTorch: PyTorch faces documentation discrepancies and negative entropy values. The documentation incorrectly states the use of a nondeterministic
index_add
, and theBeta.entropy
function returns negative values.
- Code Cleanup and Sharding Visualization Errors in PyTorch: PyTorch projects involve code cleanup and sharding visualization errors. An unnecessary header file inclusion is removed, and the
visualize_sharding
function fails with an "IndexError".
- Non-deterministic Sampling and Test Disabling in PyTorch: PyTorch faces issues with non-deterministic sampling and test disabling. The
torch.multinomial
function produces non-deterministic results, and several tests in theTestInductorOpInfoXPU
suite are disabled due to failures.
- Overflow and Inference Errors in PyTorch: PyTorch faces overflow and inference errors. The
torch.addcdiv
operation overflows to infinity with a scalar input, and inference outputs produce random 'nan' values on device 'xpu:1'.
- Randomness and Symbolic Operations in PyTorch: PyTorch faces issues with randomness and symbolic operations. The
torch.Tensor.bernoulli_()
function generates different outcomes on CPU and GPU, and the necessity of a symbolic floating-point operation in the ONNX exporter is questioned.
- Test Disabling and Dtype Promotion in PyTorch: PyTorch projects involve test disabling and dtype promotion issues. A test in the
AOTInductorTestABICompatibleGpu
suite is disabled, and dtype promotion behaves unexpectedly when performing operations between a tensor and a scalar.
- Complex Tensor Function and Multihead Attention Issues in PyTorch: PyTorch faces issues with complex tensor functions and multihead attention. The
asin()
function for complex tensors returns inconsistent results, andin_proj_bias
values for Q and K biases are unexpectedly zeros.
- CI Infrastructure and Build Process in PyTorch: PyTorch projects involve testing CI infrastructure resilience and build process issues. A scenario simulates ignored requests causing runner scarcity, and an unstable build process is resolved by reducing build time.
- CI Job Instability and TensorFloat32 Concerns in PyTorch: PyTorch projects face CI job instability and TensorFloat32 concerns. A CI job is marked unstable due to a parsing failure, and TensorFloat32 is disabled by default on Ampere-class GPUs without warning.
- Pattern Matching and Performance Anomalies in PyTorch: PyTorch faces issues with pattern matching and performance anomalies. A mismatch in the PatternMatcher arises from a user count discrepancy, and a performance anomaly occurs with the
torch.from_numpy().cuda()
function.
- Function Behavior and Pretrained Model Issues in PyTorch: PyTorch faces issues with function behavior and pretrained models. The
torch.arange()
function behaves inconsistently on CPU and GPU, andAutoModel.from_pretrained(...)
fails within atorch.device("meta")
context.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 193
Key Open Pull Requests
1. [associative_scan] Autograd for additional inputs: This pull request implements the autograd feature for additional inputs in the PyTorch project, building upon a previous pull request, and includes various commits that address issues, improve documentation, and refine the implementation to ensure functionality and compatibility.
- URL: pull/153317
- Merged: No
- Associated Commits: 9c49a, 100c5, 0e7c8, 67d62, 9e5e1, 6b415, d68b3, 9de0c, 653ea, ac1a1, d51b1, dbfd5, 64a8b, 08b72, f3788, b1277, 63350, 89ea8, c8bbd, 1c9fa, 6d835, 3f816, 62770, 67617, b73c5, ea956, 0786e, c628d, e7d63, 95d0e, 3707c, dc03c, 932f3, 2edd6
2. try relanding cublaslt autotuning support for TunableOp #: This pull request aims to reintroduce cublasLt autotuning support for the TunableOp in the PyTorch project, with additional tasks including the addition and testing of ScaledGemmTunableOp and obtaining benchmarking numbers, as evidenced by multiple commits addressing autotuning, support for various data types, and compatibility fixes for CUDA and ROCm platforms.
- URL: pull/153316
- Merged: No
- Associated Commits: 3a6e6, 6ec4b, 260c2, 15e4b, 08bcd, 4d060, 278a8, 62ea6, 0316c, 676b8, 85391, 02a11, 99cc7, 6e555, ce341, cccd3, 14fa0, 2a5d7, 68a55, ba332, c0fe6, 945b2
3. [Set] Add set.issubset
and set.issuperset
: This pull request proposes the addition of set.issubset
and set.issuperset
methods to the PyTorch project, as part of a series of related changes tracked through a stack of pull requests.
- URL: pull/152902
- Merged: No
- Associated Commits: 86a00, 45494, 38079, 8e815, 7a75b, 5a90c, 3bfc3, b583a, 3511f, 99fa6, 0ee20, 8d825, ccdb9
Other Open Pull Requests
- Error Handling Enhancements: Several pull requests focus on improving error handling in the PyTorch project. These include raising
KeyError
andTypeError
for specific conditions, such as when an element is not in a set or when there is a mismatch in the number of arguments provided.
- Set Method Enhancements: Multiple pull requests propose the addition and update of set methods in the PyTorch project. These include methods like
set.difference_update
,set.intersection_update
, andset.symmetric_difference_update
, which are part of a series of related changes managed through the ghstack tool.
- Set and Frozenset Initialization: A pull request aims to implement the correct initialization behavior for the
set
andfrozenset
data structures. This involves a series of related commits and pull requests to ensure proper functionality within the PyTorch project.
- Debugging and CI Modifications: A pull request titled "[Don't merge] Debug" involves modifications to the CI models list and torchbench model list. It includes adding and removing specific models for debugging purposes without merging into the main branch.
- FrozenSet Fixes: A pull request addresses fixes for the
FrozenSet
in the PyTorch project. It is part of a larger stack of changes and includes multiple updates marked as "[ghstack-poisoned]" across several commits.
- Exception Handling in ConstantVariable: A pull request addresses the handling of exceptions in the ConstantVariable operation within the PyTorch project. It is part of a series of related changes managed through the ghstack tool, involving multiple commits.
- Testing Enhancements: A pull request aims to add testing to ensure that the stride order in the code matches the expected configuration. This involves multiple updates and contributions from various collaborators within the PyTorch project.
- Higher Order Gradients Support: A pull request aims to enhance the PyTorch library by supporting higher order gradients with the
create_graph=True
option. It addresses potential issues with double backward computations when the forward trace includes non-composite implicit operations.
- Build Tools Update: A pull request updates the build tools version for PyTorch by using the latest Microsoft Visual Studio Compiler (MSVC). It ensures that AVX-512 instructions are correctly configured, with modifications to the findavx module and additional checks for Visual Studio path searches.
- New Operation Introduction: A pull request introduces a new operation,
onednn.qbatch_norm2d
, designed to compute uint8 batch normalization on the CPU using AVX512 instructions. It offers performance comparable to the existingquantized.batch_norm2d
for the QuantizedCPU device.
- Linter Introduction: A pull request introduces a linter to automatically detect and flag instances where contributors have hardcoded the "cuda" device in test cases. This helps prevent test failures on XPU and disrupts the XPU CI process.
- Codebase Improvements: A pull request involves replacing the
unimplemented
function withunimplemented_v2
in thetorch/_dynamo/variables/misc.py
file. It is part of a series of updates to improve the PyTorch codebase, addressing issue #147913 and following up on #152274.
- VariableBuilder Support for Sets: A pull request aims to introduce support for sets in the VariableBuilder within the PyTorch project. It is part of a series of stacked changes managed through the ghstack tool and is currently a work in progress.
- Ruff Linter Update: A pull request updates the Ruff linter to version 0.11.8 to address numerous false negatives across the codebase. It improves the validation of NOQA comments and includes various changes such as fixing typos and adjusting external rules in the
pyproject.toml
file.
- Negative Dimension Issue Fix: A pull request addresses a negative dimension issue within the parallel loss context manager. It implements a solution similar to the one proposed in issue #152016, with contributions including normalization additions and review commits.
- Header-Only File Implementation: A pull request proposes moving the
c10/core/DeviceType.h
file to a newtorch/csrc/header_only
directory to implement it as a header-only file. It retains a copy in its original location for backward compatibility and plans to move more header files in the future.
- Output Stride Consistency: A pull request aims to ensure that the output stride in the
invoke_subgraph
function is consistent with the eager execution mode. This is indicated by the title and multiple updates in the commit messages.
- EVT Component Testing: A pull request introduces end-to-end tests for the EVT component within the Cutlass project. It includes multiple updates and revisions to ensure comprehensive testing, while engaging several contributors for review and collaboration.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 273
Key Closed Pull Requests
1. [Release/2.6] Remove --no-deps and --no-index flag when building pytorch in docker build: This pull request addresses the removal of the --no-deps
and --no-index
flags when building PyTorch in a Docker environment, as the old setuptools version (<80.0.1) was installing the requirements of the torch and torchvision wheels, but the new setuptools version (>=80.0.1) does not install these requirements, necessitating the change.
- URL: pull/153163
- Merged: No
- Associated Commits: c69ea, af92b, aad1c, f3c08, 5363f, 5fbc4, 2b84d, c92f6, 1d3ff, f9e99, 46f55, 0cdf8, 6628b, c953e, 22775, 9b688, 4b9b7, f61bf, 31b52, b1a10, 5eb54, d9eed, 41811, 23e39, f01a6, 929ef, 4e418, 478a9, 7d329, 3a3de, f35ab, 7092d, 8c034, be126, e1858, d155d, 4d9de, a99cc, 51829, 983ea, 47f4e, eb304, 6e304, e2067, 57421, a61b5, 4658a, e19c1, 232eb, 1d2c2, a2639, cd15d, 9c34a, 8d4b8, dcb8a, 7be6b, ca3c3, 32070, 2236d, 1eba9, 93864, 88b97, bbd00, 1f32b, d33dd, f1481, ea546, ac7d6, ed487, 66dfe, 3783d, 8adc1, 5c4fa, 639ee, b445b, 8d72c, 374e5, 6a3b5, e607b, aafc7, d5947, 1b753, ba1ba, 70f30, 3398f, 8354d, 737cf, 4202f, 7c27e, 2e2c7, 3a818, 53ad2, 8eb5d, dbe8c, fcdff, 92b55, f6789, 2e1ed, 13339, 82ac2, 3608e, bfb23, 86b0a, 03714, 34caa, ac032, 5dd61, 7d528, d9a03, 7c072, 73dd0, b08d9, d70a9, 7ad5a, 2fd46, ed8c6, bf084, 20ad8, 2fb0a, 8cfa9, 50a04, 45896, 9d0a4, 1a808, 6fe84, a3632, 68180, fb24f, e53a9, 2cda1, 9d566, a7044, c7ba8, cbd7b, c3733, faf90, a87c9, ce6b7, dc41a, 469ce, 8ccfc, 1290e, 93693, 75628, 011d3
2. [Dynamo] Replace unimplemented
with unimplemented_v2
in torch/_dynamo/variables/misc.py
[1/2]: This pull request involves replacing the unimplemented
function with unimplemented_v2
in the torch/_dynamo/variables/misc.py
file as part of a larger task referenced by issue #147913, and includes multiple commits, many of which are co-authored by William Wen, although it was ultimately not merged.
- URL: pull/152274
- Merged: No
- Associated Commits: bf165, 2219d, fc60c, fa228, 03059, 3f64d, 49785, b0c01, d8852, 7b1e2, 2fb11, 4c828, 27675, fd03c, e7319, 4b862
3. Broken Links GHA: This pull request introduces a GitHub Action designed to run monthly, which checks for broken links within the repository and automatically creates an issue listing any broken links found, although it was not merged into the main branch.
- URL: pull/151454
- Merged: No
- Associated Commits: 12318, 22c4f, 6cb4b, 8887f, 73a25, 49982, 261ab, 41ccd, f1f70, c6d60, 7618b, 5329e, d59d2, 00b3f
Other Closed Pull Requests
- Manylinux 2.28 and GCC 13 Update: This pull request aimed to update the Manylinux 2.28 images to use GCC version 13, as part of addressing an issue related to the PyTorch project. However, it was ultimately not merged.
- CI Setup Changes: This pull request proposes changing the continuous integration (CI) setup for the PyTorch project by using CMake installed via pip instead of conda within the CI Docker images. Despite the proposed changes, it was ultimately not merged.
- Subgraph Name Refactoring: This pull request involves refactoring the process of fetching subgraph names within the PyTorch project. It was part of a series of changes managed through the ghstack tool, although it was ultimately not merged.
- XPU Backend Convolution Fix: This pull request addresses an issue in the PyTorch project where the XPU backend's convolution operation bypasses the device guard logic. It proposes a solution to ensure proper device management across multiple devices.
- Recursive DCE on Subgraphs: This pull request involves implementing a recursive dead code elimination (DCE) process on subgraphs within the PyTorch project. Despite multiple updates and discussions among contributors, it was ultimately not merged.
- MPS Kernel Improvements: This pull request introduces specialized kernels for the SDPA on MPS to improve performance, achieving significant speed improvements. It also addresses failing tests, adds new tests for specialized paths, and performs code cleanup.
- MPS Unary Operations Optimization: This pull request involves moving additional MPS unary operations to a TensorIterator-based Metal kernel to significantly improve runtime performance. It achieves a 20x speedup for certain operations, with plans for further migration in a subsequent pull request.
- Dtype Promotion Error Fix: This pull request addresses a dtype promotion error in the PyTorch project related to the concatenation decomposition process. It ensures that the cloning of a single tensor adheres to the correct dtype promotion rules.
- Cutlass Key Addition: This pull request aims to add a "cutlass key" for both Facebook's internal codebase and open-source software. It involves multiple updates and revisions as indicated by the series of commits and the associated differential revision.
- Flex Attention Symints Fix: This pull request addresses an issue in the Flex Attention component where symbolic integers from subgraph inputs and outputs were not explicitly stored. It proposes a fix to ensure these symints are properly captured and stored.
- Cutlass Library Enhancement: This pull request aims to enhance the Cutlass library by adding epilogue inputs and outputs to the
def_kernel
function. It is part of a series of updates tracked through the ghstack tool, although it was ultimately not merged.
- Docstring Linter Exemption: This pull request aims to address issue #151692 by proposing changes to exempt overriding methods from the docstring linter in the PyTorch project. However, it was ultimately not merged.
- ROCm CI Environment Upgrade: This pull request aims to upgrade the ROCm CI environment to ROCm version 6.4 and update all related GitHub workflows to use the Jammy distribution. It also modifies the install script for ROCm and addresses specific CI issues.
- Spack Includes Update for ROCm: This pull request focuses on updating the Spack includes for ROCm in the PyTorch project by cleaning up the
caffe2/CMakeLists.txt
. It ensures compliance with Spack by using therocm-core
package and adjusts the order offind_package
calls.
- Functorch Interpreters Serialization: This pull request aims to make Functorch interpreters serializable most of the time to enable saving guards on Functorch states. It includes various test cases to ensure functionality across different nested scenarios.
- Unsupported Float8 Multiplications: This pull request addresses the issue of unsupported float8 row-wise scaled matrix multiplications on Blackwell architecture. It adds skips to the relevant tests to reduce noise on machines with compute capability
sm_120+
.
- Recursive Realization of
stack_values
: This pull request aims to implement a recursive realization of thestack_values
in the Dynamo component of the PyTorch project. It potentially addresses issue #135696, as indicated by the series of commits and the involvement of multiple contributors.
- Tuple Iterator Serialization Testing: This pull request addresses the issue of testing serialization for
TUPLE_ITERATOR_LEN
in the PyTorch project. It implements a workaround to prevent tuple iterators from being exhausted during testing and invites insights for a more robust solution.
hasattr(tensor, "size")
Bug Fix: This pull request addresses a bug fix in the PyTorch project related to thehasattr(tensor, "size")
function. It is part of a stack of changes and aims to resolve issue #135696, although it was ultimately not merged.
- Hierarchical Compilation Enhancement: This pull request aims to enhance the hierarchical compilation process by incorporating mutation dependencies into the topological sorting algorithm. It is part of a series of related updates in the PyTorch project.
- Vendor-Neutral Fault Recovery: This pull request aims to make the FR code in the PyTorch project vendor-neutral by removing the dependency on
USE_C10D_NCCL
. It introduces a generic version withC10::Event
that can be utilized by other backends like Gloo.
- Code Generation Symbol ID Fix: This pull request addresses an issue in the PyTorch project where the code generation process fails due to out-of-order symbol IDs in forward input expressions. It proposes to skip code generation for SymPy expressions with multiple symbols.
- Proxy Removal for
autograd.Function.ctx
: This pull request addresses the removal of proxying theautograd.Function.ctx
into the computational graph due to the adoption of newer designs. It ensures the creation of anfx.Proxy
for theautograd.Function.ctx
object when necessary.
- ShapeEnv Cache Key Modification: This pull request addresses an issue in the
ShapeEnv.evaluate_expr()
function by modifying its cache key to include the global "suppress_guards" value. This fix was prompted by a test failure intest/dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error
.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
- MXFP8 Fix broken bias support for mxfp8
- Toxicity Score: 0.55 (Frustration expressed, Defensive responses, Mediation attempts, Escalating dissatisfaction.)
- This GitHub conversation involves several users discussing a pull request, with username1 expressing frustration over the lack of progress and username2 responding defensively. The tone shifts from collaborative to tense as username3 attempts to mediate, but username1's continued dissatisfaction escalates the tension.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 227 | 19 | 9 | 120 |
anijain2305 | 213 | 17 | 2 | 11 |
guilhermeleobas | 180 | 17 | 1 | 7 |
Skylion007 | 42 | 13 | 2 | 98 |
swolchok | 112 | 5 | 0 | 13 |
mlazos | 108 | 13 | 0 | 6 |
guangyey | 96 | 10 | 0 | 16 |
laithsakka | 66 | 15 | 15 | 20 |
FFFrog | 99 | 7 | 0 | 1 |
henrylhtsang | 79 | 11 | 7 | 10 |