Weekly GitHub Report for Pytorch: April 28, 2025 - May 05, 2025 (12:01:30)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant updates including support for torch.compile
with Python 3.13, a new performance-related feature torch.compiler.set_stance
, and FP16 support on X86 CPUs. Notably, the release also marks a shift away from publishing on Conda, with a focus on using official wheel packages or conda-forge, and introduces a backward-incompatible change by setting weights_only=True
as the default for torch.load
to enhance security.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Illegal Instruction Caused by
grid_sample
Under Windows: This issue involves a bug in PyTorch 2.7.0+cu118 on Windows 10, where usinggrid_sample
with float64 tensors causes an "illegal instruction" error, leading to crashes, particularly affecting CI/CD processes. The problem is specific to this PyTorch version and does not occur with previous versions or when using float32 tensors.- The comments discuss linking the issue to a previous one, but it is determined to be a new problem specific to PyTorch 2.7, possibly due to a switch to VS2022 for building. Debugging reveals the issue is related to AVX512 instructions running on AVX2 machines, and a rollback to VS2019 is suggested. A PR is drafted to revert to VS2019, and testing confirms that this resolves the issue.
- Number of comments this week: 14
-
NCCL out of memory error after updating to PyTorch 2.7: This issue describes a problem encountered after updating to PyTorch 2.7, where using the
init_process_group
with NCCL and callingDDP(model, device_ids=[rank])
results in an out-of-memory error, despite using minimal memory. The error did not occur before the update, and the same code worked fine with previous versions of PyTorch.- The comments discuss potential causes and solutions for the out-of-memory error, including running with
NCCL_DEBUG=INFO
andNCCL_DEBUG_SUBSYS=ALLOC
for more detailed logs. It is confirmed that the error does not occur when using a single GPU or when using the gloo backend instead of NCCL. A suggestion to run withNCCL_CUMEM_HOST_ENABLE=0
resolves the issue, indicating a potential problem with cuMem host allocations, especially under WSL. The discussion also mentions a forthcoming NCCL patch that might address similar issues. - Number of comments this week: 11
- The comments discuss potential causes and solutions for the out-of-memory error, including running with
-
Newly added lint-urls jobs are very flaky: This issue reports that the newly added lint-urls jobs in the PyTorch project are intermittently failing on pull requests and the main branch, causing instability in the continuous integration process. The issue is being actively discussed and addressed by the development team, with suggestions to mark the job as unstable and investigate potential causes such as rate limiting by external services.
- The comments discuss various examples of the job failures and potential solutions, including marking the job as unstable, investigating rate limits, and applying fixes. There is a suggestion to disable the job temporarily, and discussions on how to handle false positives in the script. The team is working on ensuring stability and considering using tags to skip checks, with some fixes already implemented and under observation.
- Number of comments this week: 10
-
setup.py develop
command is disappearing soon fromsetuptools
: This issue highlights the impending removal of thesetup.py develop
command fromsetuptools
, which PyTorch currently relies on for development and continuous integration processes. The deprecation of this command necessitates urgent action to transition to alternative methods, such as usingpip install -e . -v --no-build-isolation
, and potentially adopting new developer tools to manage builds more effectively.- The comments discuss the broader impact on related PyTorch projects and suggest limiting the
setuptools
version as a short-term solution. There is a proposal to adopt a dedicated developer CLI tool for better user experience, and a deadline is noted for transitioning away from thesetup.py
interface. Additionally, a related issue for dynamic dependency pinning is mentioned. - Number of comments this week: 7
- The comments discuss the broader impact on related PyTorch projects and suggest limiting the
-
Update
torch/nn/modules/conv.py
to use Literal for support padding modes: This issue proposes updating thetorch/nn/modules/conv.py
file in the PyTorch project to usetyping.Literal
for specifying supported padding modes, instead of using a genericstr
type, to enhance type checking and catch potential bugs. The goal is to improve code reliability by explicitly defining the supported padding modes, such as "valid" and "same", which can be verified by the type checker before the code is executed.- Multiple contributors expressed interest in working on the issue, with some seeking approval to proceed and others offering to resolve related errors. There was a concern about coordination to avoid duplicate efforts, and a contributor highlighted additional files with similar issues, suggesting further improvements.
- Number of comments this week: 7
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs within a Python script that utilizes the OotdPipeline and attempts to compile certain components with Torch's compile function, specifically when using the 'inductor' backend.
- Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing the MaxPool2D operation in PyTorch when the stride is equal to 1, suggesting that using multiple smaller MaxPool2D operations can reduce computational costs on a CPU. The approach involves representing a larger kernel size with multiple smaller ones, which has been shown to yield a speedup in processing time, as demonstrated by the provided testing code and performance comparisons.
- cuda_utils.so: failed to map segment from shared object: This issue involves a problem encountered when running a PyTorch model within a Docker container, where the execution of a cached shared object file,
cuda_utils.so
, fails due to a missing execution permission despite being run as the root user. The error occurs specifically in a Docker environment with atmpfs
permission set to1777
, and the file in question lacks the execution bit, leading to an ImportError when attempting to map a segment from the shared object. - Enable UFMT on all files in PyTorch: This issue involves enabling the UFMT (Universal Format) tool on all files within the PyTorch codebase, which currently has approximately 1,500 files that are not formatted according to UFMT standards. The process requires removing file names from the
exclude_patterns
in theUFMT
section of the.lintrunner.toml
file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed. - [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the
torch.jit.save()
function in PyTorch to exclude.debug_pkl
files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the size of exported models, particularly for deployment on mobile devices, where storage space is limited, as demonstrated by the user's experience of reducing a model's size from 6.7MB to 5.6MB by manually removing these debug files.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 98
Summarized Issues:
- Forward Compatibility in
torch.export
: The lack of forward compatibility intorch.export
for PyTorch is problematic for users converting models to CoreML and Litert, as these conversions require specific and conflicting PyTorch versions. Unliketorchscript
, which allows exporting in one environment and converting in another, users are inquiring about plans to address this oncetorch.export
exits beta.
- Runtime Errors and Compatibility Issues: Various runtime errors and compatibility issues have been reported, such as a runtime error with the
transformers
library on Windows (arm64) due to a missing module, and a bug in PyTorch nightly build PT2.8 causing incorrect outputs on Intel GPUs. These issues highlight the challenges in maintaining compatibility across different systems and versions.
- Type Errors and Regression Bugs: Upgrading to newer PyTorch versions has led to type errors in
nn.Module.dtype
when usingtorch.autocast
, and discrepancies intorch.sparse.log_softmax
outputs between CPU and CUDA. These issues suggest potential regressions in type handling and execution consistency.
- Overflow and Precision Issues: Unexpected overflow behavior in
torch.addcmul
with mixed precision tensors and incorrect results intorch.compile()
forasinh_()
operation highlight precision and overflow challenges in PyTorch. These issues emphasize the need for careful handling of precision in tensor operations.
- Distributed and Parallel Processing Bugs: Bugs in distributed processing, such as
dist.all_reduce
withdist.ReduceOp.SUM
andtorch.xpu.is_bf16_supported()
returning incorrect values, indicate challenges in distributed and parallel processing. These issues affect the reliability of distributed operations in PyTorch.
- Memory and Performance Issues: Out-of-memory errors with NCCL backend in PyTorch 2.7 and recompilation errors with FP8 support highlight memory and performance challenges. These issues underscore the importance of efficient memory management and compatibility in high-performance computing environments.
- Attribute and Error Handling Bugs: Bugs such as
AttributeError
in custom module classes and incorrect error messages intorch.compile
withtorch.nn.functional.multi_head_attention_forward
indicate issues in attribute handling and error reporting. These issues affect the usability and debugging experience in PyTorch.
- Checkpointing and Compilation Errors: Distributed checkpointing issues and compilation errors in
torch._dynamo
module highlight challenges in checkpointing and code compilation. These issues affect the stability and reliability of PyTorch in distributed and compiled environments.
- Compiler and Code Generation Inefficiencies: Inefficiencies in code generation, such as unnecessary kernel creation and outdated documentation, highlight challenges in optimizing PyTorch's compiler. These issues affect the performance and maintainability of PyTorch's codebase.
- Documentation and Test Failures: Lack of documentation for functions like
torch.nonzero_static
and test failures on specific platforms indicate challenges in documentation and testing. These issues affect the accessibility and reliability of PyTorch's features and tests.
- Debugging and Logging Issues: Inconsistent debug log generation and outdated Dynamo overview documents highlight challenges in debugging and documentation. These issues affect the ability to effectively troubleshoot and understand PyTorch's internals.
- Instruction and Build Environment Errors: "Illegal instruction" errors on Windows 10 and test failures due to specific commits highlight challenges in build environments and instruction set compatibility. These issues affect the stability and compatibility of PyTorch across different systems.
- Tensor and Execution Errors: Bugs in tensor operations, such as
.item()
onDTensor
andtorch.nn.functional.ctc_loss
, highlight challenges in tensor handling and execution. These issues affect the correctness and reliability of tensor operations in PyTorch.
- Quantile and Compilation Discrepancies: Inconsistent handling of NaN values in
torch.quantile
and incorrect tensor outputs in static compilation highlight challenges in numerical accuracy and compilation consistency. These issues affect the reliability of numerical computations in PyTorch.
- Build and Dependency Management: Migrating Docker containers to newer GCC versions and managing optional dependencies like "optree" highlight challenges in build and dependency management. These issues affect the maintainability and compatibility of PyTorch's build system.
- Memory Leaks and Optimization Issues: Memory leaks with MPS backend and inefficiencies in Profile-Guided Optimization highlight challenges in memory management and optimization. These issues affect the performance and resource utilization of PyTorch.
- Compilation Time Variance and Kernel Errors: Variance in compilation times and incorrect results from binary kernel operations highlight challenges in compilation efficiency and kernel execution. These issues affect the performance and correctness of PyTorch's compiled code.
- Build and CUDA Detection Issues: Difficulties in building PyTorch from source and CUDA detection errors highlight challenges in build configuration and hardware compatibility. These issues affect the ability to build and run PyTorch on different systems.
- Performance Discrepancies and Regression Issues: Performance discrepancies in scaled dot-product attention and regressions in AOTI for specific models highlight challenges in performance optimization and regression testing. These issues affect the efficiency and stability of PyTorch's features.
- Unexpected Output and CI Testing: Unexpected output logits with hooks and loops, and running CI on Triton pin updates highlight challenges in output consistency and continuous integration testing. These issues affect the reliability and testing of PyTorch's features.
- Python-less Environment and Profiler Crashes: Extending
torch.compile
to Python-less environments and profiler crashes with PyTorch Lightning highlight challenges in environment compatibility and profiling. These issues affect the usability and debugging of PyTorch in different environments.
- Distributed Operations and Release Tracking: Intermittent hangs with NCCL and release tracking for PyTorch 2.7.1 highlight challenges in distributed operations and release management. These issues affect the stability and coordination of PyTorch's distributed features and releases.
- Strides and Assertion Errors: Incorrect strides in
nonzero_static
and assertion errors inTestFlexAttentionCUDA
highlight challenges in tensor handling and test reliability. These issues affect the correctness and robustness of PyTorch's tensor operations and tests.
- Segmentation Faults and Similarity Checks: Segmentation faults in
ProcessGroupGloo.allgather_into_tensor_coalesced
and proposals for similarity checks highlight challenges in distributed operations and testing utilities. These issues affect the reliability and testing of PyTorch's distributed features.
- Docker Caching and Error Messages: Docker caching issues on MI300 runners and unclear error messages in
infer_size(a, b)
highlight challenges in caching and error reporting. These issues affect the efficiency and usability of PyTorch's build and error handling processes.
- Kernel and Utility Package Proposals: Bugs in
torch.__group_gemm
and proposals for acuda_tools
utility package highlight challenges in kernel execution and device management. These issues affect the performance and usability of PyTorch's kernel operations and device management.
- Documentation and String Support: Lack of documentation for FlexAttention and string support in
torch.library.custom_op
highlight challenges in documentation and feature support. These issues affect the accessibility and extensibility of PyTorch's features.
- Architecture Support and Foreach Operations: Adding SASS support for NVIDIA architectures and bugs in
torch._foreach_pow
highlight challenges in architecture compatibility and foreach operations. These issues affect the compatibility and correctness of PyTorch's operations on different architectures.
- CI Workflow and Dependency Issues: CI workflows not triggering and dependency issues in Triton Windows build highlight challenges in continuous integration and dependency management. These issues affect the reliability and maintainability of PyTorch's CI processes and builds.
- Test Failures and Runtime Errors: Test failures on MI200 platforms and runtime errors with
torch.func.jacfwd
highlight challenges in test reliability and runtime execution. These issues affect the stability and correctness of PyTorch's tests and runtime operations.
- Gradient and Configuration Issues: Gradient backpropagation issues with
Categorical
distribution and NO_SHARD configuration proposals highlight challenges in gradient handling and configuration optimization. These issues affect the differentiability and performance of PyTorch's operations and configurations.
- Distributed API and Masking Errors: Disappearance of model parameters in distributed APIs and errors with boolean masks on sharded DTensors highlight challenges in distributed processing and tensor masking. These issues affect the reliability and flexibility of PyTorch's distributed features.
- Mergebot and Fake Implementation Issues: Enhancements to mergebot functionality and bugs with fake implementations highlight challenges in automation and custom operation registration. These issues affect the efficiency and extensibility of PyTorch's automation and custom operations.
- Overflow and Accuracy Issues: Discrepancies in float16 overflow handling and accuracy issues with
opmath_t
highlight challenges in numerical accuracy and overflow handling. These issues affect the consistency and precision of PyTorch's numerical operations.
- Optimization and Stride Mismatch Errors: Proposals for optimizing all-gather operations and stride mismatch errors in
F.scaled_dot_product_attention
highlight challenges in optimization and tensor handling. These issues affect the performance and correctness of PyTorch's operations.
- Padding and Error Message Discrepancies: Runtime errors with zero-size tensor padding and unclear error messages in
checkpoint_sequential
highlight challenges in tensor operations and error reporting. These issues affect the usability and reliability of PyTorch's tensor operations and error handling.
- Tensor Method Discrepancies and Compilation Errors: Discrepancies in
torch.Tensor.put_
method and compilation errors with Cuda-12.9 highlight challenges in tensor method consistency and build compatibility. These issues affect the correctness and compatibility of PyTorch's tensor methods and builds.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 47
Summarized Issues:
- Quantization Aware Training (QAT) Performance Issues: Users have reported unexpected performance degradation when using QAT with MobileNetV2 for conversion to TFLite, where the quantized model performs worse than the original. The issue may be related to the conversion of the QAT model into ATen operators, which are not optimized for specific devices.
- Inductor CUDA Graph Tree Buffer Allocation: The Inductor CUDA Graph Tree implementation fails to capture buffer allocations in multi-stream programs on side CUDA streams, leading to runtime errors. This is due to storage data pointers not being allocated in the expected CUDA graph pool.
- PyTorch Export Functionality Bugs: There are issues with PyTorch's export functionality, including the loss of custom metadata for constant tensors during export. Additionally, the
torch.export
function fails with anIndexError
when exporting models usingInstanceNorm1d
unless thestrict
parameter is set toTrue
.
- Dynamic Axis Specialization Confusion: Users experience confusion when specifying an axis as dynamic in PyTorch, only to find it specialized without clear indication. A warning is suggested to advise users to set example dimension sizes greater than one, especially for dynamic batch sizes.
- ONNX Export Process Simplification: The ONNX export process needs streamlining by primarily using
TorchExportNonStrictStrategy
and falling back to the TorchScript-based exporter if necessary. Additional strategies do not significantly enhance export coverage and complicate error messaging.
- Circular Dependency in PyTorch Build Process: A circular dependency problem exists in the PyTorch project where building magma tarballs for ROCm or CUDA requires a manylinux image that itself needs the magma tarball. This necessitates a sequence of pull requests to bypass the dependency temporarily.
- Dynamo Compiler Hang with
_scaled_grouped_mm
: The Dynamo compiler hangs during the processing of the_scaled_grouped_mm
operator whenuse_fast_accum
is set toTrue
. This issue appears to be related to acudaStreamSynchronize()
call and is not associated with the auto-tuning feature.
- MPS Backend
aten::col2im
Operation Support: The lack of support for theaten::col2im
operation on the MPS backend in PyTorch causes it to default to CPU execution, potentially affecting performance. While the forward versionaten::im2col
is implemented, the backward version is not.
torch.distributed.tensor.debug.visualize_sharding
Enhancement: There is a proposal to enhance thetorch.distributed.tensor.debug.visualize_sharding
function to produce colorized visualizations similar tojax.debug.visualize_array_sharding
. A hybrid approach using therich
library is suggested.
- Peak Memory Usage in
torch.compile
: A bug is reported where the peak memory usage of a torch.compiled model is higher during the first run compared to subsequent fresh runs. This raises questions about expected behavior and potential caching mechanisms at the inductor or dynamo level.
- Regression Error in
torch-xpu-ops
: An update to thetorch-xpu-ops
caused a test case for theinterpolate_bilinear
function on XPU withfloat32
to fail due to significant discrepancies in tensor values. This resulted in an assertion error during testing.
- FX Graph Cache Utilization Issues: There is difficulty in determining why the FX graph cache is not being utilized or created in the PyTorch project, despite using logging tools. More detailed logging is suggested to diagnose the problem effectively.
torch.fliplr
Function Crash: A bug in PyTorch causes the program to crash with an "Aborted (core dumped)" error when using thetorch.fliplr
function with invalid data. There is a request to convert potentially unsafe.pkl
files into a safer format for reproduction and debugging.
torch.log1p
Gradient Computation Bug: Thetorch.log1p
function's gradient computation incorrectly results inNaN
for an excluded element due to masking, instead of the expected zero. This occurs when using a tensor with a condition that should exclude certain elements from contributing to the gradient calculation.
- MacOS CI Build and Test Script Updates: The CI build and test scripts for MacOS in the PyTorch project need updating to eliminate the dependency on Anaconda. This is part of a broader effort to address a related issue and involves modifying several specific scripts and workflows.
- Torch Profiler Stream Count Bug: A bug in the Torch Profiler shows 40 streams in the trace file despite using only two streams to overlap operations in a CUDA graph. This suggests that each loop iteration creates a new stream, leading to unnecessary overhead and concerns about stream reuse best practices.
- Anaconda-Related Benchmark Removal: The PyTorch project is removing Anaconda-related benchmarks, specifically addressing files such as
benchmarks/dynamo/Makefile
and others. This is related to a previous issue (#138506).
- Utility Script and Workflow Bugs: There are bugs in the PyTorch project related to utility scripts and workflows, involving files such as
python_doc_push_script.sh
,upload-test-stats-while-running.yml
, and others. These issues are linked to a previous issue (#138506).
torch.lobpcg()
Performance Bug: Thetorch.lobpcg()
function experiences a performance bug where changing the tolerance parametertol
from1e-07
to1e-08
causes the function to hang for over 40 minutes. This suggests a problem with achieving higher precision beyond the defaultfloat32
capabilities.
torch.distributions
Documentation Update: There is a need to update the documentation for several functions in thetorch.distributions
module to include a description of thevalidate_args
parameter. This parameter is currently used in the code but not documented.
- C++ Compilation Error in CI Pipeline: A C++ compilation error is encountered in the CI pipeline of a GitHub project following the release of Torch 2.7. This error is inconsistent across different systems and appears to be related to AVX-512 support.
torch._dynamo.exc.BackendCompilerFailed
Error: Upgrading PyTorch nightlies results in atorch._dynamo.exc.BackendCompilerFailed
error due to a missingexpecttest
module. This can be temporarily resolved by installing theexpecttest
package, although this dependency should not be required.
- PyTorch 2.7
py_limited_api
Compilation Bug: A bug in PyTorch 2.7 occurs when settingpy_limited_api=True
during the compilation of torch extensions, resulting in build errors due to undeclared identifiers in thepybind11
library. This can be resolved by settingpy_limited_api=False
.
torch.flipud
Function Segmentation Fault: A bug in the PyTorch library causes a segmentation fault (core dumped) error when using thetorch.flipud
function with certain arguments. This is likely due to incompatibility with quantized tensors, and a workaround is suggested by dequantizing the tensor before applying the function.
- FSDP Peak Memory Usage Spike: The FullyShardedDataParallel (FSDP) implementation in PyTorch experiences an unexpected peak memory usage spike during the initialization phase when training the Llama 4 model. This prevents the model from being loaded due to out-of-memory (OOM) errors.
scaler.step(optimizer)
Enhancement Proposal: There is a proposal to enhance thescaler.step(optimizer)
function in PyTorch to return a value indicating whether a step was skipped due to underflow or overflow. This would help users identify when necessary loss functions are not being applied during model training.
- Windows Compilation with CUDA 12.6 Error: A build failure occurs when compiling PyTorch from source on Windows using CUDA 12.6 and MSVC 2019. The compilation of the file
cuda_vectorized_test.cu
fails due to an ambiguous reference to thestd
namespace.
- Inductor Unit Test Accuracy Failure: An accuracy failure in the Inductor unit tests related to
chunk_cat
on XPU is caused by a change that began supporting contiguous inputs. This disrupted the previous assumption that all inputs were contiguous, leading to failures in the torch-xpu-ops implementation.
- MPS Backend Memory Leak: A memory leak occurs in the Metal Performance Shaders (MPS) backend when using scaled dot product attention (SDPA) with
float32
tensors in PyTorch. This involves increasing memory usage over iterations and may be due to a bug in the MetalPerformanceShaderGraph framework.
- MPS Backend Memory Allocation Problem: A memory allocation problem is encountered during model training using PyTorch on a Mac, where the MPS backend runs out of memory. The use of the environment variable
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
is suggested to disable the upper memory limit.
- Compiling PyTorch on WSL for RTX 5070 Ti: A user seeks advice on compiling PyTorch from source on Windows Subsystem for Linux (WSL) to support an NVIDIA GeForce RTX 5070 Ti. Challenges include unsupported GPU architecture
sm_129
and memory issues, with a suggestion to usesm_90
as a fallback.
torch.scatter_reduce
ONNX Conversion Bug: Thetorch.scatter_reduce
function with a "max" reduction operation fails to convert correctly to ONNX format when the input is a two-dimensional tensor. This results in shape verification errors during ONNX runtime execution.
test_host_memory_stats
Failure: A failing test named 'test_host_memory_stats' in the PyTorch project version 2.7.0+cu118 encounters an error when runningtest_cuda.py
locally. This is potentially due to improper cleanup in some tests and has been resolved by merging an associated pull request.
- Disabling Long List Print Truncation: A feature is requested to disable the truncation of long list prints to facilitate accurate bandwidth calculations for uneven all-to-all operations. All numbers are necessary for these calculations and subsequent post-processing tasks.
test_inductor_debug
Disabling: Thetest_inductor_debug
within theLoggingTests
suite is disabled due to its failure on the main branch. The issue has been closed after identifying the cause of the problem.
- CUDA 12.1 Support Request: A request is made for the release of the PyTorch library with support for CUDA 12.1 for version 2.6 and potentially newer versions. This highlights the need for compatibility with the specified CUDA version.
- DTensor Dtype Transmute Operator Bug: A bug in the PyTorch library causes the dtype transmute operator to fail for DTensor due to the absence of a registered sharding strategy for the operator
aten.view.dtype
. This results in aNotImplementedError
.
- Inductor Component Compatibility Problem: The Inductor component of PyTorch uses an outdated API due to recent changes in the Triton project, specifically the removal and renaming of the
launch_enter_hook
function. This has been addressed by a subsequent fix.
torch.randint
Function Overflow Bug: A bug in the PyTorch library causes thetorch.randint
function to fail with largehigh
arguments, particularly with the full range oftorch.uint64
. This leads to runtime errors due to overflow, and documentation updates are suggested.
- CUDAGraphs and Mixed Operations: A user inquires about capturing a model containing both CUDA and CPU operations as a single CUDAGraph in PyTorch. The response indicates that CUDAGraphs are only applicable to CUDA operations, with a suggestion to use the PyTorch forum for further questions.
- Torchtitan CI Failure: A failure in the torchtitan CI is caused by a static CUDA launcher resulting in a
RuntimeError: CUDA driver error: invalid device context
. This can be resolved by manually disabling the static CUDA launcher.
test_nvshmem
Disabling: Thetest_nvshmem
in the PyTorch project is disabled because NVSHMEM is not yet installed on the continuous integration (CI) machines. Plans are in place to enable it by building PyTorch with NVSHMEM by default and installing NVSHMEM wheels in CI workflows.
- Inductor Component Periodic Failures: Periodic failures in the PyTorch project related to the inductor component occur, where the hf_BigBird model transitions from a fail_to_run state to fail_accuracy. This indicates silent incorrectness and an increase in graph breaks.
- Modded-nanogpt Performance Regression: A minor performance regression is reported in the modded-nanogpt project between nightly builds from February 9th to March 1st. There is a slight increase in runtime, with plans for further investigation and code analysis.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 185
Key Open Pull Requests
1. [Do not merge] poke CI with FX IR always on: This pull request is focused on testing the continuous integration (CI) system by enabling FX IR conversion at all times to identify and address potential bugs, as indicated by the title "[Do not merge] poke CI with FX IR always on" and the detailed commit history that includes various code refactoring, bug fixes, and testing enhancements related to FX IR and its integration into the PyTorch project.
- URL: pull/152405
- Merged: No
- Associated Commits: d3a36, 6e176, c55fd, f299c, 403b9, fd03d, 2cd01, bd78b, df2f5, 0dfd0, 40011, 20b1d, 91a39, e7335, be417, fc5f0, 6c1cb, c94d1, 2b978, 2b6a4, 5e33e, 6e09c, b4c7a, 41cab, 2a839, 43773, 30cb1, 46e84, e6a12, 1f33f, 04bbe, 8dac9, 69979, a4a87, c5769, 51ed2, 46fed, 32130, 81831, 55f51, 0ca29, 50fca, d4be6, 07ab6, 2fab6, 12b40, 13710, c3e64, 05cf7, aa7a2, 76b12, 2831c, 8678c, fc9ff, 2f17b, fb252, d4ecc, c9a37, feda8, 5fe4e, f5e56, f64ca, 20f97, 6bbee, dcd78, 61388, e2546, 0c1db, 002e9, 3cb46, 43192, 27009, 8f948, 23da2, 328b4, bd632, 6bfc4, 16cb3, 66fad, 597b5, a7d36, c6051, b164d, 81e54, 34736, 7a343, bed61, 2a9ec, 4fbd7, 9a8d1, f4469, 9de81, bcc51, ede4a, 61b83, a13a0, f21d2, 36de7, 72dda, 7cdd2, addc7, a3609, d00c7, 12de0, eb216, fa8c1, a5e72, 0080b, c765c, bb405, 103bc, f8af0, 8d665, 59aaf, b35fb, 5eba4, ac33f, 522fb, e25a1, 079fc, 26b33, 5a0fe, c53a9, b2158, fd88b
2. [inductor][dynamo] Include operator name in size/stride/alignment assertion: This pull request updates the assert_size_stride
and assert_alignment
functions in the PyTorch project to include an optional op_name
argument for enhanced error messaging, modifies the corresponding type stubs, extracts operator names from the FX graph for better debugging in Triton code, and adds unit tests to ensure that both successful and failing assertions include the operator name.
- URL: pull/152353
- Merged: No
3. [CI] Use cmake from pip instead of conda in CI docker images: This pull request proposes changing the continuous integration (CI) process for PyTorch's Docker images by using CMake installed via pip instead of conda, as indicated by the title and supported by multiple commits with the message "tc."
- URL: pull/152537
- Merged: No
Other Open Pull Requests
- Conda Removal from PyTorch Documentation and Development: Multiple pull requests focus on removing Conda references from the PyTorch project. These changes aim to streamline the installation and development processes by eliminating Conda dependencies and optimizing the build process.
- Enhancements to PyTorch's Autotuning and Profiling: Pull requests introduce additional profiling events and code generation for GEMM kernels to improve performance analysis and autotuning capabilities. These updates align with existing features and add new templates for code generation.
- Backend and Compilation Improvements: Several pull requests enhance backend specializations and hierarchical compilation processes. These changes involve incorporating mutation dependencies and introducing new keyword arguments to improve flexibility and performance.
- Vectorization and Quantization Enhancements: Pull requests introduce vectorized operations for FP8 formats and enable vectorized code generation for quantization processes. These updates aim to enhance performance and efficiency in handling specific data types.
- Fault Recovery and Vendor Neutrality: Efforts to make the Fault Recovery code vendor-neutral are evident in multiple pull requests. These changes remove dependencies on specific backends, allowing broader compatibility while maintaining backward compatibility.
- PyTorch Functionality and Interoperability Enhancements: Pull requests focus on improving the functionality and interoperability of PyTorch's features. These include making Functorch interpreters serializable and enhancing tree functions to accept various object types.
- CUDA and ROCm Integration: Pull requests introduce new mechanisms for CUDA graph launches and integrate AITER for ROCm. These changes aim to improve robustness and performance by utilizing advanced features and refining integration paths.
- Miscellaneous Fixes and Updates: Various pull requests address issues such as dead links, numerical instability, and code generation failures. These updates ensure documentation accuracy, consistent results across platforms, and robust code generation processes.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 269
Key Closed Pull Requests
1. Add scripts to check xrefs and urls: This pull request involves adding and updating scripts to traverse the documentation and codebase of the PyTorch project to identify and address any broken cross-references and URLs, although it was ultimately not merged.
- URL: pull/151844
- Merged: No
- Associated Commits: 1d738, dcd6e, 72d59, 9605a, 2a07f, a37df, 150c5, 48c4e, ee06f, 83ed6, 8037b, e963c, 0a041, 90093, a0cbb, 8f72f, 8ad07, 55b02, 4edf5
2. [conda] Remove conda from lint-autoformat.yml: This pull request aims to remove the use of conda from the lint-autoformat.yml
file, address a missing setuptools
module error by installing it, switch from Python 3.10 to the system's Python 3.9, and utilize a virtual environment to manage dependencies not included in the base setup.
- URL: pull/152433
- Merged: No
- Associated Commits: 6a8ff, 0df51, 153fa, 96e92, d02a7, dc71d, 72e33, 05a46, c0743, edef7, 73a94, fef8f, 1dca9, 179f9, 2cdb5, f0227
3. [conda] Remove conda usage from TD llm retriever job: This pull request aims to eliminate the use of conda in the TD llm retriever job for the PyTorch project, addressing concerns about managing different Python versions, as the base currently uses Python 3.9.
- URL: pull/152338
- Merged: No
- Associated Commits: ff3f5, 764b9, f4167, f00db, 46fb7, ab12e, f821f, 3e4b5, ab8df, b107f, 24e82, b5447
Other Closed Pull Requests
- Subgraph Output Code Simplification: This topic involves simplifying the resulting output code for subgraphs in the PyTorch project. The changes are documented with comparisons of output code before and after the modifications, highlighting the improvements made.
- Assertion Checks Removal: The removal of assertion checks for the outputs of the
invoke_subgraph
function is covered here. The input assertions are considered sufficient, making output assertions unnecessary.
- TensorIterator Migration: This topic covers the migration of the multiplication operation to use TensorIterator in PyTorch. It includes a refactor of binary operation tensor generators to handle mixed data types correctly.
- Cutlass Epilogue Visitor Code Generation: This involves implementing a Python code generator for the Cutlass epilogue visitor. The pull request is part of a larger stack of related changes in the PyTorch project.
- Incorrect Typing in Inductor Module: This topic addresses incorrect typing issues in the
cuda_kernel
andcuda_template.py
components of the Inductor module. Both pull requests were ultimately not merged.
- Throwaway Changes: This pull request consists of non-mergeable, experimental commits intended for testing or temporary purposes. It is indicated by the repeated "throwaway" label.
- AOTAutogradCache Issue: This pull request addresses an issue with the AOTAutogradCache by saving the
bw_module
in the cache after stripping it of unserializable metadata. It ensures that both the lowered backward and thebw_module
are cached to prevent crashes.
- Enhancing Visualize Sharding: This topic involves enhancing the
torch.distributed.tensor.debug.visualize_sharding
functionality by adding rich support. It addresses issue #151857 and includes updates for compatibility with systems running on at least 4 GPUs.
- Export Functionality Enhancement: This pull request enhances the export functionality by supporting the export of hops with function schema arguments. It simplifies the implementation by using
pytree.register_constant
.
- Memory Allocation in CUDA Graph Trees: This topic addresses memory allocation on side streams in CUDA Graph Trees by implementing a safer approach for multithreading. It fixes the error reported in issue #151199.
- Test Configuration Monitoring: This pull request enhances the monitoring capabilities of test configurations by adding parameters for log intervals and incorporating an upload step for all test YAML files.
- Link Linting Process Enhancement: This topic involves enhancing the link linting process by configuring it to run on modified files only or on all files when scheduled. It includes updates to the linting configuration files.
- Metal Kernel Migration: This pull request migrates all addition and subtraction operations to Metal kernels. It addresses a bug related to improper downcasting of CPU double/complexDouble scalars to floats.
- XLA Issue Resolution: This topic aims to address and resolve an issue related to XLA in the PyTorch project. Despite multiple updates and commits, the pull request was ultimately not merged.
- Memory Management Optimization: This pull request optimizes memory management by ensuring that buffers are freed before invoking a subgraph call. It is indicated by the title and multiple updates in the commit messages.
- Profile-Guided Optimization Enhancement: This topic enhances the PGO process by making code state identification independent of file paths. It uses a hash of the file content along with the function name and line number.
- Gather Operation Data Type Support: This pull request enhances the PyTorch library by supporting additional data types for both input and indices in the gather operation. It is part of a series of changes tracked through the ghstack tool.
- Static Input Indexes Collection: This topic addresses an issue related to the collection of
static_input_idxs
in thecudagraphs
component. It is part of a stack of changes tracked viaghstack
.
- Tangent Metadata Caching: This pull request implements caching on tangent metadata and enables retracing if necessary. It is part of a series of related changes managed through the ghstack tool.
- Unpacked Operands in Subgraph: This pull request involves unpacking operands within a subgraph. It is part of a stack of related changes managed through the ghstack tool.
- Intermediate Representation Line for Symbolic Call Arguments: This pull request introduces a new wrapper IR line to handle symbolic call arguments. It aims to streamline code between Python and C++ backends.
- Log Message Length Reduction: This topic addresses the issue of generating unnecessarily long log messages for suppressed data-dependent errors. It aims to prevent logging of such errors when recording is not enabled.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
-
- Toxicity Score: 0.55 (Defensive responses, persistent tension, lack of resolution.)
- This GitHub conversation involves username1 and username2, where username1 initially proposes a solution that username2 critiques for not addressing the issue effectively. Username1 responds with a defensive tone, expressing frustration over the feedback, which triggers a tense exchange. Username2 attempts to clarify their point, but the conversation remains strained, with both parties showing signs of impatience.
-
[caffe2] Make c10::str works with scoped enum
- Toxicity Score: 0.55 (Defensive responses,critique of solution,tense exchange)
- This GitHub conversation involves username1 providing a solution, which username2 critiques for not addressing the issue effectively. Username1 responds defensively, leading to a tense exchange. Username3 attempts to mediate by suggesting alternative approaches, but the conversation remains strained.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
malfet | 223 | 31 | 8 | 134 |
anijain2305 | 177 | 19 | 2 | 14 |
mlazos | 160 | 21 | 0 | 10 |
FFFrog | 138 | 9 | 0 | 2 |
laithsakka | 78 | 21 | 8 | 29 |
swolchok | 112 | 10 | 0 | 2 |
zou3519 | 40 | 9 | 16 | 44 |
Skylion007 | 12 | 1 | 1 | 92 |
bobrenjc93 | 85 | 13 | 2 | 5 |
guangyey | 82 | 7 | 0 | 16 |