Weekly Project News

Subscribe
Archives

Weekly GitHub Report for Pytorch: June 23, 2025 - June 30, 2025 (22:59:45)

Weekly GitHub Report for Pytorch

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is v2.6.0

1.2 Version Information:

Released on January 29, 2025, PyTorch 2.6 introduces significant updates, including support for Python 3.13 with torch.compile, a new performance-related feature torch.compiler.set_stance, and FP16 support on X86 CPUs. Notably, the release also marks the deprecation of PyTorch's official Anaconda channel, urging users to switch to other package sources, and introduces Manylinux 2.28 build platform for Linux binaries, setting the stage for a full transition in the upcoming PyTorch 2.7 release.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. NVFp4 Cublas Error: This issue involves a CUDA error encountered when using the torch._scaled_mm function with a bias, resulting in a CUBLAS_STATUS_NOT_SUPPORTED error, particularly when the matrix dimension m is set to 1. Additionally, the issue highlights problems with the NVfp4 double quantization on CUDA 12.8, where certain configurations lead to errors, and there is confusion about the compatibility of matrix dimensions with the cuBLAS library.

    • The comments discuss attempts to reproduce the error, with some users unable to replicate it on newer CUDA and cuBLAS versions, suggesting it might be specific to version 12.8. There are suggestions to check cuBLAS logs for more details, and a user confirms that the issue with bias is resolved in a later version of cuBLAS.
    • Number of comments this week: 10
  2. [v.2.8.0] Release Tracker: This issue is a release tracker for version 2.8.0 of the PyTorch project, detailing the process and criteria for cherry-picking changes to the release branch. It outlines two phases: the first phase allows low-risk changes until July 7, 2025, while the second phase, after this date, only permits release-blocking critical fixes.

    • The comments section includes multiple requests for cherry-picking changes to the release branch, with some changes already merged and others pending due to unresolved CI issues. The criteria for these changes vary, including release-only changes and reverts on the trunk, with one critical change awaiting resolution of CI issues before landing on the trunk.
    • Number of comments this week: 6
  3. FSDP2 + TP does not work: This issue describes a problem encountered when implementing a model using FSDP2 and TP with a regular AdamW optimizer, where the user experiences a NotImplementedError related to cross-mesh operations in DTensor. The user reports that attempts to use gradient norm clipping, both with a custom function and the torchtitan implementation, result in errors, and removing norm clipping leads to further issues during the backward step.

    • The comments discuss the need for code sharing to understand the mesh configurations and suggest applying TP to all modules, using replication instead of sharding where necessary. The user confirms they applied TP selectively due to specific constraints and is advised to apply TP universally to resolve the issue. A reference to a similar approach using NoParallel is provided, and the user plans to try the suggested solution and report back.
    • Number of comments this week: 5
  4. [RFC] Remove the FSDP data copy from compute stream critical path: This issue addresses the latency in training iterations caused by the Fully Sharded Data Parallel (FSDP) data copy being on the critical path during large language model (LLM) training, specifically noting that this data copy accounts for approximately 1.4% of the total iteration time in a single node with 8 GPUs. The discussion suggests that these memory copy operations could potentially be executed in parallel with other compute kernels to reduce latency, although this might require additional memory allocation.

    • The comments discuss the potential impact of moving the FSDP data copy to a separate stream, with some users noting that the overhead is relatively small and discussing the testing setup and results on different platforms. There is also a clarification about the version of FSDP being used, with a plan to test FSDP v2 in the user's cluster.
    • Number of comments this week: 5
  5. Add is_outputs_batched param to autograd.grad: This issue discusses the need for an is_outputs_batched parameter in the autograd.grad function of PyTorch to efficiently compute per-sample gradients without requiring multiple forward passes, which is crucial for memory efficiency in certain machine learning tasks. The user explains that existing solutions, such as using the torch.func API, are not applicable to their specific use case, and they seek a more elegant and efficient method to achieve their goal.

    • The comments discuss potential solutions and alternatives, including using ExpandedWeights and torch.func.vjp with vmap, which the user finds helpful but not entirely sufficient for their needs. The user appreciates the suggestions and shares that they have managed to implement a working solution using the advice, although it still requires two forward passes. They express intent to optimize their code further and provide updates on their progress.
    • Number of comments this week: 4

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue involves an ImportError encountered when attempting to import 'triton_key' from 'triton.compiler.compiler', which is causing a backend compiler failure in a PyTorch environment. The error occurs during the execution of a Python script that utilizes the OotdPipeline and involves compiling components with Torch's compile function, specifically affecting the 'inductor' backend.
  2. Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternative algorithm for computing MaxPool2D in PyTorch when the stride is equal to 1, suggesting that a kernel size of 5 can be represented by two MaxPool2D operations with a kernel size of 3, and similarly for other sizes, to reduce computational cost on the CPU. The approach aims to optimize performance by decreasing the computation for each cell, and testing has shown a speedup of approximately 1.293 times compared to the traditional method.
  3. cuda_utils.so: failed to map segment from shared object: This issue involves a bug encountered when executing a compiled model in a Docker environment with a tmpfs permission set to 1777, where the cuda_utils.so file in the /tmp directory fails to execute due to missing execution permissions, despite being run as the root user. The error logs indicate that the problem arises during the execution of a PyTorch model, specifically when attempting to map a segment from the shared object cuda_utils.so, which lacks the necessary execution bit, leading to an ImportError.
  4. Enable UFMT on all files in PyTorch: This issue involves enabling uniform formatting (UFMT) across all files in the PyTorch codebase, as currently, approximately 1,500 files are not formatted according to the UFMT standards. The process requires removing file names from the exclude_patterns in the UFMT section of the .lintrunner.toml file and running a specific command to apply the formatting, with additional preparatory work needed to resolve known issues such as import cycles and misplaced annotations before the UFMT changes can be committed.
  5. [JIT archive] Add a flag to not include debug files: This issue proposes the addition of a flag to the torch.jit.save() function in PyTorch to exclude .debug_pkl files, which are primarily used for debugging purposes and can significantly increase the file size of TorchScript models compared to ONNX models. The motivation behind this feature request is to reduce the file size for deployment, especially on mobile devices, by eliminating unnecessary debug files, as demonstrated by a reduction from 6.7MB to 5.6MB in a test case.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 61

Summarized Issues:

  • Error Messages in PyTorch: This topic covers issues related to confusing or incorrect error messages in PyTorch. One issue involves a non-critical error message due to a missing 'CACHE_FILE' attribute in the 'cutlass' module, which could be avoided by checking for the attribute's existence. Another issue highlights an incorrect error message in the torch.binomial function, where the input and expected data types are switched, leading to confusion for users.
    • issues/156670, issues/157195
  • Model Export and Compatibility: This topic includes issues related to exporting models and compatibility with different formats and backends. One issue involves exporting a PyTorch model to ONNX format using torch-dynamo, with challenges in handling conditional inputs. Another issue describes a bug in exporting models with dynamic shapes, resulting in errors due to unsupported operators.
    • issues/156673, issues/156681
  • Bugs in PyTorch Functions: This topic covers various bugs encountered in PyTorch functions. Issues include a bug in torch.compile with SyncBatchNorm, a segmentation fault in torch.repeat_interleave, and a floating point exception in torch.nn.functional.conv_transpose3d. These bugs result in errors or crashes during execution, affecting the reliability of the functions.
    • issues/156680, issues/157097, issues/157098
  • Backend and Performance Issues: This topic includes issues related to backend compatibility and performance in PyTorch. One issue reports a segmentation fault on Apple Silicon when running a transformer model, while another highlights a performance regression in specific benchmarks on CPU. These issues affect the stability and efficiency of PyTorch on different hardware configurations.
    • issues/156723, issues/157077
  • Dependency and Configuration Problems: This topic covers issues related to dependencies and configuration in PyTorch. One issue involves outdated vendored wheels in the PyTorch pip repository, while another describes improper configuration of pybind11, leading to compatibility problems. These issues highlight the need for updated dependencies and consistent configurations across submodules.
    • issues/156694, issues/156725
  • Feature Requests and Enhancements: This topic includes feature requests and proposed enhancements for PyTorch. Issues include adding a generator argument to torch.rand_like for reproducibility, and introducing a unified memory API to standardize memory allocation across hardware backends. These enhancements aim to improve usability and consistency in PyTorch.
    • issues/156701, issues/156805
  • Distributed and Parallel Computing: This topic covers issues related to distributed and parallel computing in PyTorch. One issue involves a bug in the XCCL backend causing hangs during distributed operations, while another addresses the need for FSDP2 to support different data types for model parameters. These issues impact the efficiency and flexibility of distributed training in PyTorch.
    • issues/156782, issues/156784
  • Documentation and Usability: This topic includes issues related to documentation and usability improvements in PyTorch. One issue highlights the need for clearer documentation on the use of torch.compiler.save_cache_artifacts(), while another points out an incorrect path in the Contribution.md documentation. These issues emphasize the importance of accurate and comprehensive documentation for users.
    • issues/156797, issues/157101
  • Compilation and Build Issues: This topic covers issues related to compilation and build processes in PyTorch. One issue describes a compilation failure when building with Vulkan support on Fedora, while another involves a persistent OSError when building from source on Windows. These issues highlight challenges in ensuring successful builds across different platforms and configurations.
    • issues/156915, issues/157128

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 23

Summarized Issues:

  • Apple MPS and BatchNorm1d Bias Issue: This issue describes a bug where using Apple MPS to train a model with a BatchNorm1d -> LSTM -> BatchNorm1d structure results in the bias of the first BatchNorm1d layer becoming extremely large. This problem is not observed when the model is run on a CPU.
    • issues/156555
  • Test Failures on Specific Platforms: This issue pertains to the disabling of the test test_basic_fn_backend_eager_device_cuda within the TestPackage suite on the main branch due to its failure on the xpu platform. Further details are available through a provided link.
    • issues/156576
  • Torch Compile and Backend Issues: Several issues involve bugs in the PyTorch library related to torch.compile, including incorrect strides in traced graphs, misuse of context managers, and failures with symbolic shape parameters. These issues highlight discrepancies between CPU and HPU dispatches and challenges with dynamic shape handling.
    • issues/156578, issues/156720, issues/156724
  • Memory Format and Stride Preservation: This issue highlights an inconsistency in PyTorch's torch.clone() function when using memory_format=torch.preserve_format. It correctly preserves the strides for transposed views but fails to do so for sliced views, resulting in a contiguous tensor instead of maintaining the non-contiguous memory format as expected.
    • issues/156644
  • Hooking Issues in Transformers Library: This issue describes a problem encountered when attempting to hook the inputs of the Qwen2DecoderLayer in the Hugging Face Transformers library. The forward hook only collects positional arguments and not keyword arguments, leading to a loss of input data unless the with_kwargs=True argument is set.
    • issues/156695
  • PyTorch CI and Compilation Errors: Issues in the PyTorch CI include a function with a dictionary-type parameter causing an AddressSanitizer error and a problem with the load_inline function appending legacy flags to the NVCC command line. These issues highlight challenges in maintaining compatibility and performance in the CI environment.
    • issues/156787
  • Numerical Precision and Compatibility Issues: This issue describes a numerical precision bug in PyTorch version 2.7.1 when using torch.float64 on CUDA, where calculations of powers of two are inaccurate. Additionally, a compatibility issue arises when a program packaged with PyInstaller encounters errors on specific GPUs.
    • issues/156929
  • Dependency and Configuration Inconsistencies: This issue addresses the inconsistency in dependency versioning for Numpy across different Python versions in the PyTorch project's configuration files. Different Numpy versions are pinned in .lintrunner.toml and .ci/docker/requirements-ci.txt, leading to confusion about which version should be used.
    • issues/157012
  • FSDP and CPU Offload Issues: This issue involves a bug in the Fully Sharded Data Parallel (FSDP) feature where enabling CPU offload with pin_memory set to True results in a "CUDA error: invalid argument" on certain GPUs. This is specifically observed on A40 but not on H100, due to a potentially confusing operation of moving parameters to the CPU and then pinning them to the GPU.
    • issues/157146

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 165

Key Open Pull Requests

1. [BE] use pathlib.Path instead of os.path.* in setup.py: This pull request proposes replacing the use of os.path.* with pathlib.Path in the setup.py file of the PyTorch project, addressing a specific discussion point from a previous pull request.

  • URL: pull/156742
  • Merged: No
  • Associated Commits: 8b4f8, b1a83, f33e9, 2efee, 34628, 0d207, 7438d, 82524, 7263d, ea6b1, ce797, 476c0, 06764, 8ee83, 07b08, 00735, e6e6a, 742ad, 150d7

2. Adds support for Nested Jagged Tensor in Multihead Attention: This pull request introduces a new implementation of the multi_head_attention_forward function specifically for Nested Jagged Tensors (NJT) in the PyTorch library, addressing issue #153472 by ensuring compatibility with NJT without altering the original function, and includes unit tests to validate the changes while noting certain limitations such as the lack of support for attn_mask and need_weights.

  • URL: pull/156660
  • Merged: No
  • Associated Commits: b1bf6, d66f9, 6c6b8, cf8fe, b53e1, 6492e, db561, 795b9, 83eaa, 21a80, 7dfca, 7ea3f, dfc74, f2206, d3f48

3. [build] remove cmake cache and reconfigure again if it is invalid: This pull request aims to enhance the build process by removing the CMake cache and reconfiguring it if found to be invalid, as part of a series of changes tracked through the ghstack tool, with references to related issues and contributions from multiple commits.

  • URL: pull/156958
  • Merged: No
  • Associated Commits: 18305, a492c, 4ce58, a01a0, c6e88, 6e204, 7eb8c, 2055d, 246dd, c5e87, 7f224, 02573, 1bdac, 34094

Other Open Pull Requests

  • SDPA Module Backend Enhancements: This pull request introduces a new OVERRIDEABLE backend to the SDPA module for XPU in PyTorch. It enhances backend selection logic with a fallback mechanism for unsupported FLASH_ATTENTION, and includes updates to error messaging, unit tests, and codebase integration to improve user configurability and ensure correct backend selection.
    • pull/156669
  • CMake Version Consistency and Parsing: Two pull requests focus on CMake version handling in the PyTorch project. One introduces a linter to ensure consistency in specifying the minimum required version of CMake, while the other proposes changing the method of parsing the CMake version to a more machine-friendly JSON format.
    • pull/156961, pull/157073
  • Build Process and Error Messaging Improvements: Enhancements to the build process of C++ extensions in PyTorch include color-coded error messages, specifically highlighting undeclared variables in red. Additionally, CMake shim warnings are redirected to the standard error stream to improve clarity during the build process.
    • pull/157051, pull/157074
  • Windows Path Handling Fix: This pull request addresses a Windows-specific issue by fixing the handling of paths containing Chinese characters in the load_inline function. It includes multiple commits for testing and updating scripts to ensure compatibility.
    • pull/157032
  • AArch64 Architecture and MKLDNN Enhancements: A pull request tests AArch64 architecture failures by refining the FP32 precision API and enabling BF16 as an internal precision for MKLDNN convolution operations. It also enables BF32 tests for MKLDNN convolution and linear pointwise/binary operations in the inductor.
    • pull/156671
  • Autograd Rules for aten::aminmax: This pull request introduces functionally correct backward (VJP) and forward (JVP) autograd rules for the aten::aminmax operator in PyTorch. It updates derivatives.yaml and other relevant files to ensure accurate differentiation in eager mode, while also optimizing tensor allocation and adding comprehensive test cases.
    • pull/156675
  • CUDA 12.9 Integration in CI: This pull request integrates CUDA 12.9 into the continuous integration (CI) system of the PyTorch project. It adds periodic tests, including updates to test configurations and build scripts.
    • pull/156900
  • Command Line Argument Parsing Optimization: This pull request optimizes the PyTorch codebase by stopping the repeated parsing of command line arguments every time the common_utils module is imported. It is part of a series of smaller pull requests that collectively re-submit changes from a larger, previously submitted pull request.
    • pull/156703
  • mi300 Workflow Testing: Focused on testing the mi300 workflow, this pull request involves running tests on ROCm GPUs, using a custom branch, updating references, and avoiding a custom registry. It addresses an unspecified issue and notifies several contributors and support teams.
    • pull/156727
  • ROCm SymmetricMemory Optimization: This pull request enhances performance by de-serializing memory loads specifically for ROCm's SymmetricMemory. It addresses an issue referenced as #ISSUE_NUMBER and includes multiple commits focused on stopping the serialization of consecutive load vector operations.
    • pull/156746
  • sccache Integration in Manylinux Images: This pull request integrates sccache into the manylinux images used in the project, excluding the sccache-dist binary due to its anticipated non-use. It employs a vendored version of OpenSSL to facilitate sequential binary builds, serving as a reland of a previous attempt.
    • pull/156892
  • Untyped Definitions Removal: A pull request aims to remove untyped definitions in the PyTorch project as part of a batch update. It involves multiple commits that are part of a stack managed by ghstack and has not yet been merged.
    • pull/157011
  • Reinplace Pass Fix for View Input: This pull request addresses issue #153389 by implementing a fix for the reinplace pass handling of view input and mutable custom operations. It uses the approach suggested by Richard and involves multiple commits for updates, comments, cleaning, and linting.
    • pull/156729
  • Triton Kernel Runtime Prediction: A multi-layered perceptron model is introduced to predict Triton kernel runtimes, trained on a comprehensive dataset of matrix multiplication runs. It aims to enhance the efficiency of autotuning by identifying optimal configurations more effectively than the existing max-autotune process.
    • pull/156851
  • Generator Closure in compile_subgraph: This pull request aims to explicitly close all open generators in the compile_subgraph function to ensure that all remaining finally blocks are executed. It leverages CPython's tp_finalize function to trigger genclose.
    • pull/157149
  • Inductor Reordering and FSDP Bucketing Test: A test case for inductor reordering and FSDP bucketing is introduced, utilizing an existing but not fully robust FSDP-bucketing pass. The expectation is that the test will initially fail to serve as a starting point for improving the reordering pass.
    • pull/156749
  • TensorFloat-32 Precision in MKL-DNN: This pull request proposes allowing the use of TensorFloat-32 (tf32) as an internal precision format equivalent to float32 (fp32) for operations such as convolution, matrix multiplication, and recurrent neural networks (RNN) within the MKL-DNN backend.
    • pull/156802
  • New API for Memory Information: A new API, torch.accelerator.get_mem_info, is introduced to the PyTorch project. It is currently a work in progress with multiple commits linked to the development stack, but it has not yet been merged.
    • pull/156812
  • Naming Clarifications in PyTorch: This pull request addresses naming issues by renaming _torchdynamo_orig_callable to _torchdynamo_orig_fn and _torchdynamo_orig_backend to clarify the distinct uses of the original callable and backend in nested decorators and callbacks.
    • pull/156901
  • GPU Memory Allocation Tracking: A feature for tracking GPU memory allocation is implemented, as referenced in issue #6736 of the PyTorch test-infra repository. It includes multiple commits signed by Yang Wang.
    • pull/156907
  • Cholesky Dispatches Deletion: This pull request proposes the deletion of custom Cholesky dispatches in favor of a column-wise approach. It ensures that all operations go through generalized kernels and that Metal kernels are compatible with the same sizes and strides as CPU or CUDA backends.
    • pull/157014
  • Generator Behavior Modification: A pull request aims to modify the behavior of a generator by ensuring that the StopIteration exception is raised with a value obtained from the return statement. It is indicated by the title and the associated commits.
    • pull/157152
  • Expression Optimization: This pull request optimizes the expression max(1, x) to simply x when it is known that x is greater than or equal to 1. It addresses issues such as failed static assertions and ConstraintViolationErrors in internal tests.
    • pull/157189

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 196

Key Closed Pull Requests

1. Docfix xavier: This pull request involves removing a section of the documentation that was deemed irrelevant and potentially a result of a copy-paste error, as indicated by the title "Docfix xavier" and the body of the request, which mentions the removal of confusing documentation content.

  • URL: pull/157099
  • Merged: No
  • Associated Commits: d94ea, cdd7a, abebb, b1940, 7bab7, 7173a, 48398, 6926f, c236b, 1b8f4, b23bf, c632e, 4cc43, 89b09, 3b87b, fb027, 644fd, 8c7db, bf727, 3a8e6, f63de, 1f612, 46443, 12a6d, 4268b, 84210, d80af, b04d8, 8d218, d29e4, 9b4f0, 5bed3, ecd43, 18a92, 6b27e, 1b84f, 64ca7, 697cd, 2b73f, e691e, 1a6c1, 71fa7, f2b3b, 60ddc, 5745d, 53a13, d10ff, c4b98, 85229, a3cd7, b766c, dfd39, f2ee3, 7ad8b, 79126, 5416d, 65695, c2cca, 8b6bc, 3b61d, 06c6a, 28ca4, 1cc51, a6321, 35f1e, 3f236, ef2b1, 89490, c7ff7, 0c236, 07391, 13417, 80419

2. load inline user overridable gencode: This pull request addresses the issue of loading inline user-overridable gencode in the PyTorch project by implementing a series of testing strategies, including using cuobjdump, reading stderr for gencode verification, and creating a simpler unit test to ensure the correct flags are added by default, although it was ultimately closed without being merged.

  • URL: pull/156850
  • Merged: No
  • Associated Commits: bc688, 32e19, b6600, 0cec7, fa169, 324aa, bccd9, c6816, 49c34, 0bead, 6285a, 5e7d8, 62e90, e7381, 23881

3. Port three dynamo test to Intel GPU: This pull request aims to port three additional dynamo test files to Intel GPU for the PyTorch project, following a previous effort that ported two test files, by utilizing methods such as instantiate_device_type_tests(), determining the accelerator backend with torch.accelerator.current_accelerator(), adding XPU support in decorators, and enabling XPU for specific test paths, while maintaining the original code style.

  • URL: pull/156575
  • Merged: No
  • Associated Commits: 70ce2, 32982, 454a3, 38010, a95ae, 08f19, fe8cf, 43df2, 8da8e, dd516, 613a5, 0094e, e4b23

Other Closed Pull Requests

  • Type Annotations and Static Checking: This topic involves adding type annotations and running the mypy static type checker on the setup.py file in the PyTorch project. Although part of a stack of related changes, this pull request was ultimately not merged.
    • pull/156741
  • CUDA and Windows Support: The pull request aims to add support for building with CUDA 12.9.1 on Windows by removing the SegmentReduce.cu component. It addresses related issues such as memory limits and out-of-memory errors, and includes multiple commits for fixes, tuning, and linting.
    • pull/156630
  • Cutlass Library Updates: This topic covers renaming the Cutlass Python library to "python-cutlass" and moving the Cutlass key to the cutlass_library within the Cutlass backend. Both pull requests were part of a stack of changes managed by ghstack, but the renaming was not merged.
    • pull/156655, pull/156654
  • NVSHMEM Test Refactoring: The pull request refactors the NVSHMEM tests by moving the Triton-specific tests into a new dedicated file and reorganizes the shared Triton JIT kernels for better reusability. It ensures all original tests pass without any changes in functionality.
    • pull/156685
  • C++ Upgrader for JSON: This pull request involves the addition of a C++ upgrader designed for JSON-based upgrading within the PyTorch project. It was part of a stack of changes managed through the ghstack tool, although it was ultimately not merged.
    • pull/156761
  • Runtime Error Fixes in layer_norm: This pull request addresses an issue where calling the sum() function on a default-constructed tensor in the layer_norm function could lead to runtime errors. It implements null pointer checks and ensures that sum(0) is only called on defined tensors, along with adding and tweaking tests to verify the fix.
    • pull/156600
  • unbind_copy Function Fix: This pull request addresses an issue where the unbind_copy(..., out=...) function was incorrectly returning None instead of the out argument. The patch fixes this by updating the fake kernel to properly handle the out variant.
    • pull/156643
  • H100 CI Testing for Distributed Code: The pull request aims to enable H100 continuous integration (CI) testing for all changes related to distributed code in the PyTorch project. It leverages the existing "oncall:distributed" label to automatically trigger these tests.
    • pull/156721
  • Documentation Updates: This topic includes removing references to TorchScript from the export documentation and updating the documentation for torch.device to officially support its constructor with various methods. The TorchScript removal was not merged, while the torch.device update addresses issue #156519.
    • pull/156969, pull/156686
  • Metal Kernel Optimization: The pull request aims to optimize the performance of cummin and cummax operations in Metal kernels for PyTorch. It results in significant speed improvements across various tensor sizes and data types, as demonstrated by the provided performance metrics.
    • pull/156794
  • Benchmarking Scripts for torch.utils.data: This pull request aims to centralize benchmarking scripts for torch.utils.data components by adding a new sub-folder in the benchmarks directory. It includes a simple script to time samplers, establishing a common standard and preventing redundant script copying.
    • pull/156974
  • CUDA CI Pipeline Testing: This pull request was created to test the b200 configuration using a dummy GitHub Action in the CUDA Continuous Integration (CI) pipeline. It was ultimately not merged.
    • pull/157184
  • Optimized Implementation of Guard Collectives: The pull request proposes an optimized implementation of guard collectives by replacing the configuration option with a set_stance API. This reduces the overhead associated with checking configuration values during the torch.compile process, as highlighted by performance issues in the functorch_maml_omniglot benchmark.
    • pull/156562
  • Allocation Backend Selection: This pull request introduces a programmatic method to select the allocation backend through a new set_backend API. It provides a more dynamic alternative to the existing environment variable method, allowing for easier configuration in continuous integration environments.
    • pull/156661
  • fx_graph_runnable Script Execution: This pull request ensures that fx_graph_runnable scripts can execute simple tensor functions by adding test boilerplate and incorporating review changes from a previous pull request. It is part of a stack of changes managed through ghstack.
    • pull/156870
  • Removal of "gso" from Linear.cpp: This pull request involves the removal of "gso" from the Linear.cpp file, with multiple updates and commits focusing on the sumproduct_pair function. It was ultimately not merged.
    • pull/156899
  • FP64 Scalar Input Issue: This pull request addresses an issue where kernels that take fp64 scalar inputs generate incorrect results by skipping the correctness test for test_floats in the static launcher. It is a temporary measure until the underlying problem is resolved.
    • pull/157023
  • Graph Break Issue in torch.Tensor.data: This pull request addresses a graph break issue related to the assignment of torch.Tensor.data with a mismatched data type. It provides a temporary workaround and is linked to fixing issue #152162.
    • pull/156623
  • Triton NVSHMEM Test Suite Improvements: This pull request removes unnecessary dist.barrier calls from the Triton NVSHMEM test suite and introduces device-side signal operation support. It enhances synchronization efficiency by leveraging NVSHMEM's native ordering guarantees.
    • pull/156684
  • Release-Specific Changes for Version 2.8: This pull request involves release-specific changes for version 2.8 of the PyTorch project. It has fewer modifications than a previous related pull request due to the removal of the need for docker pinning and has been successfully merged.
    • pull/156728
  • ONNX Script API Update: This pull request aims to update the ONNX script API to be compatible with Torch version 2.8. It includes multiple commits with changes to installation scripts and requirements files, although it was ultimately not merged.
    • pull/157017
  • Migration to OpenReg: This pull request aims to migrate the cpp_extensions_open_device_registration to OpenReg. It includes considerations for fake tensors, named tensors, and custom autograd functions, but it was not merged.
    • pull/156588
Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.