Weekly GitHub Report for Pytorch: April 06, 2026 - April 13, 2026 (19:29:09)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[MODULE: ONNX] [TRIAGED] Cannot export model with CenterCrop to ONNX: This issue concerns a failure to export a classification model that includes a CenterCrop preprocessing step to the ONNX format, resulting in a TypeError because Python's built-in round() function cannot handle symbolic tensor inputs during export. The user seeks a way to include preprocessing within the ONNX model for ease of distribution, but encounters errors related to symbolic tracing and module registration when using dynamic export and preprocessing transforms from torchvision's v2 API.
- The comments clarify that the error arises because CenterCrop uses Python's round() on symbolic tensors, which is unsupported during ONNX export; the recommended workaround is to exclude preprocessing from the ONNX graph and perform it externally. Attempts to encapsulate preprocessing in torch.nn.Module subclasses and switch to dynamic export still fail due to tracing and dtype issues, with suggestions to use float32 dummy inputs and convert preprocessing transforms into registered submodules, but the user reports ongoing difficulties integrating preprocessing while maintaining dynamic export functionality.
- Number of comments this week: 11
-
[RFC] Optimize persistent reduction with
recompute from cached inputs: This issue proposes optimizing persistent reduction in normalization kernels by caching inputs as float16 and converting to float32 only during computation to reduce register pressure, and by using shared memory with TMA for larger hidden sizes to improve performance. It includes preliminary results showing significant bandwidth improvements with these methods and suggests integrating heuristic routing in Inductor's scheduler to select the appropriate kernel based on reduction dimension.- The comments discuss implementing heuristic routing in the scheduler to choose between different kernel strategies based on input size, provide example code for the register-tiled path, confirm the use of Gluon for TMA shared memory caching, and conclude with agreement that the proposed optimizations are effective and the issue can be closed.
- Number of comments this week: 5
-
torch.compile degrades the performance compared with eager execution: This issue reports that using
torch.compileto optimize a token entropy calculation function results in significantly worse performance compared to eager execution when the batch size is small (1 to 64), with improvements only appearing at larger batch sizes like 512. The user suspects that the compilation introduces overhead that slows down execution in low-latency scenarios, and provides benchmark results demonstrating this behavior across different compilation modes.- The comments suggest trying to disable split reductions via
torch._inductor.config.split_reductions = Falseto improve performance, with one user confirming that this setting alleviates the issue for the default compilation mode but not for others; another explains that overhead from CUDA graph usage in certain modes causes the slowdown, and a final suggestion is made to try amax-autotune-no-cudagraphsmode to potentially mitigate the problem. - Number of comments this week: 4
- The comments suggest trying to disable split reductions via
-
[HIGH PRIORITY] [MODULE: CORRECTNESS (SILENT)] [MODULE: REDUCTIONS] [ONCALL: PT2] [MODULE: INDUCTOR] [TOPIC: FUZZER] [BOT-TRIAGED]
torch.compileproduces significantly different output for model with multipleamax/aminreductions on same tensor compared to eager mode: This issue reports that usingtorch.compilewith the Inductor backend on a CNN model performing multiple reduction operations on the same tensor results in significantly different outputs compared to eager mode, with differences large enough to indicate an algorithmic error rather than floating-point precision drift. The problem is linked to Inductor'sreuse_partialoptimization, which attempts to reuse partial reduction results but appears to incorrectly handle fused reduction kernels when multiple reductions with overlapping dimensions are involved, causing large mismatches in the model's output.- The comments suggest disabling aggressive reduction splitting/fusion in Inductor as a workaround, explain that near-zero values in the denominator amplify small differences, and recommend increasing the epsilon or clamping the denominator to stabilize computations; additional advice includes using a compiler fallback flag and restarting the environment to resolve transient issues.
- Number of comments this week: 3
-
[FEATURE] [TRIAGED] [MODULE: FX] [ONCALL: PT2] [MODULE: DYNAMO] [BOT-TRIAGED] Call Hierarchy Metadata: This issue proposes adding a new metadata field called
node.meta["call_hierarchy"]to FX nodes, which will capture the complete, interleaved module and function call chain responsible for producing each operation during Dynamo tracing. This enhancement aims to provide detailed provenance information that is not currently available in existing metadata fields, improving traceability and debugging capabilities.- The comments include a mention of related discussions on provenance information in a recent meeting, a request for a link to the prototype implementation, and the provision of a pull request URL containing the prototype.
- Number of comments this week: 3
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 192
Summarized Issues:
- torch.compile numerical discrepancies and fusion issues: Multiple issues report that
torch.compilewith the Inductor backend produces numerical discrepancies compared to eager mode due to changes in floating-point accumulation order and fusion optimizations. These discrepancies affect various operations includingmm + biasfusion, batch pointwise math ops, quantization pipelines, batched GEMM fusion, and Transformer attention patterns, leading to compounded errors and systematic output differences. - issues/179520, issues/179561, issues/179562, issues/179567, issues/179568, issues/179569, issues/179570, issues/179574, issues/179575, issues/179577, issues/179578
- torch.compile runtime errors and optimization bugs: Several issues describe runtime errors and bugs in
torch.compilewith the Inductor backend caused by incorrect optimization passes or graph lowering failures. Problems include shape mismatch errors from redundant view removals, failures in handling multiple consumers ofcumsumresults, crashes on heterogeneoustorch.randintcalls, and assertion errors in FFT operations, all while the models run correctly in eager mode. - issues/179502, issues/179510, issues/179534, issues/179571, issues/179573, issues/179576, issues/179807, issues/179883
- ONNX export and model serialization issues: There are issues with exporting PyTorch models to ONNX format and serializing opaque type constants. Problems include invalid ONNX models generated when reusing Conv2d+BatchNorm2d blocks with optimization enabled, failures exporting models with preprocessing steps due to symbolic dimension handling, and runtime errors when saving or loading exported programs containing opaque constants.
- issues/179559, issues/179709, issues/179870
- Numerical discrepancies between CPU and CUDA implementations: Multiple issues report significant numerical differences between CPU and CUDA outputs for various functions such as
torch.special.erfcx,torch.renorm,torch.linalg.eigvalsh, and log determinant calculations. These discrepancies can impact numerical stability and reproducibility in neural network computations. - issues/179785, issues/179787, issues/179789, issues/180105
- MPS backend numerical and gradient bugs: The MPS backend exhibits numerical correctness bugs including negative values from avg_pool1d despite non-negative inputs due to precision loss, and incorrect or zero gradients in backward passes for scalarized reductions, indicating issues in the MPS Metal shader and autograd implementations.
- issues/179608, issues/180201
- Disabled tests and CI failures on ROCm platform: A large number of tests have been disabled on the ROCm platform due to consistent failures, hangs, or flaky behavior across various test suites including InlineAsmElementwise, ComboKernel, DistTensorRandomOpCompileTest, TestCudaAllocator, and others. These widespread test disables indicate ongoing stability and compatibility challenges with ROCm in the main branch.
- issues/179925, issues/179927, issues/179938, issues/179939, issues/179940, [issues/179942](https://github.com/issues/179942], issues/179943, issues/179944, issues/179945, issues/179946, issues/179947, issues/179948, issues/179949, issues/179950, issues/179951, issues/179952, issues/179953, issues/179954, issues/179955, issues/179956, issues/179957, issues/179958, issues/179959, issues/179960, issues/179961, issues/179962, issues/179963, issues/179964, issues/179965, issues/179967, issues/179968, issues/179970, issues/179973, issues/179976, issues/179977, issues/179979, issues/179981, issues/179982, issues/179983, issues/179984, issues/179985, issues/179986, issues/179987, issues/180006, issues/180009, issues/180010, issues/180011, issues/180012, issues/180013, issues/180014, issues/180015, issues/180016, issues/180017, issues/180018, issues/180019, issues/180021, issues/180022, issues/180023, issues/180024, issues/180025, issues/180026, issues/180027, issues/180028, issues/180029, issues/180030, issues/180031, issues/180032, issues/180033, issues/180034, issues/180035, issues/180036, issues/180037, issues/180038, issues/180039, issues/180040, issues/180041, issues/180042, issues/180043, issues/180044, issues/180045, issues/180046, issues/180047, issues/180048, issues/180049, issues/180051, issues/180053, issues/180054, issues/180055, issues/180056, issues/180057, issues/180058, issues/180059, issues/180060, issues/180061, issues/180062, issues/180063, issues/180064, issues/180065, issues/180066, issues/180067, issues/180069, issues/180071, issues/180072, issues/180073, issues/180074, issues/180075, issues/180076, issues/180077, issues/180078
- Precision and numerical accuracy issues on GPU: The GPU implementations of certain operations such as float16 cumulative sum and two-pass variance calculation exhibit significant precision loss or overflow compared to CPU implementations. This results in accuracy degradation by factors of 11-15 and overflow errors due to intermediate squared values exceeding float32 limits.
- issues/180150, issues/180156
- Random number generation and DTensor bugs: DTensor has a bug where the CPU RNG state does not advance on CUDA meshes, causing infinite loops in truncated normal initialization. Additionally,
torch.cuda.Streamobjects are traced incorrectly bytorch.compile, leading to segmentation faults when using non-default CUDA streams. - issues/180088, issues/180179
- Distributed and collective operation issues: There are bugs related to distributed operations including a race condition in ProcessGroupGloo causing memory corruption, failures in tracing distributed all_reduce due to ProcessGroup serialization issues, and synchronous behavior of all_to_all collectives under
torch.compilepreventing expected asynchronous overlap. - issues/179848, issues/179858, issues/179922
- Performance regressions and optimization proposals: Some issues report performance regressions such as slower small-M FP8 fused scaled matmul on SM100 hardware and performance degradation of compiled Token Entropy calculations at small batch sizes. Proposals include optimizing persistent reduction kernels and adding XPUGraph Trees for Intel GPUs to reduce host overhead.
- issues/179697, issues/179711, issues/179770, issues/180168
- Documentation and API improvements: Proposals include adding a new
node.meta["call_hierarchy"]field to FX nodes for better provenance tracking, clarifying the documentation oftorch.Tensor.index_add_regarding theindexparameter, and adding a stable C API function to obtainAtenTensorHandlefrom Python tensors. - issues/179643, issues/180107, issues/180119
- Hardware and platform-specific issues: There are platform-specific problems such as segmentation faults on Intel ARC PRO B70 GPUs due to missing SYCL device info, ROCm packaging errors causing missing TensileLibrary.dat files, and CUDA runtime errors on Windows with RTX 5080 GPUs due to missing kernel images.
- issues/179865, issues/179891, issues/180101
- Miscellaneous bugs and test infrastructure issues: Other issues include flaky test failures due to nondeterministic error ordering, CI jobs not failing despite test failures, and disabled tests due to profiler-dependent failures on aarch64 architecture.
- issues/179703, issues/179723, issues/180002
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 36
Summarized Issues:
- Build and Compilation Failures: Several issues report build or compilation failures across different platforms and configurations. These include Windows build failures with CUDA Toolkit versions above 13.0, ROCm nightly build breakages due to C++20 enforcement and missing dependencies, and segmentation faults on GPU inference caused by outdated GCC and CMake versions.
- Numerical Accuracy and Stability Discrepancies Between CPU and GPU: Multiple issues highlight significant numerical mismatches and stability problems when running operations on GPU compared to CPU. These include discrepancies in NaN counts, cumulative sum/product accuracy, least squares solver results, and conversions of NaN/Inf values to integers, indicating inconsistent precision and rounding errors across devices.
- PyTorch Test Failures and Disabled Tests on ROCm Platform: Several tests have been disabled or fail consistently on the ROCm platform, affecting continuous integration and main branch stability. These include tests related to distributed processing, autotuning, dynamic shapes, and numerical backends, indicating platform-specific reliability issues.
- TorchScript and Serialization Errors: Issues have been reported involving crashes and errors during model serialization and deserialization. These include segmentation faults from corrupted
.ptfiles and serialization failures due to unsupported tensor data types liketorch.uint32, highlighting robustness and compatibility problems.
- Torch.compile and Triton Kernel Compilation Bugs: There are bugs affecting
torch.compileand Triton kernel compilation, such as conflicts in operator handlers causing crashes, improper handling of multi-buffer epilogues leading to compilation errors, and failures tracing numpy.flatattributes on FakeTensor. These issues cause runtime errors and compilation failures in optimized code paths.- issues/179233, issues/179418, [issues/179582](https://github.com/issues/179582]
- Environment and Test Setup Issues Causing Failures: Some tests fail due to improper environment variable management or stale state between tests. For example, environment variables set during test setup are not cleared, causing named pipe errors in subsequent tests, which points to test suite hygiene problems.
- Hash Mismatch and Package Integrity Problems: Users encountered hash mismatch errors when installing PyTorch binaries via pip on Linux, caused by inconsistencies between mirrors. These issues were resolved by verifying and syncing package binaries to ensure consistent installation experiences.
- Precision Loss in Mixed-Type Tensor Operations: A bug causes silent precision loss when adding int32 tensors to bfloat16 tensors due to type promotion rules, resulting in incorrect rounding of integer values above 256. This suggests a need to revise promotion rules to prevent unintended data corruption.
- ROC Profiling and Symbol Presence Questions: There is uncertainty about the addition of the "rocprofiler_configure" symbol in the
libtorch_cpu.solibrary, with users seeking clarification on whether it relates to ROCm profiling features.
- PyTorch API and Documentation Issues: Incorrect type annotations in stub files for
torch.catandtorch.stackfunctions have been reported, along with criticism of contribution guidelines and issue templates for being discouraging and poorly written, indicating documentation and developer experience concerns.
- Numerical Function Bugs Producing Incorrect Outputs: The Softplus function returns infinite values instead of finite outputs when given extremely large beta and infinite threshold parameters, indicating a numerical stability bug affecting both CPU and GPU implementations.
- Community and Contribution Workflow Discussions: There is an ongoing discussion about best practices for the contribution workflow, including branch naming, pull request reviews, and testing requirements, aiming to improve project collaboration and code quality.
- Miscellaneous Issues and Feature Suggestions: Some issues include milestone celebration proposals and reports of inconsistent segmentation masks in industrial applications, reflecting community engagement and real-world application challenges.
- Workspace Size and Algorithm Selection Problems: MIOpen Gemm convolution solvers return a workspace size of zero on the gfx1150 GPU when used via PyTorch's ATen convolution path, causing these solvers to be excluded silently from algorithm selection and resulting in fallback to slower convolution methods.
- Checkpoint and Memory History System Errors: A SystemError occurs in PyTorch 2.11 when using compiled checkpointed models with CUDA memory history recording enabled, causing built-in methods to return NULL without exceptions during backward passes, despite working in earlier versions.
- Runtime Errors in FX and Dynamo Optimizations: The skip_nested_compile context manager does not apply properly during checkpoint callbacks, causing runtime errors when symbolically tracing dynamo-optimized functions due to missing decorator application on rematerialization callbacks.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 282
Key Open Pull Requests
1. [dynamo] Cache slice sizes to avoid unrelated unbacked symbols: This pull request addresses an issue in PyTorch's Dynamo where slicing with unbacked symbolic integer indices caused the creation of unrelated unbacked symbols for slice sizes, leading to equality check failures, by introducing a cache in ShapeEnv that reuses slice size symbols keyed by symbolic expressions to ensure consistent and deterministic slice size computation, thereby fixing a graph break in HuggingFace Aria's sequential_experts_gemm (MoE pattern).
- URL: pull/179635
- Associated Commits: 20a61, c6c48, 92eeb, 56073, fd164, 6f694, 3ee71, 23343, 25c0b, 00787, b84a9, 8b427, fb7a9, 05eb4, 3be85, d15b1, bdd26, 811ea, 0c337, 9ba1d, 3e8f1
2. [OSDC] Enable linux-docs on OSDC pull: This pull request adds support for the OSDC (ARC) runner to the documentation build configuration by enabling a build-docs-osdc job in _docs.yml that uses a matrix for C++ and Python documentation types and runs directly in the container image instead of using docker-in-docker, while preserving and gating the existing EC2 build path.
- URL: pull/179994
- Associated Commits: 0ac48, f92ce, 2f906, daf0f, 68b8a, 05b3d, 90f09, 223f8, aa628, 44283, 5bb7b, 60925, 7a0bc, 6ac0f, 966a6, a029a, 7c4db, 45717
3. [CI] Fix flaky SIGPIPE in backwards_compat test: This pull request fixes a flaky SIGPIPE error in the backwards_compat test by replacing a pipe-based grep command with a here-string to avoid a race condition caused by the -o pipefail setting in the GitHub Actions shell, ensuring reliable detection of the canary module in test output.
- URL: pull/179999
- Associated Commits: 96e77, c1a07, 209d3, ec800, 63ad8, 4c421, e7716, 8101d, e2a8e, 241fd, a6f95, 52e36, 15470, ca98a, e8c0d, 5665a, 009e1
Other Open Pull Requests
- Dynamo tracing and graph break fixes: Multiple pull requests improve the Dynamo tracing process and fix graph breaks in the HuggingFace VitsModel by making functions like
expect_truetraceable during compilation and adding constant-folding for importlib functions. These changes enable runtime assertions to be emitted into the graph and treat importlib functions as SkipFunctionVariable, enhancing graph stability.
- Inductor eager tests and Intel GPU support: Several pull requests enable and port inductor unittest tests to run on Intel GPUs by activating specific test suites and defining expected failures for XPU devices. This effort improves test coverage and compatibility for inductor compiled regions on Intel hardware.
- Inductor CUDA graph and kernel annotation enhancements: Pull requests introduce kernel annotations into CUDA graph trees and add a CUDAGraphPolicy base class to Inductor's post_compile pipeline. These changes enable customizable cudagraph wrapping and automatic processing of kernel scopes during CUDA graph capture, supporting advanced use cases and improving graph management.
- Autograd and memory management improvements: Updates include replacing
std::shared_ptr<Node>withc10::intrusive_ptr<Node>for better memory management and implementing thread-safe Python wrapping for autograd Node objects. These changes enhance memory handling and thread safety within the autograd system.
- FSDP collective bucketing optimization: A pull request introduces a pre-bucketing strategy for Fully Sharded Data Parallel collectives to improve bucketing efficiency by considering process group bandwidth and latency. This resolves issues caused by extra dependencies that previously hindered effective bucketing in overlap scheduling.
- Debugging tools for CUDA stream and ROCm CI: Pull requests add a context manager to warn about work enqueued on the NULL CUDA stream and introduce debug instrumentation in the ROCm CI pipeline to diagnose incorrect pytest exit codes. These tools help detect and debug CUDA graph capture errors and nondeterministic test failures.
- Dynamo code refactoring and typing improvements: Refactoring efforts include splitting large test files into smaller modules and adding missing type annotations across the
torch._dynamomodule. These changes improve code readability, maintainability, and typing consistency without affecting runtime behavior.
- OpaqueGenerator replacement and FX integration: One pull request replaces the
OpaqueGeneratorPython wrapper by makingtorch._C.Generatora proper opaque reference type with enhanced C++ support and FX codegen integration. This eliminates the wrapper and enables direct registration and seamless flow through FX infrastructure.
- FakeTensor higher order operations registration: A pull request registers higher order operations with the Fake dispatch key in the C++ faketensor implementation and tests them by invoking internal operations directly. This is necessary due to the current lack of integration between the compile function and the C++ faketensor.
- Continuous integration and documentation build fixes: Pull requests introduce a new CI workflow with thread sanitizer enabled and fix Python documentation build hangs by applying multiple optimizations. These improvements reduce build times and detect data races more effectively.
- AOT Inductor debugging and code generation enhancements: A pull request adds a
dump_python_moduleflag to AOT single-pass compilation mode, enabling output of Python compiled modules with Triton kernel source and execution graph info. This facilitates improved debugging and profiling with kernel name consistency for Kineto trace correlation.
- OrderedDict refactoring in Dynamo: One pull request refactors the Dynamo component by introducing OrderedDict as a subclass of UserDefinedDictVariable to better resemble CPython structure and support user-defined object variable-like cases.
- DTensor sharding propagation fix: A pull request fixes sharding propagation for Split(Flatten) operations in DTensor by correctly handling non-first flatten dimensions and adding checks to prevent invalid shard configurations. This ensures accurate tracking of input dimensions and avoids silent incorrect results.
- Dynamo getitem_const self-validation: One pull request makes the
getitem_constfunction in Dynamo self-validating to ensure behavior matches CPython's subscript functions.
- Multi-grad hook and backward debugging: A pull request introduces
register_multi_grad_hookfor leaf functions and adds debug print capabilities for backward passes and multi-tensor operations to enhance gradient computation tracing.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 291
Key Closed Pull Requests
1. [OSDC] Enable bazel build+test on OSDC pull: This pull request proposes adding support for the OSDC (ARC) runner in the _bazel-build-test.yml configuration by introducing a build-and-test-osdc job that runs directly in the container image with an inline filter, following the pattern of the existing Linux build setup, while preserving and gating the EC2 build path based on the use-arc input.
- URL: pull/179969
- Associated Commits: ce88f, 69507, 3ef4a, 9472f, de1cf, c1315, 65328, 15464, b9517, 217e4, b4471, 40cd4, 3bc55, 4fc63, 0b016, 8e9ab
- Associated Commits: ce88f, 69507, 3ef4a, 9472f, de1cf, c1315, 65328, 15464, b9517, 217e4, b4471, 40cd4, 3bc55, 4fc63, 0b016, 8e9ab
2. [dynamo] Support copy.deepcopy via polyfill: This pull request proposes adding support for the copy.deepcopy function in the Dynamo component of the PyTorch project by implementing a polyfill to enable this functionality.
- URL: pull/179611
- Associated Commits: cf24d, 75631, 482cc, 552fd, da48b, 38807, 3b17d, 7fe85, c2172, 2e49b, 5e1e8, 592c1
- Associated Commits: cf24d, 75631, 482cc, 552fd, da48b, 38807, 3b17d, 7fe85, c2172, 2e49b, 5e1e8, 592c1
3. [dynamo] Improve data-dependent branching error messages: This pull request improves error messages for data-dependent branching in Dynamo by showing the FX graph computation chain leading to the branch condition with user source locations, adding hints to use plain Python integers instead of scalar integer tensor buffers to avoid unsupported dynamic control flow, and fixing stack trace population on FX nodes to enable better debugging.
- URL: pull/179640
- Associated Commits: 9c71c, a2f04, 59e5b, 69777, 1088f, 18a7a, c86ea, 298b8, b2155, add01, ee33a, 2a411
- Associated Commits: 9c71c, a2f04, 59e5b, 69777, 1088f, 18a7a, c86ea, 298b8, b2155, add01, ee33a, 2a411
Other Closed Pull Requests
- CI Workflow and Build Optimization: This topic covers the introduction of the Runway CI workflow that enables GPU-accelerated build and test tasks for PyTorch using parameterized self-hosted runners. It also includes a periodic build snapshot workflow for fast incremental rebuilds via cached artifacts on S3, enhancing security and usability by allowing custom GitHub tokens and Bedrock OIDC authentication without Anthropic API keys.
- Dynamo Bug Fixes and Refactoring: Multiple pull requests address Dynamo improvements including fixing f-string mutation ordering by eagerly formatting supported Python values, removing special casing for namedtuple and struct sequence objects, and refactoring set tracking by decoupling SetVariable from dictionary tracking. These changes improve Python semantics, tracing consistency, and align Dynamo behavior more closely with CPython.
- Autograd Cache Key Enhancements: Several pull requests add and improve the autograd_cache_key function across different modules, including compile_fx and standalone_compile, with tests verifying consistency and handling of multiple outputs. Additional work extracts prepare_aot_config to optimize cache key computation and rejects unsupported graph shapes to prevent incorrect keys.
- Autograd Backward Node Tagging and Rematerialization: This topic introduces a
_patch_autograd_gradcontext manager that patchesautograd.gradto preserve stacktraces and automatically tag backward nodes, improving rematerialization pass detection. It also enforces user annotations when multiple backward regions are ambiguous and migrates existing tests accordingly.
- PyTorch Profiler Metadata Improvements: Multiple pull requests enhance the PyTorch profiler by adding CPU operation and Python function event metadata, including tensor strides, dtypes, shapes for TensorList arguments, and Python module/function launch event details. These changes improve profiling detail while maintaining backward compatibility.
- Callable Object Handling in Dynamo: Several pull requests implement and improve callable detection and invocation in Dynamo by adding a
python_type()method to VariableTracker classes, introducingis_callable()andgeneric_call()methods, and mirroring CPython's PyCallable_Check and PyObject_Call mechanisms. This work replaces hardcoded type checks with a unified callable interface and improves error handling.
- Test Suite Stability and Coverage Enhancements: This topic includes marking specific
inductor/test_flex_flashcases as expected failures due to unsupported features and adding a test to ensure SAC ignored operation annotation does not leak. It also covers enabling full AArch64 unit testing by removing custom scripts and adding multi-shard test jobs for m8g and periodic tests for m7g.
- Batch Matrix Multiplication Native API Implementation: This pull request introduces a native API for batch matrix multiplication outer product operations, addressing limitations of the existing SummaryPT approach. It explores performance improvements with autuning and provides extensive correctness testing across data types, tensor layouts, and shapes compared to cuBLAS.
- Output Handling and Module Splitting: This pull request adds a
tuple_returnoption to thesplit_modulefunction to ensure submodule outputs are consistently wrapped in tuples, complying with thecompile_fxconvention. It also simplifies output node generation by removing unnecessary handling of empty tuples for zero-output partitions.
- Set Tracking Refactor in Dynamo: This pull request refactors the
SetVariableclass to inherit directly fromVariableTrackerinstead ofConstDictVariable, fully decoupling set tracking from dictionary tracking. It reorganizes related classes and functions into a newsets.pymodule and updates type checks and imports accordingly.
- Higher Order Derivatives for grid_sample: This pull request attempts to implement higher order derivatives for the bilinear case in the
grid_samplefunction, aiming to generalize the approach to higher dimensional inputs while addressing various bugs and code improvements.
- Cache Key Equivalence Testing: This pull request adds a cache key equivalence mixin to the
AOTAutogradCacheTeststo improve testing of cache key behavior in PyTorch.
- Redirect to Claude Updates: This pull request consists of a series of updates marked with "[ghstack-poisoned]" commits related to redirecting to Claude.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| bobrenjc93 | 646 | 86 | 0 | 4 |
| anijain2305 | 319 | 33 | 0 | 9 |
| malfet | 170 | 22 | 4 | 11 |
| huydhn | 170 | 12 | 1 | 5 |
| mlazos | 148 | 7 | 0 | 0 |
| frgossen | 130 | 20 | 0 | 2 |
| yangw-dev | 142 | 2 | 0 | 1 |
| weifengpy | 128 | 11 | 1 | 3 |
| Skylion007 | 38 | 6 | 0 | 89 |
| aorenste | 84 | 28 | 0 | 19 |
Access Last Week's Newsletter: