Weekly GitHub Report for Pytorch: April 06, 2026 - April 13, 2026 (19:29:09)

Weekly GitHub Report for Pytorch: April 06, 2026 - April 13, 2026 (19:29:09)

        Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[MODULE: ONNX] [TRIAGED] Cannot export model with CenterCrop to ONNX: This issue concerns a failure to export a classification model that includes a CenterCrop preprocessing step to the ONNX format, resulting in a TypeError because Python's built-in round() function cannot handle symbolic tensor inputs during export. The user seeks a way to include preprocessing within the ONNX model for ease of distribution, but encounters errors related to symbolic tracing and module registration when using dynamic export and preprocessing transforms from torchvision's v2 API.

The comments clarify that the error arises because CenterCrop uses Python's round() on symbolic tensors, which is unsupported during ONNX export; the recommended workaround is to exclude preprocessing from the ONNX graph and perform it externally. Attempts to encapsulate preprocessing in torch.nn.Module subclasses and switch to dynamic export still fail due to tracing and dtype issues, with suggestions to use float32 dummy inputs and convert preprocessing transforms into registered submodules, but the user reports ongoing difficulties integrating preprocessing while maintaining dynamic export functionality.
Number of comments this week: 11

[RFC] Optimize persistent reduction with recompute from cached inputs: This issue proposes optimizing persistent reduction in normalization kernels by caching inputs as float16 and converting to float32 only during computation to reduce register pressure, and by using shared memory with TMA for larger hidden sizes to improve performance. It includes preliminary results showing significant bandwidth improvements with these methods and suggests integrating heuristic routing in Inductor's scheduler to select the appropriate kernel based on reduction dimension.  

The comments discuss implementing heuristic routing in the scheduler to choose between different kernel strategies based on input size, provide example code for the register-tiled path, confirm the use of Gluon for TMA shared memory caching, and conclude with agreement that the proposed optimizations are effective and the issue can be closed.
Number of comments this week: 5

torch.compile degrades the performance compared with eager execution: This issue reports that using torch.compile to optimize a token entropy calculation function results in significantly worse performance compared to eager execution when the batch size is small (1 to 64), with improvements only appearing at larger batch sizes like 512. The user suspects that the compilation introduces overhead that slows down execution in low-latency scenarios, and provides benchmark results demonstrating this behavior across different compilation modes.  

The comments suggest trying to disable split reductions via torch._inductor.config.split_reductions = False to improve performance, with one user confirming that this setting alleviates the issue for the default compilation mode but not for others; another explains that overhead from CUDA graph usage in certain modes causes the slowdown, and a final suggestion is made to try a max-autotune-no-cudagraphs mode to potentially mitigate the problem.
Number of comments this week: 4

[HIGH PRIORITY] [MODULE: CORRECTNESS (SILENT)] [MODULE: REDUCTIONS] [ONCALL: PT2] [MODULE: INDUCTOR] [TOPIC: FUZZER] [BOT-TRIAGED] torch.compile produces significantly different output for model with multiple amax/amin reductions on same tensor compared to eager mode: This issue reports that using torch.compile with the Inductor backend on a CNN model performing multiple reduction operations on the same tensor results in significantly different outputs compared to eager mode, with differences large enough to indicate an algorithmic error rather than floating-point precision drift. The problem is linked to Inductor's reuse_partial optimization, which attempts to reuse partial reduction results but appears to incorrectly handle fused reduction kernels when multiple reductions with overlapping dimensions are involved, causing large mismatches in the model's output.  

The comments suggest disabling aggressive reduction splitting/fusion in Inductor as a workaround, explain that near-zero values in the denominator amplify small differences, and recommend increasing the epsilon or clamping the denominator to stabilize computations; additional advice includes using a compiler fallback flag and restarting the environment to resolve transient issues.
Number of comments this week: 3

[FEATURE] [TRIAGED] [MODULE: FX] [ONCALL: PT2] [MODULE: DYNAMO] [BOT-TRIAGED] Call Hierarchy Metadata: This issue proposes adding a new metadata field called node.meta["call_hierarchy"] to FX nodes, which will capture the complete, interleaved module and function call chain responsible for producing each operation during Dynamo tracing. This enhancement aims to provide detailed provenance information that is not currently available in existing metadata fields, improving traceability and debugging capabilities.  

The comments include a mention of related discussions on provenance information in a recent meeting, a request for a link to the prototype implementation, and the provision of a pull request URL containing the prototype.
Number of comments this week: 3

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 192
Summarized Issues:

torch.compile numerical discrepancies and fusion issues: Multiple issues report that torch.compile with the Inductor backend produces numerical discrepancies compared to eager mode due to changes in floating-point accumulation order and fusion optimizations. These discrepancies affect various operations including mm + bias fusion, batch pointwise math ops, quantization pipelines, batched GEMM fusion, and Transformer attention patterns, leading to compounded errors and systematic output differences.  
issues/179520, issues/179561, issues/179562, issues/179567, issues/179568, issues/179569, issues/179570, issues/179574, issues/179575, issues/179577, issues/179578

torch.compile runtime errors and optimization bugs: Several issues describe runtime errors and bugs in torch.compile with the Inductor backend caused by incorrect optimization passes or graph lowering failures. Problems include shape mismatch errors from redundant view removals, failures in handling multiple consumers of cumsum results, crashes on heterogeneous torch.randint calls, and assertion errors in FFT operations, all while the models run correctly in eager mode.  
issues/179502, issues/179510, issues/179534, issues/179571, issues/179573, issues/179576, issues/179807, issues/179883

ONNX export and model serialization issues: There are issues with exporting PyTorch models to ONNX format and serializing opaque type constants. Problems include invalid ONNX models generated when reusing Conv2d+BatchNorm2d blocks with optimization enabled, failures exporting models with preprocessing steps due to symbolic dimension handling, and runtime errors when saving or loading exported programs containing opaque constants.  
issues/179559, issues/179709, issues/179870

Numerical discrepancies between CPU and CUDA implementations: Multiple issues report significant numerical differences between CPU and CUDA outputs for various functions such as torch.special.erfcx, torch.renorm, torch.linalg.eigvalsh, and log determinant calculations. These discrepancies can impact numerical stability and reproducibility in neural network computations.  
issues/179785, issues/179787, issues/179789, issues/180105

MPS backend numerical and gradient bugs: The MPS backend exhibits numerical correctness bugs including negative values from avg_pool1d despite non-negative inputs due to precision loss, and incorrect or zero gradients in backward passes for scalarized reductions, indicating issues in the MPS Metal shader and autograd implementations.  
issues/179608, issues/180201

Disabled tests and CI failures on ROCm platform: A large number of tests have been disabled on the ROCm platform due to consistent failures, hangs, or flaky behavior across various test suites including InlineAsmElementwise, ComboKernel, DistTensorRandomOpCompileTest, TestCudaAllocator, and others. These widespread test disables indicate ongoing stability and compatibility challenges with ROCm in the main branch.  
issues/179925, issues/179927, issues/179938, issues/179939, issues/179940, [issues/179942](https://github.com/issues/179942], issues/179943, issues/179944, issues/179945, issues/179946, issues/179947, issues/179948, issues/179949, issues/179950, issues/179951, issues/179952, issues/179953, issues/179954, issues/179955, issues/179956, issues/179957, issues/179958, issues/179959, issues/179960, issues/179961, issues/179962, issues/179963, issues/179964, issues/179965, issues/179967, issues/179968, issues/179970, issues/179973, issues/179976, issues/179977, issues/179979, issues/179981, issues/179982, issues/179983, issues/179984, issues/179985, issues/179986, issues/179987, issues/180006, issues/180009, issues/180010, issues/180011, issues/180012, issues/180013, issues/180014, issues/180015, issues/180016, issues/180017, issues/180018, issues/180019, issues/180021, issues/180022, issues/180023, issues/180024, issues/180025, issues/180026, issues/180027, issues/180028, issues/180029, issues/180030, issues/180031, issues/180032, issues/180033, issues/180034, issues/180035, issues/180036, issues/180037, issues/180038, issues/180039, issues/180040, issues/180041, issues/180042, issues/180043, issues/180044, issues/180045, issues/180046, issues/180047, issues/180048, issues/180049, issues/180051, issues/180053, issues/180054, issues/180055, issues/180056, issues/180057, issues/180058, issues/180059, issues/180060, issues/180061, issues/180062, issues/180063, issues/180064, issues/180065, issues/180066, issues/180067, issues/180069, issues/180071, issues/180072, issues/180073, issues/180074, issues/180075, issues/180076, issues/180077, issues/180078

Precision and numerical accuracy issues on GPU: The GPU implementations of certain operations such as float16 cumulative sum and two-pass variance calculation exhibit significant precision loss or overflow compared to CPU implementations. This results in accuracy degradation by factors of 11-15 and overflow errors due to intermediate squared values exceeding float32 limits.  
issues/180150, issues/180156

Random number generation and DTensor bugs: DTensor has a bug where the CPU RNG state does not advance on CUDA meshes, causing infinite loops in truncated normal initialization. Additionally, torch.cuda.Stream objects are traced incorrectly by torch.compile, leading to segmentation faults when using non-default CUDA streams.  
issues/180088, issues/180179

Distributed and collective operation issues: There are bugs related to distributed operations including a race condition in ProcessGroupGloo causing memory corruption, failures in tracing distributed all_reduce due to ProcessGroup serialization issues, and synchronous behavior of all_to_all collectives under torch.compile preventing expected asynchronous overlap.  
issues/179848, issues/179858, issues/179922

Performance regressions and optimization proposals: Some issues report performance regressions such as slower small-M FP8 fused scaled matmul on SM100 hardware and performance degradation of compiled Token Entropy calculations at small batch sizes. Proposals include optimizing persistent reduction kernels and adding XPUGraph Trees for Intel GPUs to reduce host overhead.  
issues/179697, issues/179711, issues/179770, issues/180168

Documentation and API improvements: Proposals include adding a new node.meta["call_hierarchy"] field to FX nodes for better provenance tracking, clarifying the documentation of torch.Tensor.index_add_ regarding the index parameter, and adding a stable C API function to obtain AtenTensorHandle from Python tensors.  
issues/179643, issues/180107, issues/180119

Hardware and platform-specific issues: There are platform-specific problems such as segmentation faults on Intel ARC PRO B70 GPUs due to missing SYCL device info, ROCm packaging errors causing missing TensileLibrary.dat files, and CUDA runtime errors on Windows with RTX 5080 GPUs due to missing kernel images.  
issues/179865, issues/179891, issues/180101

Miscellaneous bugs and test infrastructure issues: Other issues include flaky test failures due to nondeterministic error ordering, CI jobs not failing despite test failures, and disabled tests due to profiler-dependent failures on aarch64 architecture.  
issues/179703, issues/179723, issues/180002

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 36
Summarized Issues:

Build and Compilation Failures: Several issues report build or compilation failures across different platforms and configurations. These include Windows build failures with CUDA Toolkit versions above 13.0, ROCm nightly build breakages due to C++20 enforcement and missing dependencies, and segmentation faults on GPU inference caused by outdated GCC and CMake versions.  
issues/179005, issues/179524, issues/179829

Numerical Accuracy and Stability Discrepancies Between CPU and GPU: Multiple issues highlight significant numerical mismatches and stability problems when running operations on GPU compared to CPU. These include discrepancies in NaN counts, cumulative sum/product accuracy, least squares solver results, and conversions of NaN/Inf values to integers, indicating inconsistent precision and rounding errors across devices.  
issues/180103, issues/180104, issues/180117, issues/180152, issues/180153, issues/180154, issues/180155, issues/180157, issues/180158, issues/180159

PyTorch Test Failures and Disabled Tests on ROCm Platform: Several tests have been disabled or fail consistently on the ROCm platform, affecting continuous integration and main branch stability. These include tests related to distributed processing, autotuning, dynamic shapes, and numerical backends, indicating platform-specific reliability issues.  
issues/179484, issues/179912, issues/179966, issues/179972, issues/179980

TorchScript and Serialization Errors: Issues have been reported involving crashes and errors during model serialization and deserialization. These include segmentation faults from corrupted .pt files and serialization failures due to unsupported tensor data types like torch.uint32, highlighting robustness and compatibility problems.  
issues/179433, issues/179581

Torch.compile and Triton Kernel Compilation Bugs: There are bugs affecting torch.compile and Triton kernel compilation, such as conflicts in operator handlers causing crashes, improper handling of multi-buffer epilogues leading to compilation errors, and failures tracing numpy .flat attributes on FakeTensor. These issues cause runtime errors and compilation failures in optimized code paths.  
issues/179233, issues/179418, [issues/179582](https://github.com/issues/179582]

Environment and Test Setup Issues Causing Failures: Some tests fail due to improper environment variable management or stale state between tests. For example, environment variables set during test setup are not cleared, causing named pipe errors in subsequent tests, which points to test suite hygiene problems.  
issues/179556

Hash Mismatch and Package Integrity Problems: Users encountered hash mismatch errors when installing PyTorch binaries via pip on Linux, caused by inconsistencies between mirrors. These issues were resolved by verifying and syncing package binaries to ensure consistent installation experiences.  
issues/179554, issues/179821

Precision Loss in Mixed-Type Tensor Operations: A bug causes silent precision loss when adding int32 tensors to bfloat16 tensors due to type promotion rules, resulting in incorrect rounding of integer values above 256. This suggests a need to revise promotion rules to prevent unintended data corruption.  
issues/179845

ROC Profiling and Symbol Presence Questions: There is uncertainty about the addition of the "rocprofiler_configure" symbol in the libtorch_cpu.so library, with users seeking clarification on whether it relates to ROCm profiling features.  
issues/179553

PyTorch API and Documentation Issues: Incorrect type annotations in stub files for torch.cat and torch.stack functions have been reported, along with criticism of contribution guidelines and issue templates for being discouraging and poorly written, indicating documentation and developer experience concerns.  
issues/179391

Numerical Function Bugs Producing Incorrect Outputs: The Softplus function returns infinite values instead of finite outputs when given extremely large beta and infinite threshold parameters, indicating a numerical stability bug affecting both CPU and GPU implementations.  
issues/180193

Community and Contribution Workflow Discussions: There is an ongoing discussion about best practices for the contribution workflow, including branch naming, pull request reviews, and testing requirements, aiming to improve project collaboration and code quality.  
issues/179889

Miscellaneous Issues and Feature Suggestions: Some issues include milestone celebration proposals and reports of inconsistent segmentation masks in industrial applications, reflecting community engagement and real-world application challenges.  
issues/179929, issues/179835

Workspace Size and Algorithm Selection Problems: MIOpen Gemm convolution solvers return a workspace size of zero on the gfx1150 GPU when used via PyTorch's ATen convolution path, causing these solvers to be excluded silently from algorithm selection and resulting in fallback to slower convolution methods.  
issues/178934

Checkpoint and Memory History System Errors: A SystemError occurs in PyTorch 2.11 when using compiled checkpointed models with CUDA memory history recording enabled, causing built-in methods to return NULL without exceptions during backward passes, despite working in earlier versions.  
issues/179536

Runtime Errors in FX and Dynamo Optimizations: The skip_nested_compile context manager does not apply properly during checkpoint callbacks, causing runtime errors when symbolically tracing dynamo-optimized functions due to missing decorator application on rematerialization callbacks.  
issues/179614

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 282
Key Open Pull Requests
1. [dynamo] Cache slice sizes to avoid unrelated unbacked symbols: This pull request addresses an issue in PyTorch's Dynamo where slicing with unbacked symbolic integer indices caused the creation of unrelated unbacked symbols for slice sizes, leading to equality check failures, by introducing a cache in ShapeEnv that reuses slice size symbols keyed by symbolic expressions to ensure consistent and deterministic slice size computation, thereby fixing a graph break in HuggingFace Aria's sequential_experts_gemm (MoE pattern).

URL: pull/179635

Associated Commits: 20a61, c6c48, 92eeb, 56073, fd164, 6f694, 3ee71, 23343, 25c0b, 00787, b84a9, 8b427, fb7a9, 05eb4, 3be85, d15b1, bdd26, 811ea, 0c337, 9ba1d, 3e8f1

2. [OSDC] Enable linux-docs on OSDC pull: This pull request adds support for the OSDC (ARC) runner to the documentation build configuration by enabling a build-docs-osdc job in _docs.yml that uses a matrix for C++ and Python documentation types and runs directly in the container image instead of using docker-in-docker, while preserving and gating the existing EC2 build path.

URL: pull/179994

Associated Commits: 0ac48, f92ce, 2f906, daf0f, 68b8a, 05b3d, 90f09, 223f8, aa628, 44283, 5bb7b, 60925, 7a0bc, 6ac0f, 966a6, a029a, 7c4db, 45717

3. [CI] Fix flaky SIGPIPE in backwards_compat test: This pull request fixes a flaky SIGPIPE error in the backwards_compat test by replacing a pipe-based grep command with a here-string to avoid a race condition caused by the -o pipefail setting in the GitHub Actions shell, ensuring reliable detection of the canary module in test output.

URL: pull/179999

Associated Commits: 96e77, c1a07, 209d3, ec800, 63ad8, 4c421, e7716, 8101d, e2a8e, 241fd, a6f95, 52e36, 15470, ca98a, e8c0d, 5665a, 009e1

Other Open Pull Requests

Dynamo tracing and graph break fixes: Multiple pull requests improve the Dynamo tracing process and fix graph breaks in the HuggingFace VitsModel by making functions like expect_true traceable during compilation and adding constant-folding for importlib functions. These changes enable runtime assertions to be emitted into the graph and treat importlib functions as SkipFunctionVariable, enhancing graph stability.  
pull/179630, pull/179629

Inductor eager tests and Intel GPU support: Several pull requests enable and port inductor unittest tests to run on Intel GPUs by activating specific test suites and defining expected failures for XPU devices. This effort improves test coverage and compatibility for inductor compiled regions on Intel hardware.  
pull/179550, pull/179675, pull/179549

Inductor CUDA graph and kernel annotation enhancements: Pull requests introduce kernel annotations into CUDA graph trees and add a CUDAGraphPolicy base class to Inductor's post_compile pipeline. These changes enable customizable cudagraph wrapping and automatic processing of kernel scopes during CUDA graph capture, supporting advanced use cases and improving graph management.  
pull/179769, pull/180163

Autograd and memory management improvements: Updates include replacing std::shared_ptr<Node> with c10::intrusive_ptr<Node> for better memory management and implementing thread-safe Python wrapping for autograd Node objects. These changes enhance memory handling and thread safety within the autograd system.  
pull/179766, pull/179767

FSDP collective bucketing optimization: A pull request introduces a pre-bucketing strategy for Fully Sharded Data Parallel collectives to improve bucketing efficiency by considering process group bandwidth and latency. This resolves issues caused by extra dependencies that previously hindered effective bucketing in overlap scheduling.  
pull/179935

Debugging tools for CUDA stream and ROCm CI: Pull requests add a context manager to warn about work enqueued on the NULL CUDA stream and introduce debug instrumentation in the ROCm CI pipeline to diagnose incorrect pytest exit codes. These tools help detect and debug CUDA graph capture errors and nondeterministic test failures.  
pull/180020, pull/179725

Dynamo code refactoring and typing improvements: Refactoring efforts include splitting large test files into smaller modules and adding missing type annotations across the torch._dynamo module. These changes improve code readability, maintainability, and typing consistency without affecting runtime behavior.  
pull/179544, pull/179645

OpaqueGenerator replacement and FX integration: One pull request replaces the OpaqueGenerator Python wrapper by making torch._C.Generator a proper opaque reference type with enhanced C++ support and FX codegen integration. This eliminates the wrapper and enables direct registration and seamless flow through FX infrastructure.  
pull/179661

FakeTensor higher order operations registration: A pull request registers higher order operations with the Fake dispatch key in the C++ faketensor implementation and tests them by invoking internal operations directly. This is necessary due to the current lack of integration between the compile function and the C++ faketensor.  
pull/179921

Continuous integration and documentation build fixes: Pull requests introduce a new CI workflow with thread sanitizer enabled and fix Python documentation build hangs by applying multiple optimizations. These improvements reduce build times and detect data races more effectively.  
pull/179933, pull/180177

AOT Inductor debugging and code generation enhancements: A pull request adds a dump_python_module flag to AOT single-pass compilation mode, enabling output of Python compiled modules with Triton kernel source and execution graph info. This facilitates improved debugging and profiling with kernel name consistency for Kineto trace correlation.  
pull/179702

OrderedDict refactoring in Dynamo: One pull request refactors the Dynamo component by introducing OrderedDict as a subclass of UserDefinedDictVariable to better resemble CPython structure and support user-defined object variable-like cases.  
pull/179871

DTensor sharding propagation fix: A pull request fixes sharding propagation for Split(Flatten) operations in DTensor by correctly handling non-first flatten dimensions and adding checks to prevent invalid shard configurations. This ensures accurate tracking of input dimensions and avoids silent incorrect results.  
pull/179509

Dynamo getitem_const self-validation: One pull request makes the getitem_const function in Dynamo self-validating to ensure behavior matches CPython's subscript functions.  
pull/179499

Multi-grad hook and backward debugging: A pull request introduces register_multi_grad_hook for leaf functions and adds debug print capabilities for backward passes and multi-tensor operations to enhance gradient computation tracing.  
pull/179609

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 291
Key Closed Pull Requests
1. [OSDC] Enable bazel build+test on OSDC pull: This pull request proposes adding support for the OSDC (ARC) runner in the _bazel-build-test.yml configuration by introducing a build-and-test-osdc job that runs directly in the container image with an inline filter, following the pattern of the existing Linux build setup, while preserving and gating the EC2 build path based on the use-arc input.

URL: pull/179969

Associated Commits: ce88f, 69507, 3ef4a, 9472f, de1cf, c1315, 65328, 15464, b9517, 217e4, b4471, 40cd4, 3bc55, 4fc63, 0b016, 8e9ab

Associated Commits: ce88f, 69507, 3ef4a, 9472f, de1cf, c1315, 65328, 15464, b9517, 217e4, b4471, 40cd4, 3bc55, 4fc63, 0b016, 8e9ab

2. [dynamo] Support copy.deepcopy via polyfill: This pull request proposes adding support for the copy.deepcopy function in the Dynamo component of the PyTorch project by implementing a polyfill to enable this functionality.

URL: pull/179611

Associated Commits: cf24d, 75631, 482cc, 552fd, da48b, 38807, 3b17d, 7fe85, c2172, 2e49b, 5e1e8, 592c1

Associated Commits: cf24d, 75631, 482cc, 552fd, da48b, 38807, 3b17d, 7fe85, c2172, 2e49b, 5e1e8, 592c1

3. [dynamo] Improve data-dependent branching error messages: This pull request improves error messages for data-dependent branching in Dynamo by showing the FX graph computation chain leading to the branch condition with user source locations, adding hints to use plain Python integers instead of scalar integer tensor buffers to avoid unsupported dynamic control flow, and fixing stack trace population on FX nodes to enable better debugging.

URL: pull/179640

Associated Commits: 9c71c, a2f04, 59e5b, 69777, 1088f, 18a7a, c86ea, 298b8, b2155, add01, ee33a, 2a411

Associated Commits: 9c71c, a2f04, 59e5b, 69777, 1088f, 18a7a, c86ea, 298b8, b2155, add01, ee33a, 2a411

Other Closed Pull Requests

CI Workflow and Build Optimization: This topic covers the introduction of the Runway CI workflow that enables GPU-accelerated build and test tasks for PyTorch using parameterized self-hosted runners. It also includes a periodic build snapshot workflow for fast incremental rebuilds via cached artifacts on S3, enhancing security and usability by allowing custom GitHub tokens and Bedrock OIDC authentication without Anthropic API keys.  
pull/179662

Dynamo Bug Fixes and Refactoring: Multiple pull requests address Dynamo improvements including fixing f-string mutation ordering by eagerly formatting supported Python values, removing special casing for namedtuple and struct sequence objects, and refactoring set tracking by decoupling SetVariable from dictionary tracking. These changes improve Python semantics, tracing consistency, and align Dynamo behavior more closely with CPython.  
pull/179002, pull/179381, pull/179192

Autograd Cache Key Enhancements: Several pull requests add and improve the autograd_cache_key function across different modules, including compile_fx and standalone_compile, with tests verifying consistency and handling of multiple outputs. Additional work extracts prepare_aot_config to optimize cache key computation and rejects unsupported graph shapes to prevent incorrect keys.  
pull/178172, pull/178173, pull/178174, pull/178171, pull/178874

Autograd Backward Node Tagging and Rematerialization: This topic introduces a _patch_autograd_grad context manager that patches autograd.grad to preserve stacktraces and automatically tag backward nodes, improving rematerialization pass detection. It also enforces user annotations when multiple backward regions are ambiguous and migrates existing tests accordingly.  
pull/179105

PyTorch Profiler Metadata Improvements: Multiple pull requests enhance the PyTorch profiler by adding CPU operation and Python function event metadata, including tensor strides, dtypes, shapes for TensorList arguments, and Python module/function launch event details. These changes improve profiling detail while maintaining backward compatibility.  
pull/179211, pull/179714

Callable Object Handling in Dynamo: Several pull requests implement and improve callable detection and invocation in Dynamo by adding a python_type() method to VariableTracker classes, introducing is_callable() and generic_call() methods, and mirroring CPython's PyCallable_Check and PyObject_Call mechanisms. This work replaces hardcoded type checks with a unified callable interface and improves error handling.  
pull/179796, pull/179527, pull/179665, pull/179840

Test Suite Stability and Coverage Enhancements: This topic includes marking specific inductor/test_flex_flash cases as expected failures due to unsupported features and adding a test to ensure SAC ignored operation annotation does not leak. It also covers enabling full AArch64 unit testing by removing custom scripts and adding multi-shard test jobs for m8g and periodic tests for m7g.  
pull/176931, pull/178991, pull/178270

Batch Matrix Multiplication Native API Implementation: This pull request introduces a native API for batch matrix multiplication outer product operations, addressing limitations of the existing SummaryPT approach. It explores performance improvements with autuning and provides extensive correctness testing across data types, tensor layouts, and shapes compared to cuBLAS.  
pull/179082

Output Handling and Module Splitting: This pull request adds a tuple_return option to the split_module function to ensure submodule outputs are consistently wrapped in tuples, complying with the compile_fx convention. It also simplifies output node generation by removing unnecessary handling of empty tuples for zero-output partitions.  
pull/179007

Set Tracking Refactor in Dynamo: This pull request refactors the SetVariable class to inherit directly from VariableTracker instead of ConstDictVariable, fully decoupling set tracking from dictionary tracking. It reorganizes related classes and functions into a new sets.py module and updates type checks and imports accordingly.  
pull/179192

Higher Order Derivatives for grid_sample: This pull request attempts to implement higher order derivatives for the bilinear case in the grid_sample function, aiming to generalize the approach to higher dimensional inputs while addressing various bugs and code improvements.  
pull/177487

Cache Key Equivalence Testing: This pull request adds a cache key equivalence mixin to the AOTAutogradCacheTests to improve testing of cache key behavior in PyTorch.  
pull/178873

Redirect to Claude Updates: This pull request consists of a series of updates marked with "[ghstack-poisoned]" commits related to redirecting to Claude.  
pull/179207

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

bobrenjc93
646
86
0
4

anijain2305
319
33
0
9

malfet
170
22
4
11

huydhn
170
12
1
5

mlazos
148
7
0
0

frgossen
130
20
0
2

yangw-dev
142
2
0
1

weifengpy
128
11
1
3

Skylion007
38
6
0
89

aorenste
84
28
0
19

Access Last Week's Newsletter:  

Link

                                Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Issues	Comments
bobrenjc93	646	86	0	4
anijain2305	319	33	0	9
malfet	170	22	4	11
huydhn	170	12	1	5
mlazos	148	7	0	0
frgossen	130	20	0	2
yangw-dev	142	2	0	1
weifengpy	128	11	1	3
Skylion007	38	6	0	89
aorenste	84	28	0	19