Weekly GitHub Report for Pytorch: February 01, 2026 - February 08, 2026 (15:57:33)
Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[ONCALL: DISTRIBUTED] [ONCALL: PT2] [BOT-TRIAGED] [BOT-MISLABELED] torch.compile doesn't trace
dist.all_reduceoutput correctly: This issue reports a bug wheretorch.compileincorrectly traces the output ofdist.all_reducewhen called withasync_op=False, causing it to return atorch.Tensorinstead ofNoneas expected, which leads to an AttributeError when attempting to call.wait()on the result. The problem occurs because the compiled version treats the output as an asynchronous handle, conflicting with the eager behavior, and the discussion reveals that asynchronous operations currently cause graph breaks and are not fully supported in compiled mode, with suggestions to branch logic based on compilation state or use the functional collective API consistently.- The comments clarify that asynchronous all_reduce is not supported with
torch.compiledue to graph breaks, explain how Dynamo rewrites the collective calls to return an AsyncCollectiveTensor, and discuss the handling ofwait_tensorcalls in different compilation backends, noting that some passes movewait_tensorcloser to its consumer to optimize execution. - Number of comments this week: 7
- The comments clarify that asynchronous all_reduce is not supported with
-
[TRIAGE REVIEW] [MODULE: MEMORY FORMAT] [MODULE: CORRECTNESS (SILENT)] [MODULE: POOLING] [MODULE: NORMS AND NORMALIZATION] [MODULE: MPS] [BOT-TRIAGED] [MPS] BatchNorm2d/avg_pool2d produce wrong results for channels_last tensors with storage_offset > 0: This issue reports a bug where
BatchNorm2dandavg_pool2dproduce incorrect results on the MPS backend when tensors use thechannels_lastmemory format and have astorage_offsetgreater than zero, indicating a problem with how the offset is handled in these operations. The problem does not occur when the tensor is cloned (which resets the offset), but persists when making the tensor contiguous inchannels_lastformat, suggesting that the bug is related to kernel address calculations or graph caching that ignores thestorage_offset.- The comments discuss the likely cause being either incorrect handling of
storage_offsetin kernel address math or reuse of cached kernels without accounting for offset, with tests confirming that cloning fixes the issue while making the tensor contiguous does not; the conversation also briefly diverges into a lighthearted exchange about the nature of the diagnostic comments. - Number of comments this week: 6
- The comments discuss the likely cause being either incorrect handling of
-
[TRIAGE REVIEW] [MODULE: NN] [MODULE: CORRECTNESS (SILENT)] [MODULE: MPS] [BOT-TRIAGED] [MPS] Incorrect
grid_sampleoutputs: NHWC kernel correctness + missing memory format in kernel caching keys: This issue reports correctness problems with thetorch.nn.functional.grid_samplefunction on the MPS backend, identifying two separate bugs: one related to the NHWC kernel path and another involving missing memory format keys in kernel caching. The reporter provides a minimal reproducible example demonstrating these bugs and discusses ongoing efforts to isolate additional MPS kernel issues affecting other operations like BatchNorm2D.- The comments focus on acknowledging the complexity of the problem, sharing a related pull request that fixes the primary
grid_samplebug, and discussing plans to test and address further MPS kernel issues in more complex workflows, highlighting the iterative nature of debugging and fixing these backend problems. - Number of comments this week: 5
- The comments focus on acknowledging the complexity of the problem, sharing a related pull request that fixes the primary
-
[ONCALL: DISTRIBUTED] [MODULE: SYMM_MEM] [BOT-TRIAGED] "CUDA driver error: invalid device ordinal" when calling
symm_mem.rendezvous: This issue reports a runtime error "CUDA driver error: invalid device ordinal" encountered when callingsymm_mem.rendezvousin a multi-GPU setup using PyTorch's symmetric memory API, despite the user verifying correct GPU ordinals and successful NCCL initialization. The error appears related to hardware or driver limitations, possibly involving the lack of NVLink connectivity between GPUs, which may cause CUDA to reject memory access permissions during the rendezvous operation.- The comments include a reproduction attempt on a different environment that did not encounter the issue, a suggestion that the error might be due to RTX 2080 Ti hardware limitations, a user confirming symm_mem works on RTX 2080 Ti with NVLink, and a discussion pointing to the CUDA driver call failing due to missing hardware access paths, with a request to improve the error message for clarity.
- Number of comments this week: 4
-
[HIGH PRIORITY] [TRIAGE REVIEW] [ONCALL: PT2] [MODULE: INDUCTOR] [inductor] A regression bug: argmax outputs wrong when working on transposed and mutated matrix: This issue reports a regression bug in PyTorch where the argmax function produces incorrect results when applied to a transposed and mutated matrix, specifically in the context of the inductor backend. The problem reoccurs in version 2.10.0 and a recent development build, with the original fix only addressing the CUDA backend, leaving the CPU backend still affected.
- The comments discuss confirming the original fix's limitation to the CUDA backend, provide a test case for CUDA, and highlight the need to extend the fix to the CPU backend, with contributors agreeing on the current status and next steps.
- Number of comments this week: 4
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 80
Summarized Issues:
- Precision and Mixed Precision Issues: Several issues report numerical inaccuracies and dtype mismatches related to precision handling in PyTorch. These include a significant numerical divergence in GPT2 Word Position Embedding between float32 and float16 on CUDA, Flex Attention failing with mixed precision due to float and BFloat16 dtype mismatch, and mixed precision training overflow causing NaNs in validation loss with pre-trained models like ResNet18.
- TorchInductor Backend Incorrectness and Crashes: Multiple issues describe incorrect results, assertion failures, or crashes when using the TorchInductor backend with torch.compile. Problems include incorrect argmax indices for boolean tensors, advanced indexing with duplicate indices producing wrong results, dropout on transposed tensors causing mismatches, copysign and remainder operations yielding inconsistent outputs, and crashes during lowering or functionalization fallback.
- Regression and Inconsistent Behavior in Compilation and Dynamo: Several issues report regressions or inconsistent behavior in torch.compile or Dynamo, including failures to handle overridden Python magic methods, incorrect tracing of dist.all_reduce outputs, and errors treating user-defined objects as constants during JIT compilation.
- Memory Leaks and Allocation Issues: There are reports of memory leaks and allocation problems, such as continuous memory growth when using torch.compile, GPU memory not being deallocated during profiling, and the NVIDIA DGX platform hanging on large unified memory allocations instead of raising OOM errors.
- Backend and Hardware Compatibility Problems: Issues include CUDA driver errors with invalid device ordinals in multi-GPU setups, lack of support for NVIDIA RTX 5050 GPUs causing kernel image errors, and memory access violations on AMD Instinct GPUs during FP16 triangular operations.
- Test Failures and Disabled Tests: Multiple tests are failing or disabled across various platforms and modules, including ROCm test failures for cross entropy loss, disabled sparse multiplication tests, and Triton integration test failures after updates.
- Autograd and Gradient Computation Bugs: Issues include custom autograd Functions incorrectly computing gradients during partial backward passes, backward pass errors in LayerNorm with dynamic shapes, and requests for fallback support for undefined higher-order backward passes in composite custom operations.
- Model Export and ONNX Issues: Problems arise when exporting models, such as Alpamayo with Qwen3-VL backbone failing due to unregistered active modes, and BiLSTM ONNX export failing with dynamic time dimension conflicts.
- Operator Fusion and Performance Debugging: There is a request for guidance on preventing operator fusion in TorchInductor to debug slow fused operators, indicating challenges in controlling fusion behavior for performance analysis.
- Numerical and Kernel Implementation Bugs on MPS Backend: Several issues report incorrect outputs or crashes on the MPS backend due to kernel bugs, including grid_sample producing wrong results, BatchNorm2d and avg_pool2d failing with channels_last tensors with storage offsets, and torch.abs overflowing or underflowing for complex inputs.
- Compilation and Runtime Errors with Python Constructs: Bugs include torch.compile crashing on Python try/except blocks due to unhandled AttributeErrors and failures when cloning tensors and applying as_strided with in-place additions causing assertion errors.
- Test Infrastructure and CI Issues: Problems include CI failures due to missing shared libraries, confusion caused by CI outage reports, and proposals to improve test reuse and coverage across hardware backends.
- Error Message Regression Testing Proposals: Multiple issues propose adding
module_error_inputs_funcfor various modules like Linear, Embedding, CrossEntropyLoss, Conv2d, and MaxPool2d to enable regression testing of error messages for edge cases and ensure consistency.
- DTensor and Distributed Tensor Bugs: Issues include DTensor's squeeze_ operation updating metadata but not local tensor shape, and a proposed enhancement to layer_norm strategy to reduce communication overhead by decomposing computations for sharded tensors.
- Hash Map and Data Structure Bugs: A bug in flat_hash_map's find() and emplace() functions causes iterator overruns and crashes due to an off-by-one error, requiring a patch to terminate iteration properly.
- Sorting and Output Order Inconsistencies: The aten.sort function with stable=None produces inconsistent output orders between eager and inductor backends on CUDA, breaking execution consistency.
- Profiling and Memory Tracking Enhancements: Proposals include adding manual synchronization APIs to CUDACachingAllocator to track physical memory usage accurately and improving torch.foreach_copy performance on CUDA by using cudaMemcpyBatchedAsync for mixed data types.
- Compilation Graph and Export Behavior Changes: Changes in torch._dynamo.export behavior from 2.9.1 to 2.10 affect modular FX graph generation and raise questions about stable APIs for preserving call_module calls and handling multiple FX graphs.
- Distributed and NCCL Backend Hangs: A hang occurs in the NCCL backend when using batched asynchronous send/receive with large tensors and many operations per batch in distributed setups.
- Custom Operation Dispatch and FakeTensorMode Bugs: FakeTensorMode fails to dispatch custom operations when an nn.Parameter subclass is used, causing NotImplemented errors despite valid registrations.
- LayerNorm Backward Pass Failures with Dynamic Shapes: Runtime errors occur in LayerNorm backward when using torch.compile with inductor on tensors with symbolic dynamic shapes, due to incorrect workspace slice indices.
- Boolean Operation Bugs in Inductor: Boolean operations involving
.dataassignments produce incorrect results in the inductor backend compared to eager execution, causing assertion failures.
- Error Message and Exception Handling in Dynamo Tracing: Dynamo tracing incorrectly reports tracing failures for the builtin
reproperator instead of raising the expected ValueError, causing confusion between eager and compiled modes.
- Requests for API and Feature Enhancements: Requests include support for runtime dictionary lookups in ConstDictVariable to reduce recompilation, increasing NUM_THREADS for ARM OpenBLAS compilation, and discussions on Stable ABI support for certain C++ APIs.
- DataLoader Pin-Memory Deprecation Warnings: The DataLoader's pin-memory helper passes a deprecated
deviceargument toTensor.pin_memory(), causing warnings due to internal API changes where the device is now inferred implicitly.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 58
Summarized Issues:
- Segmentation Faults in CUDA Operators and AOT Loading: Multiple issues report segmentation faults occurring in CUDA-related operations, including
torch._export.aot_load()crashing when loading compiled AOT artifacts with constants,torch.ops.aten.grucrashing with packed sequences, andtorch.ops.aten.lstm.datacrashing when batch_sizes tensor is not on CPU. These faults cause abrupt Python process crashes without clear diagnostic messages, indicating critical stability problems in CUDA operator handling. - issues/172739, issues/173623, issues/173944
- Test Disabling Due to Failures on XPU and ROCm Platforms: Several tests have been disabled due to consistent failures on the main branch affecting XPU and ROCm platforms, including
test_skip_non_tf32in SDPAPatternRewriterGpuTests and DynamicTests,test_triton_autotuning_cudaandtest_triton_mutated_autotuning_cudain AOTInductorTestABICompatibleGpu, and multiple SDPA and compile_preserves_metadata_cache tests on ROCm. These disables are temporary measures to maintain CI stability while fixes or skips are prepared. - issues/173336, issues/173352, issues/173619, issues/173620, issues/173712, issues/173713, issues/173714, issues/173715, issues/173717
- Compilation and Runtime Errors in Triton and Inductor Backends: Issues include a Triton kernel test failing with an IndexError due to argument index out of range, assertion errors in Triton-based flex decoding tests caused by LLVM verification failures, and Inductor backend errors such as a TypeError from incorrect CSE class instantiation and incorrect zero gradients in
cumprodbackward pass on CUDA. These problems affect kernel generation, autotuning, and model compilation stability. - issues/173795, issues/174306, issues/174311, issues/174016, issues/174094
- Wheel and Packaging Metadata Bugs: There are bugs related to wheel file metadata, including an invalid macOS platform tag
cp313-cp313-macosx_110_0_arm64in version 2.10.0 and a similar issue with the macOS arm64 wheel usingmacosx_110_0instead ofmacosx_11_0. These cause compatibility tools to misidentify the platform and may lead to repeated reinstallations or broken installs. - issues/173462, issues/174265
- Runtime and Memory Errors in Distributed and CUDA Memory APIs: Issues include a RuntimeError triggered by setting
TORCH_DISTRIBUTED_DEBUG=DETAILwhen accessingbackend.mem_allocatorwith NCCL, a RuntimeError "invalid device ordinal" when allocating symmetric memory across multiple GPUs, and a RuntimeError fromtorch.cuda.memory_snapshot()due to invalidmempool_idargument. These errors indicate problems in memory management and distributed backend debugging. - issues/173538, issues/174029, issues/174044
- Regression and Performance Issues in Model Training and Execution: A regression causes illegal memory accesses during Flash Attention backward pass after upgrading to 2.8.3, and upgrading from 2.8.x to 2.9.x causes a slowdown in the Qwen3-VL model's vision-conditioned forward pass due to Conv3d fallback. Additionally, a nightly build regression causes out-of-memory errors on 4x A100 GPUs for previously fitting training runs. These regressions impact model accuracy, speed, and memory usage.
- issues/173953, issues/174051, issues/174244
- Documentation and Link Issues: Broken or incorrect documentation links have been reported, including a broken link to nested tensors documentation in PyTorch tutorials and a link in the PhotoTour dataset documentation redirecting to a gambling advertisement instead of the correct dataset URL. These issues affect user access to accurate resources.
- issues/174380, issues/174542
- Build and Configuration Failures: Problems include no GPU targets being selected when building AOTriton due to incorrect environment variable formatting, and CUDA 12.6 binary builds failing due to unresolved symbols and linker errors related to
linalg_eig_cusolver_xgeev. These issues prevent successful builds and deployments on certain platforms. - issues/174068, issues/174281
- Bugs in PyTorch Core Functions and APIs: Several bugs affect core PyTorch functions, such as a NameError in
_wrap_values()due to undefinednamed_children, a bug intorch.tensordotdocumentation indexing, a bug innn.LayerNormproducing NaNs on CPU with extreme float32 inputs, and a bug in PyTorch Dynamo's ONNX export causing bias name conflicts. These bugs impact correctness and usability of core APIs. - issues/173879, issues/173924, issues/174011, issues/174042, issues/174133
- CUDA and GPU Architecture Compatibility Issues: Issues include CUDA errors running Stable Diffusion on new NVIDIA GPUs due to missing kernel image support, AOTInductor generating PTX code targeting sm_120a but nvcc expecting sm_120 causing compilation failure, and ARMv8.1 LSE atomic instructions causing illegal instruction crashes on ARMv8.0 processors. These compatibility problems hinder usage on newer or specific hardware.
- issues/173991, issues/174161, issues/174344
- PyTorch Compile and Autograd Integration Bugs: Bugs include
torch.compilewrappingautograd.Functionsubclasses with an incompatibleApplyTemplatelackingsetup_context, causingRuntimeErrorwithtorch.functransforms, and AOTAutograd warm caching failing due to unpickleable local functions causing backend compilation errors. These issues affect model compilation and autograd correctness. - issues/174067, issues/174299
- Test Failures and CI Infrastructure Issues: Failures include flaky CUDA memory pool tests due to improper reset of indicator variables, disabling of specific CUDA tests on ROCm after hipify v2 integration, and GitHub runner incidents causing delays and cancellations in PR merges and CI jobs. These issues impact test reliability and development workflow.
- issues/174392, issues/174404, issues/174119
- Numerical Stability and Precision Discrepancies: A significant numerical deviation is reported in
nn.Conv2doutputs comparing CUDA FP32 and CPU FP16, especially for 7x7 kernels, where CPU FP16 lacks stability causing error amplification beyond acceptable thresholds. This discrepancy may degrade model accuracy when using mixed precision across devices. - issues/174089
- Miscellaneous Bugs and Questions: Other issues include a race condition in the
orgqrfunction for Metal Performance Shaders due to device-side barrier limitations, a crash inindex.Tensoroperation with empty indices caused by a failed assertion, and a question about the rationale behind accuracy threshold values in an fp8 ROCm CI test. These highlight diverse minor bugs and community inquiries. - issues/173972, issues/173995, issues/174493
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 243
Key Open Pull Requests
1. Fix InputObserver.infer_arguments with empty caches: This pull request refactors the InputObserver class to improve argument inference when caches are empty, significantly expands test coverage to handle optional and mixed arguments as well as dynamic input scenarios, and integrates pandas for enhanced discrepancy analysis between model outputs and ONNX exports.
- URL: pull/174205
- Associated Commits: 24b29, 2a250, 46eca, 2117e, abc2f, 7e2af, fc47e, ecff1, 31004, fea38, 32763, 80655, 20581, 78d6b, 3746c, 68385, 28643, 88d98, e4b5e, eef93, 7d94e, d971f
2. [FSDP2] enable more tests on CPU: This pull request enables additional unit tests to run on CPU for the FSDP2 feature in PyTorch, building on prior support introduced in earlier pull requests.
- URL: pull/174048
- Associated Commits: 2dc0a, 08218, 3fb46, 9444f, 2ab7f, cb274, 1d0dc, 987ab, ec941, a70f2, 17014, c0a64, 5633b, 329e8
3. DStorage for DTensor/DParam: This pull request introduces DStorage functionality for DTensor and DParam in PyTorch, enabling model parameters to be viewed and managed as a unified byte storage representation.
- URL: pull/174267
- Associated Commits: e8826, 52b7d, cd520, 484cb, 101eb, 55905, 18d1c, 04bc9, 08689, 76be5, 5a492, f30b6, e4fa6, c23af
Other Open Pull Requests
- Checkpoint module refactoring and enhancements: Multiple pull requests improve the torch.utils.checkpoint module by refactoring internal implementations and adding new parameters. These changes simplify checkpointing by using SavedTensor objects directly and introduce an explicit device_type parameter to optimize device handling during checkpointing.
- [pull/174327, pull/174328, pull/174333]
- Gradient computation fixes in autograd: A pull request fixes the behavior of
ctx.needs_input_gradin custom autograd functions to dynamically update gradient requirements during partial backward passes. This enables more efficient gradient computation, particularly benefiting zero-bubble pipeline parallelism scenarios. - [pull/174079]
- Serialization improvements: One pull request changes the serialization approach in serialize.py by replacing comma-based serialization with a dictionary-based method. This update saves serialized results as strings and deserializes them back to appropriate data types without altering function inputs or outputs.
- [pull/174170]
- CI and build infrastructure updates: Several pull requests enhance continuous integration and build workflows by adding Pallas TPU CI testing with private repository access, switching ROCm nightly builds to more reliable gfx942 runners, and adding a dedicated CI job for CPython tests on Python 3.13 to improve test stability.
- [pull/174201, pull/174290, pull/174414]
- Backend and device support migrations: Multiple pull requests migrate functionality and tests to support new hardware backends, including moving grid_sampler_2d to Metal, adapting unit tests for Intel GPU support, and adding frontend Python APIs for XPUGraph to improve capture and replay on XPU devices.
- [pull/174343, pull/174370, pull/174046]
- Documentation restructuring: One pull request reorganizes the C++ documentation by modularizing API files and removing exhale in favor of Doxygen and breathe. This restructuring drastically reduces build time from 5.5 hours to about one minute and eliminates thousands of nearly empty pages.
- [pull/174096]
- DTensor and JIT kernel enhancements: Pull requests add an OpInfo test suite for fullgraph compilation of DTensor operations and extend JIT-compiled CUDA kernels to support uint16, uint32, and uint64 scalar types. These changes address compilation verification and fix crashes in torch.special functions using unsigned integers on CUDA.
- [pull/174142, pull/174303]
- Torch function mode dispatch control: A pull request introduces a mechanism to skip torch function mode dispatch for a single call while keeping the mode active for subsequent operations. This is implemented via a skip_one_hop TLS flag and a new context manager, enabling functions like backward() to bypass mode dispatch internally but maintain mode activity overall.
- [pull/174098]
- Test assertion removals: A series of stacked pull requests focus on removing assert statements from various test files across the PyTorch codebase. These changes aim to improve test code quality and maintainability by eliminating redundant or outdated assertions in top-level, fx, and distributed test directories.
- [pull/174255, pull/174256, pull/174257, pull/174258, pull/174259, pull/174260, pull/174261, pull/174262, pull/174263]
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 284
Key Closed Pull Requests
1. [release/2.7] Enable ROCm for linalg ops - cholesky, lstsq and gels: This pull request is a cherry-pick for release/2.7 that enables the use of the hipSolver backend instead of the previously default magma backend for ROCm in linear algebra operations such as cholesky, lstsq, and gels, allowing users to select the cusolver backend for potentially improved performance.
- URL: pull/174129
- Associated Commits: 79126, 5416d, 65695, c2cca, 8b6bc, 3b61d, 06c6a, 28ca4, 1cc51, a6321, 35f1e, 3f236, ef2b1, 89490, c7ff7, 0c236, 07391, 13417, e294d, 04c7c, 414fc, ff69f, c0fde, 41d64, f31bd, ae842, 4e346, ef226, 40f0d, a80d3, 06077, cced6, de84f, 54d00, 39a79, 8d7ae, 7010d, ad7a2, d458c, 6fd40, 030d6, 7fe67, 62f12, 943cc, 39c25, 55ef4, 47c27, 5df0d, ee96e, cd603, 2dc4b, 24b0c, 99847, cd885, 20d62, dab81, 27e9c, 823b1, 1f8a9, 3f73e, f717b, 1a316, f7721, dec2e, 90091, fa982, 3bfe0, c1423, 800aa, 378a5, 3cddd, 9ebc6, 5beaf, 8af99, f001a, 6e62a, 1cb81, b8d92, 0d98f, ab54c, 70518, 9030e, 2c220, 03c7d, 1fee1, 1d1c7, 92d32, 0073e, 6f2f4, a1599, bdec1, e8f8a, ff4dd, 4c731, 02cee, 4a815, 92d6d, 1ae99, 306ba, 769d5, 94173, e0afc, 83133, 7a876, bbd01, 38f2b, 62ea9, a9d0d, 790cc, 12141, 77a7b, e2d14, f0c1c, 4c858, 5ebff, 17364, 2337d, bf007, ba48d, d17e2, 189aa, e867a, 5631e, 83049, d62a3, 68990, 8a12d, 3fc00, c7ce5, 197c9, b5d59, 9412d, c17ce, 34f3b, 13520, 06c10, 49675, d598f, 8e450, 575e2, 7edf5, c7a1e, 66726, 2a215, 0bd40, df38c, 7a768, 509a6, a4d60, 6fba5, dce73, 4d586, e725e, 7f01c, 4c00e, 866cc, 9434e, b6228, f86d1, 3ea89, b2571, fc756, 22c98, cd0f7, fe3d3, 30508, f07b7, 6b52d, 35dae, d5542, 60111, a929f, be95f, 1cd45, f0534, f0aeb, 6c845, 44c0e, faae1, 6fd45, 5e2f3, a46fe, 699f4, 19431, 5cd45, 39916, 1dfb2, f3ff1, b6098, 359ee, 56383, 55f04, aab10, 2975a, e1c87, 69f40, d9382, 9db1b, f6616, b0c5b, 698b5, 07f41, 56b79, 8d1a0, 1781e, 1f24f, c00d4, 10cbf, 26531, cbf75, 59925, eb99f, 85f25, e3cca, 9e206, b3f84, c02c4, 8d426, 99ccf, 167f7, dcd8e, a033d, 9015d, e8c4b, c2114, 018e5, 975f6, b8b81, 6110a, de2e5, fcbe2, 652c9, 40012, 4b0d5, 63052, a7b6c, 130d9, e3112, 9dc91, 175d5, 7de12, 94afe, 88375, 65632, 5925e, 132ce, bd94a, 262e5, 953e8, 1dc03
- Associated Commits: 79126, 5416d, 65695, c2cca, 8b6bc, 3b61d, 06c6a, 28ca4, 1cc51, a6321, 35f1e, 3f236, ef2b1, 89490, c7ff7, 0c236, 07391, 13417, e294d, 04c7c, 414fc, ff69f, c0fde, 41d64, f31bd, ae842, 4e346, ef226, 40f0d, a80d3, 06077, cced6, de84f, 54d00, 39a79, 8d7ae, 7010d, ad7a2, d458c, 6fd40, 030d6, 7fe67, 62f12, 943cc, 39c25, 55ef4, 47c27, 5df0d, ee96e, cd603, 2dc4b, 24b0c, 99847, cd885, 20d62, dab81, 27e9c, 823b1, 1f8a9, 3f73e, f717b, 1a316, f7721, dec2e, 90091, fa982, 3bfe0, c1423, 800aa, 378a5, 3cddd, 9ebc6, 5beaf, 8af99, f001a, 6e62a, 1cb81, b8d92, 0d98f, ab54c, 70518, 9030e, 2c220, 03c7d, 1fee1, 1d1c7, 92d32, 0073e, 6f2f4, a1599, bdec1, e8f8a, ff4dd, 4c731, 02cee, 4a815, 92d6d, 1ae99, 306ba, 769d5, 94173, e0afc, 83133, 7a876, bbd01, 38f2b, 62ea9, a9d0d, 790cc, 12141, 77a7b, e2d14, f0c1c, 4c858, 5ebff, 17364, 2337d, bf007, ba48d, d17e2, 189aa, e867a, 5631e, 83049, d62a3, 68990, 8a12d, 3fc00, c7ce5, 197c9, b5d59, 9412d, c17ce, 34f3b, 13520, 06c10, 49675, d598f, 8e450, 575e2, 7edf5, c7a1e, 66726, 2a215, 0bd40, df38c, 7a768, 509a6, a4d60, 6fba5, dce73, 4d586, e725e, 7f01c, 4c00e, 866cc, 9434e, b6228, f86d1, 3ea89, b2571, fc756, 22c98, cd0f7, fe3d3, 30508, f07b7, 6b52d, 35dae, d5542, 60111, a929f, be95f, 1cd45, f0534, f0aeb, 6c845, 44c0e, faae1, 6fd45, 5e2f3, a46fe, 699f4, 19431, 5cd45, 39916, 1dfb2, f3ff1, b6098, 359ee, 56383, 55f04, aab10, 2975a, e1c87, 69f40, d9382, 9db1b, f6616, b0c5b, 698b5, 07f41, 56b79, 8d1a0, 1781e, 1f24f, c00d4, 10cbf, 26531, cbf75, 59925, eb99f, 85f25, e3cca, 9e206, b3f84, c02c4, 8d426, 99ccf, 167f7, dcd8e, a033d, 9015d, e8c4b, c2114, 018e5, 975f6, b8b81, 6110a, de2e5, fcbe2, 652c9, 40012, 4b0d5, 63052, a7b6c, 130d9, e3112, 9dc91, 175d5, 7de12, 94afe, 88375, 65632, 5925e, 132ce, bd94a, 262e5, 953e8, 1dc03
2. [reland][ROCm] remove caffe2 from hipify: This pull request relands a previous attempt to remove caffe2 from the hipify tool in the ROCm project by eliminating all "MasqueradingAsCUDA" files and classes and avoiding renaming "CUDA" classes to "HIP," addressing infrastructure issues and incorporating multiple fixes, updates, and mapping improvements to ensure compatibility and build stability.
- URL: pull/172796
- Associated Commits: 4cb19, 5d694, c5a07, 7d7e3, c6f2c, e538b, f0fca, c55ca, 52f55, 7c7be, 6c7cb, 0f0a5, 43be9, fced1, 1bd02, 9cd91, b25bf, 84388, d21a7, fa4ea, bfc83, 3f702, dd3ca, 9b608, 03a40, be812, e7838, 9cf12, 3c4c1, b09bb, c3f73, ee1f7, 71e55, dca58, 5a319, b8641, 6b41f, e23ec, 1970c, 1bfb1, 11e1c, 64210, 7dcbb, 7d2af, e3652, a7c55, 3d3e4, 82a60, 6b226, 47962, f48f0, e259e, 7b449, a15de, 5a66c, 76c18, 64325, 6e55c, 8e093, 7a60c, d34f4, 68086, 61db1
- Associated Commits: 4cb19, 5d694, c5a07, 7d7e3, c6f2c, e538b, f0fca, c55ca, 52f55, 7c7be, 6c7cb, 0f0a5, 43be9, fced1, 1bd02, 9cd91, b25bf, 84388, d21a7, fa4ea, bfc83, 3f702, dd3ca, 9b608, 03a40, be812, e7838, 9cf12, 3c4c1, b09bb, c3f73, ee1f7, 71e55, dca58, 5a319, b8641, 6b41f, e23ec, 1970c, 1bfb1, 11e1c, 64210, 7dcbb, 7d2af, e3652, a7c55, 3d3e4, 82a60, 6b226, 47962, f48f0, e259e, 7b449, a15de, 5a66c, 76c18, 64325, 6e55c, 8e093, 7a60c, d34f4, 68086, 61db1
3. Handle List/Dict Comprehension Graph Breaks for Python3.12+: This pull request addresses the changes in Python 3.12+ where list and dict comprehensions are inlined into their surrounding functions by enhancing Dynamo's tracing mechanism to handle graph breaks within comprehensions more precisely, implementing bytecode analysis and checkpointing to skip only the comprehension-related code rather than the entire function, and covering numerous edge cases including nested comprehensions, side effects, and variable scope mutations, with new tests added to ensure correctness.
- URL: pull/173558
- Associated Commits: 56724, c55d7, 5757c, a0eb3, 084e5, 13b84, 9a755, 7ac14, 48d2c, 98374, 9a9db, b31da, dd72d, b4f59, 74fe6, 6c567, 63fd0, a4d89, 8547f, 2129b, c4b09, 4ce41, 3a27d, 8895e, ba25a, d3811, 6cb43, 00a98, 2ee74, 960e9, 1342d, ee671, cf5bd, 411d5, 89182, 3909a, 2b02f, d5fdd, d3015, cb56d, a4524, 8cc33, edc3f, 29785
- Associated Commits: 56724, c55d7, 5757c, a0eb3, 084e5, 13b84, 9a755, 7ac14, 48d2c, 98374, 9a9db, b31da, dd72d, b4f59, 74fe6, 6c567, 63fd0, a4d89, 8547f, 2129b, c4b09, 4ce41, 3a27d, 8895e, ba25a, d3811, 6cb43, 00a98, 2ee74, 960e9, 1342d, ee671, cf5bd, 411d5, 89182, 3909a, 2b02f, d5fdd, d3015, cb56d, a4524, 8cc33, edc3f, 29785
Other Closed Pull Requests
- Dynamo VariableTracker Construction Consolidation: Multiple pull requests focus on consolidating the construction of VariableTracker instances in various PyTorch Dynamo modules by routing direct variable creation through centralized builders like SourcelessBuilder.create() or VariableBuilder. These changes address import circularity issues and implement the first step of a related issue to improve code structure and maintainability.
- Dynamo Profiler Enhancements: Pull requests introduce a Dynamo-native profiler operating at the tracing layer to measure time spent tracing Python functions, improving user-level visibility beyond cProfile. Additional improvements include recording generator frames during profiling to enhance accuracy and completeness of performance data.
- ROCm Backend Fixes and Improvements: Several pull requests address ROCm-specific issues including fixing unit tests by extending skips and correcting grid value expectations, removing caffe2 dependency from hipify, and attempting fixes for ROCm forward issues with related bug cleanups. These changes improve ROCm compatibility and infrastructure stability.
- Dynamic Shape Support in Linear Algebra Operations: A pull request fixes dimension-dependent errors in 18 linear algebra operations by replacing direct dimension comparisons with runtime checks and handling unbacked symbolic dimensions properly. This enables these operations to support dynamic shapes effectively.
- DTensor Single-Dimension Pointwise Operation Rule: One pull request completes the implementation of the single-dimension pointwise operation rule within the DTensor component, advancing DTensor functionality.
- ONNX Export Support for Higher Order Operators: A pull request implements ONNX export support for
torch.ops.higher_order.invoke_subgraph, preserving functions created by nested compilation as separate entities in the ONNX graph. It notes that further updates to the onnxscript optimizer and version converter are needed to fully prevent inlining.
- XPU Build Lazy Dependency on Intel Level Zero: A pull request implements a lazy dependency on the Intel Level Zero library for the XPU build by linking against a stub that defers loading until runtime. This prevents failures on CPU-only machines lacking
libze_loader.soand enforces API calls through an indirection layer.
- Dynamo Cache and Performance Optimizations: Pull requests propose alternative implementations to cache attribute source construction and simplify the variable tracker cache by caching only on the Source object. These changes reduce redundant calls and improve compile time performance in Dynamo.
- pull/174020, [pull/174242](https://github.com/pytorch/pytorch/pull/174242]
- Pallas TPU CI Security Enhancement: A follow-up pull request adds checksum verification for the Bazelisk download in the Pallas TPU continuous integration setup to enhance security and integrity.
- Torch Stable ABI Enhancements: A pull request adds
deletersupport totorch::stable::from_blobby introducing a new function and necessary scaffolding to facilitate a clean port of TorchCodec to the stable ABI.
- Automated PR Review Skill Implementation: One pull request proposes the initial implementation of a "Claude review skill" to improve automated pull request reviews with a balance of specific examples and general guidance, outlining future enhancements for compatibility and validation.
- Inductor and Triton BlockPatternMatch Improvements: A pull request improves BlockPatternMatch by preventing premature expansion of expressions, removing precomputed sizes for dynamic shapes, defining non-negativity for FloorDiv, and fixing a long-running test related to low memory max pooling.
- Function Saved Tensors Clearing Option: A pull request introduces an option to clear saved_tensors in a Function upon access, addressing a specific issue and including related commits for documentation and test skipping.
- Unit Test Fix for XPU Inductor: A pull request fixes the unit test SDPAPatternRewriterGpuDynamicTests.test_skip_non_tf32 within the [xpu][fix][inductor] scope.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| albanD | 155 | 27 | 1 | 8 |
| malfet | 113 | 15 | 1 | 48 |
| wconstab | 156 | 13 | 0 | 8 |
| pianpwk | 161 | 11 | 0 | 2 |
| ydwu4 | 156 | 15 | 0 | 1 |
| laithsakka | 145 | 18 | 0 | 0 |
| NikhilAPatel | 136 | 0 | 0 | 0 |
| anijain2305 | 107 | 13 | 0 | 13 |
| BenjaminDEMAILLE | 128 | 0 | 0 | 0 |
| kurtamohler | 117 | 2 | 2 | 6 |
Access Last Week's Newsletter: