Weekly GitHub Report for Pytorch: September 22, 2025 - September 29, 2025 (12:03:00)

            Weekly GitHub Report for Pytorch: September 22, 2025 - September 29, 2025 (12:03:00)

                    Weekly GitHub Report for Pytorch
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
The current version of this repository is v2.6.0
1.2 Version Information:
Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include FP16 support on X86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

Regressions detected with operator microbenchmarks for Pytorch 2.9_rc1 vs Pytorch2.8: This issue reports significant performance regressions observed in operator microbenchmarks when running PyTorch 2.9 release candidate 1 compared to PyTorch 2.8, specifically in compile mode on various CUDA devices. The regressions affect key matrix multiplication operations, and the discussion focuses on reproducing the issue, verifying benchmark accuracy, and understanding discrepancies between CI results and profiling data.

The comments clarify that regressions occur only in compile mode, not eager mode, and highlight difficulties reproducing the benchmarks due to missing tests in the release branch; contributors share relevant PRs to enable testing, question the accuracy of reported latency numbers, and provide profiling data showing no clear regressions, suggesting CI measurement flakiness; the consensus is to proceed with the 2.9 release and defer any urgent fixes to the 2.10 milestone while investigating how to align CI results better with profiling outcomes.
Number of comments this week: 13

compile error with custom softmax triton kernel: This issue describes a compilation error encountered when using a custom Triton kernel for softmax within a PyTorch function decorated with torch.compile(fullgraph=True). The error arises because the PyTorch Dynamo/AOT-autograd tracer treats the output tensor—produced by in-place mutation of a pre-allocated tensor inside the Triton kernel—as an invalid graph node, causing an assertion failure during graph partitioning.

The discussion reveals that detaching and cloning the output tensor before saving it avoids the error, though cloning is seen as inefficient. It is explained that Dynamo traces the Python wrapper but not inside the Triton kernel, so it sees the output tensor as an invalid node due to in-place mutation. Wrapping the kernel in a torch.library.custom_op hides its internals from Dynamo, preventing the error. The issue is contrasted with other kernels like RMSNorm, which do not fail without custom ops, and the conversation suggests this is a bug or limitation in how Dynamo handles such kernels under fullgraph compilation.
Number of comments this week: 13

GroupedMM triton template IMAs on B200: This issue reports a memory access error occurring in the GroupedMM Triton template when running on a B200 GPU, specifically related to invalid global reads in the kernel code. The problem appears to be linked to non-TMA loads in the Triton kernel, with ongoing efforts to fix the issue and discussions about compatibility with different Triton versions and the future deprecation of block pointer programming models.

The comments discuss the non-deterministic nature of the error, clarify reproduction steps and hardware used, and confirm a fix is in progress addressing non-TMA loads. There is also a conversation about which Triton versions should be supported, with caution advised against using soon-to-be deprecated block pointer APIs.
Number of comments this week: 9

Torch CPU wheels no longer available for download from PyTorch website: This issue reports that the PyTorch website no longer provides downloadable CPU wheel files at a previously available URL, causing disruptions for users who relied on that location for installation. The problem arose suddenly without prior notice, breaking builds that depended on the now-removed links, and users sought clarification and alternative installation methods.

Commenters confirmed the issue affected multiple users and identified that switching to a different index URL resolved the problem temporarily. The discussion clarified that the previous URL was never officially documented for this use, but many had come to depend on it. Official installation commands were shared, emphasizing the intended method, and the issue was eventually resolved with the original URLs becoming functional again.
Number of comments this week: 8

cpu wheel pip index issue: This issue reports a problem where using the provided URL for CPU wheels leads to links redirecting to the regular PyTorch index, causing installation failures for CPU-specific wheels. The user notes that the CPU wheels were still accessible via a direct path using --find-links, and that this functionality was working correctly a few days prior.

The comments identify a recent change in the test infrastructure as the likely cause, with a partial revert improving the situation. Users confirm the correct installation commands, discuss the URL rendering and indexing behavior, and verify that the issue has been resolved, while also suggesting improvements to CI validation to prevent similar problems in the future.
Number of comments this week: 8

2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler': This issue reports an ImportError encountered when attempting to import the name 'triton_key' from the module 'triton.compiler.compiler' during the use of PyTorch's backend compiler 'inductor'. The user provides detailed environment information and code snippets showing that the error arises while compiling certain pipeline components with torch.compile, indicating a potential compatibility or packaging problem between PyTorch, Triton, and their respective versions.
Alternate algorithm for computing MaxPool2D under specific condition.: This issue proposes an alternate algorithm for computing MaxPool2D when the stride is equal to 1, by representing a larger kernel size (e.g., 5 or 7) as multiple smaller MaxPool2D operations with kernel size 3. This method aims to reduce computational cost on the CPU by decreasing the number of operations per cell and suggests modifying the MaxPool2D layer directly to avoid additional overhead during backpropagation, with demonstrated speedup in testing.
cuda_utils.so: failed to map segment from shared object: This issue describes a problem encountered when running a PyTorch model inside a Docker container with a tmpfs mounted at /tmp having permissions set to 1777. Although the model compiles successfully, execution fails with an error indicating that the shared object cuda_utils.so cannot be mapped due to missing execute permissions on the file, despite the script running as root and the directories having appropriate permissions.
Enable UFMT on all files in PyTorch: This issue addresses the task of enabling uniform formatting (UFMT) across all files in the PyTorch codebase, specifically targeting around 1,500 files that are currently excluded from UFMT enforcement. It outlines the process for removing files from the exclusion list, running the formatter, handling known formatting-related problems, and organizing the work by directory to facilitate manageable and reviewable pull requests.
[JIT archive] Add a flag to not include debug files: This issue proposes adding a flag to the torch.jit.save() function that allows users to exclude debug files, specifically .debug_pkl files, from the JIT archive to reduce the overall file size. The motivation stems from observations that these debug files, which are only used for debugging purposes, can occupy a significant portion of the archive size without affecting model correctness, making their exclusion particularly beneficial for deploying smaller models on mobile devices.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 108
Summarized Issues:

Compilation and Backend Failures in Inductor and TorchDynamo: Multiple issues report compilation errors, assertion failures, and runtime errors in the Inductor backend and TorchDynamo compiler, including problems with invalid nodes treated as outputs, stride mismatches, PassManager failures during backward passes, and symbolic integer hint mismatches. These bugs cause divergence between eager and compiled execution, prevent successful compilation, and lead to crashes or incorrect results in compiled models.  
issues/163567, issues/163572, issues/163603, issues/163874, issues/163877, issues/163878, issues/163894

Bugs and Numerical Issues in Attention and Matrix Multiplication on GPUs: Several issues highlight numerical inaccuracies and silent errors in GPU computations, including bfloat16 precision discrepancies on Blackwell GPUs, incorrect results in flex_attention with bfloat16 and cross-attention masks, and NaN outputs in scaled_dot_product_attention during evaluation mode. These problems undermine trust in GPU-accelerated attention and matrix operations, especially when using reduced precision formats.  
issues/163574, issues/163588, issues/163997

CUDA and GPU Compatibility and Runtime Errors: Issues report conflicts and runtime errors related to CUDA versions and GPU environments, such as incompatible CUDA runtime libraries when mixing CUDA 12.8 and 13, CUDA driver downgrades causing device detection failures, and NCCL runtime errors during distributed model parallelism initialization. These problems cause build failures, deadlocks, or crashes in GPU-accelerated workflows.  
issues/163510, issues/163658, issues/163678

Distributed and Parallelism Deadlocks and Failures: Several issues describe deadlocks and failures in distributed training setups, including NCCL backend deadlocks when ranks desynchronize, missing gradients in torch.distributed.pipelining with ScheduleGPipe, and pipeline parallelism causing gradient divergence. These issues cause training hangs or incorrect training behavior in multi-GPU or distributed environments.  
issues/163546, issues/163688, issues/163859

Triton Kernel and Tensor Memory Accelerator (TMA) Related Bugs and Improvements: Issues cover bugs in Triton kernels such as out-of-bounds memory reads in GroupedMM templates, compilation failures due to undefined variables, and assertion errors related to TMA store operations. Additionally, proposals include renaming the USE_TMA flag to better reflect tensor descriptor usage and adding tensor descriptor support to Flex Backwards templates. These affect kernel correctness and clarity in the Triton framework.  
issues/163462, issues/163536, issues/163543, issues/163674, issues/163702

Torch.compile and Graph Breaks with Dynamic Shapes and Symbolic Sizes: Multiple issues report that torch.compile with fullgraph=True or dynamic shapes enabled causes graph breaks or compilation failures due to unsupported operations like tolist(), integer division on unbacked sizes, or missing as_proxy() implementations. These bugs prevent successful compilation of models using dynamic or symbolic tensor sizes.  
issues/163641, issues/163668, issues/163798

Incorrect or Missing Error Handling in MPS Backend and Other Operations: Issues report silent truncation without errors in torch.nn.functional.one_hot on MPS when targets exceed num_classes, missing out-of-bounds errors in embedding_bag on MPS, and segmentation faults in custom MPS kernels. These bugs cause silent data corruption or crashes on Apple Silicon GPUs.  
issues/163504, issues/163630, issues/163721

Test Failures and CI Instability on ROCm and Other Platforms: Several test modules and suites fail intermittently or are disabled on ROCm platforms due to regressions or flaky behavior, including tests in CudaReproTests and allreduce_inductor_cudagraph_trees. These failures complicate continuous integration and reduce test coverage reliability.  
issues/163689, issues/163701, issues/164018, issues/163765

Documentation, Build, and Packaging Issues: Issues include underspecified and inconsistent documentation for building PyTorch docs, sudden removal of CPU wheel files breaking builds, and proposals to rename the pytorch-triton package to triton for consistency. These problems affect developer experience and package management.  
issues/163675, issues/163789, issues/163963

Performance Regressions and Benchmarking Concerns: Reports include significant performance regressions in operator microbenchmarks on CUDA devices between PyTorch 2.8 and 2.9 release candidates, and discussions on GPU vs CPU performance trade-offs for small matrix operations on Apple M4 hardware. These highlight concerns about recent performance impacts and benchmarking methodology.  
issues/163608, issues/163576

Symbolic and Type System Issues in Compilation and Schema Definitions: Issues describe problems with symbolic shape handling causing runtime errors, missing support for double types in native function schemas, and failures in type signature tracking CI jobs. These affect type safety and correctness in compilation and schema validation.  
issues/163659, issues/163759, issues/163990

Memory and Null Pointer Safety Concerns: One issue highlights a potential null-pointer dereference risk in sycl_runtime_wrappers.h due to unchecked malloc failure, suggesting the need for defensive programming to prevent crashes.  
issues/163624

Miscellaneous Bugs in Operator Behavior and API Usage: Issues include torch.baddbmm failing to validate batch dimension broadcasting with out_dtype, torch.combinations throwing errors with symbolic sizes, and torch.nn.functional.interpolate crashing with bilinear mode in Inductor backend. These bugs cause incorrect outputs or crashes in common tensor operations.  
issues/163816, issues/163759, issues/163833

Autorevert and CI Tooling Enhancements: A request to enable the Autorevert bot to post comments in shadow mode aims to improve monitoring and troubleshooting of automatic reverts, supporting oncall personnel.  
issues/163650

Build and Compilation Speed Issues: An issue reports extremely long compile times (~15 minutes) for generated TraceType_*.cpp files with clang-17 on M4Pro hardware, impacting developer productivity.  
issues/163853

Symbolic Expression and Attribute Errors in Inductor Compiler: Issues report AttributeErrors due to missing attributes in sympy Infinity objects and negative values in symbolic expression evaluation causing Inductor compilation failures during backward passes.  
issues/163875, issues/163876

Proposal for New Features and Enhancements: Suggestions include adding nn.BitLinear for 1-bit transformer models, support for symmetric memory programming for multi-GPU kernels, and unifying build workflows for x86 and aarch64 architectures to streamline development.  
issues/163821, issues/163666, issues/163970

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 27
Summarized Issues:

Eager vs Compiled Execution Discrepancies: Multiple issues report differences in behavior or output between PyTorch's eager mode and compiled mode using torch.compile. These include numerical precision divergences, type errors during backward passes, convolution operation failures due to stride mismatches, and out-of-memory errors occurring only in compiled mode, indicating challenges in maintaining parity between execution paths.  
[issues/163449, issues/163457, issues/163563, issues/163569, issues/163604, issues/163687]

Distributed and Parallelism Issues: Several issues involve distributed training and parallelism features such as FSDP, process group management, and gradient scaling. Problems include the need for explicit gradient scalers in mixed precision FSDP2, lingering CUDA contexts after destroying process groups, and metadata mismatches during checkpointing with FSDP and activation checkpointing.  
[issues/163519, issues/163690, issues/163741]

Memory and CUDA Kernel Errors: There are reports of CUDA-related errors including out-of-memory conditions, out-of-bounds shared memory reads in cuDNN kernels, and silent CUDA IMA errors in nvshmem Triton kernels due to naming requirements. These issues highlight challenges in CUDA kernel stability and memory management on specific hardware and configurations.  
[issues/163563, issues/163753, issues/163915]

CI and Infrastructure Failures: Multiple issues describe failures and outages affecting continuous integration and infrastructure, such as Windows build failures due to missing attributes, flaky or disabled tests on ROCm and NVSHMEM platforms, MacOS runner maintenance causing reduced availability, and network outages impacting job execution.  
[issues/163530, issues/163568, issues/163663, issues/163847, issues/163900]

Compilation and Model Support Limitations: Some issues report unsupported operations or errors when compiling specific models or using certain inputs, such as boolean masks in TransformerEncoder compilation, failures in conv1d compilation, and flex_attention kernel configuration failures. These indicate gaps in compiler support for certain PyTorch features or input types.  
[issues/163569, issues/163640, issues/163687]

Documentation and Usability Improvements: A few issues focus on improving documentation clarity and build tooling, including fixing a sentence in torch.matmul docs and resolving C++ documentation build failures due to Python and dependency upgrades.  
[issues/163672, issues/163949]

Hardware and Platform Compatibility: Issues include initialization failures on new GPUs like the NVIDIA RTX 5090 with CUDA 12.8 on Windows, and unexpected behavior on the Metal Performance Shaders backend with nan inputs, reflecting ongoing challenges in supporting diverse hardware and platforms.  
[issues/163684, issues/163851]

Feature Requests and Development Guidance: One issue requests guidance on which device operators need improvement for algorithm development, specifically asking about AMD GPU operators, indicating community interest in expanding hardware support.  
[issues/163606]

New Operator Implementation: There is an issue about implementing the aten_bilinear function, describing its expected behavior and schema, reflecting ongoing operator development efforts.  
[issues/163730]

Testing and Feature Flags: Some issues relate to test management and feature toggling, such as disabling autorevert functionality and disabling specific tests to improve CI reliability.  
[issues/163857, issues/163847]

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 210
Key Open Pull Requests
1. [DO NOT MERGE] Dynamo config set up: This pull request is about setting up and updating the configuration and documentation for TorchDynamo within the PyTorch project, including multiple commits that enhance the dynamo_config files and config.py to support this setup.

URL: pull/163517

Merged: No

Associated Commits: 14d3a, c288a, 51026, 03719, bddba, 1510f, 6dbbb, 91117, f55be, 76d12, dd273, 630ee

2. [Inductor-FX] Support unbacked symbol definitions: This pull request adds support for unbacked symbol definitions generated by Inductor to the FX backend by introducing a new Wrapper IR line for code generation, implementing FX IR generation from unbacked symbols, handling autotuning limitations with warnings, and improving related helper functions, along with importing relevant tests to ensure proper handling of dynamic control flow scenarios.

URL: pull/163729

Merged: No

Associated Commits: 0ca99, 2675f, be7db, f5b8c, 10d1c, 20f0d, 283a3, d3118, b61e9, 51070, 8d628, 3ab43

3. Add smoke tests to verify that stable ABI FA3 wheel runs w/ newer torch: This pull request adds smoke tests to verify that the stable ABI FA3 wheel is compatible and runs correctly with newer versions of the PyTorch library.

URL: pull/163782

Merged: No

Associated Commits: 87b43, 6aef1, 16bfe, 8fc1b, 1d8b5, a1ece, 42440, d543c, 9437c, ffeb9, cc70f, 99d18

Other Open Pull Requests

Removal of PythonOpRegistrationTrampoline: This pull request proposes the removal of the PythonOpRegistrationTrampoline component from the PyTorch backend, noting that the process was simpler than initially expected. The change simplifies the backend by eliminating this component.

pull/163466

DTensor Export Support: This pull request explores and implements support for exporting models with DTensor parameters and inputs by experimenting with multiple export paths to achieve a joint graph for DTensorized modules. It adds tests to validate these approaches and discusses the preferred method for future use.

pull/163609

Unifying Export and Dynamo Behavior: This pull request aims to unify the behavior of Export and Dynamo regarding the handling of inline inbuilt neural network modules by enabling the install_free_tensors flag by default for both export and export_v2. This change ensures consistent behavior across these features.

pull/163921

CUDA addmm Refactor: This pull request proposes refactoring the CUDA addmm implementation to improve code navigation and clarity between the Lt and non-Lt execution paths. The refactor enhances maintainability and understanding of the CUDA addmm code.

pull/163955

LogAddExp Complex Number Consistency: This pull request addresses inconsistencies between CPU and CUDA implementations of the LogAddExp function for complex numbers by updating the CUDA kernel to support complex data types. It also enhances unit tests to compare CPU and CUDA outputs and adds new tests to ensure correctness and consistency.

pull/163509

AOTI MPS Shim Implementation: This pull request implements an AOTI MPS shim by updating the MPS shim API with new handle types and function declarations for library management and kernel execution. It modifies MPS shader code generation to produce source constants and updates kernel call generation to utilize the shimified API rather than direct libtorch calls.

pull/163865

Scaled Dot Product Attention Numerical Matching: This pull request updates the reference implementation of scaled_dot_product_attention to precisely match the numerical behavior of the MATH backend by pre-scaling query and key tensors before matrix multiplication. It also incorporates float32 upcasting for reduced precision inputs and adds a regression test to verify exact bitwise equivalence.

pull/163508

Speeding Up __instancecheck__: This pull request aims to speed up the __instancecheck__ method by moving the VariableTrackerMeta implementation to C++ and caching its results. This results in a reduction of tracing time for the test_ziplongest benchmark by approximately 0.7 seconds.

pull/163656

Graph Break on tolist in Dynamo: This pull request addresses the issue of ensuring a graph break occurs on the tolist operation when capture_scalar_outputs is false in PyTorch's Dynamo tracing. It maintains the current contract and prevents failures during retracing in autograd, as demonstrated by the added unit test.

pull/163807

Generalizing FloorDiv Conversion in Inductor: This pull request generalizes the FloorDiv conversion in Inductor's FX backend to handle more complex symbolic launch grid expressions by improving the replace_floor_div function. It removes the slower "python_slow" grid mode, enabling more efficient and comprehensive handling of floor division operations without increasing CPU usage.

pull/163828

Clamping in binary_cross_entropy_with_logits: This pull request updates the binary_cross_entropy_with_logits loss function to clamp input values between -100 and 100 to handle extreme values consistently with binary_cross_entropy. This prevents arbitrarily large loss values that could disrupt training and includes corresponding unit tests and performance optimizations.

pull/163975

Boxing Helper to Stable API: This pull request provides a boxing helper to the stable API in PyTorch, addressing issue #163346 and improving the project's functionality as part of a series of stacked changes.

pull/163505

Handling DDE Errors in infer_size_impl: This pull request addresses handling DDE errors in the infer_size_impl function, which were encountered while running VLLM with unbacked memory for the Qwen/Qwen2-1.5B-Instruct model.

pull/163822

Removal of 'allow-untyped-defs' Directives: These pull requests propose the removal of the 'allow-untyped-defs' directive from multiple files including './torch/utils/benchmark/op_fuzzers/unary.py', './torch/distributions/half_normal.py', and './torch/distributed/optim/zero_redundancy_optimizer.pyi'. This cleanup removes unnecessary type checking directives from the codebase.

pull/163473, pull/163474, pull/163477

Performance Operator Microbenchmarks: This pull request aims to test and refine performance operator microbenchmarks on the PyTorch 2.9 release candidate, including separating benchmark runs for different hardware configurations such as b100 and h100+a100.

pull/163538

Round-Robin Load Balancing in ContextParallel: This pull request introduces a process-time based Round-Robin load balancing mechanism to the ContextParallel (CP) module in PyTorch to improve workload distribution.

pull/163617

Hierarchical All-to-All Inter-node Dispatch: This pull request introduces a hierarchical all-to-all (A2A) communication mechanism for inter-node dispatch by performing a histogram on topk_node_idx, sorting the token sequence accordingly, expanding it by topk_node times, and executing a concurrent 1D A2A operation rail-wise on a 2D mesh. This mimics the initial step of deduplication in A2A.

pull/163814

Hierarchical All-to-All Intra-node Dispatch: This pull request implements a hierarchical all-to-all (A2A) intra-node dispatch mechanism that shuffles tokens rail-wise to experts within a node using a 2D mesh communication pattern. It expands the token sequence by a top-k factor and converts intra-node top-k indices into splits to closely mimic the Mixture of Experts (MoE) approach.

pull/163815

Bucketing Method for Communication-Compute Overlap: This pull request introduces a bucketing method that preserves communication-compute overlap by augmenting the computation graph with implicit dependencies between collective operations and hiding compute tasks. It merges collective starts and waits into unified subgraphs to maintain these relationships during bucketing.

pull/163960

Sparse Tensor Fixes: This pull request addresses fixes related to sparse tensor functionality in the PyTorch project, specifically resolving issues referenced in issue #148324. Named tensor fixes were moved to a separate pull request.

pull/163535

Using _RecordFunctionFast in Autograd Profiler: This pull request updates the torch.autograd.profiler.record_function to utilize the more efficient _RecordFunctionFast, significantly reducing compiler runtime overhead in smaller graphs as demonstrated by benchmark improvements.

pull/163566

Migration of Additional Callsites: This pull request aims to migrate additional callsites within the PyTorch codebase as part of an ongoing series of related changes to improve or update the project's internal function calls.

pull/163580

Fixing local_map Tracing in AP Context: This pull request addresses the issue of incorrectly tracing the local_map body with global shapes in the Automatic Parallelism (AP) context by modifying the tracing process to use local shapes instead. This ensures that local_map redistributes DTensor inputs appropriately and prevents errors related to shape specialization during subsequent tracing after sharding.

pull/163602

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 231
Key Closed Pull Requests
1. GEMM-template Horizontal : This pull request implements a horizontal transverse strategy for the CPP GEMM template to improve performance when Matrix A is much larger than Matrix B, adds configuration options to enable this strategy or heuristically choose between vertical and horizontal transverse, and demonstrates significant performance improvements in relevant cases.

URL: pull/163467

Merged: No

Associated Commits: b6e21, 97ecb, 59c42, 23b3b, 36ce9, 00e77, d217d, ba980, 4cda7, ddf44, 2f4bf, b2e8d, 420a8, 0f3c7, f0de5, 87277

2. Allow unbacked to unbacked replacements if rhs unbacked symbols are all inputs: This pull request proposes allowing unbacked-to-unbacked symbol replacements when all right-hand side unbacked symbols are inputs, thereby easing reasoning logic during model tracing by not banning such replacements since those symbols are visible throughout the program.

URL: pull/163652

Merged: No

Associated Commits: 331ba, 74b6f, a6f96, d29f0, ac097, 6707a, c73ed, 04301, 9e27c, b0d95, fd5e2, eaaeb, 9c7da, 73b6e, 8db45, 0c1b8

3. remove allow-untyped-defs from ./torch/utils/benchmark/op_fuzzers/sparse_unary.py: This pull request proposes removing the "allow-untyped-defs" directive from the file "./torch/utils/benchmark/op_fuzzers/sparse_unary.py" in the PyTorch project, but it was not merged.

URL: pull/163476

Merged: No

Associated Commits: 77664, 43d29, 7d64a, 94d9d, 9dca8, b1c1c, c2478, 23465, ebbd5, 132de, e4238, a255d, 0f6ab, 288f9, ae003

Other Closed Pull Requests

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

bobrenjc93
215
53
23
26

malfet
114
17
6
119

laithsakka
104
17
6
68

kwen2501
118
30
6
38

Skylion007
11
6
1
167

ezyang
63
15
10
96

tugsbayasgalan
121
30
0
23

swolchok
94
20
1
22

huydhn
88
8
3
32

FFFrog
87
21
1
10

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
bobrenjc93	215	53	23	26
malfet	114	17	6	119
laithsakka	104	17	6	68
kwen2501	118	30	6	38
Skylion007	11	6	1	167
ezyang	63	15	10	96
tugsbayasgalan	121	30	0	23
swolchok	94	20	1	22
huydhn	88	8	3	32
FFFrog	87	21	1	10