Weekly Project News

Archives

Weekly GitHub Report for Pytorch: February 15, 2026 - February 22, 2026 (14:49:18)

Weekly GitHub Report for Pytorch

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

The current version of this repository is v2.6.0

1.2 Version Information:

Released on January 29, 2025, PyTorch 2.6 introduces significant enhancements including torch.compile support for Python 3.13, a new dynamic compilation control API torch.compiler.set_stance, and improved AOTInductor packaging and ABI compatibility. Notable highlights also include beta-level FP16 support on x86 CPUs, expanded Intel GPU support with simplified installation, and a backward-incompatible security improvement flipping the default of torch.load to weights_only=True, alongside numerous performance optimizations, bug fixes, and deprecations such as the discontinuation of official Conda packages.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. [HIGH PRIORITY] [TRIAGE REVIEW] [MODULE: CRASH] [MODULE: ACTIVATION CHECKPOINTING] [ONCALL: PT2] [MODULE: DTENSOR] [MODULE: PT2-DISPATCHER] [MODULE: FLEX ATTENTION] [BOT-TRIAGED] Block mask caching for flex attention and SAC don't play nicely together (RuntimeError: Only Tensors of floating point and complex dtype can require gradients): This issue describes a runtime error occurring when block mask caching in flex attention interacts poorly with selective activation checkpointing (SAC), specifically causing a RuntimeError related to tensors of integer dtype incorrectly requiring gradients during recomputation. The problem arises because a compiled create_block_mask produces a BlockMask with int tensors that are cached and reused, leading SAC's aot_autograd wrapper to fail when reconstructing DTensors from these int tensors with requires_grad=True inside compiled functions.

    • The comments discuss the difficulty in pinpointing the error, note that the failure is due to branching on global state causing cache mismatches in SAC, and explore a proposed solution to clear the cache during recomputation which ultimately fails. A user shares a workaround involving a recompute tape mechanism to record and replay cache hits during forward and recompute passes, ensuring consistent behavior and preventing the error.
    • Number of comments this week: 7
  2. [ONCALL: PT2] [MODULE: DYNAMIC SHAPES] MobileBertForMaskedLM is 90% slower with unbacked vs backed !: This issue reports that the MobileBertForMaskedLM model runs approximately 90% slower when using unbacked batch processing compared to backed batch processing with the Inductor backend, despite recent optimizations related to size hinting. The author is seeking assistance from the Inductor team to further optimize performance for unbacked batches, sharing preliminary benchmark results and incremental code improvements that have progressively increased speed but have not yet fully resolved the slowdown.

    • The comments detail a series of incremental optimizations to the Inductor codebase, including changes to size hinting heuristics, handling of unbacked symbols, and padding strategies, which collectively improved performance from 1.19x to over 2x speedup, with ongoing efforts to finalize the fixes.
    • Number of comments this week: 5
  3. [HIGH PRIORITY] [TRIAGE REVIEW] [NEEDS REPRODUCTION] [MODULE: MEMORY USAGE] [ONCALL: PT2] torch.compile VRAM usage regression between 2.9.1 and 2.10.0: This issue reports a significant increase in VRAM usage when using torch.compile between PyTorch versions 2.9.1 and 2.10.0, specifically noting that compiled models in 2.10.0 consume substantially more VRAM compared to previous versions. The user provides detailed memory profiling data and suspects changes in activation checkpointing or compiler partitioning decisions as potential causes, requesting assistance to isolate the root cause with further diagnostics.

    • The comments acknowledge the severity of the regression and request additional diagnostic data such as tlparse logs and memory snapshots to compare compilation artifacts between versions; detailed memory profiles and hypotheses about activation checkpointing and compiler behavior are discussed to guide further investigation.
    • Number of comments this week: 4
  4. [TRIAGED] [FUNCTION REQUEST] [ONCALL: PT2] [MODULE: DYNAMO] [MODULE: COMPILE UX] [DYNAMO-TRIAGE-DEC2025] [BOT-TRIAGED] torch.compile(..., name="flex_attention"): This issue proposes adding a name keyword argument to the torch.compile function to assign a name to compile regions, enabling better identification and usage in other contexts such as activation checkpointing and stack traces. The motivation is to improve the ability to distinguish and manage compiled regions, particularly when using features like SAC and inductor compiled code, by associating meaningful names with these regions.

    • The comments generally support the proposal, with one suggesting shipping it immediately, another discussing the trade-offs between using string names versus Python object references for naming to avoid conflicts, and others highlighting the usefulness of named regions for debugging, tracing, and improving model transparency and interpretability.
    • Number of comments this week: 4
  5. [TRIAGED] [RELEASE TRACKER] [v.2.11.0] Release Tracker: This issue is about tracking and managing cherry-picks to the release branch for the PyTorch 2.11.0 release, outlining specific criteria and processes for what changes can be included during different phases of the release cycle. It provides detailed instructions on how to submit cherry-pick requests, the approval workflow, and the types of fixes allowed to ensure stability and correctness before the final release.

    • The comments document multiple cherry-pick requests submitted for the release branch, all of which were reviewed and merged by a release team member, indicating active management and approval of changes to the release branch.
    • Number of comments this week: 3

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

As of our latest update, there are no stale issues for the project this week.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 53

Summarized Issues:

  • torch.compile bugs and regressions: Multiple issues report bugs and regressions related to torch.compile, including invalid C++ code generation causing compilation failure on RTX 3090 GPUs, increased VRAM usage regressions between versions 2.9.1 and 2.10.0, incorrect interpolation results with certain backends, inconsistencies in random number generation between eager and compiled modes, and a minor regression causing assertion errors with float8 tensors on some GPUs. These problems affect compilation correctness, performance, and reproducibility across different PyTorch versions and hardware.
    • issues/175057, issues/175058, issues/175154, issues/175156, issues/175206
  • MPS backend crashes and incorrect gradients: There are critical bugs on the MPS backend where the backward pass of BatchNorm2d produces incorrect weight gradients with channels_last inputs, and AvgPool2d backward operations crash due to buffer size assertion failures on channels_last formatted tensors. These issues cause training divergence and runtime crashes, although forward passes remain correct, highlighting backend-specific stability problems.
    • issues/175189, issues/175190
  • Distributed and NCCL deadlocks and communication bottlenecks: Deadlocks occur when launching NCCL point-to-point and collective operations concurrently from different threads due to circular dependencies in NCCL's progress engine. Additionally, serialization of point-to-point communication on a single CUDA stream causes head-of-line blocking in pipeline parallel workloads, limiting communication overlap and throughput.
    • issues/175145, issues/175225
  • Inductor backend performance and fusion issues: The Inductor backend faces performance challenges such as MobileBertForMaskedLM running 90% slower with unbacked batch processing, storage offsets not propagating correctly during qkv fusion causing accuracy failures, and runtime OutOfMemoryErrors in Triton due to excessive fusion of operations in backward passes. These issues impact both speed and correctness of compiled models.
    • issues/175167, issues/175325, issues/175250
  • Dynamo component correctness and refactoring: Several issues highlight bugs and needed improvements in the Dynamo codebase, including incorrect insertion of HAS_ATTR guards, auditing and fixing the use of the is operator, lack of metaclass support, and proposals to refactor complex builtins into separate variable trackers to improve maintainability and compile-time performance.
    • issues/175263, issues/175267, issues/175269, issues/175292
  • SAC (Selective Activation Checkpointing) and flex attention issues: Problems with SAC include caching BlockMask objects containing integer tensors leading to backward failures, difficulty in matching flex attention regions idiomatically after compilation, and proposals to assign unique IDs to inductor_compiled_code operators to improve tracking and error detection during recomputation.
    • issues/175258, issues/175229, issues/175306
  • Build and packaging errors: Build failures occur due to missing directory checks in hipification scripts affecting Fedora packages, and Docker builds fail because of outdated URLs for zlib downloads, requiring updates or alternative sources to fix 404 errors.
    • issues/175160, issues/175193
  • Export and DTensor related errors: Exporting tensor-parallel models using DTensor fails due to DTensorSpec not being registered as a pytree constant, and running decompositions on ExportedPrograms with DTensor triggers assertion errors during decomposition, blocking successful export and decomposition workflows.
    • issues/175467, issues/175469
  • Segmentation faults in embedding_bag: The torch.nn.functional.embedding_bag function causes segmentation faults when offsets exceed indices length or when using float64 weights with empty offsets, due to insufficient validation and missing bounds checks in the C++ backend.
    • issues/175368, issues/175370
  • Documentation and usability improvements: Documentation errors include incorrect descriptions of LSTM outputs, and usability issues such as inability to suppress repeated TensorFloat32 warnings and proposals to add a name argument to torch.compile for better region identification and debugging.
    • issues/175479, issues/175484, issues/175390
  • Hardware capability and backend support enhancements: Proposals include replacing get_device_capability() calls with generic feature queries for better runtime hardware verification, and adding a cuTile backend to Inductor for portable tile-based GPU programming, aiming to improve hardware support and performance portability.
    • issues/175211, issues/175311
  • Random number generation and sampling reproducibility: There is a request to update torch.distributions.Gamma().sample() to accept a torch.Generator for deterministic sampling, addressing reproducibility in simulations.
    • issues/175478
  • Tensor subclass and custom_op interaction bug: Using TensorSubclass with custom_op under torch.compile causes runtime failures because the custom_op implementation is incorrectly invoked during Dynamo tracing instead of the fake implementation, leading to errors accessing tensor data pointers.
    • issues/175408
  • Type safety and casting improvements: A discussion is ongoing about whether overrides.is_tensor_like should return a TypeGuard[Tensor] to enable safer type casting and reduce explicit casts in autograd workflows.
    • issues/175324
  • Test failures and disabled tests: Several tests have been disabled or are failing consistently, including a comprehensive linear CUDA test in Inductor, index_put_error_cuda on ROCm, and a custom_op test on the xpu platform, indicating ongoing stability and platform-specific issues.
    • issues/175354, issues/175482, issues/175475
  • Out-of-memory and memory fragmentation issues: An out-of-memory error on ROCm GPUs occurs despite apparent free memory, suggesting memory fragmentation or allocation issues that prevent small allocations from succeeding.
    • issues/175431
  • Intermittent CPU crashes on older hardware: An illegal instruction crash occurs intermittently when running torch.sin on large CPU tensors with PyTorch 2.10.0+ on older Intel Xeon processors, linked to MKL VML kernels and mitigated by limiting PyTorch to a single thread.
    • issues/175436
  • Beam search concurrency test failure: A test for beam search with concurrency limits fails because outputs differ when concurrency limits are applied, indicating correctness issues in concurrent beam search implementations.
    • issues/175437
  • Quantization test failures in vLLM: vLLM quantization tests fail due to incompatibilities with older quantization config versions and unexpected keyword arguments during model initialization, blocking progress on the 2.11 release.
    • issues/175435

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 50

Summarized Issues:

  • Inductor Backend Boolean and Compilation Bugs: Multiple issues report incorrect behavior and bugs in the PyTorch Inductor backend related to boolean tensor operations and compilation. These include incorrect indices returned by argmax on boolean tensors, wrong results from boolean .data assignments, and improper tracing of dist.all_reduce outputs causing runtime errors during compiled execution.
  • issues/174069, issues/174187, issues/174280
  • MPS Backend Numerical and Memory Errors: Several issues highlight numerical inaccuracies and memory access errors on the MPS backend. Problems include incorrect results from BatchNorm2d and avg_pool2d with channels_last tensors having storage offsets, out-of-bounds memory access in scaled dot product attention causing data corruption, and incorrect gradient computations in torch.linalg.solve.
  • issues/174345, issues/174861, issues/175192
  • Instability in Inductor and Periodic-Dynamo Benchmarks: Numerous issues report instability and flaky failures in various inductor-periodic and periodic-dynamo benchmark tests across CPU and CUDA platforms. These instabilities affect multiple test variants and environments, indicating widespread reliability problems in benchmarking infrastructure.
  • issues/175121, issues/175122, issues/175123, issues/175124, issues/175125, issues/175126, issues/175127, issues/175128, issues/175129, issues/175130, issues/175131, issues/175132, issues/175133, issues/175134, issues/175135, issues/175136, issues/175137, issues/175138, issues/175139, issues/175140, issues/175141, issues/175142, issues/175143
  • Non-Contiguous Tensor Handling Bugs on MPS and CUDA: Several issues describe incorrect behavior when using non-contiguous or permuted tensors on MPS and CUDA backends. These include incorrect results from torch.sort, activation functions producing wrong gradients, and scatter_ operations reading source elements in storage order rather than logical order, causing errors unless tensors are made contiguous first.
  • issues/175187, issues/175188, issues/175191
  • Loss Functions and Gradient Backpropagation Issues: There are reports of failures in loss functions and gradient computations, such as nn.CrossEntropyLoss on MPS accepting invalid float labels without error and NLLLoss failing to backpropagate gradients for non-contiguous 4D inputs, resulting in runtime errors.
  • issues/174943, issues/175084
  • Linking and Build Errors Due to Name Mangling: One issue describes inconsistent name mangling between C++ and CUDA for templated functions, causing linking errors during the build process due to mismatched mangled symbols in CPU and CUDA libraries.
  • issues/174898
  • Test Failures and Module Dependency Issues: Some issues report test failures caused by missing dependencies or disabled tests due to persistent errors. Examples include a missing Python module 'dominate' causing a ModuleNotFoundError and disabled tests failing due to segmentation faults or strict tolerance thresholds.
  • issues/174919, issues/174952, issues/175019, issues/175065
  • CUDA Kernel and Memory Access Errors: There are reports of CUDA illegal memory access errors in specific attention mechanisms when compiled with torch.compile, suggesting out-of-bounds or incorrect indexing in generated kernels.
  • issues/174923
  • Feature Requests for Autograd and Slicing Enhancements: Requests include adding per-loss selective upstream gradient blocking in autograd to optimize memory and compute, and supporting negative step sizes in slicing syntax by internally rewriting them to use flip and positive steps for better ergonomics.
  • issues/175165, issues/175240
  • Numerical Accuracy and Linear Algebra Discrepancies: Issues report numerical accuracy failures in CUDA tests for tensor inversion and linear algebra solves, influenced by matrix conditioning and backend differences, as well as undefined behavior in GemmHelper due to improper vector usage causing runtime errors.
  • issues/175282, issues/175302
  • Warnings and Compatibility Discussions: Some issues request suppression of warnings when creating tensors from readonly NumPy arrays to reduce log clutter, and discuss whether to begin work on supporting the upcoming Python 3.15 release to ensure timely compatibility.
  • issues/175395, issues/175402, issues/175407
  • GPU Layout Support Test Bug: One issue describes a test failure where the layouts_supported check incorrectly expects a RuntimeError for certain GPUs that actually support float8 layouts, causing the test to fail erroneously.
  • issues/175182

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 197

Key Open Pull Requests

1. [HOP][print]Add Dtensor support: This pull request adds support for DTensor arguments to the torch._higher_order_ops.print higher order operator by unwrapping DTensors to their local tensors via to_local() so that each rank prints its own local shard without introducing collectives, enabling rank-wise printing consistent with jax.debug.print inside compiled functions and ensuring compatibility with torch.compile(fullgraph=True) and lowering to inductor, while leaving global tensor printing and rank filtering to the user.

  • URL: pull/175222
  • Associated Commits: c6714, d4708, 51317, 33ae6, c14e2, 92839, 50f67, 26d7f, 3d82d, ea721, 86014, ad68e, a90bb, dced9

2. [test] Add basic pyrefly infer command: This pull request introduces a basic pyrefly infer command along with initial setup, tests for autotyping, lint fixes, directory renaming to avoid clashes, reference typing completion, integration of a custom directory switcher, and test rewrites to enhance the pyrefly functionality within the PyTorch project.

  • URL: pull/175153
  • Associated Commits: be3d0, 11f5d, 0d4a8, b88e6, d26be, 286e7, cf386, c5eef, c8a85, e4b96, 5270f, 7cfce, a34c5

3. Update vLLM pinned commit: This pull request updates the pinned commit of the vLLM submodule, reorganizes test paths to align with upstream changes in vLLM, and includes fixes to ensure successful building with CUDA 12.8.

  • URL: pull/175238
  • Associated Commits: 98577, 75fdc, 90d75, 4261c, 9306d, 86b8a, 8af69, 77598, a051d, 61cae, d1ba8, b6c37

Other Open Pull Requests

  • powf precision improvements: Multiple pull requests enhance the precision and correctness of the powf function in CUDA and Inductor backends. These changes include adding an inline PTX implementation for powf_cuda with non-flush-to-zero instructions and fixing the pow precision helper to correctly handle fp64 inputs by falling back to libdevice.pow, ensuring numerical accuracy and matching eager execution results.
    • pull/175227, pull/175268
  • ROCm and HIP BFloat16 enhancements: The ROCm environment sees improvements with a refactor to use the native HIP __bf16 type for hardware-accelerated float conversions and arithmetic, replacing software-based conversions. Additionally, ROCm CI stability is improved by skipping sccache PATH wrappers for certain compiler commands to prevent build failures during nightly builds.
    • pull/175303, pull/175443
  • Inductor backend operation decompositions and autotuning: Several pull requests modify the Inductor backend to skip decomposition of addcdiv and addcmul operations, enabling FMA-based lowering for improved bitwise precision parity with eager CUDA. Autotuning is enhanced by integrating custom operator autotuning into aten.mm, adding scoped configuration propagation for specific operations, and extending ND tiling heuristics to better support reduction kernels.
    • pull/175310, pull/175278, pull/175277, pull/175309
  • Pytree API improvements and public exposure: The pytree module is improved with a new simplified API that aligns with JAX's design and reverses argument order for better compatibility. Furthermore, torch.utils.pytree is made a public API, allowing users to switch between Python and C++ implementations and benefit from enhanced Dynamo traceability.
    • pull/175083, pull/175082, pull/175420
  • Debugging and tracing fixes in Dynamo and bdb: Fixes are applied to the InteractiveDebugSession to prevent absorption of test exceptions and to ensure the debugging command 'q' exits immediately without stopping on exceptions or returns. Additionally, an unnecessary unimplemented prefix variable related to comprehension graph breaks is removed to address recursion depth issues in optimized graph break handling.
    • pull/175103, pull/175173, pull/175420
  • Memory management and benchmarking enhancements: An API is added to clean up cuBLAS workspaces during CUDA graph capture to prevent memory leaks in autotuning benchmarks. A new Inductor benchmarker is introduced for ROCm platforms using the Torch profiler to improve kernel timing accuracy during autotuning.
    • pull/175276, pull/175097
  • Expression caching and compilation flag updates: A per-SymNode expression cache keyed on the _replacements_version_counter is introduced to optimize expression handling. The default value of the wrap_inductor_compiled_regions flag is updated to True to modify Inductor backend compiled region behavior.
    • pull/175353, pull/175169
  • ONNX export and inference function updates: The torch.onnx.export function is fixed to handle renamed input names with dynamic shapes by remapping keys to original parameter names, ensuring successful export validation. An inference-only variant of varlen_attn is added that mutates a provided output tensor in place to support pre-allocated outputs during inference.
    • pull/175279, pull/175103
  • CI improvements: The timeout duration for ROCm nightly binary builds in the CI workflow is increased to 360 minutes to address recurring timeout issues during libtorch builds.
    • pull/175152

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 232

Key Closed Pull Requests

1. Enable hipSOLVER for supported linalg operations on ROCm: This pull request aims to enable hipSOLVER support for specific linear algebra operations on the ROCm platform by adding appropriate guards and fallbacks to integrate hipSOLVER 64-bit APIs for functions like xpotrf, xpotrs, and xgeqrf, while ensuring unsupported APIs fall back to MAGMA, thereby enhancing ROCm's compatibility and performance for these linalg operations.

  • URL: pull/175367
  • Associated Commits: 8b80a, 34f78, 88129, dd44b, 1c96f, 3d409, 7672d, a7ad4, 36b75, 32803, f5519, 94de7, 7fb50, 61a38, 4b515, d9556, 48007, b9f71, 94c6e, 25087, ac013, b6ee5, 08906, 77acf, 03aab, b2db1
  • Associated Commits: 8b80a, 34f78, 88129, dd44b, 1c96f, 3d409, 7672d, a7ad4, 36b75, 32803, f5519, 94de7, 7fb50, 61a38, 4b515, d9556, 48007, b9f71, 94c6e, 25087, ac013, b6ee5, 08906, 77acf, 03aab, b2db1

2. More size-hinting cleanups: This pull request focuses on cleaning up size-hinting in the codebase by replacing all size_hint calls with fallback to use optimization_hint, removing fallback parameters from size_hint calls in preparation for its eventual deletion, and substituting calls to symbolic_hint() with replace_backed_with_hints() to improve code clarity and maintainability.

  • URL: pull/174580
  • Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c
  • Associated Commits: 9c8d4, 8dc39, d4b3f, 0a7e5, aa75b, d15eb, 8d4bc, b3c4d, a1277, 3fce3, 6dc29, 63d42, a8cfd, 258b9, 9fc6c

3. [DTensor] Strategy Validation (3/3): strategy querying, orchestrator, and CLI: This pull request adds a comprehensive DTensor sharding rule validation framework including an orchestrator that queries DTensor’s claimed sharding strategies via multiple paths, computes ground truth validity for each placement combination, detects and reports discrepancies such as incorrect or missing rules with false positive mitigations, provides a CLI for running validations on individual or all registered operators, and includes end-to-end tests to ensure reliable detection of DTensor bugs.

  • URL: pull/174800
  • Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546
  • Associated Commits: 9c568, e4070, b8c6b, 0b24b, 97bd9, cc1a2, 6af72, cb134, 12a1b, 18d26, 1516d, 90b67, 889a4, 59049, 34546

Other Closed Pull Requests

  • DTensor sharding validation and improvements: Multiple pull requests enhance DTensor functionality by adding a validation engine for sharding rules, fixing stack dimension normalization, improving decomposition flow with lenient view redistributes, supporting single-dimension bucketize rules for sharded buckets, and updating backward functions to handle None gradients explicitly. These changes collectively improve correctness, validation robustness, and error handling in DTensor operations.
    • pull/174799, pull/174640, pull/175194, pull/173865, pull/174830
  • Pallas TPU backend support and tiling improvements: These pull requests introduce element-wise operation support on the Pallas TPU backend, including initial IR lowering, broadcasting, and code generation fixes, as well as advancing tiling implementation with feature enablement and bug fixes related to CPU and TPU shape handling. The updates enable more efficient and correct execution on the Pallas backend.
    • pull/174743, pull/175027
  • Tracing and nonstrict_trace feature enhancements: Several pull requests improve the nonstrict_trace feature by adding support for neural network modules as inputs and enhancing documentation and tests. Additionally, tracing performance is improved by deferring stack trace symbolization to reduce overhead during node creation.
    • pull/172372, pull/172395, pull/175334
  • ProcessGroup and FakeScriptObject tracing fixes: A pull request modifies the ProcessGroup class to use an ABC metaclass, enabling FakeScriptObject to be registered as a virtual subclass and ensuring correct isinstance behavior during tracing with FakeScriptObjectStack instances.
    • pull/172566
  • Size hint and unbacked tensor support in IR: This pull request rewrites all remaining usages of size_hint in ir.py to support unbacked tensors, replacing ambiguous calls with precise APIs like guard_or_false and optimization_hint, improving correctness and performance especially for vLLM.
    • pull/174937
  • CI improvements for TIMM pretrained model caching: A pull request enhances the CI process by enabling caching of TIMM pretrained models on a shared Hugging Face cache, preventing benchmark failures due to offline mode blocking downloads through a download-only flag, version-pinned cache directory, and a stamp file.
    • pull/174596
  • Flex module and MaskMod equality fix: This pull request attempts to fix equality comparison in the MaskMod component to ensure correct behavior during multiple tracing in ahead-of-time compilation, though it was not merged.
    • pull/175343
  • InputObserver custom empty tensor support: This pull request adds support for specifying a custom empty tensor in InputObserver to handle missing inputs like pixel_values during sequential forward calls, ensuring consistent input observation in multi-modal models.
    • pull/174964
  • CUDA graph benchmarking and autotuning enhancements: These pull requests add CUDA graph benchmarking capabilities to ExternKernelCaller and introduce a min_speedup_threshold parameter to the custom operation autotuning API, enabling more accurate performance comparisons by normalizing kernel launch overhead and filtering algorithms by speedup ratio.
    • pull/175275, pull/173811
  • Pyrefly environment detection fix: A pull request addresses an issue where Pyrefly incorrectly detects its environment in GitHub actions by decoding JSON output to ensure compatibility locally and on CI, pending a more comprehensive fix.
    • pull/175289
  • ShardingPropagator thread safety fix: This pull request fixes a potential hang in ShardingPropagator during multi-threading tests by enabling a lock only in testing, ensuring thread safety with FakeTensorMode without impacting performance.
    • pull/174820
  • DTensorContinuousTestBase GPU device index fix: This pull request modifies _init_pg to call set_device_index(rank) only after verifying sufficient GPUs exist, preventing errors from setting device indices for non-existent devices.
    • pull/174845
  • Dynamo einops version check revert: This pull request partially reverts a previous change by disabling the einops 0.8.2 version check in Dynamo, falling back to earlier behavior to prevent excessive warning logspam caused by tracing einops operations using @lru_cache.
    • pull/175351
  • Autotuning JSON output enhancement: This pull request modifies select_algorithm.py to include kernel name types in JSON output during autotuning, facilitating easier downstream analysis of tuning data.
    • pull/173811
  • MultiMarginLoss error message improvement: This pull request improves the error message for MultiMarginLoss to provide clearer details about target tensor size inconsistencies, aiding user debugging.
    • pull/174072
  • CuteDSL norm kernels introduction: This pull request introduces CuteDSL kernels specifically designed for computing norms in PyTorch.
    • pull/174987
  • Dynamo guard emission optimization: This pull request reduces redundant guards during MRO attribute lookups by installing DICT_CONTAINS absent guards only once per shared intermediate MRO class and attribute pair, skipping unnecessary guards for data descriptors, and caching resolved MRO sources, improving efficiency.
    • pull/175006
  • Unmerged fix attempt for issue 174939: This pull request attempts to fix issue 174939 with a series of updates marked "[ghstack-poisoned]" but was not merged.
    • pull/175063

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
laithsakka 171 18 2 9
albanD 178 3 1 1
wconstab 121 17 0 23
anijain2305 145 8 5 3
pianpwk 131 19 0 5
malfet 111 18 0 25
ydwu4 121 12 0 0
eellison 99 12 0 16
guilhermeleobas 89 11 0 2
weifengpy 99 3 0 0

Access Last Week's Newsletter:

  • Link
Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.