Weekly GitHub Report for Xla: December 01, 2025 - December 08, 2025 (12:01:22)
Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
As of our latest update, there are no active issues with ongoing comments this week.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- New
nvshmemrule breaks the build: This issue reports a build failure caused by a newnvshmemrule introduced in a recent pull request, which leads to an error related to the absence of agetenvmethod in therepository_ctxobject during CUDA configuration. The reporter is seeking guidance on whether they need to update their side to resolve this error, particularly in relation to changes mentioned for JAX, or if the fix must come from the open_xla project, along with an estimated timeline for such a resolution. - Failed to Parse MLIR generated by Torchax: This issue describes a problem encountered when exporting a PyTorch model to MLIR using the torch-xla torchax export API, where the generated MLIR fails to parse due to an unregistered operation 'vhlo.rsqrt_v2' in the VHLO dialect. The user reports that this error prevents deserialization of the portable artifact with StableHLO_v1.9.5, despite using compatible versions of torch, torchxla, and building XLA from the corresponding commit, and provides code and bytecode samples to help diagnose the issue.
- Gpu collective performance model bug: This issue concerns a bug in the gpu_collective_performance model where an update to the lowLatencyBandwidth for AMD links was made without corresponding changes to the CUDA section. As a result, invoking the gpu_collective_performance model with H100 settings leads to a failure, indicating incomplete or inconsistent updates in the performance model code.
- Cross compile to ARM with custom gcc: This issue concerns difficulties encountered when attempting to cross-compile the XLA project from an x86 architecture to ARM64 using a custom GCC compiler. The user reports that despite using the
--config=cross_compile_linux_arm64flag in the Bazel build system, the build process continues to produce an x86 binary, indicating a possible misconfiguration or missing step in the cross-compilation setup. Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 3
Summarized Issues:
- XLA Sharded Tensor Concatenation Optimization: The current lowering process for concatenating sharded tensors in XLA generates many small transpose fusion and NCCL kernels, leading to inefficiency. An optimization is requested to reduce these numerous small operations to improve performance, ideally matching the efficiency of the
shard_mapimplementation. - issues/34710
- PJRT Output Sharding Communication: There is a need for clarification on how PJRT implementers should communicate the correct output shardings back to the framework after compilation. The current process is complex and lacks a well-defined interface, causing uncertainty for implementers.
- issues/34830
- macOS Build Failures in XLA Backend: Building the project on macOS fails due to missing definitions such as 'if_google' and undeclared targets like 'tiled_emitter_constraints' in Bazel build files within the XLA backend directories. These missing components prevent successful compilation on macOS systems.
- issues/34935
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 1
Summarized Issues:
- Build failures due to outdated compiler versions: The build failure on a Google Cloud Platform virtual machine occurred because the GCC compiler used was outdated and did not support the required C++ constexpr semantics. This caused errors related to calls to non-constexpr functions within constexpr contexts during the compilation of the XLAGPU ROCm backend.
- issues/34850
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 17
Key Open Pull Requests
1. [ROCm] Add register spilling detection support AMD: This pull request adds support for detecting register spilling on AMD GPUs by analyzing HSACO metadata and disassembling object files, enhancing performance diagnostics in the ROCm backend of the XLA compiler.
- URL: pull/34812
- Merged: No
2. [XLA:GPU] make DYNAMIC_SLICE_COPY_FUSION command default lowered to cuda-graph: This pull request makes the DYNAMIC_SLICE_COPY_FUSION command the default for lowering to cuda-graph by adding it to the enabled GPU command buffers in the default debug options, aiming to improve performance and including updated unit tests.
- URL: pull/34734
- Merged: No
3. [GPU] Update fabric detection failure warning message: This pull request updates the fabric detection failure warning message to be clearer and more actionable by converting it to a verbose log (VLOG) and restricting the warning to only Blackwell+ GPUs, addressing confusion among non-MNNVL users.
- URL: pull/34868
- Merged: No
Other Open Pull Requests
- Bug fixes in GPU backend and tests: Multiple pull requests address critical bug fixes in the GPU backend and related tests. These include fixing descriptor mismatches in fp8 GEMMs for CUDA 13.0, resolving LLVM calling convention errors for AMD GPUs, fixing cuDNN SDPA test workspace size issues, and resolving ROCm platform test failures due to missing libraries or incomplete checks.
- pull/34912, pull/34789, pull/34806, pull/34808, pull/34843
- Performance optimizations and feature enablement for GPU: Several pull requests focus on improving GPU performance and enabling new features. These include fusing shared memory write loops for transposes to reduce execution time, enabling dynamic slice fusion by default for CUDA graph execution, enabling the scatter determinism expander flag by default for better performance, and optimizing bf16 negation and absolute value operations by avoiding unnecessary normalization.
- pull/34633, pull/34735, pull/34917, pull/34898
- Support and compatibility updates for new hardware and software: Updates include enabling
tcgen05instructions on Thor devices (sm110) to fix test failures and ensure CUDA 13.0 compatibility, and adding buffer type information to the GpuExecutable memory allocation profile to aid optimization. - pull/34705, pull/34802
- Improvements to debugging and documentation: One pull request updates onednn custom call operation names to include target-specific identifiers for better debugging clarity, and another fixes a broken or moved link in the GPU flag guidance documentation to maintain accurate references.
- pull/34715, pull/34864
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 5
Key Closed Pull Requests
1. [XLA:GPU] Fix the command_buffer_cmd vlog debug message to profile the memory allocation types.: This pull request aims to fix the command_buffer_cmd vlog debug message in order to accurately profile the memory allocation types within the XLA GPU backend.
- URL: pull/34657
- Merged: No
2. [GPU] Fix mixed inputs F8 dot in Triton codegen.: This pull request addresses a bug fix for handling mixed inputs in the F8 dot operation within the Triton code generation for GPU, including unit and execution tests to ensure correctness.
- URL: pull/34704
- Merged: No
- Associated Commits: c8672
3. Preserve existing backend config in WhileLoopTripCountAnnotator: This pull request addresses a bug in the WhileLoopTripCountAnnotator by modifying it to preserve existing backend configuration data instead of overwriting it, thereby preventing the loss of previously set dynamic variable information critical for subsequent optimization passes.
- URL: pull/34707
- Merged: No
- Associated Commits: d1daf
Other Closed Pull Requests
- ROCm Nightly Build Test Execution Separation: This pull request proposes splitting the execution of multi-GPU and single-GPU tests in the ROCm nightly build process to prevent test clashes and enable the use of different run scripts that properly manage GPU visibility. This separation improves test reliability by isolating GPU environments during testing.
- pull/34750
- XLA GPU Backend ScatterDeterminism Bug Fix: This pull request fixes a bug in the XLA GPU backend by correcting the ScatterDeterminismExpander to properly handle batched scatter operations normalized by BatchedGatherScatterNormalizer. It ensures deterministic scatter behavior by accurately computing scalar indices using scatter_dims_to_operand_dims for sorting, preventing index collisions across batches and enabling reliable execution of models utilizing batched scatter such as batched attention and embedding lookups.
- pull/34870
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| alekstheod | 32 | 5 | 0 | 8 |
| amd-songpiao | 11 | 3 | 0 | 2 |
| shawnwang18 | 11 | 5 | 0 | 0 |
| terryysun | 9 | 4 | 0 | 0 |
| sergachev | 6 | 3 | 0 | 0 |
| hsharsha | 2 | 2 | 0 | 5 |
| pemeliya | 5 | 2 | 0 | 0 |
| Cjkkkk | 4 | 3 | 0 | 0 |
| dimitar-asenov | 0 | 0 | 0 | 7 |
| akuegel | 0 | 0 | 0 | 7 |