Weekly GitHub Report for Xla: August 25, 2025 - September 01, 2025 (12:01:12)
Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Support LLVM21 + blackwell family: This issue requests support for LLVM version 21 and the Blackwell GPU family, including Spark, Thor, and GB300 architectures. The user reports build failures when attempting to compile JAX with these configurations, encountering errors related to unsupported CUDA GPU architectures and unknown compiler arguments during the build process.
- The comments include a detailed build log showing compilation errors caused by clang not recognizing certain CUDA-related flags and unsupported GPU architectures, followed by a reference to an LLVM21-related commit in an external toolchain repository as a potential lead or solution.
- Number of comments this week: 2
-
Formatter checking behavior mismatch between OSS and Google Internal: This issue highlights a discrepancy in formatter checking behavior between the open-source version and Google's internal systems, where a pull request passes external style checks but fails internal tests due to missing new lines in newly added files. The reporter requests alignment of these checks to avoid delays caused by the lack of visibility into internal test failures, which currently require intervention from Google personnel to resolve.
- The single comment acknowledges the issue and notes that it is also impacting another related issue, indicating that this inconsistency is a broader problem affecting multiple contributions.
- Number of comments this week: 1
Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- New
nvshmem
rule breaks the build: This issue reports a build failure caused by a newnvshmem
rule introduced in a recent update, which leads to an error related to the absence of thegetenv
method in therepository_ctx
object during the CUDA configuration step. The reporter is seeking guidance on whether they need to make changes on their side to resolve this error, particularly in relation to recent changes affecting JAX, or if the problem requires a fix within the open_xla project, along with an estimated timeline for such a fix. - Failed to Parse MLIR generated by Torchax: This issue describes a problem encountered when exporting a PyTorch model to MLIR using the torch-xla torchax export API, where the generated MLIR fails to parse due to an unregistered operation 'vhlo.rsqrt_v2' in the VHLO dialect. The user reports that this error prevents deserialization of the portable artifact with StableHLO_v1.9.5, despite using compatible versions of torch, torchxla, and building XLA from the corresponding commit, and has provided code snippets and bytecode samples to assist in troubleshooting.
- support bazel modules: This issue requests the adoption of Bazel modules within the project, highlighting that Bazel modules have seen significant adoption in the community. The user points out that XLA is currently the only package in their Bazel build that does not support these modules, implying a need for compatibility improvements.
- Gpu collective performance model bug: This issue addresses a bug in the gpu_collective_performance model where the recent update correctly adjusts the lowLatencyBandwidth for AMD links but fails to apply the same update to the CUDA section. As a result, invoking the gpu_collective_performance model with H100 GPU settings leads to a failure, indicating incomplete handling of bandwidth parameters across different GPU architectures. Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 3
Summarized Issues:
- Build and Compatibility Issues: Several issues highlight build failures and compatibility problems, including unsupported CUDA GPU architectures and clang errors when targeting newer platforms like LLVM 21 and the Blackwell GPU family. Additionally, a debug build fails due to a template instantiation error involving
absl::log_internal::Check_NEImpl
andstd::unique_ptr
, indicating challenges in maintaining successful builds across different configurations. - issues/30726, issues/30742
- Formatting and Testing Discrepancies: There is a discrepancy between open-source formatting checks and Google's internal Copybara tests, where newly added files require an extra newline internally. This inconsistency causes confusion and delays in merging due to conflicting feedback and lack of visibility into internal test failures.
- issues/30652
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 0
Summarized Issues:
As of our latest update, there were no issues closed in the project this week.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 8
Key Open Pull Requests
1. Expose Multi-Host HLO Runner in Python: This pull request introduces a new feature that exposes the multi-host HLO runner through a nanobind interface, enabling Python programs to register and execute custom calls, thereby addressing the issue of non-executable HLOs containing custom calls due to unlinked custom call targets.
- URL: pull/30706
- Merged: No
2. [XLA:GPU] Lowering command buffer's WhileCmd
into an unrolled cuda-graph.: This pull request introduces a new feature that enables lowering the WhileCmd
in the command buffer into an unrolled CUDA graph by precomputing the loop trip count and modifying the state management to support multiple unrolled iterations, thereby overcoming limitations of the CUDA graph loop operator and improving support for patterns like NCCL kernels and DynamicSliceFusion in JAX LLM models.
- URL: pull/30695
- Merged: No
3. [XLA:GPU][oneAPI] Stub for IntelGpuCompiler: This pull request introduces a preliminary stub for the IntelGpuCompiler within the XLA GPU oneAPI backend, laying the groundwork for future feature implementations.
- URL: pull/30714
- Merged: No
Other Open Pull Requests
- Kernel Registration Improvements: This pull request proposes using the base address for symmetric kernel registration to improve the registration process, potentially enhancing performance or correctness in relevant workloads. The change aims to streamline kernel registration by leveraging a consistent base address approach.
- pull/30610
- Algebraic Simplifier Enhancements: This update to the algebraic simplifier replaces no-op bitcast-converts with bitcasts even when the bit widths of the types differ. This change refines the simplification process by handling bit width differences more effectively.
- pull/30704
- SYCL Context and Device Pool Implementation: This pull request implements a SYCL context and device pool for XLA on GPU with oneAPI, including the addition of corresponding tests. It introduces foundational support for SYCL-based GPU execution without dependencies on other changes.
- pull/30716
- CollectiveBackendAssigner Optimization: This modification applies the size threshold exclusively to AllReduce operations, removing it for CollectivePermute ops to improve performance by leveraging NVSHMEM's consistent low-latency behavior regardless of message size. The change optimizes collective communication by tailoring thresholds to operation types.
- pull/30718
- Scaled Dot Fusion Support: New
LHS_SCALE
andRHS_SCALE
scopes are introduced to theTritonFusionAnalysis
to enable support for scaled dot fusion. This facilitates the handling of scaled-dot fusion in upcoming fusion emitters, enhancing fusion capabilities. - pull/30749
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 15
Key Closed Pull Requests
1. [XLA:GPU] Refactor dynamic slice fusion lowering code to reduce the API calls: This pull request aims to refactor the dynamic slice fusion lowering code in the XLA GPU backend to reduce redundant API calls, specifically minimizing the number of calls to GetLoopInductionVarTupleIdx
.
- URL: pull/30397
- Merged: No
2. Add ScopedClonedModuleCallInliner class in call_inliner: This pull request introduces the ScopedClonedModuleCallInliner class in the call_inliner module, which clones and inlines a target module during initialization to enable loop analysis on modules that cannot be modified directly, addressing issues with the while_loop_analysis pass on modules parsed by the command buffer rewriter.
- URL: pull/29884
- Merged: No
3. [DOC] Add docs for HLO Dumps: This pull request proposes adding a new documentation page detailing how to obtain HLO Dumps in various formats across different environments, including instructions on filtering, transforming, and replaying these dumps, as well as introducing a new "Debugging" section in the sidebar to organize this and future debugging guides.
- URL: pull/30414
- Merged: No
Other Closed Pull Requests
- XLA GPU backend command buffer refactoring and scheduling improvements: Multiple pull requests focus on enhancing the XLA GPU backend's command buffer management by refactoring command dependency implementation with a unified token resource and moving the left-hand side scheduling functionality to the execution graph. These changes aim to simplify dependency management and improve scheduling organization within the GPU backend.
- CUDA graph and command buffer execution enhancements: There is a new feature allowing multiple CommandBufferCmdExecutor instances to be recorded sequentially into a single CUDA graph command buffer, enabling flexible construction with dependencies and optional finalization. This improves the ability to build complex CUDA graphs from multiple command sequences.
- Bug fixes in HLO instruction handling and collective permute cloning: Bug fixes include enabling access to the
precision_config
attribute for thescaled-dot
HLO operation and correcting the cloning process of the collective permute instruction to clone all operands, which is necessary for variadic operations. These fixes address failures and issues in gradient accumulation scenarios.
- Triton fusion analysis cleanup and GPU command buffer improvements: One pull request removes an unused enum value from Triton fusion analysis as a cleanup step, while another adds the legacy Triton custom call target to the GPU command buffer by default to improve performance and reduce user configuration. These changes maintain code hygiene and enhance GPU command buffer functionality.
- Support for CPU executable debugging and multi-GPU test filtering: A pull request enables dumping optimized HLO when deserializing CPU executables to improve debugging and testing in the JAX persistent compilation cache workflow. Another filters out multi-GPU tagged unit tests from the single-GPU test suite to ensure proper test execution environments.
- CUDA architecture version update for Thor GPU: The SM number for the Thor GPU is renamed from sm_101 to sm_110 to align with CUDA 13 and later versions, reflecting that sm_101 will be unused and support will be for CUDA 13 onward only. This update ensures compatibility with newer CUDA releases.
- Float normalization disabling for bitcasts (unmerged): A pull request proposes disabling float normalization for bitcasts to correctly handle no-op data type conversions, but this change has not been merged.
- IntelGpuCompiler preliminary stub introduction: A preliminary stub for the IntelGpuCompiler is introduced within the XLA GPU oneAPI project as a foundational placeholder for future feature development.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
shawnwang18 | 41 | 9 | 0 | 6 |
Arech8 | 4 | 1 | 0 | 5 |
sergey-kozub | 5 | 4 | 0 | 0 |
Copilot | 0 | 0 | 0 | 9 |
SandSnip3r | 0 | 0 | 0 | 9 |
mgoldfarb-nvidia | 6 | 1 | 0 | 1 |
pemeliya | 6 | 1 | 0 | 0 |
Zoey-Cheng | 5 | 1 | 0 | 0 |
beckerhe | 2 | 0 | 0 | 4 |
mraunak | 6 | 0 | 0 | 0 |