Weekly GitHub Report for Xla: September 01, 2025 - September 08, 2025 (12:01:06)
Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
As of our latest update, there are no active issues with ongoing comments this week.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- New
nvshmem
rule breaks the build: This issue reports a build failure caused by a newnvshmem
rule introduced in a recent update, which leads to an error related to the absence of agetenv
method in therepository_ctx
object during the CUDA configuration step. The reporter is seeking guidance on whether any changes are needed on their side to resolve this error, particularly in relation to recent pull requests affecting JAX, and is also inquiring about the timeline for a fix if the problem originates within the open_xla project itself. - Failed to Parse MLIR generated by Torchax: This issue describes a problem encountered when exporting a PyTorch model to MLIR using the torch-xla torchax export API, where the generated MLIR fails to parse due to an unregistered operation 'vhlo.rsqrt_v2' in the VHLO dialect. The user is attempting to compile the exported MLIR into an XLA binary using XLA AOT compilation but faces deserialization errors with StableHLO, despite using compatible versions of torch, torchxla, and building XLA from the corresponding commit.
- support bazel modules: This issue requests the adoption of Bazel modules within the project, highlighting that Bazel modules have seen significant usage and benefits. The reporter notes that XLA is currently the only package in their Bazel build that lacks support for these modules, implying a need for improved compatibility.
- Gpu collective performance model bug: This issue concerns a bug in the gpu_collective_performance model where the recent update to the lowLatencyBandwidth for AMD links was not consistently applied to the CUDA section. As a result, invoking the gpu_collective_performance model with H100 GPU settings leads to a failure, indicating incomplete or inconsistent parameter updates within the model. Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 1
Summarized Issues:
- Building and running NumPyro with JAX GPU support on Windows: This issue highlights the difficulty of building and running NumPyro with JAX and jaxlib GPU support on Windows without using WSL. The user needs to integrate with Amazon APIs and is seeking guidance on overcoming build obstacles related to the Bazel build system.
- issues/31038
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 1
Summarized Issues:
- Build Failures Due to Compilation Errors: The build failure occurs during a debug build of the XLA project caused by a compilation error in
xnn_fusion_thunk.cc
. The error is related to a missing matching function for a call toDetect
in the Abseil logging library, which triggers a template instantiation failure involving astd::unique_ptr
and results in a clang compiler error. - issues/30742
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 19
Key Open Pull Requests
1. [XLA:CPU][oneDNN] Enable oneDNN MatMul Custom Calls in Thunk Runtime: This pull request enables support for oneDNN MatMul operations in the XLA:CPU Thunk runtime by implementing the base OneDnnOpThunk, extending memory utilities to manage asynchronous execution resources, updating the thunk emitter and compiler to handle oneDNN custom call ops for MatMul, and modifying build rules to include the new library and tests.
- URL: pull/30997
- Merged: No
2. [DOC] Update to operation_semantics: This pull request updates the operation_semantics documentation by fixing spelling errors, improving declarations and tables, adding new example code and images, including links to related resources, and detailing limitations and usage information for various operations such as AllReduce, AllToAll, Dot, Recv, ReduceScatter, ReduceWindow, Scatter, Send, Slice, and Convolution.
- URL: pull/30998
- Merged: No
3. [XLA:GPU] Enable RBE for the ONEAPI presubmit: This pull request aims to enable Remote Build Execution (RBE) for the ONEAPI presubmit process by making SYCL autoconfiguration compatible with RBE through replacing repository_ctx.getenv(...) calls with get_host_environ(repository_ctx, ...) to leverage remote caching and parallelism.
- URL: pull/30875
- Merged: No
Other Open Pull Requests
- ROCm backend enhancements and fixes: Multiple pull requests improve ROCm backend capabilities by adding custom all_reduce operations with fine-grained memory management and supporting BF16 and FP16 CK-Tile GEMM operations as custom fusions. Additional enhancements include CommandBuffer support for CollectivePermute and Convolution operations with reduced graph fragmentation and a fix for FP8 type handling on AMD gfx942.
pull/30965, pull/30798, pull/30855, pull/30977
- oneDNN runtime and fusion improvements: These pull requests introduce a runtime flag to enable oneDNN custom calls in experimental mode and update the OneDnn threadpool to use asynchronous execution, improving performance for onednn_fusion_thunks. A bug fix in the OneDnnFusion emitter ensures unique logical tensor IDs for parameters, preventing incorrect tensor mappings and enabling better backend fusion of binary operations.
pull/31025, pull/31003, pull/31035
- Fusion compiler and GEMM operation enhancements: Pull requests add support for lowering the scaled dot HLO operation to the cuDNN graph and modify the SplitK GEMM rewriter to enable split-k transformations on block scaled dot fusions. These changes enhance the fusion compiler and triton legacy emitter by handling block scaled dot operations with padding scaling tensors and introducing helper functions and pattern matchers.
pull/30847, pull/30967
- Scheduling and threading fixes: Several pull requests address scheduling improvements by modifying passes to utilize both main host and parallel threads, fixing a deadlock by scheduling cross-host data transfers on XLA launcher threads, and updating test expectations for parallel transpose computations in the thunk runtime. These changes ensure correct async compute pipeline handling and smoother multi-host execution.
pull/30984, pull/31027, pull/31028
- XLA GPU compiler optimization: This pull request moves the ReduceScatterCreator pass to run after the AlgebraicSimplifier, enabling detection and conversion of more all-reduce operations into reduce-scatter operations, which improves performance in the XLA GPU compiler.
pull/31030
- cuDNN and bias shape flexibility: A pull request removes constraints on the cuDNN sdpa dbias bias shape, allowing support for additional bias shape broadcastings without changing HLO and updating relevant tests, thereby enhancing dbias computation flexibility.
pull/30996
- Documentation and typographical fixes: These pull requests fix Markdown table generation in the
flags_guidance.md
file and correct various typographical errors in the codebase without functional changes.
pull/30894, pull/31018
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 24
Key Closed Pull Requests
1. Expose Multi-Host HLO Runner in Python: This pull request proposes a new feature to expose the multi-host HLO runner through a nanobind interface, enabling Python programs to register and execute custom calls that were previously unexecutable due to unlinked custom call targets.
- URL: pull/30706
- Merged: No
2. [XLA:CPU][oneDNN] Add Base Implementation of oneDNN Thunk via Custom Call FFI: This pull request introduces the foundational implementation of the OneDnnThunk
in XLA:CPU, enabling execution of oneDNN-based operations via a typed FFI custom call interface, including new source files, resource management extensions, basic unit tests, and updated build rules.
- URL: pull/30562
- Merged: No
3. [XLA:GPU] Add command buffer NestedChildCmd unittest.: This pull request proposes adding a unit test to verify the functionality of nested command buffers, specifically testing the creation and update operations of a ChildCmd inside another ChildCmd, to improve testing coverage for the XLA GPU command buffer implementation.
- URL: pull/30799
- Merged: No
Other Closed Pull Requests
- BatchNormGrad documentation updates: Multiple pull requests update the BatchNormGrad operation documentation by renaming parameters from
mean
tobatch_mean
andvariance
tobatch_var
to better align with the function signature. These changes improve clarity and consistency in the operation_semantics.md file.
- Operation semantics documentation improvements for Dot and DotGeneral: Several pull requests enhance the operation_semantics.md documentation by improving table formatting for consistency and adding detailed descriptions of the
precision_config
andpreferred_element_type
fields. These updates clarify the types and semantics of these fields in both Dot and DotGeneral operations.
- Comprehensive operation semantics restructuring and additions: One pull request proposes extensive documentation updates including restructuring and adding details for operations like CompositeCall, Conv, and Reduce, along with formatting improvements and links to the StableHLO specification. This enhances the overall clarity and completeness of the operation semantics documentation.
- GPU backend bitcast-convert and layout assignment fixes and optimizations: Multiple pull requests address GPU backend improvements by optimizing bitcast-convert operations involving different type widths, normalizing layouts for such operations, fixing broadcast shape layout bugs in block scaled dot custom calls, and disabling certain fusions of bitwidth-changing bitcasts to prevent indexing and memory access issues. These changes improve correctness and performance in GPU layout assignment and fusion processes.
- pull/30818, pull/30821, [pull/30856](https://github.com/pull/30856], pull/30864
- CollectiveBackendAssigner performance improvement: A pull request removes the size threshold condition from CollectivePermute operations in the CollectiveBackendAssigner, applying size-based backend selection only to AllReduce operations. This change is based on benchmarking showing NVSHMEM provides consistent low latency for CollectivePermute regardless of message size.
- SYCL support and Remote Build Execution (RBE) enablement: Pull requests add support for the sycl_event component with tests, update the sycl_library wrapper for SYCL-specific build and linking options, implement error handling for SYCL exceptions, and enable RBE for the ONEAPI presubmit process by making SYCL autoconfiguration compatible with RBE. These changes facilitate easier maintenance and remote caching capabilities for SYCL targets.
- Test improvements and debugging enhancements: Several pull requests improve testing and debugging by moving HLO-based end-to-end tests for better organization, fixing and re-enabling GPU convolution layout assignment tests, and adding source and sink node counts to CUDA graph debug output. These updates enhance test reliability and developer insight.
- IntelGpuCompiler stub introduction: A pull request introduces a preliminary stub for the IntelGpuCompiler within the XLA GPU oneAPI backend as a foundational placeholder for future feature development.
- TritonFusionAnalysis scaled dot fusion support: A pull request adds
LHS_SCALE
andRHS_SCALE
scopes to the TritonFusionAnalysis to enable support for scaled dot fusion, facilitating upcoming fusion emitter capabilities.
- cusolver syevBatched function exposure for JAX: A pull request exposes the syevBatched functions from the cusolver library to JAX, enabling faster computation of jax.lax.linalg.eigh for non-trivial batch sizes by leveraging batched operations.
- Bitcast decomposition error handling: A pull request proposes that bitcast decomposition should fail when there are mismatches in element counts or bitwidths, improving error detection during these operations.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
shawnwang18 | 36 | 4 | 0 | 1 |
othakkar | 13 | 3 | 0 | 7 |
athurdekoos | 12 | 8 | 0 | 1 |
sergachev | 10 | 7 | 0 | 0 |
mraunak | 14 | 2 | 0 | 0 |
sergey-kozub | 8 | 5 | 0 | 0 |
pemeliya | 7 | 2 | 0 | 0 |
penpornk | 1 | 0 | 0 | 7 |
mgoldfarb-nvidia | 6 | 1 | 0 | 1 |
Zoey-Cheng | 5 | 1 | 0 | 0 |