Weekly GitHub Report for Xla: March 23, 2026 - March 30, 2026 (22:23:46)
Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[A/B DIFF BENCHMARKING] BatchNormExpander: missing test coverage for inference, sharding, and disabled-flag paths: This issue highlights missing test coverage in the
batchnorm_expander.ccfile for several specific code paths, including inference, sharding, and disabled-flag scenarios. It suggests adding tests that mirror existing patterns to ensure these untested paths are properly validated.- The comments express openness to contributions and confirm that a pull request has been submitted addressing all the identified coverage gaps, including basic inference expansion, sharding for inference and gradient, and disabled rewrite flag cases.
- Number of comments this week: 2
-
[ERR:BUILD] "no declaration matches 'tsl::Future" when building pjrt helpers: This issue describes a compilation error encountered when building the
pjrt_c_api_helperstarget in the XLA project, specifically related to a missing declaration for thetsl::Futuretemplate'sFlattenmethod. The user provides detailed reproduction steps including Bazel build configurations, workspace setup, and a Dockerfile to replicate the problem on a specific commit.- The comment suggests trying a specific commit as a potential fix for the compilation error, but no further discussion or resolution is provided.
- Number of comments this week: 1
Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 19
Summarized Issues:
- StableHLO Runtime Failures Due to Null or Missing Pointers: Multiple issues report runtime failures after converting HLO modules to StableHLO caused by null pointers or missing mesh/device assignments. These failures include null convolution dimension numbers, null window pointers during convolution padding legalization, missing mesh names, and null device assignments in collective runtime components.
- issues/39821, issues/39822, issues/39824, issues/39825
- StableHLO Runtime Check Failures Related to Parameter and Shape Validation: Several issues describe RET_CHECK or check failures triggered by invalid parameter indices, unexpected non-tuple shapes, or mismatched parameter shapes in StableHLO execution. These validation errors cause the translated StableHLO to fail despite the original HLO running successfully.
- issues/39823, issues/39828, issues/39830
- Internal Errors from Shape and Tuple Mismatches in StableHLO: Multiple internal errors arise from mismatched tuple structures or shape mismatches during StableHLO execution, including issues with original_value metadata, tuple operands in post-layout assignment, and async output shape mismatches. These shape inconsistencies cause runtime failures in the translated StableHLO modules.
- issues/39827, issues/39833, issues/39835, issues/39836
- StableHLO Execution Errors Due to Unsupported or Unexpected Operations: Some issues report errors caused by unsupported dynamic input dimensions in reshape operations, unexpected bitcast operations during layout assignment, and invalid argument types in binary operations. These operational mismatches lead to execution failures in StableHLO despite successful original HLO runs.
- issues/39826, issues/39829, issues/39831
- StableHLO Unimplemented Features and Parallelism Conflicts: One issue highlights an "UNIMPLEMENTED" error triggered when a computation is invoked in both parallel and sequential contexts, indicating incomplete support for certain parallelism scenarios in StableHLO execution.
- issues/39832
- Buffer Debug XOR Checksum Kernel Corruption Detection Bug: A bug in the buffer debug XOR checksum kernel causes failure to detect certain sum-preserving corruption patterns, leading to false negatives and masked hardware errors during corruption detection.
- issues/39850
- Insufficient Test Coverage in BatchNorm Expander: There is a lack of test coverage for several untested code paths in batchnorm_expander.cc, including BatchNorm inference, sharding scenarios, and disabled-flag conditions, necessitating additional tests following existing patterns.
- issues/39930
- Compilation Error Due to Missing Declaration in pjrt_c_api_helpers: A compilation failure occurs in the pjrt_c_api_helpers target caused by a missing declaration for
tsl::Futurerelated to theFlatten()method in future.h on a specific commit. - issues/40031
- SPMD Partitioner Gradient Bug Causing Incorrect Training Results: A bug in the SPMD partitioner leads to incorrect gradients during model training on certain device counts by failing to insert an all-reduce operation when resharding from unreduced to mixed sharded and replicated axes, resulting in inconsistent sharded matrix multiplication outcomes.
- issues/40034
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 0
Summarized Issues:
As of our latest update, there were no issues closed in the project this week.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 19
Key Open Pull Requests
1. [ROCm] Use hermetic clang for rocm: This pull request updates the ROCm build process to use a hermetic LLVM dependency along with its provided Clang compiler for compiling XLA, thereby removing the reliance on the system-installed LLVM.
- URL: pull/39854
2. Improvements to the HBM OOM Error page (Error E1000): This pull request improves the HBM OOM Error page (Error E1000) by adding a detailed table summarizing potential interventions for out-of-memory errors with rankings on safety, impact, and caveats, along with notes on the drawbacks of specific techniques such as host offloading, microbatching, manual checkpointing, advanced sharding, and additional context on donating input buffers, based on insights from an internal user experience research study.
- URL: pull/39926
3. [ROCm] Refactor Triton backend for AllReduce and Enable ROCm support: This pull request refactors the Triton backend for AllReduce by replacing CUDA-specific inline assembly lowering of atomic operations with a unified approach using Triton extern_elementwise operations and LLVM intrinsics, thereby enabling ROCm support for the AllReduce Triton backend.
- URL: pull/39924
Other Open Pull Requests
- ROCm backend improvements and fixes: Multiple pull requests enhance the ROCm backend by refactoring tests for platform-agnostic use, fixing bf16 upcast/downcast handling for libdevice calls, porting CUB sort FFI handler refactoring, removing deprecated HIP context APIs, updating CI setup to decouple from TensorFlow, fixing ROCm symlink parsing, modifying ScaledDotRewriter for cublasLt compatibility, and adding GPU context activation with error clearing in FFI handlers. These changes collectively improve ROCm support, maintain compatibility with AMD recommendations, and address flaky test failures.
- Memory allocation and tracking improvements: A pull request migrates CUDA memory allocation and deallocation to use the thread-safe
MemoryAllocator::AllocationTrackerAPI, replacing legacy free functions and unifying allocation tracking. This prepares the codebase for future migration of collective memory allocation and improves thread safety.
- BatchNormExpander test coverage and behavior: One pull request adds comprehensive test coverage for previously untested BatchNormExpander code paths, including BatchNormInference expansion and sharding propagation for inference and gradient computations. It also verifies that the rewrite pass is a no-op when all related flags are disabled, ensuring robustness of the BatchNormExpander.
- Global hang watchdog consolidation: A pull request introduces a global per-process hang watchdog for XLA, consolidating multiple watchdogs into a single instance. This change efficiently covers most use cases and simplifies hang detection.
- Convolution robustness with fallback plans: One pull request introduces a fallback mechanism that uses BF16 convolution plans when cuDNN autotuning fails to find valid FP8 plans, specifically addressing convolution failures on NVIDIA RTX 5090 GPUs with channels or groups less than 16. This improves robustness and prevents JAX crashes during convolution operations.
- Collective communication and scheduling fixes: A pull request adds missing scheduling for the all-gather-start collective operation in the XLA GPU backend to ensure proper handling of this communication step. Another fixes the nvshmem sendrecv subtest and notes required changes to align with the one-sided nature of nvshmem collectives.
- GPU thunk progress tracking redesign: One pull request redesigns the progress tracker for GPU thunks to correctly record and preserve the complete execution history within loops by appending new event records for every execution. This enables accurate tracking of loop iterations and improves deadlock diagnosis.
- Autotuning client support for cross-compilation: A pull request enables passing a PjRt client for autotuning during cross-compilation by plumbing the autotuning client through
XlaCompileOptionstoPjRtCompile. This allows autotuning to run on real hardware with identical GPUs for optimal kernel selection.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 27
Key Closed Pull Requests
1. [ROCm] Support hipblaslt group-gemm 3/5: This pull request introduces the middle layers of group-gemm support for ROCm, including runtime, thunk, and stream_executor components, by adding necessary classes and structures, enabling a second path in the cublaslt gemm backend for group-gemm operations, and incorporating unit tests to verify functionality such as RaggedDot rewriting and proto object handling.
- URL: pull/38909
2. [ROCm] Implement empty graph node support and safeguards for HIP command buffer: This pull request implements support for empty graph nodes and adds safeguards in the ROCm HIP command buffer backend to prevent crashes during graph instantiation by enabling empty nodes as dependency synchronization points, fixing priority setting behavior, detecting empty traced graphs, and inserting empty nodes before finalization, accompanied by new platform-agnostic tests to validate these changes.
- URL: pull/39417
3. [xla:gpu] Use generic AsynStart/Done thunks for host/device memcpy thunks: This pull request aims to replace specific copy thunk async events with generic AsyncStartThunk and AsyncDoneThunk for host-to-device and device-to-host memory copies, remove the CopyDoneThunk, and fix the correct execution stream resolving by using params.stream instead of attribute resolution.
- URL: pull/39428
Other Closed Pull Requests
- ROCm and GPU Backend Support Enhancements: Multiple pull requests improve ROCm platform support by enabling the ROCm Triton backend for AllReduce, fixing ROCm-specific test failures, and extending GPU collective operations to SYCL with oneCCL. These changes also include disabling keep_going in ROCm PR checks for faster failure detection and enabling fusion_emitter_large_test on ROCm with selective test skipping.
- CUDA Virtual Memory Management (VMM) and GPU Memory Allocator Improvements: These pull requests introduce a comprehensive CUDA VMM allocator framework, refactor GPU executable and client code for multi-device VMM support, fix use-after-free bugs in async transfers, and add a VMM-based device memory allocator option to the default XLA memory allocator. This enhances GPU memory management with improved reservation, allocation, and asynchronous deallocation.
- GPU Collective and Async Thunk Infrastructure Refactoring: Pull requests migrate GPU collective thunks to a generic Async Start/Done infrastructure, simplifying concurrency management without changing execution order. Additionally, oneCCL oneAPI collective operations are registered as a SYCL backend to support Intel GPU collectives and fix related unit test failures.
- pull/39382, [pull/39190](https://github.com/openxla/xla/pull/39190]
- Test Stability and Coverage Improvements: Several pull requests add test coverage for untested code paths in AllReduceSimplifier and all_gather_remove_degenerate_dims, skip flaky or environment-dependent tests such as AutoShardingTest in OSS builds, and improve test robustness by wrapping imports and handling platform-specific fingerprints. These changes ensure smoother test execution and better validation.
- Performance and Debugging Enhancements in GPU Execution: Pull requests improve debugging of GPU deadlocks by tracking nested while loops in thunk progress and enhance crash dump readability by logging thunk execution progress chronologically. Additionally, ObjectPool performance is improved by replacing spin-wait with a version counter for lock-free operations.
- ROCm HIPBLAS and GEMM Support Fixes: A pull request fixes hipblasLt Int8 GEMM support and autotuner output comparison by correcting workspace buffer logic, adding missing computation types, and including Int8 typed matmul dispatch entries to prevent crashes and execution errors.
- Build and Dependency Management for ROCm: One pull request extends the select_threshold macro to accept dictionaries for better ROCm interface handling in Bazel dependencies and fixes output validity to prevent build failures on certain ROCm versions.
- Memory Coloring Enhancements: A pull request updates the GPU memory colorer to support assigning memory spaces to custom call operands and results via frontend attributes, enabling the buffer colorer to apply specified memory space requirements during buffer assignment.
- JAX Pipeline Resource Optimization: Two pull requests propose switching the JAX unit test pipeline to Kubernetes runners for better resource availability and reducing GPU runner usage from eight to one to avoid resource waste since multi-GPU tests are not run locally.
- SYCL Test Consistency Update: A pull request updates SYCL tests by replacing hardcoded spirv-binary with an HLO-based spirv-binary to maintain consistency with other SYCL tests like sycl_executor_test and sycl_stream_test.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| shawnwang18 | 62 | 6 | 0 | 0 |
| ezhulenev | 48 | 17 | 0 | 0 |
| alekstheod | 23 | 8 | 0 | 6 |
| mfrancepillois | 28 | 4 | 0 | 0 |
| and-ivanov | 16 | 5 | 0 | 4 |
| zoranjovanovic-ns | 13 | 3 | 0 | 2 |
| housrepository | 0 | 0 | 17 | 0 |
| magaonka-amd | 10 | 3 | 0 | 0 |
| pemeliya | 11 | 2 | 0 | 0 |
| sergachev | 11 | 1 | 0 | 0 |