Weekly GitHub Report for Xla: March 16, 2026 - March 23, 2026 (19:48:35)

Weekly GitHub Report for Xla: March 16, 2026 - March 23, 2026 (19:48:35)

        Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 
As of our latest update, there are no active issues with ongoing comments this week. 
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 3
Summarized Issues:

Library Symbol Conflicts and Hardware Support: The bundled hwloc library in XLA exports global symbols without PCI discovery support, causing symbol collisions that disrupt third-party libraries relying on hwloc for PCI device enumeration. This leads to failures and performance degradation specifically on AWS EFA-enabled GPU instances.  
issues/39355

Backend Execution Failures: A StableHLO module runs successfully on the CPU backend but fails on the Interpreter backend due to an unimplemented primitive type error related to tuple handling during execution. This indicates inconsistent support for certain operations across different backends.  
issues/39464

Tool Crashes Despite Successful Execution: The hlo-translate tool crashes when processing a specific StableHLO module, even though the module executes correctly on both CPU and CUDA platforms. This suggests issues in the translation tool independent of the module's runtime correctness.  
issues/39465

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 3
Summarized Issues:

IR Printing and Debugging: The user seeks a way to print the intermediate representation (IR) after each compiler pass to observe the transformation of stablehlo dialect operations to XLA or Triton dialects, similar to the --mlir-print-ir-after-all option in MLIR. Current options only allow dumping the HloModule, which does not provide the desired level of detail for debugging transformations.  
issues/39101

Execution Failures on CUDA: A StableHLO module that runs correctly on CPU fails reproducibly on CUDA due to an unsupported operation error during the conversion of HLO to MLIR for GPU execution. This indicates a gap in GPU backend support for certain operations present in the module.  
issues/39462

Numerical Inconsistency Between CPU and CUDA: The StableHLO module produces inconsistent numerical results when executed on CPU versus CUDA, with investigations pointing to numerical instability in the triangular-solve operation as the cause. This instability leads to significant differences in output values across platforms.  
issues/39463

2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 23
Key Open Pull Requests
1. update-free command buffer with va_range multiplexing. : This pull request updates the free command buffer implementation by introducing virtual address (VA) range multiplexing with two VA reservation sets per executor, enabling concurrent CPU remapping and GPU execution to improve performance and flexibility, while also adding comprehensive tests and refactoring the VMM allocator into an abstract base and CUDA-specific subclass to support multi-device configurations and VA remapping in command buffer thunks.

URL: pull/39672

Associated Commits: 5185d, 38e89, 52da0, 2459a, 97241, beef3, eb13a, f4d77, ff4b1, 4d682, d59f2, 96d84, 58674, 2d7e0, b0f7e, 87259, 7f337, 775c2

2. [xla:gpu] Add VA remapping for command buffer thunks: This pull request adds support for virtual address remapping in the GpuExecutable component to enable command buffer thunks to use fixed virtual addresses across executions, allowing command buffers to be recorded once and replayed without updates despite changes in physical buffer allocations, and introduces a new flag to opt into this remapping path along with related changes to memory reservation, thunk execution, and platform-agnostic abstractions.

URL: pull/39393

Associated Commits: 73b23, 59086, ca416, e6902, abac9, e1bfe, 1665b

3. [XLA:GPU] Adds a new construction mode (kCapture) for command buffer: This pull request introduces a new ConstructionMode called kCapture to the Command base class and CommandExecutor in the XLA GPU backend, enabling inline CUDA stream capture to record commands directly into the outer command buffer for improved performance by eliminating child-graph launch overhead, while maintaining backward compatibility with the existing kExplicit mode and supporting mixed usage within command sequences.

URL: pull/39475

Associated Commits: 0be42, 7b042, 55590

Other Open Pull Requests

ROCm backend fixes and enhancements: Multiple pull requests address issues and improvements in the ROCm backend, including preventing unsupported hipblaslt GEMM calls for certain data types, enabling empty graph nodes to avoid crashes, fixing hipBLASLt autotuner backend disabling bugs, and updating scratch memory allocation to use BFCAllocator for MIOpen autotuning. These changes improve stability, compatibility, and performance on ROCm platforms with added unit tests and platform-agnostic safeguards.  
pull/39373, pull/39417, pull/39567, pull/39622, pull/39621

Async thunk infrastructure and GPU collective improvements: Several pull requests migrate GPU collective thunks to a generic Async Start/Done thunk infrastructure, replace copy thunk async events with generic async thunks, and introduce new async thunk passes to optimize asynchronous execution by removing redundant thunks and expanding async scopes. These changes simplify thunk management and improve asynchronous operation overlap and correctness in the XLA GPU backend.  
pull/39382, pull/39428, pull/39435

Collective operations caching and scheduling: A pull request implements caching for traced command buffers in ROCm collective operations using a Rendezvous gate to synchronize cache decisions across ranks, preventing NCCL deadlocks and improving performance by avoiding redundant graph instantiations. Another pull request adds an annotation to schedule custom communication kernels on NVIDIA GPUs as high-priority native collectives with dedicated streams to prevent overlap.  
pull/39449, pull/39604

Profiling and annotation enhancements: One pull request adds support for capturing and propagating scope_range_id in the ROCm profiler, constructing a hierarchical timeline grouping tree and exporting it for visualization, while uniformly setting scope_range_id across all event types to enhance profiling capabilities.  
pull/39309

GPU memory management and allocator framework: A comprehensive CUDA Virtual Memory Management (VMM) allocator framework is introduced with abstract and CUDA-specific classes, refactoring GPU executable and client code for multi-device support, fixing a use-after-free bug in async host-to-device transfers, and adding end-to-end tests to validate the new allocator.  
pull/39535

Hang watchdog for PJRT GPU client: A hang watchdog is added to abort the PJRT GPU client process if NCCL communicator creation times out, addressing issues where rendezvous termination does not detect timeouts in single local participant scenarios, thus preventing potential infinite hangs.  
pull/39537

SYCL and oneCCL oneAPI collective support: Initial support for Intel GPU collectives is enabled by registering oneCCL oneAPI collective operations as a backend for XLA when built with SYCL, resolving unit test failures related to missing GPU collective support. Additionally, a hardcoded spirv-binary in SYCL tests is replaced with an HLO-based binary for consistency.  
pull/39497, pull/39508

Dynamic memcpy offset and metadata fixes: A pull request fixes dynamic memcpy offset computation for host offloading with collective pipelining by tracking per-variable initialization and step metadata in WhileLoopBackendConfig, ensuring correct dynamic slice offsets and preventing stale metadata issues that could cause data corruption.  
pull/39302

Test coverage and platform-specific test adjustments: Test coverage is added for AllReduceSimplifier no-op removal paths and replicated input reductions, while a test is skipped outside Google-internal builds due to reliance on unavailable autosharding infrastructure. Another test is fixed on ROCm by using platform-conditional fingerprints for autotuning results.  
pull/39621, pull/39673, pull/39690

SE Kernel API and packed argument storage: Support is added for dynamically sized storage of packed arguments in the SE Kernel API within the XLA GPU backend, replacing previously hardcoded NCCL device communication sizes to maintain compatibility across releases.  
pull/39601

Concurrency feature update for GPU scheduling: The [xla:gpu] code is updated to replace ScheduleWhenReady with the standard concurrency feature MakeFutureWhenReady(...).OnReady() for scheduling callbacks, aiming to improve asynchronous operation handling.  
pull/39516

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 39
Key Closed Pull Requests
1. Shawnw/vmm with cpu build fix: This pull request introduces and refactors CUDA Virtual Memory Management (VMM) allocators by adding abstract base and CUDA-specific subclasses for device memory allocation with deferred deallocation, updates related GPU executable and client code to use the new VMM allocator design, adds end-to-end tests for the VMM allocator configuration, and includes conditional compilation guards for CUDA platform-specific code to fix build issues.

URL: pull/39226

Associated Commits: cc1ef, ba797, bbd5b, 71836, 764b5, f8ddc, 92f44

Associated Commits: cc1ef, ba797, bbd5b, 71836, 764b5, f8ddc, 92f44

2. Shawnw/cmd buffer with capturing: This pull request proposes enhancements to the XLA GPU backend by introducing and refining command buffer construction modes that leverage CUDA stream capture for more efficient command recording, including adding a new inline capture mode to reduce CUDA graph nesting, refactoring related code for clarity and performance, and removing redundant capture modes to streamline command execution.

URL: pull/39461

Associated Commits: eb3cd, 52524, 678a4, 73c79, 1fe07, 9e346

Associated Commits: eb3cd, 52524, 678a4, 73c79, 1fe07, 9e346

3. [ROCm] CI: Fix containers settings: This pull request addresses a race condition between concurrent workflow runs on the same self-hosted runner by renaming containers to be unique and modifies the tmpfs settings, reducing its size from 80GB to prevent container processes from exiting due to out-of-memory errors.

URL: pull/39279

Associated Commits: aaf4d, f8b76, fecee, 46508, a96fe

Associated Commits: aaf4d, f8b76, fecee, 46508, a96fe

Other Closed Pull Requests

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

ezhulenev
50
21
0
14

shawnwang18
62
12
0
1

mfrancepillois
25
3
0
0

magaonka-amd
9
3
0
8

sergachev
14
4
0
0

alekstheod
7
5
0
4

zoranjovanovic-ns
11
3
0
1

leo-amd
8
2
0
1

mdfaijul
6
2
0
2

Eetusjo
6
2
0
0

                            Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Comments
ezhulenev	50	21	14
shawnwang18	62	12	1
mfrancepillois	25	3	0
magaonka-amd	9	3	8
sergachev	14	4	0
alekstheod	7	5	4
zoranjovanovic-ns	11	3	1
leo-amd	8	2	1
mdfaijul	6	2	2
Eetusjo	6	2	0