Weekly GitHub Report for Xla: April 06, 2026 - April 13, 2026 (19:28:29)

Weekly GitHub Report for Xla: April 06, 2026 - April 13, 2026 (19:28:29)

        Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 

[NVIDIA-GPU] [ERR:PERFORMANCE] GPU performance model significantly overestimates INT8 GEMM speedup for small/medium shapes: This issue addresses the significant overestimation by the GPU performance model of INT8 GEMM speedup compared to FP32, particularly for small and medium tensor shapes, where measured speedups on an RTX 3080 show large prediction errors and even cases where INT8 is slower than FP32. The root causes include missing kernel launch and quantization overhead in the model, peak utilization assumptions that do not hold in practice, and the impact this has on XLA's cost model decisions for lowering operations to INT8, with a suggested improvement involving a shape-dependent sigmoid correction to better fit observed performance.  

The comments discuss understanding the problem and potential implementation approaches, concluding that a hybrid solution applying a shape-dependent correction post-hoc in the runtime estimation is best. They also highlight that while the sigmoid correction model fits the RTX 3080 data, parameters would need to be adjusted per GPU architecture due to differing peak performance ratios, but the need for shape-dependent correction is consistent across architectures.
Number of comments this week: 2

[CPU] Sub-optimal LLVM IR vectorization for reduce over small non-power-of-2 dimensions on CPU (interleaved shufflevector instead of contiguous loads): This issue describes a problem with LLVM IR vectorization generated by XLA:CPU when lowering a reduce operation over small non-power-of-2 innermost dimensions, resulting in inefficient interleaved vector loads and sequential floating-point additions instead of contiguous loads and vectorized reductions. The user provides a minimal reproducible example and contrasts the current sub-optimal LLVM IR with a previously better-performing direct-to-LLVM emitter approach that produced contiguous vector loads and efficient tree reductions.  

The single comment simply tags another user for awareness, indicating no further discussion or resolution has occurred yet.
Number of comments this week: 1

Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 
As of our latest update, there are no stale issues for the project this week. 
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 3
Summarized Issues:

API Extension Limitations: The issue highlights difficulties in adding metadata extensions to the PJRT_BufferFromHostBuffer function within the PJRTClient to PjRtCApiClient interface without altering existing argument structures or using side channels. It questions whether the current PJRTClient implementation supports custom API extensions or if direct construction of the PJRT API argument struct is required, as demonstrated by torch_xla's method.  
issues/40504

Inefficient LLVM IR Vectorization: This issue describes sub-optimal LLVM IR vectorization generated by XLA:CPU when lowering reduce operations over small non-power-of-2 innermost dimensions. The result is inefficient interleaved shufflevector loads and sequential floating-point additions instead of contiguous vector loads and efficient horizontal reductions.  
issues/40677

GPU Performance Model Overestimation: The problem involves significant overestimation by the GPU performance model of INT8 GEMM speedup over FP32 for small and medium tensor shapes. The model neglects kernel launch and quantization overheads as well as peak utilization limits, causing large prediction errors and incorrect optimization decisions in XLA for these workloads.  
issues/40680

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 0
Summarized Issues:
As of our latest update, there were no issues closed in the project this week.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 23
Key Open Pull Requests
1. [WIP] [ROCm] Move rocm configure to ml_toolchain repo, enable hermetic llvm usage: This pull request moves the ROCm configuration to the ml_toolchain repository and enables hermetic LLVM usage by compiling XLA under a hermetic Clang setup for ROCm, thereby activating a more isolated and consistent toolchain environment.

URL: pull/40807

Associated Commits: 893f7, d7290, 23497, ff1db, e49fe, b2a41, 16ece, 9b721, 95257, 5942a, 27cb1

2. Shawnw/cmd buffer capture mode rebase update free cmd buffer v4 rebase: This pull request introduces CUDA Virtual Memory Management (VMM) allocator classes and updates the GPU command buffer handling to support VA remapping with new allocation modes, adds a ConstructionMode enum to optimize CUDA graph building via inline stream capture, renames update mode enums for clarity, and includes comprehensive tests and code cleanups to improve performance and maintainability of GPU command execution in the XLA compiler.

URL: pull/40458

Associated Commits: a162d, 67871, 86b5b, cb763, 7e2e3, 28a41, 994fc

3. [XLA:GPU] Dockerized Hermetic Tests for oneAPI: This pull request updates the ci_test_xla.sh script for oneAPI and introduces a new ci_docker_test.sh script to enable running all XLA test cases within a fully hermetic Docker environment, while also removing several SYCL-related build and installation scripts.

URL: pull/40529

Associated Commits: d139c, 1e61e, 7d941, eb142, db545, 384d2, ee20b

Other Open Pull Requests

Thunk-as-Command Migration: Multiple pull requests focus on migrating various command classes such as LaunchCmd, EmptyCmd, and MemcpyDeviceToDeviceCmd to have their corresponding Thunk classes implement the Command interface directly. These changes consolidate internal methods, remove redundant wrapper classes, improve error handling, and update tests to reflect the new command structure, streamlining GPU command dependency management.  
pull/40684, pull/40763, pull/40716

ROCm and CUDA Triton Backend Unification: Several pull requests work on unifying the AllReduce collective emitter and atomic operations lowering for both CUDA and ROCm GPU targets by replacing PTX assembly with LLVM intrinsics and extern_elementwise Triton operations. These changes include enabling ROCm support, adding necessary APIs, and helper functions to facilitate atomic operation lowering and improve backend compatibility.  
pull/40464, pull/40463, pull/40460, pull/40462

CUPTI Range Profiling Integration: Two pull requests add NVIDIA CUPTI range profiling support to the XLA GPU profiler infrastructure and multi-host HLO runner. They enable detailed hardware performance counter collection with multi-pass replay, extend profiler interfaces, and improve data processing and testing for accurate per-kernel performance analysis.  
pull/40450, pull/40449, pull/40751

Collective Permute and Backend Improvements: Pull requests introduce a new one-sided collective permute execution mode using NCCL APIs, migrate peer-to-peer collective permute to collective memory APIs, and add a flag to control replica group formation for collective permute operations. These changes enhance backend assigner passes, support multiple collective operation modes, and optimize clique formation for peer-to-peer communication.  
pull/40761, pull/40676, pull/40761

Cross-Host Data Transfer Refactoring: One pull request unifies PreparedSend and PreparedReceive into a single PreparedTransfer struct and introduces a shared helper function for cross-host data transfer buffers. This refactoring improves buffer management and enables better overlap of communication and computation.  
pull/40585

GPU Async and Host Communication Cleanup: A pull request cleans up GPU thunk APIs by creating a separate HostAsyncThunk for asynchronous device-to-host communication, removing legacy async thunk support, and migrating affected files to use modern error handling macros. This simplifies the command buffer converter and removes obsolete control dependencies.  
pull/40667

Performance Optimization for GpuCliques: One pull request optimizes the performance of requesting GpuCliques in large models by caching replica groups and cliques and replacing linear searches with hash map lookups. This reduces delays in kernel launches on the device.  
pull/40738

Integer Overflow Safety in Shape Inference: A pull request adds overflow checks using OverflowSafeMultiply to prevent silent integer overflow in shape inference dimension computations. This ensures errors are returned instead of corrupted shapes when near INT64_MAX.  
pull/40750

Removal of Outdated Command Buffer Call Handling: One pull request removes special case handling for calls to command buffers in the XLA GPU codebase, reflecting the current use of a command buffer pass with thunks that allow calls of any type.  
pull/40751

ROCm Platform and FP8 Support Updates: A pull request updates ROCm platform support by marking FP8 fast accumulation as supported starting from ROCm version 7, setting the minimum version to 6.3, and ensuring relevant HLO tests run successfully.  
pull/40702

AOT Compilation Platform Check and Spec File: One pull request adds a platform check to ensure ahead-of-time compilation only occurs when host and target platforms match, introduces a oneAPI compatible spec file for deviceless compilation testing, and includes a test case to verify rejection on platform mismatch.  
pull/40593

Deadlock Prevention in Async Collective Multi-Streaming: A pull request implements deadlock prevention mechanisms for asynchronous collective multi-streaming on GPUs, fixes a related casting issue, and reactivates and adds unit tests to ensure correctness.  
pull/40656

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 52
Key Closed Pull Requests
1. [ROCm] CI: Add ROCm CI support to GitHub Actions workflow: This pull request adds AMD ROCm GPU continuous integration support to the GitHub Actions workflow and Python build system by configuring sequential single and multi-GPU test suites, migrating build logic from shell scripts to Python, defining new build types targeting modern AMD architectures, and introducing end-to-end CI jobs to ensure OpenXLA functionality on ROCm platforms.

URL: pull/36893

Associated Commits: 0160f, 33627, 5fd0b, 6c682, f8ef7, 90316, cfd64, dc80f, 5ca07, c0cc5, 64281, 424e2

Associated Commits: 0160f, 33627, 5fd0b, 6c682, f8ef7, 90316, cfd64, dc80f, 5ca07, c0cc5, 64281, 424e2

2. [XLA:GPU] Make Command a subclass of Thunk: This pull request refactors the XLA GPU backend by making the Command class a subclass of Thunk to eliminate duplicated APIs and explicitly represent their conceptual relationship, thereby improving code maintainability without changing existing behavior.

URL: pull/40498

Associated Commits: 09e17, a21c7, 0eda3, 6637e, 9d5b2, 86cc2

Associated Commits: 09e17, a21c7, 0eda3, 6637e, 9d5b2, 86cc2

3. [XLA:GPU] Implement PutSignal, Signal, and WaitSignal on NcclCommunicator: This pull request implements the one-sided remote memory access (RMA) operations PutSignal, Signal, and WaitSignal on the NcclCommunicator using NCCL 2.29+ to enable direct remote memory access with symmetric buffers for higher-level communication patterns in the XLA GPU backend.

URL: pull/40594

Associated Commits: 06c81, c6e7a, 18e2a, ef3b0, 2f299

Associated Commits: 06c81, c6e7a, 18e2a, ef3b0, 2f299

Other Closed Pull Requests

Command and Thunk Refactoring: Multiple pull requests focus on refactoring the command and thunk hierarchies in the XLA GPU backend to reduce redundancy and improve maintainability. These include migrating ComputationIdCmd functionality into the ReplicaOrPartitionIdThunk class hierarchy, making Command a subclass of Thunk, renaming Command to CommandThunk to reflect inheritance, and eliminating separate command classes by having thunks implement the Command interface directly.  
pull/40609, pull/39254, pull/40570, pull/40722

GPU Backend Bug Fixes and Compatibility: Several pull requests address bug fixes and compatibility improvements in the GPU backend, including fixing build failures by correcting function signatures, skipping failing tests on specific hardware, ensuring proper scale settings for fp8 GEMM heuristics, and fixing stale error leaks in ROCm profilers. These changes improve stability and correctness across different GPU platforms.  
pull/40331, pull/40369, pull/40747, pull/40199

GPU Command Buffer and Performance Enhancements: One pull request introduces a three-value enum to replace a boolean flag controlling virtual address remapping in GPU command buffers, enabling partial update-free replay and VA range multiplexing. This reduces per-execution overhead and improves performance in models with repeated identical executions.  
pull/40173

ROCm and Intel GPU Platform Updates: Updates include upgrading the ROCm CI Docker image to fix segfaults, removing deprecated HIP context APIs in ROCm backend, and adding support for 64-bit kernel timestamps in the Intel GPU oneAPI backend. These changes ensure compatibility with newer hardware and software versions.  
pull/40405, pull/39925, pull/40186

Memory Allocation and Thread Safety Improvements: A pull request migrates CUDA memory allocation and deallocation to use a new thread-safe AllocationTracker API, integrating various CUDA allocators into CudaExecutor and removing legacy functions. This unifies allocation tracking and adds hashing support for device addresses.  
pull/39723

Testing and Profiling Enhancements: Improvements include adding TraceMe for while loop iterations to enhance host profiling data, separating FP8 collective operation tests by variant for ROCm platforms, and updating test file copying to use symlinks preserving RUNPATHS for oneAPI builds. These changes improve test coverage and profiling accuracy.  
pull/40376, pull/40490, pull/40383

Communication and Execution Stream Improvements: One pull request replaces a boolean flag with a strongly-typed CommunicationId integer to support multiple overlapping communication types on the same devices. Another enables passing additional execution stream IDs to XLA:GPU FFI handlers for dedicated collective streams, enhancing communication flexibility and execution control.  
pull/40418, pull/40406

Code Cleanup and API Modernization: Updates include replacing ScheduleWhenReady with MakeFutureWhenReady(...).OnReady() for scheduling callbacks, simplifying the RocmContext class to a header-only value type, and adding a target configuration file for Intel's Data Center GPU Max (PVC). These changes modernize the codebase and improve maintainability.  
pull/39516, pull/39925, pull/40603

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

shawnwang18
89
13
0
10

ezhulenev
47
20
0
24

alekstheod
33
1
0
0

mfrancepillois
18
6
0
1

and-ivanov
20
0
0
0

sfvaroglu
8
2
0
8

akhilgoe
12
4
0
0

phambinhfin
11
2
0
0

bhavani-subramanian
8
4
0
0

mraunak
10
2
0
0

                                Don't miss what's next. Subscribe to Weekly Project News:

                        https://github.com/owner/public_repo (required)

            Email address (required)

Contributor	Commits	Pull Requests	Comments
shawnwang18	89	13	10
ezhulenev	47	20	24
alekstheod	33	1	0
mfrancepillois	18	6	1
and-ivanov	20	0	0
sfvaroglu	8	2	8
akhilgoe	12	4	0
phambinhfin	11	2	0
bhavani-subramanian	8	4	0
mraunak	10	2	0