Weekly GitHub Report for Xla: March 09, 2026 - March 16, 2026 (19:44:08)
Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
[BUG] [STAT:AWAITING OPENXLA-ENG]
bitcast_convertof packed constantU8(uint8) to "packed" types (S4,S2,U4,U2) fails: This issue describes a problem where thebitcast_convertoperation on a packed constant of typeU8to smaller packed types likeS4,S2,U4, andU2produces inconsistent results compared to when the same value is provided as an input parameter, leading to incorrect output values for constants. The root cause appears to be that the internalLiteraldata structure used for constants does not properly support packed subbyte elements, causing the constant-folding process to perform a simple memcpy rather than a correct bitcast conversion, whereas inputs are handled correctly.- The comments discuss the underlying cause being the lack of packed subbyte support in the
Literalstructure, confirm that inputs work correctly as a workaround, and note the inconsistency in behavior between constants and inputs; additional context about PJRT buffer handling and potential API improvements is also shared. - Number of comments this week: 4
- The comments discuss the underlying cause being the lack of packed subbyte support in the
-
[QUESTION] [GPU] [STAT:AWAITING RESPONSE FROM CONTRIBUTOR] How to print IR after each passes?: This issue is about a user seeking guidance on how to print the intermediate representation (IR) after each pass when running a command on a stablehlo input file, similar to the
--mlir-print-ir-after-alloption in MLIR. The user specifically wants to observe how stablehlo dialect operations are transformed into XLA or triton dialects and has tried certain dump options that only output the HloModule.- The comment explains that the file named with the suffix
before_optimizations.txtshows the transformation result from stablehlo to XLA, and clarifies that HloModule is the internal representation used within XLA. - Number of comments this week: 1
- The comment explains that the file named with the suffix
-
Delay kernel timer timeout on Blackwell (SM 12.0a) — hardcoded cycle count assumes 2GHz: This issue addresses a problem with the delay kernel timeout in the CUDA code for Blackwell GPUs, where a hardcoded cycle count assumes a 2GHz clock, causing the timeout to be shorter than intended on RTX 5090 GPUs running at 3.09GHz. The user reports frequent delay kernel timeout warnings and suggests that the timeout should either scale with the actual SM clock frequency or use a larger fixed value to accommodate higher clock speeds.
- The single comment questions the original contributor about the possibility of updating the code to fix the timeout issue, indicating a prompt for further investigation or modification.
- Number of comments this week: 1
Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
As of our latest update, there are no stale issues for the project this week.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 4
Summarized Issues:
- StableHLO bitcast_convert issues: The
bitcast_convertoperation in StableHLO mishandles packed constantU8values when converting to packed subbyte types likeS4,S2,U4, andU2, leading to inconsistent outputs compared to when these values are provided as input parameters. This inconsistency affects the correctness of conversions involving packed subbyte types. - issues/38964
- IR printing and debugging support: Users are seeking ways to print the intermediate representation (IR) after each compiler pass to observe transformations from stablehlo dialect operations to XLA or triton dialects, similar to the
--mlir-print-ir-after-alloption in MLIR. Currently, only the HloModule can be dumped, limiting debugging and inspection capabilities. - issues/39101
- HLO snapshot generation regression on GPU: The
xla_dump_hlo_snapshotsflag stopped producing snapshot artifacts for JAX scripts running on GPU (jax[cuda12]) starting from version 0.5.3, although it continues to work correctly on CPU and in earlier versions. This regression impacts users who rely on snapshot artifacts for debugging GPU executions. - issues/39194
- CUDA kernel timeout scaling issue: A hardcoded delay kernel timeout in CUDA code assumes a 2GHz clock speed, causing premature timeout warnings on NVIDIA RTX 5090 GPUs with higher clock speeds. The timeout should be adjusted to scale with the actual SM clock frequency or increased to accommodate faster GPUs to prevent false warnings.
- issues/39255
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 1
Summarized Issues:
- NVIDIA RTX 5090 f32 Matrix Multiplication Accuracy: This issue investigates the increasing divergence in f32 matrix multiplication accuracy on NVIDIA RTX 5090 GPUs compared to CPU results as matrix size grows. It identifies that the root cause is the use of TF32 tensor cores in cuBLAS/Triton GEMM kernels rather than a hardware bug, and also highlights related autotuner timer timeouts and missing configuration hints specific to this architecture.
- issues/39250
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 21
Key Open Pull Requests
1. Shawnw/vmm with cpu build fix: This pull request introduces a refactored CUDA Virtual Memory Management (VMM) allocator with an abstract base and CUDA-specific subclass, adds new VMM-based device memory allocator classes, updates related GPU executable and client code to support virtual address remapping, includes end-to-end tests for the VMM allocator configuration, and applies conditional compilation guards for CUDA platform-specific includes to fix build issues.
- URL: pull/39226
2. [ROCm] CI: Add GHA workflow for ROCm JAX unit tests: This pull request adds a new GitHub Actions workflow to migrate the JAX unit test stage for ROCm from Jenkins CI to GitHub Actions, configuring it to run on specialized AMD GPU runners within a ROCm TensorFlow build container and delegating test execution to JAX's upstream script while ensuring integration with the XLA repository.
- URL: pull/38919
3. [XLA:GPU] Add Put, Signal, and WaitSignal APIs to Communicator for NCCL: This pull request adds the Put, Signal, and WaitSignal APIs to the Collectives/Communicator interface specifically for the NCCL backend, enabling the implementation of the NCCL one-sided put path (such as ragged all-to-all with symmetric buffers) on top of existing CollectiveMemory and SymmetricMemory APIs, with initial stub implementations provided.
- URL: pull/38943
Other Open Pull Requests
- Async Thunk Refactoring and Generic Async Execution: Multiple pull requests replace specific copy thunk async events with generic AsyncStartThunk and AsyncDoneThunk implementations, streamlining asynchronous execution in the XLA GPU backend. These changes also introduce a generic mechanism for asynchronous execution of nested thunk sequences with event-based synchronization and replace the WaitForStreamsThunk and SequentialThunk pattern with generic async thunks for all asynchronous compute operations.
- Thunk and Command Class Refactoring: The Command class is refactored to be a subclass of Thunk to eliminate duplicated APIs and resolve diamond inheritance issues, improving code structure without changing behavior. Additionally, the CustomCallThunk class is split into two distinct classes to separate XLA FFI custom calls from legacy custom calls, simplifying the codebase by removing legacy code paths.
- AsyncWorkRunner API Unification and Bug Fixes: The AsyncWorkRunner is refactored to unify its API with tsl::Executor by owning thread pools by value and replacing multiple scheduling methods with a single Execute(Task) method. This eliminates use-after-free bugs and updates all relevant callers accordingly, improving concurrency and performance.
- CUDA and GPU Backend Improvements: Several pull requests improve GPU backend support by conditionally including CuDNN compiler headers only when GOOGLE_CUDA is defined, fixing autotuning of GEMM fusions with the cuDNN backend, and cleaning up kernel dynamic shared memory size handling to call CUDA APIs only when necessary.
- Triton GEMM and Backend Enhancements: New default Triton GEMM configurations are added for consumer Blackwell GPUs to fix autotuner issues, and the ROCm Triton backend is enabled for the AllReduce collective emitter by adding generic passes and necessary API support.
- CUDA Memory Allocator and Virtual Memory Management: Standalone CUDA memory allocator implementations are introduced for all CUDA-supported memory kinds by extracting allocation APIs from the CudaExecutor. A new CUDA Virtual Memory Management based device memory allocator option is also added, including RAII wrappers and refactoring to support virtual address remapping for command buffer thunks.
- Build and Testing Infrastructure Enhancements: Bant macro expansions are added for XLA test targets to enable buildcleaner-like analysis for missing dependencies, and a Reset() method is implemented in the SyclStreamPool class to clean up shared stream pools between tests, preventing resource leaks.
- Cloud Storage CLI Migration: The migration from the legacy gsutil tool to the modern gcloud storage CLI is automated to improve performance, authentication, and command consistency, with advisories for thorough testing due to potential behavioral differences.
- PDL GPU Support Enablement: PDL is enabled by default for GPU support in the project as detailed in the referenced prior pull request.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 33
Key Closed Pull Requests
1. update free command buffer v2: This pull request updates the free command buffer implementation by introducing new CUDA Virtual Memory Management classes—including MemoryReservation, CudaRawMemoryAllocation, and CudaMemoryReservation—with RAII wrappers and a StreamExecutorVmmAllocator for improved device memory allocation, alongside refactorings, debug enhancements, and comprehensive unit tests to ensure correctness and performance.
- URL: pull/38644
2. [ROCm] Refactor cub_sort_kernel_rocm.cu.cc: This pull request refactors the cub_sort_kernel_rocm.cu.cc file by splitting it into separate files for kernels, FFI registration, and a header, significantly reducing compilation times on various ROCm versions, especially those prior to ROCm 7.1.
- URL: pull/38400
3. [GPU] Implement Programmatic Dependent Launch (PDL).: This pull request introduces an initial implementation of Programmatic Dependent Launch (PDL) for GPU kernels in XLA by adding griddepcontrol.wait instructions at the start of all XLA-generated MLIR kernels to reduce kernel launch latencies on Hopper+ GPUs, with a design that supports future enhancements such as moving wait instructions based on data dependencies and adding dependent launch controls, resulting in measured performance improvements across various benchmarks.
- URL: pull/38544
Other Closed Pull Requests
- ROCm Backend and ROCm-Specific Features: Multiple pull requests enhance ROCm support by adding TF32 computation type in hipBLASLt GEMM operations, enabling FissionBackend autotuning, introducing minimal ExecutableAbiVersion ROCm support, and adding ROCm-specific accuracy budgets to fix intrinsic accuracy tests. These changes improve compatibility, performance, and reliability of ROCm within the XLA framework.
- Autotuning and Kernel Configuration Fixes: Several pull requests fix invalid split_k and block_k configurations in AMD GPU autotuning, improve autotuner error logging, and add missing ROCm version checks for RCCL fp8 support. These updates ensure more reliable autotuning and backward compatibility with older ROCm versions.
- GPU and ROCm Code Maintenance and Optimization: Pull requests include deduplication of the LaunchCudaKernel function, updating ROCm backend to use f16 libdevice functions to prevent precision loss, and fixing the fusion_emitter_int4_device_test expected output for ROCm. These changes improve code maintainability and test correctness on GPU and ROCm platforms.
- Async Execution and Concurrency Improvements: One pull request introduces the AsyncExecution library to fix a critical bug in CollectiveThunk::AsyncEvents by managing asynchronous execution events properly, while another updates the concurrency package to use standard concurrency primitives and deprecates TskTaskExecutor. These changes enhance parallel execution handling and thread control.
- Triton and ROCm Compatibility Fixes: Pull requests address ROCm compatibility by skipping a failing unit test on pre-Ampere GPUs, fixing crashes in the Triton pipeline using addNestedPass, and integrating a new Triton version with fixes for ROCm. These updates ensure stable Triton operation on ROCm platforms.
- Documentation and Build Configuration Enhancements: One pull request updates the Error Codes documentation for improved readability and consistency, while another proposes enabling the
--config=warningsbuild option by default to treat compiler warnings as errors. These changes improve developer experience and code quality enforcement.
- Bug Fixes and Test Improvements: Pull requests fix undefined behavior in FFI state get/set functions, improve the RunAndCompareThreeIterations function in command_buffer_test to handle input arguments correctly, and add synchronization points to prevent CUDA deadlocks during NCCL and cuBLAS initialization. These fixes enhance stability and correctness in various components.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
| Contributor | Commits | Pull Requests | Issues | Comments |
|---|---|---|---|---|
| ezhulenev | 44 | 15 | 0 | 11 |
| mfrancepillois | 29 | 7 | 0 | 0 |
| benknutson-google | 29 | 0 | 0 | 0 |
| shawnwang18 | 22 | 6 | 0 | 0 |
| sergachev | 13 | 7 | 0 | 0 |
| zoranjovanovic-ns | 9 | 7 | 0 | 0 |
| blasphemetheus | 2 | 1 | 2 | 9 |
| meteorcloudy | 13 | 0 | 0 | 0 |
| Eetusjo | 5 | 4 | 0 | 0 |
| sfvaroglu | 3 | 1 | 0 | 5 |