Weekly GitHub Report for Xla: August 04, 2025 - August 11, 2025 (22:39:30)

            Weekly GitHub Report for Xla: August 04, 2025 - August 11, 2025 (22:39:30)

            Weekly GitHub Report for Xla
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.

Table of Contents

I. News
1.1. Recent Version Releases
1.2. Other Noteworthy Updates

II. Issues
2.1. Top 5 Active Issues
2.2. Top 5 Stale Issues
2.3. Open Issues
2.4. Closed Issues
2.5. Issue Discussion Insights

III. Pull Requests
3.1. Open Pull Requests
3.2. Closed Pull Requests
3.3. Pull Request Discussion Insights

IV. Contributors
4.1. Contributors

I. News
1.1 Recent Version Releases:
No recent version releases were found.
1.2 Version Information:
Please provide the version release information you would like me to analyze and summarize.

II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted. 
As of our latest update, there are no active issues with ongoing comments this week. 
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible. 

New nvshmem rule breaks the build: This issue reports a build failure caused by a new nvshmem rule introduced in a recent pull request, which leads to an error related to the repository_ctx object lacking a getenv method during the CUDA configuration step. The reporter is seeking guidance on whether they need to update their side to resolve this error, particularly in relation to changes mentioned for JAX, or if the fix must come from the open_xla project, along with an estimated timeline for addressing the problem.
Failed to Parse MLIR generated by Torchax: This issue describes a problem encountered when exporting a PyTorch model to MLIR using the torch-xla torchax export API, where the generated MLIR fails to parse due to an unregistered operation 'vhlo.rsqrt_v2' in the VHLO dialect. The user is attempting to compile the exported model with XLA AOT but faces deserialization errors with StableHLO, despite using compatible versions of torch, torchxla, and building XLA from the corresponding commit, and has provided code snippets and bytecode samples to assist in troubleshooting.
support bazel modules: This issue discusses the potential adoption of Bazel modules within the project, highlighting that Bazel modules have seen significant adoption in the community. The reporter points out that XLA is currently the only package in their Bazel build that does not support Bazel modules and inquires about any plans to integrate this support.
Gpu collective performance model bug: This issue addresses a bug in the gpu_collective_performance model where the recent update to lowLatencyBandwidth for AMD links was not applied to the CUDA section, causing failures when using H100 settings. As a result, the model call with these settings does not function correctly, indicating an inconsistency in how bandwidth parameters are handled across different GPU architectures.
Possibility to specify strides when sending the data from buffer to host: This issue addresses the limitation in specifying byte strides when transferring data from a PJRT Buffer back to the host, particularly for data originally in column-major format. It highlights that while the byte_strides argument facilitates this conversion when creating the buffer, a similar mechanism is not generally supported for the reverse operation due to inconsistent plugin support, and requests the addition of a byte_strides field to enable this functionality.

2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository. 
Issues Opened This Week: 2
Summarized Issues:

Performance Optimization Techniques: This topic covers efforts to improve computational efficiency in XLA operations, including requests for benchmark data on GPU MFU performance for Llama models and optimization of loop operations for partial prefix sums. The issues highlight challenges such as replacing inefficient ReduceWindow with sliding window techniques and exploring parallelization of WhileOp to enhance performance.  
[issues/29836, issues/29857]

2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable. 
Issues Closed This Week: 0
Summarized Issues:
As of our latest update, there were no issues closed in the project this week.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week. 

III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 6
Key Open Pull Requests
1. [XLA:GPU} Add nested command buffer support: This pull request proposes adding support for nested command buffers to the XLA GPU backend, aiming to enhance command buffer management and execution.

URL: pull/29787

Merged: No

Associated Commits: f637e, ee4ba, 4aab5, b400b, a7cbd, 9a232, 8da26

2. while_loop_analysis supports module that has been parsed by command buffer rewriter (has nested call): This pull request enhances the while_loop_analysis to support modules that have been parsed by the command buffer rewriter, including those with nested calls, by introducing a new HloModule clone API, adding a call inliner, and performing various related fixes and build system cleanups.

URL: pull/29854

Merged: No

Associated Commits: dbc05, cd5d5, 7251a, b6e8e, 5187b, d5ac0, 31571

3. Communication Fusion via Nvshmem: allreduce softmax: This pull request introduces a communication fusion mechanism using Nvshmem to optimize the allreduce softmax operation, including the addition of an ar-softmax fusion pass, integration of ar in optimization, TTIR graph corrections, implementation of the nvshmemx API call, and setting up the nvshmem linker.

URL: pull/30028

Merged: No

Associated Commits: f5039, ce14c, f8851, 21b14, c4d85

Other Open Pull Requests

bf16 Support and ROCm Device Optimization: This pull request adds support for bf16 starting from gfx11, fixes bugs, and optimizes the RocmComputeCapability in device_description.h. It also enables the ALG_DOT_BF16 operator on ROCm hardware with appropriate support.  
pull/29766

Integration of rocprofiler-sdk and roctracer for GPU Profiling: This pull request integrates rocprofiler-sdk (v3) and roctracer (v1) into the XLA project to replace older profiling tools. It enables improved GPU event profiling on AMD GPUs with support for both time-based and step-based profiling, conditional compilation based on ROCm version, and includes new unit tests for ROCm version 6.3 and above.  
pull/29769

ScopedClonedModuleCallInliner for Loop Analysis: This pull request introduces the ScopedClonedModuleCallInliner class in the call_inliner module, which clones a target module and performs inlining during initialization. This addresses limitations with the while_loop_analysis pass on modules parsed by the command buffer rewriter by enabling loop analysis on modules that cannot be modified directly.  
pull/29884

3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 7
Key Closed Pull Requests
1. [XLA:CPU][oneDNN] Add build flag to enable asynchronous support in oneDNN: This pull request proposes adding a build flag to optionally enable asynchronous execution support in the oneDNN library for XLA on CPU, allowing users to compile oneDNN with this feature.

URL: pull/28883

Merged: No

Associated Commits: a30f8, 47c79

2. [GPU] Bubble up mismatched buffer color from donation: This pull request aims to improve error handling in XLA by bubbling up buffer assignment check messages when users specify donation of an input buffer with a mismatched output memory space via out_shardings, thereby preventing silent failures and providing clear feedback in cases where buffer donation is not possible.

URL: pull/29270

Merged: No

Associated Commits: 04331, 769eb

3. SPMD Dot Tests: This pull request proposes adding end-to-end tests for the Single Program Multiple Data (SPMD) partitioning of dot operations to ensure correctness and reliability, although it has not been merged.

URL: pull/29511

Merged: No

Associated Commits: b0dd0, 45c81

Other Closed Pull Requests

Stream management improvements: These pull requests focus on enhancing stream handling within the XLA framework. One introduces a round-robin stream assignment algorithm for asynchronous collective operations on NVIDIA GPUs as a preparatory step for future pipeline integration, while another removes the Stream ID from the command buffer implementation due to the adoption of a DAG for dependency specification, making the Stream ID redundant.  
pull/28919, pull/29204

HloModule cloning API enhancement: This pull request adds a new CloneWithContext API to the HloModule, allowing users to retrieve the mapped instruction or computation within the cloned module. The change is primarily a refactor of existing code and does not introduce new tests.  
pull/29852

AUTHORS file update attempt: This pull request proposes adding NVIDIA Corporation to the AUTHORS file but was ultimately not merged.  
pull/29894

3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment. 
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week. 

IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month. 
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor
Commits
Pull Requests
Issues
Comments

shawnwang18
18
6
0
2

othakkar
5
2
0
5

mraunak
11
0
0
0

Copilot
0
0
0
11

Arech8
4
1
0
5

frgossen
0
0
0
10

penpornk
0
0
0
8

philipphack
3
2
0
2

jaro-sevcik
2
2
1
1

Zoey-Cheng
5
1
0
0

Don't miss what's next. Subscribe to Weekly Project News:

Contributor	Commits	Pull Requests	Issues	Comments
shawnwang18	18	6	0	2
othakkar	5	2	0	5
mraunak	11	0	0	0
Copilot	0	0	0	11
Arech8	4	1	0	5
frgossen	0	0	0	10
penpornk	0	0	0	8
philipphack	3	2	0	2
jaro-sevcik	2	2	1	1
Zoey-Cheng	5	1	0	0