Weekly Project News

Archives
Subscribe

Weekly GitHub Report for Xla: October 27, 2025 - November 03, 2025 (12:01:22)

Weekly GitHub Report for Xla

Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.


Table of Contents

  • I. News
    • 1.1. Recent Version Releases
    • 1.2. Other Noteworthy Updates
  • II. Issues
    • 2.1. Top 5 Active Issues
    • 2.2. Top 5 Stale Issues
    • 2.3. Open Issues
    • 2.4. Closed Issues
    • 2.5. Issue Discussion Insights
  • III. Pull Requests
    • 3.1. Open Pull Requests
    • 3.2. Closed Pull Requests
    • 3.3. Pull Request Discussion Insights
  • IV. Contributors
    • 4.1. Contributors

I. News

1.1 Recent Version Releases:

No recent version releases were found.

1.2 Version Information:

Please provide the version release information you would like me to analyze and summarize.

II. Issues

2.1 Top 5 Active Issues:

We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.

  1. 🚨 [GPU][AOT/JIT] Runtime crash when executing bounded dynamic shape HLO: slice extent must be smaller than buffer size: This issue reports a runtime crash occurring when executing GPU AOT- or JIT-compiled executables that use bounded dynamic shapes, triggered by a buffer size mismatch error during a slice operation. The problem is traced to a bug in the runtime implementation of the PadToStatic custom call, where device-side parameters lack the necessary metadata for dynamic dimension sizes, causing the buffer to be too small and leading to a failed check and crash.

    • The comments clarify that unbounded dynamic shapes are not supported, while bounded dynamic shapes rely on padding to a maximum size with an additional parameter specifying the actual size at runtime. A workaround avoiding the PadToStatic custom call is suggested, and it is explained that dynamic shapes do not significantly increase JIT overhead since the padded shape is compiled once and reused. However, compiled kernels for bounded dynamic shapes may be less efficient due to padding to the upper bound.
    • Number of comments this week: 5
  2. [XLA:GPU] Jax example produces CUDA_ERROR_INVALID_VALUE: invalid argument: This issue reports a CUDA error encountered when running a specific JAX program on GPU, where the XLA:GPU backend generates a kernel with invalid arguments, resulting in a CUDA_ERROR_INVALID_VALUE during execution. The user provides a minimal reproducible example along with system details and an extensive HLO (High-Level Optimizer) dump to aid in diagnosing the problem.

    • The single comment in the thread shares a detailed HLO dump of the JAX computation, which includes multiple regions and operations, aiming to help identify the source of the CUDA invalid argument error during kernel execution.
    • Number of comments this week: 1
  3. [XLA:GPU] Very low utilization for lax.ragged_all_to_all on 8x B200, but near optimal utilization cross-host: This issue discusses the unexpectedly low bandwidth utilization of the lax.ragged_all_to_all operation on a single host with 8 B200 GPUs connected via NVLink, contrasted with near-optimal bandwidth utilization observed across multiple hosts. The user seeks confirmation on whether this behavior is expected and requests potential optimizations, especially regarding scaling beyond two hosts.

    • The single comment simply tags two other contributors, likely to bring their attention to the issue, without providing further analysis or solutions.
    • Number of comments this week: 1

Since there were fewer than 5 open issues, all of the open issues have been listed above.

2.2 Top 5 Stale Issues:

We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.

  1. New nvshmem rule breaks the build: This issue reports a build failure caused by a new nvshmem rule introduced in a recent update, which leads to an error related to the absence of a getenv method in the repository_ctx object during the CUDA configuration step. The reporter is seeking guidance on whether they need to make changes on their side to resolve this problem or if the fix must come from the open_xla project, specifically regarding the timing and details of a potential update to the cuda_configure settings.
  2. Failed to Parse MLIR generated by Torchax: This issue reports a failure to parse MLIR code generated by the Torchax export API when attempting to compile a Torch model using XLA AOT compilation, specifically encountering an error related to an unregistered operation 'vhlo.rsqrt_v2' in the VHLO dialect. The user provides detailed reproduction steps, including code snippets and environment information, and seeks assistance resolving the deserialization error caused by unsupported operations in the StableHLO artifact.
  3. support bazel modules: This issue requests the adoption of Bazel modules within the project, highlighting that Bazel modules have seen significant adoption in the community. The reporter notes that XLA is currently the only package in their Bazel build that does not support these modules, implying a need for improved compatibility.
  4. Gpu collective performance model bug: This issue addresses a bug in the gpu_collective_performance model where the update to lowLatencyBandwidth for AMD links was applied without corresponding changes to the CUDA section. As a result, invoking the gpu_collective_performance model with H100 settings leads to a failure, indicating incomplete handling of bandwidth parameters across different GPU architectures.
  5. Cross compile to ARM with custom gcc: This issue concerns difficulties encountered when attempting to cross-compile the XLA project from an x86 architecture to ARM64 using a custom GCC compiler. The user reports that despite using the --config=cross_compile_linux_arm64 flag, the Bazel build system continues to produce an x86 binary, indicating a potential misconfiguration or missing step in the cross-compilation process.

2.3 Open Issues

This section lists, groups, and then summarizes issues that were created within the last week in the repository.

Issues Opened This Week: 3

Summarized Issues:

  • GPU Execution Errors: Multiple issues describe failures during GPU execution, including runtime crashes and internal CUDA errors. These problems stem from buffer size mismatches causing segmentation faults and invalid kernel arguments triggering CUDA errors, both leading to unsuccessful program runs on GPU hardware.
  • [issues/33194, issues/33220]
  • GPU Bandwidth Utilization: One issue highlights very low bandwidth utilization of the lax.ragged_all_to_all operation on a single host with multiple GPUs connected via NVLink. This contrasts with near-optimal utilization across multiple hosts, raising questions about expected behavior and possible optimizations.
  • [issues/33386]

2.4 Closed Issues

This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.

Issues Closed This Week: 1

Summarized Issues:

  • Compilation errors in XLA:GPU matrix multiplication: This topic covers issues related to compilation errors encountered during matrix multiplication on XLA:GPU, specifically when the contracting dimension is too small to be properly divided by the split_k_gemm_rewriter. These errors result in the message "Too small divisible part of the contracting dimension," indicating a problem with how the matrix shapes are handled during compilation.
  • issues/33157

2.5 Issue Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.


III. Pull Requests

3.1 Open Pull Requests

This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Opened This Week: 17

Key Open Pull Requests

1. Adding a delayMoveToHost heuristic to LHS and related tests. : This pull request introduces a new delayMoveToHost heuristic to the GPU scheduler in LHS to improve overlapping of data transfers between device and host with computations, aiming to better hide D2H/H2D data movement, and includes related unit and execution tests to validate the feature.

  • URL: pull/33240
  • Merged: No
  • Associated Commits: b4d69, 91f1a, b94cc, 19bce, df83f

2. [ROCm] make rbe docker image parametrized move to local config rocm: This pull request makes the ROCm RBE Docker image configurable by moving the RBE platform settings to a local configuration file, enabling the image to be parameterized to align with the CI worker node requirements.

  • URL: pull/33261
  • Merged: No
  • Associated Commits: 3054e, fc63f, 1948a, bafaf, 10984

3. [XLA:GPU][oneAPI] Enable RBE for the ONEAPI presubmit: This pull request aims to enable Remote Build Execution (RBE) for the ONEAPI presubmit process in the XLA GPU project to improve build efficiency through remote caching and parallelism.

  • URL: pull/33388
  • Merged: No
  • Associated Commits: b31cc, cda82, dac54, a66be

Other Open Pull Requests

  • Error Documentation Pages: Multiple pull requests add documentation pages for specific error codes, including Error 0100, Error 0101, and Error 0102. These contributions include creating skeleton pages and removing detailed error information to be handled separately, enhancing the project's error documentation structure.
    • pull/33199, pull/33200, pull/33201
  • Complex Number Operations and Tests: Enhancements to the HloEvaluator enable support for more complex number operations such as trigonometric and hyperbolic functions, accompanied by tests verifying constant folding. Additionally, improvements to the complex exponential function implementation include tests and recommendations for merging order to ensure correctness.
    • pull/33212, pull/33361
  • GPU Performance Optimizations: Several pull requests focus on GPU-related improvements, including optimizing all-gather operations by reducing transposes, introducing cross-host data transfer APIs for better communicator caching and NCCL group calls, and adding a configurable knob to limit asynchronous compute resources. These changes aim to improve GPU performance and resource management.
    • pull/33260, pull/33269, pull/33284
  • Build and CI Improvements: Pull requests address build issues such as fixing the IntelGpuCompiler after API changes, assigning GPU-specific pools for test actions in CI, cleaning up rpath definitions for hermetic builds, and adding keepalive timeout settings to prevent connection drops during remote build execution. These changes enhance build stability and CI efficiency.
    • pull/33211, pull/33363, [pull/33413](https://github.com/pull/33413], pull/33414
  • Debugging and Testing Utilities: A dummy test pull request is created to evaluate RBE performance with commits for initial testing and forced reruns, and another pull request enables dumping command buffer graph dot files to a configurable directory for easier debugging in environments like slurm.
    • pull/33282, pull/33363

3.2 Closed Pull Requests

This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.

Pull Requests Closed This Week: 6

Key Closed Pull Requests

1. [ROCm] Fix hermetic tests when executing on rbe worker: This pull request aims to fix hermetic tests when executing on remote build execution (rbe) workers by correcting dependency handling to ensure the correct ROCm libraries are loaded from the appropriate data directories, preventing test failures caused by improper local RPATH settings.

  • URL: pull/33188
  • Merged: No
  • Associated Commits: 43913, ffb62, 339c4, edd57

2. Rename "forward compatible" capabilities to "family compatible", per NVIDIA naming.: This pull request proposes renaming the "forward compatible" capabilities to "family compatible" in accordance with NVIDIA's updated naming conventions for architecture features introduced in CUDA 12.9.

  • URL: pull/33117
  • Merged: No
  • Associated Commits: 331d4

3. Update .clang-format: This pull request attempts to update the .clang-format file to correct invalid values that were causing YAML parsing errors and preventing proper formatting on the contributor's machine.

  • URL: pull/33125
  • Merged: No
  • Associated Commits: 5db99

Other Closed Pull Requests

  • CUDA graph debugging improvements: This pull request introduces a CUDA graph dump option that restricts VLOG output to the primary command buffer graph only. This change helps simplify debugging by avoiding overwhelming the screen log with nested CUDA graph details.
  • pull/33149
  • GPU reduce-precision simplification bug fix: This pull request fixes a bug where the GPU reduce-precision simplification was unintentionally disabled by a previous commit. The fix restores the intended functionality of the simplification process.
  • pull/33205
  • GemmFusionAutotuner bug fix: This pull request resolves a bug in the GemmFusionAutotuner by clearing multi-stream attributes when extracting HLO instructions into new modules for autotuning. It resets stream IDs to the default stream to prevent segmentation faults caused by invalid stream references in the autotuner's isolated execution context.
  • pull/33238

3.3 Pull Request Discussion Insights

This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.

Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.


IV. Contributors

4.1 Contributors

Active Contributors:

We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.

If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.

Contributor Commits Pull Requests Issues Comments
alekstheod 49 10 0 2
rao-ashish 3 2 0 20
shawnwang18 7 6 0 0
sergachev 7 3 0 2
emilyfertig 0 0 0 12
dimvar 6 5 0 0
mingxu1067 6 2 0 3
mtsokol 6 4 0 0
mmakevic-amd 7 0 0 0
olegshyshkov 0 0 0 7

Don't miss what's next. Subscribe to Weekly Project News:
Powered by Buttondown, the easiest way to start and grow your newsletter.