Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can PyTorch/XLA wheel for release branch build with cxx_abi disabled? #5325

Closed
vanbasten23 opened this issue Jul 19, 2023 · 5 comments
Closed

Comments

@vanbasten23
Copy link
Collaborator

Hi @mateuszlewko ,

I wonder if the new wheel build process (with ansible) can disable cxx_abi when it builds a torch_xla wheel for a release branch (such as r2.0). We recently build a torch_xla wheel (on pt/xla branch r2.0, cuda 11.8, python=3.10). From the log, it seems it still enables cxx_abi (I see -D_GLIBCXX_USE_CXX11_ABI=1 in the log above, which make me think it enables cxx_abi. Please correct me if I'm wrong.). Building an official torch_xla with cxx_abi enabled causes torch_xla wheel to be incompatible with torch's wheel.

What we used to do in the release branch, is to first apply a torch patch (as in this pr), then disable the cxx_abi (as in this pr). So my question is

  1. With ansible, does the wheel have cxx abi enabled?
  2. If so for above question, with ansible, is it possible to apply the torch patch file first and then set the flag to be false?

Thanks.

cc: @JackCaoG @miladm

@mateuszlewko
Copy link
Collaborator

Hey,

First some background information.

Building process

  1. All build related tasks are present in this role:
    https://github.com/pytorch/xla/blob/master/infra/ansible/roles/build_srcs/tasks/main.yaml.
    They should be self-descriptive, but please reach out if something is unclear.
  2. For non-nightly releases, you should look at Ansible setup version at a given tag, branch or commit, so in this case branch r2.0: infra/ansible/roles/build_srcs/tasks/main.yaml#L32-L36 - this is also the step that builds torch_xla.
  3. Each ansible.builtin.command task in Ansible has a separate shell
    environment, i.e. previous tasks are not polluting env vars of other tasks.
  4. Having said that, most tasks are loading env_vars
    https://github.com/pytorch/xla/blob/r2.0/infra/ansible/roles/build_srcs/tasks/main.yaml#L36 Ansible dict, which is a combination of vars from the config: infra/ansible/config/env.yaml#L21-L48
    (depending on the arch and accelerator, in this case: common + amd64 + cuda). Implementation detail: the dicts are combined here.
  5. Task that builds XLA computation client library, always sets a parameter -D_GLIBCXX_USE_CXX11_ABI=1 (in addition to env_vars). This was based on the pre-Ansible setup. It can be changed easily. This task is not running on master branch (nightly releases) thanks to Bazel migration.

To answer your first question, cxx_abi is explicitly enabled for the XLA computation client library. It's not explicitly set for PyTorch_XLA, but I see in the logs it's set anyway (search for Determined_GLIBCXX_USE_CXX11_ABI=1 in the logs).
I think you need to set it explicitly to false 0.
Now the question is, do you want to disable it for all builds or just 2.0? If just 2.0 then push a new commit to the r2.0 branch with the following modifications:

  1. Remove -D_GLIBCXX_USE_CXX11_ABI=1 from the "Build XLA computation client
    library" task: infra/ansible/roles/build_srcs/tasks/main.yaml#L27.

  2. Add common env var _GLIBCXX_USE_CXX11_ABI=0 in https://github.com/pytorch/xla/blob/r2.0/infra/ansible/config/env.yaml#L22.
    This will be picked up by all tasks that have environment: "{{ env_vars }}".

Applying patches

Sure, it's easy to apply any patches with Ansible. Example of applying TF
patches: https://github.com/pytorch/xla/blob/r2.0/infra/ansible/roles/fetch_srcs/tasks/main.yaml#L29-L40.
Simply add another task there with the correct directory.

Testing your changes locally

You can test your changes locally (assuming you have docker installed) by
running the same docker build command as in the cloud build step:
https://screenshot.googleplex.com/6imoM249wTWp2NF.

In the infra/ansible directory run

docker build -f=Dockerfile . --build-arg=accelerator=cuda \
--build-arg=arch=amd64 --build-arg=cuda_version=11.8 \
--build-arg=git_tag=v2.0.0 --build-arg=package_version=2.0 \
--build-arg=python_version=3.10 \
--build-arg=ansible_vars='{"accelerator":"cuda","arch":"amd64","cuda_version":"11.8","git_tag":"v2.0.0","package_version":"2.0","python_version":"3.10","pytorch_git_rev":"v2.0.0","xla_git_rev":"v2.0.0"}' -t=local_image

Hope it helps,
Mateusz

@vanbasten23
Copy link
Collaborator Author

Thanks @mateuszlewko . I'll give it a try.

@vanbasten23
Copy link
Collaborator Author

The pr has been merged and we started the build. But the build seems failing.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Jul 25, 2023

I looks I need to update the tag v2.0.0 which I did. But the build still failed: log:

  • At first, it was able to check out the correct commit:
Initialized empty Git repository in /workspace/.git/
From https://github.com/pytorch/xla
 * branch            3b7798db3dd6ee1fc0550a332f13d06db3e8d169 -> FETCH_HEAD
HEAD is now at 3b7798d Disable cxx abi in ansible when building pt/xla for branch r2.0 (#5332)
BUILD
Starting Step #0 - "git_fetch"
Step #0 - "git_fetch": Already have image (with digest): gcr.io/cloud-builders/git

Notice commit 3b7798db3dd6ee1fc0550a332f13d06db3e8d169 is the one I recently pushed to r2.0 branch.

cc: @ManfeiBai

@vanbasten23
Copy link
Collaborator Author

I'm able to create a new r2.0 wheel now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants