Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pin pytorch source builds nightly to speed up CI #1328

Closed
powderluv opened this issue Aug 31, 2022 · 17 comments
Closed

pin pytorch source builds nightly to speed up CI #1328

powderluv opened this issue Aug 31, 2022 · 17 comments

Comments

@powderluv
Copy link
Collaborator

Pin the Pytorch bump to a nightly builder -- that also populates the ccache for the day.

@silvasean
Copy link
Contributor

The other option here is to do like we do for LLVM, with a deliberate ~weekly bump of the PyTorch dependency.

@powderluv
Copy link
Collaborator Author

I am hoping the infra to bump will be the same -- and we can adjust how frequently (nightly / weekly) etc. When it fails it will still need someone to fix it up but hopefully we get the green commit moving without hand holding.

We will also have to disable CIs from uploading their CI caches.

@powderluv
Copy link
Collaborator Author

@ashay I have a WIP on how I think we could pin PyTorch and roll only when our tests pass. So we get the best of being very close to PyTorch automatically like we do -- but we don't roll into a broken master.

I have a pseudo-code branch here: https://github.com/llvm/torch-mlir/tree/RollPyTorch if you want to give it a crack.

Basically we add a nightly that runs our current OOT+PyTorch source build with Top of Master Pytorch. If it passes we commit the torch_mlir_pytorch_version file.

And on the CI side if we load the pinned SHA. Happy to discuss any other suggestions just wanted to dump what I had in my mind.

FYI @sjain-stanford @makslevental

@ashay
Copy link
Collaborator

ashay commented Sep 23, 2022

Building on top of what you had, I pushed a new commit to the RollPyTorch branch to fetch the binary release so that we don't need to build from source. Let me know if I'm veering too far off from what you had in mind! Thanks much!

This commit uses the pip index versions command and some bash commands, but we could change it to use Python if necessary. The fetched PyTorch version is written to pytorch-requirements.txt, which I've kept separate from the top-level requirements.txt so that CI can write to just this one file. Of course, the pytorch-requirements.txt is referenced by requirements.txt, so end users can continue to run pip install requirements.txt and they'll get all the required modules to build Torch-MLIR. I also changed the Github Actions script to do a Release build instead of an out-of-tree build to get better confidence.

I haven't tested a full CI run, but if this approach is generally along the lines of what we want, I'll go ahead and try this out in CI.

@powderluv
Copy link
Collaborator Author

Oh thank you for the commit. I think pinning the binary is good too, but I think we still need pure source builds for downstream customers. They will have long supported PyTorch forks (long past the PyTorch nightlies are removed) to enable a hermetic build environment without an external binary.

Do you know anyway we can find the SHA of that particular binary you got with
python -m pip index versions -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html --pre torch | grep "Available versions" | tr ' ' '\n' | sort -nr | head -n1 | tr -d ',' ?

We could use that SHA for the source builds CI and also pin the exact same binary for binary builds.

Feel free to add any commits to the branch. We can clean up / review / commit once we figure it out.

@ashay
Copy link
Collaborator

ashay commented Sep 26, 2022

Here's a quick summary of our discussion from the Developer Hour meeting, just so that it's easier to access later.

I have updated the branch to fetch the commit hash for the nightly release from the whl file. Assuming that the Github Actions script in RollPyTorch runs once a day, it should be able to pick up the most recent whl file, write the pytorch-requirements.txt file (which is referenced by the top-level requirements.txt file), and validate whether an out-of-tree build works.

Both the Torch-MLIR release version and the PyTorch commit hash are written to a file called release-info at the root of the directory (although, if we want to write this information to separate files, we should consider putting it into a separate directory.

The part that remains, however, is pushing the updated pytorch-requirements.txt file and the release-info file to the main branch. Since the main branch is protected, I presume we'll need to create a PR from within the Github Actions script, which one of us will have to approve. Is there another way that doesn't require creating a PR?

@powderluv
Copy link
Collaborator Author

Thanks @ashay

I don't think we want to save our Torch-MLIR version in the github repo. That is calculated by the release build and named accordingly when creating it. We don't want to conflict with that in the RollPyTorch.yml builder. So just PyTorch SHA is sufficient I would think. I would also suggest pytorch-version.txt just like https://github.com/pytorch/pytorch/blob/master/version.txt since it is not a "release-info" yet just a Pytorch pin like git submodules.

BTW curious if we can drop the +cpu in pytorch-requirements.txt since we anyway point to the cpu nightly page ? Will fix #1381 with that.

@silvasean
Copy link
Contributor

@GMNGeoffrey @stellaraccident -- any idea for how to push an updated SHA1 hash directly into the repo? Do we need to disable branch protection somehow? We currently have branch protection for CI and unresolved conversations.

@GMNGeoffrey
Copy link
Member

In IREE we have it set to require pull request reviews, but admins can override. That means that admins can make direct pushes to the repo:

image

This is nice because gives us break glass when CI is stuck or something, but also a bit terrifying because running the wrong git push command can break things, which is why I have this little git command to set my upstream push URL to something invalid except when I explicitly activate it: https://gist.github.com/42dd9a9792390094a43bdb69659320c0

@GMNGeoffrey
Copy link
Member

GMNGeoffrey commented Sep 26, 2022

(Also note that from a security perspective, allowing admins to do this makes zero difference because if we check the box saying they can't, then as an admin they can always go uncheck it)

@ashay
Copy link
Collaborator

ashay commented Sep 26, 2022

Thanks all! I have included a rudimentary change to push the updated files to the main branch (see commit 22eb82c), but I have not tried it at my end, for fear that it might do bad things (like recursive commits). If someone can vet it and/or run this in a controlled environment, that'd be very nice. Thanks!

I dropped the +cpu suffix, and everything seems to work well, although I'm wary of the change because regexes often create more problems than they solve. If something breaks, I'll be sure to revert the change. I've also dropped the release-info file. Instead, the hash is written to pytorch-version.txt.

@GMNGeoffrey
Copy link
Member

Ah I've gone back and read the context here. A PR might not be a bad idea anyway, since it will run presubmits. You could even set up a bot to auto-approve it so it just needs to pass required checks :-) I think the latter would require a second bot account. If you want to push directly you'll need to use a bot account or app with the ability to bypass checks (can configure which users can do that in settings):

image

@powderluv
Copy link
Collaborator Author

@ashay for testing we can push to the same RollPytorch branch ref. Once it is working we can open a new PR

@ashay
Copy link
Collaborator

ashay commented Sep 30, 2022

PRs #1419 and #1438 address the pinning and periodic update of the PyTorch release, but we don't seem to have resolved the caching issue. We have had the auto-update in place for the last two days, and the LLVM builds hit the cache, but libtorch builds don't, causing feature PRs in Torch-MLIR to spend about an hour and half in CI.

@powderluv
Copy link
Collaborator Author

I think now we turn off cache saving from the CI builds and the nightly roll just create the cache. I have another bug open for the cache issue with details (afk so can't link bug right now)

@powderluv
Copy link
Collaborator Author

Is the cache issue #1323

@powderluv
Copy link
Collaborator Author

Lets close this and track the Cache issue separately (since that happens with / without pinning)

qedawkins pushed a commit to nod-ai/torch-mlir that referenced this issue Oct 3, 2022
…lvm#1328)

Use the Diagnostic class to diagnose the axis attribute for the ONNX Flatten, Gather and GatherElements operators.

Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants