-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: ICC compiler package name change #1566
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update @rscohn2!
@rscohn2 looking at the latest GitHub action runs, it looks like the high-level DPC++ packages we currently use are getting too large to run on public CI. Do you recommend to select potentially a smaller sub-package in oneAPI that pulls less dependencies?
Thank you for your help! |
``` /usr/bin/ld: final link failed: No space left on device LLVM ERROR: IO failure on output stream: No space left on device ``` #1566 (comment)
I don't think there are smaller packages. @mmzakhar: Any ideas? You are probably only using a few libraries from: |
Thank you @rscohn2! I just deleted But it turns out that MKL package is the only one shipping Potentially, creating more sub-packages that provide individual compilers and sub-aspects of MKL could be a way to reduce the binary size.
You mean to In case it is helpful, Nvidia's CUDA apt repo provides similarly large packages for their toolkit as oneAPI does. With them, their facade packages mainly declare dependencies on very fine-grained dependencies that one can also select manually. I found this super helpful when building containers and pulling dependencies in resource-constrained environments, such as continuous integration: For AMD, we also pull libraries like rocFFT manually to keep the install footprint small: WarpX/.github/workflows/dependencies/hip.sh Lines 32 to 33 in faf15f0
|
Yes. I think it is the only short term solution. I will share a link to this topic with the package people. The assumption that installs are done once and go to very large disks no longer holds with virtualization/containerization. |
Here is a workaround for sporadic "no space left on device" issue:
With this, the pipeline needs to success once, and the required files will be restored from cache in subsequent runs, skipping the large install. The cache can be auto-updated when MKL changes in APT repo. An example of such usage of CI cache can be found in https://github.com/oneapi-src/oneapi-ci/blob/master/.github/workflows/build_all.yml#L239 |
@rscohn2 @mmzakhar The caching is a good hint, yet I think an orthogonal step to speed up our CI install (but otherwise faces the same challenges in terms of temporary size). |
I could not find the link line in your logs to see what you use. I was expecting to see -lmkl_core and similar. If you are not using the SYCL interfaces to MKL, then these are a good candidate: And/or removing either *.a or *.so in that directory, depending on which you use. |
Thank you for the hint! Yes, the link line is not shown since it aborts before that step. I tried this in #1743 and get close - now we only run out of disk space in the linking step itself. Is there anything else we can remove in the compiler or MKL package? |
Maybe these FPGA files: |
Oh nice, those are great chunks to slice off and now brought us over the edge. Thank you! 🎉 |
Oh no, this is still too large and hits us again now. |
I did a PR against the branch for #1783 to remove mkl .a. It removes 1 gig, but still waiting for it to finish. We have not released anything new, and I verified that your deletes work as expected. Maybe not all github VM's are same size and you got lucky before. |
Oh, great idea! Thank you, that works! 🎉
Thanks for the info! Yes, I think we were pretty close and small fluctuations caused this. |
@ax3l: You can free up 1-2 GB by cleaning packages from the apt cache: https://github.com/oneapi-src/oneapi-ci/pull/43/files |
Thank you for the hint! Added in #1841 |
@rscohn2 I noted that since about a day or two and still ongoing today, CI jobs crash on
Is there potentially a oneAPI release in process and the CDN servers are not updated in a single transaction? It's certainly a bit disruptive on 4 of the repos that I co-maintain, do you have a hint how we can make this more robust? :) |
There is a release in progress, but I would not expect it to cause errors
over an extended period of time. I sent an email to the owner.
…On Mon, Mar 29, 2021 at 3:34 PM Axel Huebl ***@***.***> wrote:
@rscohn2 <https://github.com/rscohn2> I noted that since about a day or
two and still ongoing today, CI jobs crash on apt with messages of the
form:
Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-all/Packages.bz2 File has unexpected size (10383 != 6571). Mirror sync in progress? [IP: 104.127.244.160 443]
Hashes of expected file:
- Filesize:6571 [weak]
- SHA512:16f21c4f8c2b6ce59434685a5f0598a8a5328f321528e565ab0bba9c773d67011a27922832205e8303857520dd7678c17d7ea4fced3efcee432701c2a33404ae
- SHA256:b3204b9762e33c522c5a2c50160cd87227eee8607c8efcab76557324e7678eb7
- SHA1:050ca40f7cf212295f82df347bf9ad253024b8d5 [weak]
- MD5Sum:e45e36ac6473d85cce567d5f1bf9cdc8 [weak]
Release file created at: Wed, 24 Mar 2021 09:16:21 +0000
E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-amd64/Packages.bz2
E: Some index files failed to download. They have been ignored, or old ones used instead.
Error: Process completed with exit code 100.
Is there potentially a oneAPI release in process and the CDN servers are
not updated in a single transaction?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1566 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOY2WU4KH2UXKU32YE7BNLTGDI5LANCNFSM4UVCLBQA>
.
|
@rscohn2 With the latest DPC++ release that rolled out yesterday for the apt packages (2021.3.0), it looks like compile time increased significantly: Our DPC++ build takes now >5hrs on CI, which has taken 34min before (ICC and ICX not affected). I wonder if that's a DPC++ compiler performance regression or some oversubscription of resources, e.g. more compile parallelism in the last compiler release? We compile with |
They would notice if there was a broad regression so it is probably something that needs a specific source to trigger. If you can provide a source file (preprocessed with -E) that shows the 20x slowdown I can report it. Here is a suggestion for determining which file is slow. https://stackoverflow.com/a/5973540/2525421 |
@rscohn2, thank you.
It looks like this already hangs in the preprocessor. When I try to reduce this line to a
According to my process manager, the process that hangs (full CPU load, regular memory usage) is:
I attached
re-attaching a couple of times also shows this tip of the stack:
|
I reproduced the problem and filed a ticket. |
Thanks a lot! 👍 |
@ax3l : Here is the fix for the slowdown intel/llvm#4065 |
@rscohn2 that's fantastic, thanks a lot for triaging this! 🚀 ✨ |
@ax3l: I verified that it is working with a compiler release from github: https://github.com/intel/llvm/releases/tag/sycl-nightly%2F20210718 It is likely to be in update 4 because it was fixed relatively early in the release cycle. |
@rscohn2 thank you, that's fantastic! @WeiqunZhang also found a way to reduce our usage of recursive functions at the end of last week via #2063 . So for testing, definitely use the same commit as before since we just merged a commit that changed the behavior of the code for that routine. |
@rscohn2 Since about yesterday, we see quite often the error
when starting up our Ubuntu CI on GH action and downloading the oneAPI deb packages. Is it possible the Intel CDN servers are not in consistent state or that a release is going on? Our current setup looks like this: |
I asked about the failures and will let you know. GitHub Actions has a feature where you can cache an install. It will speed up install and I suspect will make it more reliable since it saves a tarball in Azure storage. Here is an example: And then make the install conditional on the cache restore failing: Here is a successful restore: |
No updates going on. Could it be the disk filling up? It may be that you sometimes get a VM with less disk space. Can you do a |
It turns out some packages (not the ones you are using) were updated, and the index file can be out of sync. They are trying to improve the reliability. If GitHub caching works for you, it would probably avoid the problem. |
@rscohn2 thank you, that's good to know! Thank you also for looking into the sync/reliability on updates of the index. I was not using caching yet so we get a new release when it drops, but that's a good idea to consider. |
@ax3l: I published a github action that does oneapi install with cacheing and pruning of unnecessary files. Unless you request a specific version, you will get the latest. https://github.com/marketplace/actions/setup-oneapi. It will cache /opt/intel/oneapi at the end of the action so you can manually prune as well. |
No description provided.