Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hangs for 2 minutes while storing buildx in cache #293

Closed
3 tasks done
nicks opened this issue Jan 25, 2024 · 9 comments
Closed
3 tasks done

hangs for 2 minutes while storing buildx in cache #293

nicks opened this issue Jan 25, 2024 · 9 comments

Comments

@nicks
Copy link
Contributor

nicks commented Jan 25, 2024

Contributing guidelines

I've found a bug, and:

  • The documentation does not mention anything about my problem
  • There are no open or closed issues that are related to my problem

Description

setup-buildx-action consistently hangs for 2 minutes.

Expected behaviour

No pause

Actual behaviour

pause

Repository URL

private

Workflow run URL

private

YAML workflow

Relevant snippet:

      - name: Docker Login
        uses: docker/login-action@v2
        with:
          username: my_username
          password: ${{ secrets.MY_PAT }}

      - name: Setup Hydrobuild
        uses: docker/setup-buildx-action@v3
        with:
          version: "lab:latest"
          driver: cloud
          endpoint: my-endpoint
          install: true

Workflow logs

::group::Inspect builder
Thu, 25 Jan 2024 15:18:40 GMT
Inspect builder
...
Thu, 25 Jan 2024 15:18:40 GMT
  ::endgroup::
Thu, 25 Jan 2024 15:20:48 GMT
##[debug]Node Action run completed with exit code 0

(Note the long pause between inspecting the builder and the node action completing)

BuildKit logs

No response

Additional info

@crazy-max spent some time poking at this and couldn't figure it out. but some additional data points:

  • The pause is consistently reproducible on the first run
  • If you re-run the action, the problem goes away
@nicks
Copy link
Contributor Author

nicks commented Jan 29, 2024

hmmm moving this out of vpn didn't fix the problem, not i'm even more confused. need to dig in more.

@nicks
Copy link
Contributor Author

nicks commented Jan 29, 2024

(i think there's some sort of caching issue at play that made it seem like doing it outside of vpn fixed it)

@nicks
Copy link
Contributor Author

nicks commented Jan 29, 2024

OK, i'm 90% sure the problem is this line -- https://github.com/docker/actions-toolkit/blob/9282d3e13b5c0ff919207d30e173b0504a8cd8df/src/buildx/install.ts#L334

if i comment out that line, the problem goes away. when it uploads the file, GHA does some sort of work in the background to keeps the step open. maybe the cache save is just slow?

@nicks nicks changed the title hangs for 2 minutes when run inside a VPN hangs for 2 minutes while storing buildx in cache Jan 29, 2024
@nicks
Copy link
Contributor Author

nicks commented Jan 29, 2024

(updated the description to make it clear this is not a vpn problem, but now i don't really know why i can consistently reproduce this in one repo but not others)

@nicks
Copy link
Contributor Author

nicks commented Jan 29, 2024

ahhh i think this might be a github actions bug

Screenshot from 2024-01-29 17-29-33

my new theory is that

  1. cache.save() is triggering a GC
  2. the GC is borked and just causes pauses

@nicks
Copy link
Contributor Author

nicks commented Jan 30, 2024

what if we move the cache.save() into the post part of the job? i think that's technically more correct anyway -- it's what other github actions stuff does

nicks added a commit to docker/actions-toolkit that referenced this issue Jan 30, 2024
i mocked this up while investigating
docker/setup-buildx-action#293

i guess this is SLIGHTLY better in that it makes the 2 minutes pause
that i'm seeing happen AFTER tests, but the pause is still there.

Signed-off-by: Nick Santos <nick.santos@docker.com>
@nicks
Copy link
Contributor Author

nicks commented Jan 30, 2024

posted this change that at least isolates the weird behavior - docker/actions-toolkit#241

nicks added a commit to docker/actions-toolkit that referenced this issue Jan 30, 2024
i mocked this up while investigating
docker/setup-buildx-action#293

i guess this is SLIGHTLY better in that it makes the 2 minutes pause
that i'm seeing happen AFTER tests, but the pause is still there.

Signed-off-by: Nick Santos <nick.santos@docker.com>
nicks added a commit to docker/actions-toolkit that referenced this issue Jan 30, 2024
i mocked this up while investigating
docker/setup-buildx-action#293

i guess this is SLIGHTLY better in that it makes the 2 minutes pause
that i'm seeing happen AFTER tests, but the pause is still there.

Signed-off-by: Nick Santos <nick.santos@docker.com>
@nicks
Copy link
Contributor Author

nicks commented Jan 31, 2024

one thing i don't really understand about setup-buildx-action's caching strategy is that it's creating a separate cache for each PR branch. (see screenshot)

Screenshot from 2024-01-31 12-26-07

wouldn't it make more sense for the cache to be global for the repo?

@nicks
Copy link
Contributor Author

nicks commented Feb 2, 2024

ok, we figured out the root cause. we don't think this is a setup-buildx-action issue. writing down the problem in case we see it again:

github actions uses a layered cache approach. each branch can write to exactly
one cache (the cache "owned" by the branch). But each branch implicitly reads
from multiple caches based on the branch structure. So if you're on a feature
branch, that branch can read from main and feature.

for more info on this, see:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache

this works really well if you run a github action on both pull and push (which
most ci systems built on github actions do)

the repo where we were having this problem only ran setup-buildx-action on pull, not on push.

this leads to weird and pathological cache behavior, because every branch starts
with an empty cache. Common values never get written back to the main branch
cache. And I think it's leading to the github actions cache implementation
behaving very strangely with weird pauses, etc because it has hundreds of
disconnected caches, each with their own duplicate copy of buildx.

i still think it might make sense to do move the cache save call until the post-action step (like other github actions do), but let's close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants