Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading artifacts from GH actions consistently fails with 503 error #4185

Open
ericvergnaud opened this issue Mar 15, 2023 · 14 comments
Open

Comments

@ericvergnaud
Copy link
Contributor

GH builds regularly fail with the 2 cpp targets using gcc.
The error occurs not during the build/test itself, but when uploading the artifacts to GH.
The error is 503 (service unavailable)
Could it be that the artifact is too large (log says 150994943 bytes) ?
See https://github.com/antlr/antlr4/actions/runs/4423203675/jobs/7755704404
@hs-apotell would you be able to look into this ?

@ericvergnaud ericvergnaud changed the title Uploading artifacts from GH actions consistently fail with 503 errors Uploading artifacts from GH actions consistently fails with 503 errors Mar 15, 2023
@ericvergnaud ericvergnaud changed the title Uploading artifacts from GH actions consistently fails with 503 errors Uploading artifacts from GH actions consistently fails with 503 error Mar 15, 2023
@ericvergnaud
Copy link
Contributor Author

ericvergnaud commented Mar 15, 2023

Interestingly, re-running the failed jobs succeeds, and the last artifact size in a successful build is 'only' 104579386 bytes.
This shows inconsistency across builds and smells like a polluted reuse of a previous build...

@ericvergnaud
Copy link
Contributor Author

Also it seems no tests are run for cpp builds... very weird

@hs-apotell
Copy link
Collaborator

hs-apotell commented Mar 15, 2023

Notably, this builds failing wasn't the case always. It seems to have started happening more consistently in recent times. Has anything substantial changes in the past few weeks that could correlate with the failures?

Digging into a few failed builds, the error is not always consistent either - 400 and 503. But the errors are always network related and so rebuilds succeeding isn't surprising or unexpected.

Could it be that the artifact is too large (log says 150994943 bytes) ?

Size wouldn't matter here. We have other builds producing and uploading artifacts that are over 3GB. antlr doesn't generate anywhere close to that size. Also, the size of the artifact uploaded vs. files on disk will be different because the uploading action zips them.

This shows inconsistency across builds and smells like a polluted reuse of a previous build...

Every build is running on a pristine VM machine. There is no pollution. If the sizes are different across builds than the generated file sizes on disks are different. How, why, which - those are questions we can follow up on. But VM pollution is not an issue.

Also it seems no tests are run for cpp builds... very weird

No tests for the cpp builds is intentional. cpp natives are built twice once using the cmake directly (i.e. not using the java wrappers) so the warnings/errors can be captured. Tests are not a concern here with these builds. They are being run as part of the other builds.

I will investigate further to narrow down the root cause of the failure.

@hs-apotell
Copy link
Collaborator

I hope this explains it - actions/upload-artifact#270

The failures started happening when I upgraded the specific Github action from v2. to v3 on 11/27/2022.

I will create a new PR with the recommended fix for the issue.

@ericvergnaud
Copy link
Contributor Author

Thanks for this.
Not sure I understand your comments re testing. Can you point me to a cpp job that does run the tests ?

@ericvergnaud
Copy link
Contributor Author

Ah I get it now, the cpp is for building the lib, and then the regular job uses it for testing. And the segregation is for building using different 'flavors'... thanks.

@hs-apotell
Copy link
Collaborator

May be the jobs can use some renaming to drive the intent home. Any suggestions?

@ericvergnaud
Copy link
Contributor Author

build-cpp-library ?

hs-apotell added a commit to hs-apotell/antlr4 that referenced this issue Mar 15, 2023
Github action for upload was upgraded to v3 recently and the release is
unstable causing too many uploads to fail. Reverting that change to go
back to using v2.

Unfortunately, this change also downgrades use of Node.js to 12 which
is deprecated, generating too many warnings in build output. Favoring
warnings over failed builds.

Signed-off-by: HS <hs@apotell.com>
@jimidle
Copy link
Collaborator

jimidle commented Mar 16, 2023

If this is truly a network issue, should we not report this to GitHub?

@kaby76
Copy link
Contributor

kaby76 commented Mar 16, 2023

I've had a ton of network errors with Github Actions in grammars-v4. It was particularly bad for the Mac servers, which I believe are sub-par hardware (but there's no /proc/cpuinfo, and arch and uname -a don't give squat). To get around all the network mess, I had to write code to do builds with retries. I also try to avoid certain times of the day with some big PRs.

(Eventually, the only thing that really, really fixed the problem was to make the builds only work on the changed grammars, so the network wasn't being pounded to death by all the simulaneous builds. I can only guess that Github probably virtualizes multiple machines on one piece of hardware, which still has only one shared network link. Your workflow spawns 33 builds!)

I looked at the code for upload-artifact. The error is raised here. Perhaps you could fork a copy, create your own "antlr-upload-archive", and employ a retry of the crappy retry. Maybe if you retry a good number, things might eventually work.

Unfortunately, the toolkit hardwires the retry count to 5, and does not offer an API to modify the value.

There was some issue somewhere in github actions that mentioned that last "chunk" was having problems. Maybe this is it? But, you don't do an "ls -l *.tgz" in the "Prepare artifacts" step to know how big the file really is, and whether the last chunk is being sent.

@hs-apotell
Copy link
Collaborator

Yes, it is a network issue but not a Github issue. This seems to be somehow related to implementation of the upload-artifact action itself. This worked in the previous version but fails with the latest version. You can follow the bug report I pointed out on the upload-artifact repository. Unfortunately, this is not the only report issue about this problem. This issue has been reported numerous times with no resolution.

I am unsure if I want to fork/clone the repository and take ownership of it. Neither I have the time to maintain it nor see an immediate need for it. If this continues to be a problem there are other actions similar to this one that we can use.

I introduced a PR with the version rollback, however, that also failed with similar problem. Will try other options to see if I can swap the action for something more reliable.

hs-apotell added a commit to hs-apotell/antlr4 that referenced this issue Mar 19, 2023
Github action for upload was upgraded to v3 recently and the release is
unstable causing too many uploads to fail. Downgrading back to previous
version hasn't made significant improvement either.

Since the artifacts aren't exactly used by any chained job, failures for
uploading the artifact can be ignored. The artifacts are used mostly for
the purpose for debugging and so if needed the user can trigger specific
build again to get the artifact.

Signed-off-by: HS <hs@apotell.com>
@ericvergnaud
Copy link
Contributor Author

Since the artifacts are not necessary, how about disabling that step altogether ?
If people complain, we can look again at a solution ?

@hs-apotell
Copy link
Collaborator

The option to continue-on-error has the same effect - ignoring the result if the upload fails.

parrt pushed a commit that referenced this issue Mar 22, 2023
Github action for upload was upgraded to v3 recently and the release is
unstable causing too many uploads to fail. Downgrading back to previous
version hasn't made significant improvement either.

Since the artifacts aren't exactly used by any chained job, failures for
uploading the artifact can be ignored. The artifacts are used mostly for
the purpose for debugging and so if needed the user can trigger specific
build again to get the artifact.

Signed-off-by: HS <hs@apotell.com>
jimidle pushed a commit to jimidle/antlr4 that referenced this issue Mar 28, 2023
Github action for upload was upgraded to v3 recently and the release is
unstable causing too many uploads to fail. Downgrading back to previous
version hasn't made significant improvement either.

Since the artifacts aren't exactly used by any chained job, failures for
uploading the artifact can be ignored. The artifacts are used mostly for
the purpose for debugging and so if needed the user can trigger specific
build again to get the artifact.

Signed-off-by: HS <hs@apotell.com>
Signed-off-by: Jim.Idle <jimi@idle.ws>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants