-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Improve release speed #9392
🌱 Improve release speed #9392
Conversation
Skipping CI for Draft Pull Request. |
There are a couple of simpler way to achieve this just using Make. Firstly you can just define release-staging as: .PHONY: release-staging
release-staging: ## Build and push container images to the staging bucket
REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-build-all
REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-push-all
REGISTRY=$(STAGING_REGISTRY) $(MAKE) release-alias-tag Then calling make release-staging -j8 will pass the parallel argument to the submakes, but still execute them sequentially. Alternatively you could define docker-build-% as a dependency of docker-push-%, i.e.: docker-push-%: docker-build-%
$(MAKE) ARCH=$* docker-push You can then remove a step from release-staging: ## Build and push container images to the staging bucket
REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-push-all
REGISTRY=$(STAGING_REGISTRY) $(MAKE) release-alias-tag This will allow make to start push as soon as the relevant build completes rather than waiting for them all to complete, which is probably a marginal speed improvement. Incidentally, I rewrote cloud-provider-openstack's cloudbuild to use buildx for multi-arch recently and was very happy with the results both in terms of performance and maintainability: https://github.com/kubernetes/cloud-provider-openstack/blob/master/Dockerfile Although note that CPO's build is somewhat more complicated than normal because we build multiple images for each architecture, and for performance we build them all in the same buildx builder. cloud-provider-aws has a somewhat simpler multi-arch cloudbuild example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any data on the impact of this on the speed? I'd prefer to only merge performance improvements where there is a significant improvement.
I believe this has the potential to improve performance by a factor of the number of build architectures, which is currently 5. However, it will be dominated by the build stage, which is already heavily multi-process because it's Go. Running all 5 of them simultaneously on the same system is not likely to give a 5x performance benefit. |
@mdbooth thanks for the insights -- i thought i was missing something here, Cloudbuild doesn't provide much telemetry on resource consumption during the build jobs but your findings are consistent with a lot of the questions i had while testing this messing around with the parallelism and machine size @killianmuldoon here is the slack thread for ref. so far yes, our build times from the previous release were ~28 mins. here are the most recent results from the builds I have been testing with (fastest I have gotten it to run is ~6 mins which is about consistent with the 5x improvement mdbooth speculated): https://a.cl.ly/6quw9NJX |
here is a screenshot from the most recent run overview with improvements suggested in the draft + a larger machine (bumped to For comparison, here is the screenshot of the run overview from the original config: https://a.cl.ly/4guRAqQK (the job ran in 26 minutes) |
Thanks for adding the data @cahillsf! Take this out of draft once you think it's ready to fully review and I'll come back to it. |
2888de1
to
d2fd286
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: a83d2023e475d565f960b12d55c621dc5f033b5d
|
Thanks for working on it @cahillsf, this is already a huge improvement IMO. /lgtm |
👋 bumping this when you have a chance to review @killianmuldoon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, but this looks generally good to me.
revert to second variant
3d6c94c
to
1b037fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
/lgtm
LGTM label has been added. Git tree hash: 8505157bdf71b000815b83acfccda376cb83a988
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This would be nice to try out before upcoming patch releases next week! Hopefully, we will have enough reviews by then from another maintainer 🙂
Thx everyone for working on this and especially on providing data!! /approve @cahillsf Thank you very much for your work on this!! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@furkatgofurov7 @killianmuldoon I assume we have to cherry-pick this to get an actual improvement for 1.5.x and 1.4.x? (fine from my side, just making sure we don't forget it) |
Hm, this should be the first job running with this make target: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296 |
Hm one failed with timeout after 46 min. No idea if related: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296 |
the logs don't provide much to go off of unless i am missing something. same error on the subsequent run as well. any way for us to view the additional context from the Cloud Build UI? |
There should be the cloud build log in the artifacts folders, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296/artifacts/build.log (except if you already looked at this one) |
There is nothing except BUILD TIMEOUT, am I blind or missing something that is causing it? If not, why it did not happen when testing it locally? |
not I am aware of it, unfortunately. |
yep not much other indications of errors, it's just timing out. the failed runs are not even completing all of the builds, there should be 35 calls to |
Should we revert this PR to get the image pushes on main back? Independent of that it might be good to get some timestamps into these logs |
here is the PR to revert: #9465 sorry for breaking this
that is a good question, will have to try to dig into this |
No worries :) I'm open to further experiments, just wanted to avoid pressure to have to fix this ASAP while it's broken |
Yeah, no worries, thanks for quick response! |
hey @sbueringer looking into this piece:
trying to track down where the best place to make this modification would be, thinking that it might require something like this - piping If so, I think the modification would have to be made in the
I would have assumed all of this would be in the image-builder repo, but can't seem to find the |
The nightly cloudbuild job is defined here: https://github.com/kubernetes/test-infra/blob/00527b5641ac6d7e5b99249c8c8c6c8a4a636999/config/jobs/image-pushing/k8s-staging-cluster-api.yaml#L379-L412 It uses |
@@ -3,7 +3,7 @@ | |||
timeout: 2700s | |||
options: | |||
substitution_option: ALLOW_LOOSE | |||
machineType: 'E2_HIGHCPU_8' | |||
machineType: 'E2_HIGHCPU_32' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the timeout maybe due to this machine type increase?
Maybe it was not able to get scheduled?
Do the other changes make sense without this line (+ setting 8
in line 15?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that seems most likely and would also explain why the timeout did not occur in my environment. the Cloudbuild queue time is visible via the GCP console but is not clear from our current logging. Having the timestamps would help us confirm that theory, but given several runs timed out I am not sure we could take any action to improve this situation anyway.
Do the other changes make sense without this line (+ setting 8 in line 15?
Yes they do and based on my previous testing should give us ~10 mins improvement in job time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanted to touch back on here @chrischdi -- what do you think is the best path forward? ill open a new PR leaving the machine type as E2_HIGHCPU_8
and pass in -j 8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a viable path forward 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
What this PR does / why we need it:
Last release cut (slack thread for ref) we were blocked pretty early on in the process waiting for the
post-cluster-api-push-images
job to complete. We cut two patch releases (1.4.6
and1.5.1
) with each taking ~28 minutes to complete.This PR improves the speed of the release process by:
-j 32
to themake
command to parallelize each sub-make and-O
to unify the logging output for each jobThe results of my local testing have been promising, with the most recent job with the current changes completing in under 6 minutes: https://a.cl.ly/5zubWldx
Here is some historical testing data from my sandbox:
-j
option vals and machine sizes: https://a.cl.ly/6quw9NJX/area release