🌱 Improve release speed #9392

cahillsf · 2023-09-11T13:11:58Z

What this PR does / why we need it:
Last release cut (slack thread for ref) we were blocked pretty early on in the process waiting for the post-cluster-api-push-images job to complete. We cut two patch releases (1.4.6 and 1.5.1) with each taking ~28 minutes to complete.

This PR improves the speed of the release process by:

splitting the original make command into 3 sub-makes to enable parallelization within the cloudbuild job
passing -j 32 to the make command to parallelize each sub-make and -O to unify the logging output for each job
bumping the cloudbuild machine to a larger size

The results of my local testing have been promising, with the most recent job with the current changes completing in under 6 minutes: https://a.cl.ly/5zubWldx

Here is some historical testing data from my sandbox:

builds with the 3 make targets split into 3 separate targets (3 steps defined in the cloudbuild) with varying -j option vals and machine sizes: https://a.cl.ly/6quw9NJX
local run overview from the original config (the job ran in 26 minutes which is consistent with the results from the last patch release cut): https://a.cl.ly/4guRAqQK

/area release

k8s-ci-robot · 2023-09-11T13:12:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mdbooth · 2023-09-11T14:06:55Z

There are a couple of simpler way to achieve this just using Make. Firstly you can just define release-staging as:

.PHONY: release-staging
release-staging: ## Build and push container images to the staging bucket
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-build-all
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-push-all
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) release-alias-tag

Then calling make release-staging -j8 will pass the parallel argument to the submakes, but still execute them sequentially.

Alternatively you could define docker-build-% as a dependency of docker-push-%, i.e.:

docker-push-%: docker-build-%
        $(MAKE) ARCH=$* docker-push

You can then remove a step from release-staging:

release-staging: ## Build and push container images to the staging bucket
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-push-all
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) release-alias-tag

This will allow make to start push as soon as the relevant build completes rather than waiting for them all to complete, which is probably a marginal speed improvement.

Incidentally, I rewrote cloud-provider-openstack's cloudbuild to use buildx for multi-arch recently and was very happy with the results both in terms of performance and maintainability:

https://github.com/kubernetes/cloud-provider-openstack/blob/master/Dockerfile
https://github.com/kubernetes/cloud-provider-openstack/blob/master/cloudbuild.yaml
https://github.com/kubernetes/cloud-provider-openstack/blob/df6131eef40621b3660bb9634c8a0ed93927ab83/Makefile#L172-L178

Although note that CPO's build is somewhat more complicated than normal because we build multiple images for each architecture, and for performance we build them all in the same buildx builder.

cloud-provider-aws has a somewhat simpler multi-arch cloudbuild example:
https://github.com/kubernetes/cloud-provider-aws/blob/master/cloudbuild.yaml

killianmuldoon

Do we have any data on the impact of this on the speed? I'd prefer to only merge performance improvements where there is a significant improvement.

mdbooth · 2023-09-11T16:27:39Z

Do we have any data on the impact of this on the speed? I'd prefer to only merge performance improvements where there is a significant improvement.

I believe this has the potential to improve performance by a factor of the number of build architectures, which is currently 5.

However, it will be dominated by the build stage, which is already heavily multi-process because it's Go. Running all 5 of them simultaneously on the same system is not likely to give a 5x performance benefit.

cahillsf · 2023-09-11T17:33:42Z

@mdbooth thanks for the insights -- i thought i was missing something here, Cloudbuild doesn't provide much telemetry on resource consumption during the build jobs but your findings are consistent with a lot of the questions i had while testing this messing around with the parallelism and machine size

@killianmuldoon here is the slack thread for ref. so far yes, our build times from the previous release were ~28 mins. here are the most recent results from the builds I have been testing with (fastest I have gotten it to run is ~6 mins which is about consistent with the 5x improvement mdbooth speculated): https://a.cl.ly/6quw9NJX

cahillsf · 2023-09-12T00:45:47Z

here is a screenshot from the most recent run overview with improvements suggested in the draft + a larger machine (bumped to E2_HIGHCPU_32): https://a.cl.ly/5zubWldx (the job ran in under 6 minutes)

For comparison, here is the screenshot of the run overview from the original config: https://a.cl.ly/4guRAqQK (the job ran in 26 minutes)

killianmuldoon · 2023-09-12T10:04:33Z

Thanks for adding the data @cahillsf! Take this out of draft once you think it's ready to fully review and I'll come back to it.

mdbooth

/lgtm

Makefile

k8s-ci-robot · 2023-09-12T12:58:05Z

LGTM label has been added.

Git tree hash: a83d2023e475d565f960b12d55c621dc5f033b5d

furkatgofurov7 · 2023-09-13T13:50:44Z

here is a screenshot from the most recent run overview with improvements suggested in the draft + a larger machine (bumped to E2_HIGHCPU_32): https://a.cl.ly/5zubWldx (the job ran in under 6 minutes)

For comparison, here is the screenshot of the run overview from the original config: https://a.cl.ly/4guRAqQK (the job ran in 26 minutes)

Thanks for working on it @cahillsf, this is already a huge improvement IMO.

/lgtm

cahillsf · 2023-09-18T13:30:52Z

👋 bumping this when you have a chance to review @killianmuldoon

killianmuldoon

One question, but this looks generally good to me.

Makefile

revert to second variant

killianmuldoon

Thanks for this!

/lgtm

k8s-ci-robot · 2023-09-19T08:25:55Z

LGTM label has been added.

Git tree hash: 8505157bdf71b000815b83acfccda376cb83a988

furkatgofurov7

/lgtm

This would be nice to try out before upcoming patch releases next week! Hopefully, we will have enough reviews by then from another maintainer 🙂

sbueringer · 2023-09-19T15:44:30Z

Thx everyone for working on this and especially on providing data!!

/approve

@cahillsf Thank you very much for your work on this!!
Can you please update the PR description to reflect the latest state of the PR?

k8s-ci-robot · 2023-09-19T15:44:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2023-09-19T15:45:28Z

@furkatgofurov7 @killianmuldoon I assume we have to cherry-pick this to get an actual improvement for 1.5.x and 1.4.x?

(fine from my side, just making sure we don't forget it)

sbueringer · 2023-09-19T16:15:15Z

Hm, this should be the first job running with this make target: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296
(already > 14m)

sbueringer · 2023-09-19T16:52:01Z

Hm one failed with timeout after 46 min. No idea if related: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296

cahillsf · 2023-09-19T17:16:05Z

~~looks like it was on the third sub-make~~ actually that is just an overall failure:

2023/09/19 16:46:07 Failed to run some build jobs: [error running [gcloud builds submit --verbosity info --config /home/prow/go/src/github.com/kubernetes-sigs/cluster-api/cloudbuild.yaml --substitutions _PULL_BASE_REF=main,_GIT_TAG=v20230919-v1.5.0-rc.0-472-gd9814963c --project k8s-staging-cluster-api --gcs-log-dir gs://k8s-staging-cluster-api-gcb/logs --gcs-source-staging-dir gs://k8s-staging-cluster-api-gcb/source gs://k8s-staging-cluster-api-gcb/source/badf5c11-4108-46e6-ba47-6563213bcd31.tgz]: exit status 1]

the logs don't provide much to go off of unless i am missing something.

same error on the subsequent run as well.

any way for us to view the additional context from the Cloud Build UI?

sbueringer · 2023-09-19T17:25:11Z

There should be the cloud build log in the artifacts folders, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296/artifacts/build.log

(except if you already looked at this one)

furkatgofurov7 · 2023-09-19T17:45:35Z

There is nothing except BUILD TIMEOUT, am I blind or missing something that is causing it? If not, why it did not happen when testing it locally?

furkatgofurov7 · 2023-09-19T17:46:42Z

There should be the cloud build log in the artifacts folders, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296/artifacts/build.log

(except if you already looked at this one)

~~looks like it was on the third sub-make~~ actually that is just an overall failure:
2023/09/19 16:46:07 Failed to run some build jobs: [error running [gcloud builds submit --verbosity info --config /home/prow/go/src/github.com/kubernetes-sigs/cluster-api/cloudbuild.yaml --substitutions _PULL_BASE_REF=main,_GIT_TAG=v20230919-v1.5.0-rc.0-472-gd9814963c --project k8s-staging-cluster-api --gcs-log-dir gs://k8s-staging-cluster-api-gcb/logs --gcs-source-staging-dir gs://k8s-staging-cluster-api-gcb/source gs://k8s-staging-cluster-api-gcb/source/badf5c11-4108-46e6-ba47-6563213bcd31.tgz]: exit status 1]
the logs don't provide much to go off of unless i am missing something.

same error on the subsequent run as well.

any way for us to view the additional context from the Cloud Build UI?

not I am aware of it, unfortunately.

cahillsf · 2023-09-19T17:47:24Z

yep not much other indications of errors, it's just timing out.

the failed runs are not even completing all of the builds, there should be 35 calls to DOCKER_BUILDKIT=1 docker build --build-arg builder_image= (5 archs x 7 imgs). The failed job logs only indicate 14 calls to this docker build.

sbueringer · 2023-09-19T17:49:09Z

Should we revert this PR to get the image pushes on main back?

Independent of that it might be good to get some timestamps into these logs

…-release-speed" This reverts commit d981496, reversing changes made to 2639f2c.

cahillsf · 2023-09-19T17:54:12Z

here is the PR to revert: #9465

sorry for breaking this

If not, why it did not happen when testing it locally?

that is a good question, will have to try to dig into this

sbueringer · 2023-09-19T18:20:05Z

sorry for breaking this

No worries :)

I'm open to further experiments, just wanted to avoid pressure to have to fix this ASAP while it's broken

furkatgofurov7 · 2023-09-19T19:23:51Z

sorry for breaking this

Yeah, no worries, thanks for quick response!

cahillsf · 2023-09-24T17:49:15Z

hey @sbueringer looking into this piece:

Independent of that it might be good to get some timestamps into these logs

trying to track down where the best place to make this modification would be, thinking that it might require something like this - piping gcloud builds submit to ts. Does that sound right to you?

If so, I think the modification would have to be made in the image-builder container that is used to kick off the Cloudbuild job (src), but I am having trouble finding the source code for this run.sh command. Locally i can exec into the container and output the contents, but it seems to be calling this builder executable which I also cannot find the source code for, e.g.:

echo "Running..."
if [ -n "${ARTIFACTS}" ] && [ -z "${LOG_TO_STDOUT}" ]; then
  echo "\$ARTIFACTS is set, sending logs to ${ARTIFACTS}"
  /builder --log-dir="${ARTIFACTS}" "$@"
else
  /builder "$@"
fi

I would have assumed all of this would be in the image-builder repo, but can't seem to find the run.sh file or builder source code. Any ideas for best next steps?

mdbooth · 2023-09-25T08:33:42Z

The nightly cloudbuild job is defined here: https://github.com/kubernetes/test-infra/blob/00527b5641ac6d7e5b99249c8c8c6c8a4a636999/config/jobs/image-pushing/k8s-staging-cluster-api.yaml#L379-L412

It uses gcr.io/k8s-staging-test-infra/image-builder:v20230711-e33377c2b4, which is defined here: https://github.com/kubernetes/test-infra/tree/master/images/builder

chrischdi · 2023-09-25T09:38:09Z

cloudbuild.yaml

@@ -3,7 +3,7 @@
 timeout: 2700s
 options:
  substitution_option: ALLOW_LOOSE
-  machineType: 'E2_HIGHCPU_8'
+  machineType: 'E2_HIGHCPU_32'


Was the timeout maybe due to this machine type increase?

Maybe it was not able to get scheduled?

Do the other changes make sense without this line (+ setting 8 in line 15?

yes that seems most likely and would also explain why the timeout did not occur in my environment. the Cloudbuild queue time is visible via the GCP console but is not clear from our current logging. Having the timestamps would help us confirm that theory, but given several runs timed out I am not sure we could take any action to improve this situation anyway.

Do the other changes make sense without this line (+ setting 8 in line 15?

Yes they do and based on my previous testing should give us ~10 mins improvement in job time

wanted to touch back on here @chrischdi -- what do you think is the best path forward? ill open a new PR leaving the machine type as E2_HIGHCPU_8 and pass in -j 8 ?

I think that's a viable path forward 👍

Nice catch!

k8s-ci-robot requested review from richardcase and stmcginnis September 11, 2023 13:12

killianmuldoon reviewed Sep 11, 2023

View reviewed changes

cahillsf closed this Sep 11, 2023

cahillsf reopened this Sep 11, 2023

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 11, 2023

cahillsf force-pushed the improve-release-speed branch from 2888de1 to d2fd286 Compare September 12, 2023 12:20

cahillsf marked this pull request as ready for review September 12, 2023 12:25

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2023

k8s-ci-robot requested review from enxebre and fabriziopandini September 12, 2023 12:25

mdbooth reviewed Sep 12, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

k8s-ci-robot assigned mdbooth Sep 12, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2023

k8s-ci-robot assigned furkatgofurov7 Sep 13, 2023

killianmuldoon reviewed Sep 18, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

improve release-staging build

1b037fd

revert to second variant

cahillsf force-pushed the improve-release-speed branch from 3d6c94c to 1b037fd Compare September 18, 2023 20:41

killianmuldoon reviewed Sep 19, 2023

View reviewed changes

k8s-ci-robot assigned killianmuldoon Sep 19, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2023

furkatgofurov7 reviewed Sep 19, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2023

k8s-ci-robot merged commit d981496 into kubernetes-sigs:main Sep 19, 2023

k8s-ci-robot added this to the v1.6 milestone Sep 19, 2023

cahillsf added a commit to cahillsf/cluster-api that referenced this pull request Sep 19, 2023

Revert "Merge pull request kubernetes-sigs#9392 from cahillsf/improve…

9d47a48

…-release-speed" This reverts commit d981496, reversing changes made to 2639f2c.

cahillsf mentioned this pull request Sep 19, 2023

🐛 Reverts pull request from cahillsf/improve-release-speed #9465

Merged

chrischdi reviewed Sep 25, 2023

View reviewed changes

cahillsf mentioned this pull request Oct 10, 2023

🌱 Improve release staging build speed #9536

Merged

🌱 Improve release speed #9392

🌱 Improve release speed #9392

Conversation

cahillsf commented Sep 11, 2023 • edited Loading

k8s-ci-robot commented Sep 11, 2023

mdbooth commented Sep 11, 2023

killianmuldoon left a comment

Choose a reason for hiding this comment

mdbooth commented Sep 11, 2023 • edited Loading

cahillsf commented Sep 11, 2023 • edited Loading

cahillsf commented Sep 12, 2023 • edited Loading

killianmuldoon commented Sep 12, 2023

mdbooth left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 12, 2023

furkatgofurov7 commented Sep 13, 2023

cahillsf commented Sep 18, 2023

killianmuldoon left a comment

Choose a reason for hiding this comment

killianmuldoon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 19, 2023

furkatgofurov7 left a comment

Choose a reason for hiding this comment

sbueringer commented Sep 19, 2023

k8s-ci-robot commented Sep 19, 2023

sbueringer commented Sep 19, 2023

sbueringer commented Sep 19, 2023

sbueringer commented Sep 19, 2023

cahillsf commented Sep 19, 2023 • edited Loading

sbueringer commented Sep 19, 2023 • edited Loading

furkatgofurov7 commented Sep 19, 2023

furkatgofurov7 commented Sep 19, 2023

cahillsf commented Sep 19, 2023 • edited Loading

sbueringer commented Sep 19, 2023 • edited Loading

cahillsf commented Sep 19, 2023 • edited Loading

sbueringer commented Sep 19, 2023 • edited Loading

furkatgofurov7 commented Sep 19, 2023

cahillsf commented Sep 24, 2023

mdbooth commented Sep 25, 2023

chrischdi Sep 25, 2023

Choose a reason for hiding this comment

cahillsf Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

cahillsf Oct 5, 2023

Choose a reason for hiding this comment

chrischdi Oct 5, 2023

Choose a reason for hiding this comment

sbueringer Nov 16, 2023

Choose a reason for hiding this comment

cahillsf commented Sep 11, 2023 •

edited

Loading

mdbooth commented Sep 11, 2023 •

edited

Loading

cahillsf commented Sep 11, 2023 •

edited

Loading

cahillsf commented Sep 12, 2023 •

edited

Loading

cahillsf commented Sep 19, 2023 •

edited

Loading

sbueringer commented Sep 19, 2023 •

edited

Loading

cahillsf commented Sep 19, 2023 •

edited

Loading

sbueringer commented Sep 19, 2023 •

edited

Loading

cahillsf commented Sep 19, 2023 •

edited

Loading

sbueringer commented Sep 19, 2023 •

edited

Loading

cahillsf Sep 25, 2023 •

edited

Loading