-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate standalone deployment to workload identity on GCP #2619
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IronPan Can you help me briefly review?
Focus on https://github.com/kubeflow/pipelines/pull/2619/files#diff-e5921efca5c992d5c5ab4e91927c9ff3.
It is the script I will let users run.
/lgtm |
/test |
/retest |
Is this PR ready to be reviewed? |
Yes, but only review the script used to set up workload identity. I ran samples successfully with it. However, still struggling to migrate test infra on workload identity. |
It turns out that tests failed before because our image builder docker image was built a year ago with google/cloud-sdk. The old version of cloud sdk doesn't work well with workload identity. |
UPDATE, I removed all usages of use_gcp_secret to pass sample tests. |
Verify testing again for flakiness |
/lgtm |
/test kubeflow-pipeline-sample-test |
container builder seems to be flaky at fetching some image: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/2619/kubeflow-pipeline-sample-test/1206752832519147523#1:build-log.txt%3A4165 Test again to verify |
Verify again for flakiness |
/lgtm /retest |
local gsa_full="$gsa@$project.iam.gserviceaccount.com" | ||
local namespace=${4:-$NAMESPACE} | ||
|
||
gcloud iam service-accounts add-iam-policy-binding $gsa_full \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add retries here? This command is a source of some test flakiness:
++ gcloud iam service-accounts add-iam-policy-binding test-argo@ml-pipeline-test.iam.gserviceaccount.com '--member=serviceAccount:ml-pipeline-test.svc.id.goog[kubeflow/test-runner]' --role=roles/iam.workloadIdentityUser
ERROR: (gcloud.iam.service-accounts.add-iam-policy-binding) ABORTED: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me put up a PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name=yaml_spec['spec']['containers'][0]['env'][0]['name'], | ||
value=yaml_spec['spec']['containers'][0]['env'][0]['value'], | ||
)]) | ||
args = yaml_spec['spec']['containers'][0]['args']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about this change to the container-building functions for non-GKE deployments.
For container-building functions there might be no way for the user to supply secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also think it's important to fix. However, I don't currently have bandwidth. Let's get this scheduled soon then.
@@ -57,11 +56,8 @@ def use_gcp_api_op(): | |||
def secret_op_pipeline(url='gs://ml-pipeline-playground/shakespeare1.txt'): | |||
"""A pipeline that uses secret to access cloud hosted resouces.""" | |||
|
|||
gcs_read_task = gcs_read_op(url).apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of secrets seems to be the core feature demonstrated by this sample. Perhaps this should not be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Current situation is that we are not running tests that use a service account token. Shall we remove this sample from presubmit/postsubmit tests?
) * Script to set up workload identity for standalone deployment * Migrate tests to run on standalone + workload identity * Fix test script * Switch to static GSAs for testing, because they have name length limit * Add workload identity binding for argo * Fix argo workload identity bindings * Remove user-gcp-sa from tests * Remove use_gcp_secret from xgboost sample * Allow debugging tests locally * Wait for policies to take effect * Update deploy-pipeline-lite.sh * Update deploy-pipeline-lite.sh * [WIP] test gcloud auth list with test-runner sa * Add namespace * test again * Use new image builder * test again * Remove debug code * Remove usages of use_gcp_secret * Fix unit test and tensorboard pod template * Add debug code again to test * Try waiting until workload identity bindings are ready * Fix some other samples * Fix parameterized tfx oss sample * Add retry to image building * Try fixing tfx oss sample * Fix compiled tfx oss sample * Update all google/cloud-sdk to latest * Try fixing parameterized tfx oss sample again * Also verify pipeline-runner ksa is working * Fix parameterized_tfx_oss sample * Update gcp-workload-identity-setup.sh * Revert unneeded change * Pin to new google/cloud-sdk * Remove wrongly commited binaries
Signed-off-by: ddelange <14880945+ddelange@users.noreply.github.com> Signed-off-by: ddelange <14880945+ddelange@users.noreply.github.com>
Part of #1691
Changes:
use_gcp_secret
Note, I didn't completely remove test infra for other than workload identity, because I think it still makes sense to also test KFP in for example full scope cluster. So that we can easily add another test configuration in prow with different ENV to use a different one.
/assign @IronPan
/cc @gaoning777
/cc @rmgogogo
This change is