Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 #444

jlewi · 2019-08-19T20:58:20Z

Our auto-deployments currently recycle names; e.g.
kf-vmaster-n??...

We originally did this because with IAP we used to need to create register each hostname manually. With universal redirect that should no longer be necessary.

In the past, reusing hostnames would lead to lets-encrypt quota errors. We worked around this by generating self-signed certificates.

But on master we have now switched to using GKE managed certificates and uploading our own self-signed certificate no longer works.

It would be nice if GKE managed certificates avoids the quota issues if we reuse an endpoint.

However, we should now be able to use unique hostnames for each auto-deployment. That should simplify the auto-deployment logic.

To use unique names e.g.

kf-v0-6-%YYYY%MM%DD-%HH%MM%SS

We will need to fix our cleanup logic to delete older deployments and only keep the N most recent.

We can reuse our existing cleanup_ci.py script and just modify that.

* Related to: kubeflow#444

Related to kubeflow#471 * Don't set name in the spec because we want to infer it form directory. * Create a new script to deploy with a unique name * Related to: kubeflow#444 * Update cleanup script to clean up new auto-deployed clusters

* Auto deploy job needs to use the new kfctl syntax; also use unique names Related to #471 * Don't set name in the spec because we want to infer it form directory. * Create a new script to deploy with a unique name * Related to: #444 * Update cleanup script to clean up new auto-deployed clusters * In cron job get code from master. * Fix lint. * Revert changes to create_kf_instance * update to v1beta1 spec. * * We need to use a self-signed certificate with the auto-deployed clusters because otherwise we hit lets-encrypt rate limiting.

jlewi · 2019-10-25T00:40:13Z

The cron job is now failing.

INFO|2019-10-25T00:12:40|/src/kubeflow/testing/py/kubeflow/testing/util.py|71| Error:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 500 with message: gcp apply could not update deployment manager Error could not update storage-kubeflow.yaml; Insert deployment error: googleapi: Error 400: Invalid value for field 'resource.labels': ''.  Label key 'GIT_LABEL' violates format constraints. The key must start with a lowercase character, can only contain lowercase letters, numeric characters, underscores and dashes. The key can be at most 63 characters long. International characters are allowed., invalid

Related to: kubeflow#444

jlewi · 2019-10-25T01:47:40Z

We need to update
https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/get_kf_testing_cluster.py

To work with the new naming pattern.

* Fix auto-deployment labels to be valid GCP labels. Related to: #444 * Update to use PR in oneoff. * Fix labels.

jlewi · 2019-10-26T03:09:18Z

We need to update cleanup_ci.py - the max duration of clusters is shorter then the frequency with which we are auto-deploying clusters from master

…and E2E clusters * The auto deployed clusters are now using unique names rather than being recycled and we rely on cleanup_ci.py to GC old auto-deployments (kubeflow#444) * To support this we need to use variable expiration times. * Deployments created by tests should expire as soon as the test is done (so 1-2 hours) * But auto-deployed clusters need to live longer * There are only refreshed periodically by a cron job (~8 hours) we don't want to delete the cluster before a new one is deployed because we need a cluster for the example tests * We want to leave the clusters up as long as we can to facilitate debugging by people working on example tests. Related to kubeflow#444

* Auto deployed clusters are no longer recycling names; instead each auto deployed cluster will have a unique name * Use regexes to identify the appropriate auto deployed cluster * Only consider clusters with a minimum age; this is a hack to ensure clusters are properly setup. * Related to: kubeflow#444

…and E2E clusters (#512) * cleanup_ci needs to use different expiration times for auto deployed and E2E clusters * The auto deployed clusters are now using unique names rather than being recycled and we rely on cleanup_ci.py to GC old auto-deployments (#444) * To support this we need to use variable expiration times. * Deployments created by tests should expire as soon as the test is done (so 1-2 hours) * But auto-deployed clusters need to live longer * There are only refreshed periodically by a cron job (~8 hours) we don't want to delete the cluster before a new one is deployed because we need a cluster for the example tests * We want to leave the clusters up as long as we can to facilitate debugging by people working on example tests. Related to #444 * Address reviewer comments.

* The test wasn't actually running because we were passing arguments that were unknown to pytest * Remove the old role.yaml; we don't use it anymore * Wait for the Job to finish and properly report status; kubeflow/testing#514 contains the new routine * The test still isn't passing because of kubeflow#673 * In addition we need to fix the auto deployments kubeflow/testing#444 Related to kubeflow#665

* Auto deployed clusters are no longer recycling names; instead each auto deployed cluster will have a unique name * Use regexes to identify the appropriate auto deployed cluster * Only consider clusters with a minimum age; this is a hack to ensure clusters are properly setup. * Related to: kubeflow#444

#674) * Fix the xgboost_synthetic test so it actually runs and produces signal * The test wasn't actually running because we were passing arguments that were unknown to pytest * Remove the old role.yaml; we don't use it anymore * Wait for the Job to finish and properly report status; kubeflow/testing#514 contains the new routine * The test still isn't passing because of #673 * In addition we need to fix the auto deployments kubeflow/testing#444 Related to #665 * Fix lint.

* Auto deployed clusters are no longer recycling names; instead each auto deployed cluster will have a unique name * Use regexes to identify the appropriate auto deployed cluster * Only consider clusters with a minimum age; this is a hack to ensure clusters are properly setup. * Related to: kubeflow#444

* Auto deployed clusters are no longer recycling names; instead each auto deployed cluster will have a unique name * Use regexes to identify the appropriate auto deployed cluster * Only consider clusters with a minimum age; this is a hack to ensure clusters are properly setup. * Related to: #444

jtfogarty · 2020-01-06T20:41:30Z

/kind feature

jlewi · 2020-02-03T23:20:40Z

This should be fixed.

jlewi added priority/p1 area/engprod labels Aug 19, 2019

jlewi pushed a commit to jlewi/testing that referenced this issue Oct 18, 2019

* Create a script to deploy with a unique name

7ad1987

* Related to: kubeflow#444

jlewi mentioned this issue Oct 18, 2019

Auto deploy job needs to use the new kfctl syntax. #495

Merged

jlewi pushed a commit to jlewi/testing that referenced this issue Oct 25, 2019

Fix auto-deployment labels to be valid GCP labels.

8614755

Related to: kubeflow#444

jlewi pushed a commit to jlewi/testing that referenced this issue Oct 25, 2019

Fix auto-deployment labels to be valid GCP labels.

bafa3d4

Related to: kubeflow#444

jlewi mentioned this issue Oct 25, 2019

Fix auto-deployment labels to be valid GCP labels. #506

Merged

k8s-ci-robot pushed a commit that referenced this issue Oct 25, 2019

Fix auto-deployment labels to be valid GCP labels. (#506)

fd24db8

* Fix auto-deployment labels to be valid GCP labels. Related to: #444 * Update to use PR in oneoff. * Fix labels.

jlewi changed the title ~~Auto-deployments of Kubeflow should use unique names and not recycle names~~ Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 Oct 30, 2019

jlewi mentioned this issue Oct 30, 2019

WIP: Resolve two problems in ci/cd testing. kubeflow/examples#668

Closed

jlewi mentioned this issue Nov 1, 2019

cleanup_ci needs to use different expiration times for auto deployed and E2E clusters #512

Merged

jlewi mentioned this issue Nov 1, 2019

Update get credentials to work with new auto deploy naming schema #513

Merged

jlewi mentioned this issue Nov 2, 2019

Fix the xgboost_synthetic test so it actually runs and produces signal kubeflow/examples#674

Merged

k8s-ci-robot added the kind/feature label Jan 6, 2020

jlewi closed this as completed Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 #444

Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 #444

jlewi commented Aug 19, 2019

jlewi commented Oct 25, 2019

jlewi commented Oct 25, 2019

jlewi commented Oct 26, 2019

jtfogarty commented Jan 6, 2020

jlewi commented Feb 3, 2020

Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 #444

Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 #444

Comments

jlewi commented Aug 19, 2019

jlewi commented Oct 25, 2019

jlewi commented Oct 25, 2019

jlewi commented Oct 26, 2019

jtfogarty commented Jan 6, 2020

jlewi commented Feb 3, 2020