Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 #444

Closed
jlewi opened this issue Aug 19, 2019 · 5 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Aug 19, 2019

Our auto-deployments currently recycle names; e.g.
kf-vmaster-n??...

We originally did this because with IAP we used to need to create register each hostname manually. With universal redirect that should no longer be necessary.

In the past, reusing hostnames would lead to lets-encrypt quota errors. We worked around this by generating self-signed certificates.

But on master we have now switched to using GKE managed certificates and uploading our own self-signed certificate no longer works.

It would be nice if GKE managed certificates avoids the quota issues if we reuse an endpoint.

However, we should now be able to use unique hostnames for each auto-deployment. That should simplify the auto-deployment logic.

To use unique names e.g.

kf-v0-6-%YYYY%MM%DD-%HH%MM%SS

We will need to fix our cleanup logic to delete older deployments and only keep the N most recent.

We can reuse our existing cleanup_ci.py script and just modify that.

jlewi pushed a commit to jlewi/testing that referenced this issue Oct 18, 2019
jlewi pushed a commit to jlewi/testing that referenced this issue Oct 18, 2019
Related to kubeflow#471

* Don't set name in the spec because we want to infer it form directory.

* Create a new script to deploy with a unique name

* Related to: kubeflow#444

* Update cleanup script to clean up new auto-deployed clusters
jlewi pushed a commit to jlewi/testing that referenced this issue Oct 23, 2019
Related to kubeflow#471

* Don't set name in the spec because we want to infer it form directory.

* Create a new script to deploy with a unique name

* Related to: kubeflow#444

* Update cleanup script to clean up new auto-deployed clusters
k8s-ci-robot pushed a commit that referenced this issue Oct 23, 2019
* Auto deploy job needs to use the new kfctl syntax; also use unique names

Related to #471

* Don't set name in the spec because we want to infer it form directory.

* Create a new script to deploy with a unique name

* Related to: #444

* Update cleanup script to clean up new auto-deployed clusters

* In cron job get code from master.

* Fix lint.

* Revert changes to create_kf_instance

* update to v1beta1 spec.

* * We need to use a self-signed certificate with the auto-deployed clusters
  because otherwise we hit lets-encrypt rate limiting.
@jlewi
Copy link
Contributor Author

jlewi commented Oct 25, 2019

The cron job is now failing.

INFO|2019-10-25T00:12:40|/src/kubeflow/testing/py/kubeflow/testing/util.py|71| Error:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 500 with message: gcp apply could not update deployment manager Error could not update storage-kubeflow.yaml; Insert deployment error: googleapi: Error 400: Invalid value for field 'resource.labels': ''.  Label key 'GIT_LABEL' violates format constraints. The key must start with a lowercase character, can only contain lowercase letters, numeric characters, underscores and dashes. The key can be at most 63 characters long. International characters are allowed., invalid

jlewi pushed a commit to jlewi/testing that referenced this issue Oct 25, 2019
jlewi pushed a commit to jlewi/testing that referenced this issue Oct 25, 2019
@jlewi
Copy link
Contributor Author

jlewi commented Oct 25, 2019

We need to update
https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/get_kf_testing_cluster.py

To work with the new naming pattern.

k8s-ci-robot pushed a commit that referenced this issue Oct 25, 2019
* Fix auto-deployment labels to be valid GCP labels.

Related to:
 #444

* Update to use PR in oneoff.

* Fix labels.
@jlewi
Copy link
Contributor Author

jlewi commented Oct 26, 2019

We need to update cleanup_ci.py - the max duration of clusters is shorter then the frequency with which we are auto-deploying clusters from master

@jlewi jlewi changed the title Auto-deployments of Kubeflow should use unique names and not recycle names Auto-deployments of Kubeflow should use unique names and not recycle names - Upgrade to v0.7 Oct 30, 2019
jlewi pushed a commit to jlewi/testing that referenced this issue Nov 1, 2019
…and E2E clusters

* The auto deployed clusters are now using unique names rather than being
  recycled and we rely on cleanup_ci.py to GC old auto-deployments (kubeflow#444)

* To support this we need to use variable expiration times.

  * Deployments created by tests should expire as soon as the test is done (so 1-2 hours)

  * But auto-deployed clusters need to live longer

    * There are only refreshed periodically by a cron job (~8 hours) we don't
      want to delete the cluster before a new one is deployed because we need
      a cluster for the example tests

    * We want to leave the clusters up as long as we can to facilitate debugging
      by people working on example tests.

Related to kubeflow#444
jlewi pushed a commit to jlewi/testing that referenced this issue Nov 1, 2019
…and E2E clusters

* The auto deployed clusters are now using unique names rather than being
  recycled and we rely on cleanup_ci.py to GC old auto-deployments (kubeflow#444)

* To support this we need to use variable expiration times.

  * Deployments created by tests should expire as soon as the test is done (so 1-2 hours)

  * But auto-deployed clusters need to live longer

    * There are only refreshed periodically by a cron job (~8 hours) we don't
      want to delete the cluster before a new one is deployed because we need
      a cluster for the example tests

    * We want to leave the clusters up as long as we can to facilitate debugging
      by people working on example tests.

Related to kubeflow#444
jlewi pushed a commit to jlewi/testing that referenced this issue Nov 1, 2019
* Auto deployed clusters are no longer recycling names; instead each
  auto deployed cluster will have a unique name

* Use regexes to identify the appropriate auto deployed cluster

* Only consider clusters with a minimum age; this is a hack to ensure
  clusters are properly setup.

* Related to: kubeflow#444
k8s-ci-robot pushed a commit that referenced this issue Nov 1, 2019
…and E2E clusters (#512)

* cleanup_ci needs to use different expiration times for auto deployed and E2E clusters

* The auto deployed clusters are now using unique names rather than being
  recycled and we rely on cleanup_ci.py to GC old auto-deployments (#444)

* To support this we need to use variable expiration times.

  * Deployments created by tests should expire as soon as the test is done (so 1-2 hours)

  * But auto-deployed clusters need to live longer

    * There are only refreshed periodically by a cron job (~8 hours) we don't
      want to delete the cluster before a new one is deployed because we need
      a cluster for the example tests

    * We want to leave the clusters up as long as we can to facilitate debugging
      by people working on example tests.

Related to #444

* Address reviewer comments.
jlewi pushed a commit to jlewi/examples that referenced this issue Nov 2, 2019
* The test wasn't actually running because we were passing arguments that
  were unknown to pytest

* Remove the old role.yaml; we don't use it anymore

* Wait for the Job to finish and properly report status; kubeflow/testing#514
  contains the new routine

* The test still isn't passing because of kubeflow#673

* In addition we need to fix the auto deployments kubeflow/testing#444

Related to kubeflow#665
jlewi pushed a commit to jlewi/testing that referenced this issue Nov 2, 2019
* Auto deployed clusters are no longer recycling names; instead each
  auto deployed cluster will have a unique name

* Use regexes to identify the appropriate auto deployed cluster

* Only consider clusters with a minimum age; this is a hack to ensure
  clusters are properly setup.

* Related to: kubeflow#444
jlewi pushed a commit to jlewi/testing that referenced this issue Nov 4, 2019
* Auto deployed clusters are no longer recycling names; instead each
  auto deployed cluster will have a unique name

* Use regexes to identify the appropriate auto deployed cluster

* Only consider clusters with a minimum age; this is a hack to ensure
  clusters are properly setup.

* Related to: kubeflow#444
k8s-ci-robot pushed a commit to kubeflow/examples that referenced this issue Nov 5, 2019
#674)

* Fix the xgboost_synthetic test so it actually runs and produces signal

* The test wasn't actually running because we were passing arguments that
  were unknown to pytest

* Remove the old role.yaml; we don't use it anymore

* Wait for the Job to finish and properly report status; kubeflow/testing#514
  contains the new routine

* The test still isn't passing because of #673

* In addition we need to fix the auto deployments kubeflow/testing#444

Related to #665

* Fix lint.
jlewi pushed a commit to jlewi/testing that referenced this issue Nov 5, 2019
* Auto deployed clusters are no longer recycling names; instead each
  auto deployed cluster will have a unique name

* Use regexes to identify the appropriate auto deployed cluster

* Only consider clusters with a minimum age; this is a hack to ensure
  clusters are properly setup.

* Related to: kubeflow#444
k8s-ci-robot pushed a commit that referenced this issue Nov 5, 2019
* Auto deployed clusters are no longer recycling names; instead each
  auto deployed cluster will have a unique name

* Use regexes to identify the appropriate auto deployed cluster

* Only consider clusters with a minimum age; this is a hack to ensure
  clusters are properly setup.

* Related to: #444
@jtfogarty
Copy link

/kind feature

@jlewi
Copy link
Contributor Author

jlewi commented Feb 3, 2020

This should be fixed.

@jlewi jlewi closed this as completed Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants