Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update xgboost_synthetic test infra; preliminary updates to work with 0.7.0 #666

Merged
merged 3 commits into from
Oct 25, 2019

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Oct 20, 2019

Update xgboost_synthetic test infra to use pytest and pyfunc.

  • Related to Update xgboost_synthetic to 0.7  #655 update xgboost_synthetic for 0.7

  • Related to to No signal about xgboost_synthetic test in periodic dashboard and its failing #665 no signal about xgboost_synthetic

  • We need to update the xgboost_synthetic example to work with 0.7.0;
    e.g. workload identity

  • This PR focuses on updating the test infra and some preliminary
    updates the notebook

  • More fixes to the test and the notebook are probably needed in order
    to get it to actually pass

  • Update job spec for 0.7; remove the secret and set the default service
    account.

    • This is to make it work with workload identity
  • Instead of using kustomize to define the job to run the notebook we can just modify the YAML spec using python.

  • Use the python API for K8s to create the job rather than shelling out.

  • Notebook should do a 0.7 compatible check for credentials

    • We don't want to assume GOOGLE_APPLICATION_CREDENTIALS is set
      because we will be using workload identity.
  • Take in repos as an argument akin to what checkout_repos.sh requires

  • Convert xgboost_test.py to a pytest.

    • This allows us to mark it as expected to fail so we can start to get
      signal without blocking

    • We also need to emit junit files to show up in test grid.

  • Convert the jsonnet workflow for the E2E test to a python function to
    define the workflow.

    • Remove the old jsonnet workflow.

This change is Reviewable

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

* Related to kubeflow#655 update xgboost_synthetic to use workload identity

* Related to to kubeflow#665 no signal about xgboost_synthetic

* We need to update the xgboost_synthetic example to work with 0.7.0;
  e.g. workload identity

* This PR focuses on updating the test infra and some preliminary
  updates the notebook

* More fixes to the test and the notebook are probably needed in order
  to get it to actually pass

* Update job spec for 0.7; remove the secret and set the default service
  account.

  * This is to make it work with workload identity

* Instead of using kustomize to define the job to run the notebook we can just modify the YAML spec using python.
* Use the python API for K8s to create the job rather than shelling out.

* Notebook should do a 0.7 compatible check for credentials

  * We don't want to assume GOOGLE_APPLICATION_CREDENTIALS is set
    because we will be using workload identity.

* Take in repos as an argument akin to what checkout_repos.sh requires

* Convert xgboost_test.py to a pytest.

  * This allows us to mark it as expected to fail so we can start to get
    signal without blocking

  * We also need to emit junit files to show up in test grid.

* Convert the jsonnet workflow for the E2E test to a python function to
  define the workflow.

  * Remove the old jsonnet workflow.
@jlewi jlewi changed the title Update xgboost_synthetic test to work with 0.7 Update xgboost_synthetic test infra; preliminary updates to work with 0.7.0 Oct 24, 2019
@jlewi jlewi marked this pull request as ready for review October 24, 2019 04:32
@jlewi
Copy link
Contributor Author

jlewi commented Oct 24, 2019

/assign @jinchihe
/assign @kunmingg

Copy link
Member

@jinchihe jinchihe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Great starting to update test to use py function.

with open("job.yaml") as hf:
job = yaml.load(hf)

job["metadata"]["namespace"] = name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; should be setting name.

# TODO(jlewi): This test is currently failing because various things
# need to be updated to work with 0.7.0. Until that's fixed we mark it
# as expected to fail so we can begin to get signal.
@pytest.mark.xfail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have a ticket to trace this before merging.

end_time = datetime.datetime.now() + datetime.timedelta(
minutes=15)

namespace = job["metadata"]["namespace"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use namespace var directly? the job["metadata"]["namespace"] from namespace and no if..else.. for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the if you are referring to?
Do you mean setting of name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the job["metadata"]["namespace"] = name above, so no need to namespace = job["metadata"]["namespace"], we can use the namespace var directly.
Yes I noticed you have updated that. Great!

continue
# ready_replicas could be None
if not job.conditions:
logging.info("Job missing condition")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should continue to next loop here if the job.conditions is not out (in very shart time the job conditions may be generated), otherwise will traceback by next last_condition = job.conditions[-1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch


if not last_condition or last_condition["type"] not in ["Failed", "Complete"]:
logging.error("Timeout waiting for job %s.%s to finish.", namespace, name)
assert last_condition["type"] in ["Failed", "Complete"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should raise RuntimeError if last_condition["type"] in ["Failed"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1,14 +0,0 @@
apiVersion: rbac.authorization.k8s.io/v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can create Job if not define Role? I remember need this otherwise we cannot submit that, just reminder here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespaces should be provisioned with the service account default-editor which should have sufficient priveleges. We shouldn't need to do any additional role creation. If it is then its a bug in Kubeflow somewhere.

@jinchihe
Copy link
Member

Seems there are pylint error in examples/xgboost_synthetic/util.py , others LGTM

INFO|2019-10-25T00:03:20|util.py:71| ************* Module util
INFO|2019-10-25T00:03:20|util.py:71| C: 15, 0: Line too long (102/100) (line-too-long)
INFO|2019-10-25T00:03:20|util.py:71| C: 37, 0: Trailing whitespace (trailing-whitespace)
INFO|2019-10-25T00:03:20|util.py:71| C: 45, 0: Trailing whitespace (trailing-whitespace)
INFO|2019-10-25T00:03:20|util.py:71| C:  9, 0: standard import "from pathlib import Path" should be placed before "import requests" (wrong-import-order)

* Install pip packages in user space
  * 0.7.0 images are based on TF images and they have different permissions
* Install a newer version of fairing sdk that works with workload identity

* Split pip installing dependencies out of util.py and into notebook_setup.py

  * That's because util.py could depend on the packages being installed by
    notebook_setup.py

* After pip installing the modules into user space; we need to add the local
  path for pip packages to the python otherwise we get import not found
  errors.
@jinchihe
Copy link
Member

/lgtm

@jlewi
Copy link
Contributor Author

jlewi commented Oct 25, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 7e28cd6 into kubeflow:master Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants