-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update xgboost_synthetic test infra; preliminary updates to work with 0.7.0 #666
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
* Related to kubeflow#655 update xgboost_synthetic to use workload identity * Related to to kubeflow#665 no signal about xgboost_synthetic * We need to update the xgboost_synthetic example to work with 0.7.0; e.g. workload identity * This PR focuses on updating the test infra and some preliminary updates the notebook * More fixes to the test and the notebook are probably needed in order to get it to actually pass * Update job spec for 0.7; remove the secret and set the default service account. * This is to make it work with workload identity * Instead of using kustomize to define the job to run the notebook we can just modify the YAML spec using python. * Use the python API for K8s to create the job rather than shelling out. * Notebook should do a 0.7 compatible check for credentials * We don't want to assume GOOGLE_APPLICATION_CREDENTIALS is set because we will be using workload identity. * Take in repos as an argument akin to what checkout_repos.sh requires * Convert xgboost_test.py to a pytest. * This allows us to mark it as expected to fail so we can start to get signal without blocking * We also need to emit junit files to show up in test grid. * Convert the jsonnet workflow for the E2E test to a python function to define the workflow. * Remove the old jsonnet workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi Great starting to update test to use py function.
with open("job.yaml") as hf: | ||
job = yaml.load(hf) | ||
|
||
job["metadata"]["namespace"] = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; should be setting name.
# TODO(jlewi): This test is currently failing because various things | ||
# need to be updated to work with 0.7.0. Until that's fixed we mark it | ||
# as expected to fail so we can begin to get signal. | ||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have a ticket to trace this before merging.
end_time = datetime.datetime.now() + datetime.timedelta( | ||
minutes=15) | ||
|
||
namespace = job["metadata"]["namespace"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use namespace var directly? the job["metadata"]["namespace"]
from namespace and no if..else.. for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the if you are referring to?
Do you mean setting of name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the job["metadata"]["namespace"] = name
above, so no need to namespace = job["metadata"]["namespace"]
, we can use the namespace var directly.
Yes I noticed you have updated that. Great!
continue | ||
# ready_replicas could be None | ||
if not job.conditions: | ||
logging.info("Job missing condition") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should continue
to next loop here if the job.conditions is not out (in very shart time the job conditions may be generated), otherwise will traceback by next last_condition = job.conditions[-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
|
||
if not last_condition or last_condition["type"] not in ["Failed", "Complete"]: | ||
logging.error("Timeout waiting for job %s.%s to finish.", namespace, name) | ||
assert last_condition["type"] in ["Failed", "Complete"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should raise RuntimeError
if last_condition["type"] in ["Failed"]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -1,14 +0,0 @@ | |||
apiVersion: rbac.authorization.k8s.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can create Job if not define Role? I remember need this otherwise we cannot submit that, just reminder here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespaces should be provisioned with the service account default-editor which should have sufficient priveleges. We shouldn't need to do any additional role creation. If it is then its a bug in Kubeflow somewhere.
Seems there are pylint error in
|
* Install pip packages in user space * 0.7.0 images are based on TF images and they have different permissions * Install a newer version of fairing sdk that works with workload identity * Split pip installing dependencies out of util.py and into notebook_setup.py * That's because util.py could depend on the packages being installed by notebook_setup.py * After pip installing the modules into user space; we need to add the local path for pip packages to the python otherwise we get import not found errors.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Update xgboost_synthetic test infra to use pytest and pyfunc.
Related to Update xgboost_synthetic to 0.7 #655 update xgboost_synthetic for 0.7
Related to to No signal about xgboost_synthetic test in periodic dashboard and its failing #665 no signal about xgboost_synthetic
We need to update the xgboost_synthetic example to work with 0.7.0;
e.g. workload identity
This PR focuses on updating the test infra and some preliminary
updates the notebook
More fixes to the test and the notebook are probably needed in order
to get it to actually pass
Update job spec for 0.7; remove the secret and set the default service
account.
Instead of using kustomize to define the job to run the notebook we can just modify the YAML spec using python.
Use the python API for K8s to create the job rather than shelling out.
Notebook should do a 0.7 compatible check for credentials
because we will be using workload identity.
Take in repos as an argument akin to what checkout_repos.sh requires
Convert xgboost_test.py to a pytest.
This allows us to mark it as expected to fail so we can start to get
signal without blocking
We also need to emit junit files to show up in test grid.
Convert the jsonnet workflow for the E2E test to a python function to
define the workflow.
This change is