-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Resolve two problems in ci/cd testing. #668
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0da4120
to
8117985
Compare
The first problem has been work around, but hit another problem.
|
1f6b9c2
to
c8d29dc
Compare
/cc @jlewi |
@@ -20,7 +20,7 @@ local defaultParams = { | |||
// Which Kubeflow cluster to use for running TFJobs on. | |||
kfProject: "kubeflow-ci-deployment", | |||
kfZone: "us-east1-b", | |||
kfCluster: "kf-vmaster-n01", | |||
kfCluster: "kf-v0-6-n00", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing the tests to run against a v0-6 cluster. Do we really want to do that?
Don't we want to test against a cluster that's running master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi Updated this due to the issue kubeflow/testing#499, if the issue resovled, of course, we can still use master cluster. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the test pass doesn't help us if the test doesn't test what we care about. The meaning of the dashboard for periodic master should mean the tests are passing on clusters deployed on master.
So if we actually run on v0-6 clusters then we are creating misleading signal.
There are two possible paths forward
- Fix the test so it passes on master (ideal)
- Mark the test as expected to fail so that the dashboard correctly reflects that the test is failing on master but marking it as expected to fail doesn't block presubmits
One way to achieve that is to convert the test to use pytest. Then we can just annotate it to indicate its expected to fail.
We will need to update the workflow to invoke it using pytest.
You could either update the existing jsonnet file or wait for #666 and then add it to the pyfunc which might be easier then modifying ksonnet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I perfer to fix the test problem on master, I think that may be caused by master cluster problem kubeflow/testing#499 or workload identity changed. Will take a look later.
e274bc1
to
44a8b20
Compare
44a8b20
to
05334ad
Compare
05334ad
to
0f38dec
Compare
@jinchihe regarding the master clusters against which we run the example tests. We need to fix The clusters named vmaster-n?? are no longer being updated. The issue has more info on the work that needs to be done to fix things. Its on my todo but I probably won't get to it until next week. |
0f38dec
to
b4f4dfc
Compare
b4f4dfc
to
7711103
Compare
#677 is tracking updating the mnist tests The auto-deployed clusters should now be working and the script to get the credentials of the latest master cluste should be working as well. |
Close the ticket, has been resolved in #684 |
The PR is going to fix two issue below:
The root cause is the kubectl version is v1.10.0 in the
test-worker
image, the kubectl has been upgraded in the PR kubeflow/testing#500, but thetest-worker
image is not refreshed. Oncetest-worker
refreshed, we can remove the code change for upgrade kubectl in the PR.kf-vmaster-n00
, there is a problem Workload identity not working TF; kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get access testing#499, @jlewi is going to investigate, but we can change to another cluster to work around this now, to avoid blocking other PR merging.This change is