Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Resolve two problems in ci/cd testing. #668

Closed
wants to merge 1 commit into from

Conversation

jinchihe
Copy link
Member

@jinchihe jinchihe commented Oct 23, 2019

The PR is going to fix two issue below:

  1. In the PR pin the web-ui version of TF to 1.7-- same as training #658 and No signal about xgboost_synthetic test in periodic dashboard and its failing #665 , there is a problem below
error: SchemaError(io.k8s.api.certificates.v1beta1.CertificateSigningRequestList): invalid object doesn't have additional properties

The root cause is the kubectl version is v1.10.0 in the test-worker image, the kubectl has been upgraded in the PR kubeflow/testing#500, but the test-worker image is not refreshed. Once test-worker refreshed, we can remove the code change for upgrade kubectl in the PR.

  1. For the cluster kf-vmaster-n00, there is a problem Workload identity not working TF; kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get access testing#499, @jlewi is going to investigate, but we can change to another cluster to work around this now, to avoid blocking other PR merging.

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jinchihe
You can assign the PR to them by writing /assign @jinchihe in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jinchihe
Copy link
Member Author

The first problem has been work around, but hit another problem.

W
   log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 384, in MonitoredTrainingSession
   stop_grace_period_secs=stop_grace_period_secs)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 795, in __init__
   stop_grace_period_secs=stop_grace_period_secs)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 511, in __init__
   h.begin()
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 424, in begin
   self._summary_writer = SummaryWriterCache.get(self._checkpoint_dir)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer_cache.py", line 63, in get
   logdir, graph=ops.get_default_graph())
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 352, in __init__
   filename_suffix)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
   gfile.MakeDirs(self._logdir)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 374, in recursive_create_dir
   pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
   c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, err
or message ''), response '{
 "error": {
   "code": 403,
   "message": "Primary: /namespaces/kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get access to
 kubeflow-ci-deployment_ci-temp/mnist/models/1186838838111113219.",
   "errors": [
     {
       "message": "Primary: /namespaces/kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get acces
s to kubeflow-ci-deployment_ci-temp/mnist/models/1186838838111113219.",
       "domain": "global",
'
         when reading metadata of gs://kubeflow-ci-deployment_ci-temp/mnist/models/1186838838111113219

@jinchihe
Copy link
Member Author

/cc @jlewi

@k8s-ci-robot k8s-ci-robot requested a review from jlewi October 24, 2019 03:01
@@ -20,7 +20,7 @@ local defaultParams = {
// Which Kubeflow cluster to use for running TFJobs on.
kfProject: "kubeflow-ci-deployment",
kfZone: "us-east1-b",
kfCluster: "kf-vmaster-n01",
kfCluster: "kf-v0-6-n00",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing the tests to run against a v0-6 cluster. Do we really want to do that?
Don't we want to test against a cluster that's running master?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Updated this due to the issue kubeflow/testing#499, if the issue resovled, of course, we can still use master cluster. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the test pass doesn't help us if the test doesn't test what we care about. The meaning of the dashboard for periodic master should mean the tests are passing on clusters deployed on master.

So if we actually run on v0-6 clusters then we are creating misleading signal.

There are two possible paths forward

  1. Fix the test so it passes on master (ideal)
  2. Mark the test as expected to fail so that the dashboard correctly reflects that the test is failing on master but marking it as expected to fail doesn't block presubmits

One way to achieve that is to convert the test to use pytest. Then we can just annotate it to indicate its expected to fail.

We will need to update the workflow to invoke it using pytest.

You could either update the existing jsonnet file or wait for #666 and then add it to the pyfunc which might be easier then modifying ksonnet.

Copy link
Member Author

@jinchihe jinchihe Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I perfer to fix the test problem on master, I think that may be caused by master cluster problem kubeflow/testing#499 or workload identity changed. Will take a look later.

@jinchihe jinchihe force-pushed the resolve_ci_test_problem branch 2 times, most recently from e274bc1 to 44a8b20 Compare October 29, 2019 08:53
@jinchihe jinchihe force-pushed the resolve_ci_test_problem branch from 44a8b20 to 05334ad Compare October 29, 2019 08:55
@jinchihe jinchihe force-pushed the resolve_ci_test_problem branch from 05334ad to 0f38dec Compare October 29, 2019 13:37
@jinchihe jinchihe changed the title Resolve two problems in ci/cd testing. WIP: Resolve two problems in ci/cd testing. Oct 29, 2019
@jlewi
Copy link
Contributor

jlewi commented Oct 30, 2019

@jinchihe regarding the master clusters against which we run the example tests. We need to fix
kubeflow/testing#444

The clusters named vmaster-n?? are no longer being updated. The issue has more info on the work that needs to be done to fix things. Its on my todo but I probably won't get to it until next week.

@jinchihe jinchihe force-pushed the resolve_ci_test_problem branch from 0f38dec to b4f4dfc Compare October 30, 2019 09:53
@jinchihe jinchihe force-pushed the resolve_ci_test_problem branch from b4f4dfc to 7711103 Compare October 30, 2019 09:57
@jlewi
Copy link
Contributor

jlewi commented Nov 8, 2019

#677 is tracking updating the mnist tests

The auto-deployed clusters should now be working and the script to get the credentials of the latest master cluste should be working as well.

@jinchihe
Copy link
Member Author

jinchihe commented Dec 7, 2019

Close the ticket, has been resolved in #684

@jinchihe jinchihe closed this Dec 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants