WIP: Resolve two problems in ci/cd testing. #668

jinchihe · 2019-10-23T02:21:37Z

The PR is going to fix two issue below:

In the PR pin the web-ui version of TF to 1.7-- same as training #658 and No signal about xgboost_synthetic test in periodic dashboard and its failing #665 , there is a problem below

error: SchemaError(io.k8s.api.certificates.v1beta1.CertificateSigningRequestList): invalid object doesn't have additional properties

The root cause is the kubectl version is v1.10.0 in the test-worker image, the kubectl has been upgraded in the PR kubeflow/testing#500, but the test-worker image is not refreshed. Once test-worker refreshed, we can remove the code change for upgrade kubectl in the PR.

For the cluster kf-vmaster-n00, there is a problem Workload identity not working TF; kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get access testing#499, @jlewi is going to investigate, but we can change to another cluster to work around this now, to avoid blocking other PR merging.

This change is

k8s-ci-robot · 2019-10-23T02:21:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jinchihe
You can assign the PR to them by writing /assign @jinchihe in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jinchihe · 2019-10-23T03:21:01Z

The first problem has been work around, but hit another problem.

W
   log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 384, in MonitoredTrainingSession
   stop_grace_period_secs=stop_grace_period_secs)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 795, in __init__
   stop_grace_period_secs=stop_grace_period_secs)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 511, in __init__
   h.begin()
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 424, in begin
   self._summary_writer = SummaryWriterCache.get(self._checkpoint_dir)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer_cache.py", line 63, in get
   logdir, graph=ops.get_default_graph())
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 352, in __init__
   filename_suffix)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
   gfile.MakeDirs(self._logdir)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 374, in recursive_create_dir
   pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
   c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, err
or message ''), response '{
 "error": {
   "code": 403,
   "message": "Primary: /namespaces/kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get access to
 kubeflow-ci-deployment_ci-temp/mnist/models/1186838838111113219.",
   "errors": [
     {
       "message": "Primary: /namespaces/kubeflow-ci-deployment.svc.id.goog with additional claims does not have storage.objects.get acces
s to kubeflow-ci-deployment_ci-temp/mnist/models/1186838838111113219.",
       "domain": "global",
'
         when reading metadata of gs://kubeflow-ci-deployment_ci-temp/mnist/models/1186838838111113219

jinchihe · 2019-10-24T03:01:13Z

/cc @jlewi

jlewi · 2019-10-24T04:38:09Z

test/workflows/components/mnist.jsonnet

@@ -20,7 +20,7 @@ local defaultParams = {
  // Which Kubeflow cluster to use for running TFJobs on.
  kfProject: "kubeflow-ci-deployment",
  kfZone: "us-east1-b",
-  kfCluster: "kf-vmaster-n01",
+  kfCluster: "kf-v0-6-n00",


This is causing the tests to run against a v0-6 cluster. Do we really want to do that?
Don't we want to test against a cluster that's running master?

@jlewi Updated this due to the issue kubeflow/testing#499, if the issue resovled, of course, we can still use master cluster. Thanks.

Making the test pass doesn't help us if the test doesn't test what we care about. The meaning of the dashboard for periodic master should mean the tests are passing on clusters deployed on master.

So if we actually run on v0-6 clusters then we are creating misleading signal.

There are two possible paths forward

Fix the test so it passes on master (ideal)

Mark the test as expected to fail so that the dashboard correctly reflects that the test is failing on master but marking it as expected to fail doesn't block presubmits

One way to achieve that is to convert the test to use pytest. Then we can just annotate it to indicate its expected to fail.

We will need to update the workflow to invoke it using pytest.

You could either update the existing jsonnet file or wait for #666 and then add it to the pyfunc which might be easier then modifying ksonnet.

So I perfer to fix the test problem on master, I think that may be caused by master cluster problem kubeflow/testing#499 or workload identity changed. Will take a look later.

jlewi · 2019-10-30T05:15:28Z

@jinchihe regarding the master clusters against which we run the example tests. We need to fix
kubeflow/testing#444

The clusters named vmaster-n?? are no longer being updated. The issue has more info on the work that needs to be done to fix things. Its on my todo but I probably won't get to it until next week.

jlewi · 2019-11-08T04:31:28Z

#677 is tracking updating the mnist tests

The auto-deployed clusters should now be working and the script to get the credentials of the latest master cluste should be working as well.

jinchihe · 2019-12-07T03:02:02Z

Close the ticket, has been resolved in #684

k8s-ci-robot added the do-not-merge/work-in-progress label Oct 23, 2019

k8s-ci-robot requested review from lluunn and texasmichelle October 23, 2019 02:21

k8s-ci-robot added the size/XS label Oct 23, 2019

jinchihe force-pushed the resolve_ci_test_problem branch from 0da4120 to 8117985 Compare October 23, 2019 02:56

jinchihe force-pushed the resolve_ci_test_problem branch 2 times, most recently from 1f6b9c2 to c8d29dc Compare October 23, 2019 08:34

k8s-ci-robot added size/S and removed size/XS labels Oct 23, 2019

jinchihe changed the title ~~WIP: debug mnist ci/cd testing problem.~~ Resolve two problems in ci/cd testing. Oct 23, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress label Oct 23, 2019

jinchihe mentioned this pull request Oct 24, 2019

Fixes Bugs for the mnist example #664

Closed

k8s-ci-robot requested a review from jlewi October 24, 2019 03:01

jlewi reviewed Oct 24, 2019

View reviewed changes

jinchihe force-pushed the resolve_ci_test_problem branch 2 times, most recently from e274bc1 to 44a8b20 Compare October 29, 2019 08:53

k8s-ci-robot added size/L and removed size/S labels Oct 29, 2019

jinchihe force-pushed the resolve_ci_test_problem branch from 44a8b20 to 05334ad Compare October 29, 2019 08:55

k8s-ci-robot added size/XS and removed size/L labels Oct 29, 2019

jinchihe force-pushed the resolve_ci_test_problem branch from 05334ad to 0f38dec Compare October 29, 2019 13:37

k8s-ci-robot added size/S and removed size/XS labels Oct 29, 2019

jinchihe changed the title ~~Resolve two problems in ci/cd testing.~~ WIP: Resolve two problems in ci/cd testing. Oct 29, 2019

k8s-ci-robot added the do-not-merge/work-in-progress label Oct 29, 2019

jinchihe force-pushed the resolve_ci_test_problem branch from 0f38dec to b4f4dfc Compare October 30, 2019 09:53

k8s-ci-robot added size/XS and removed size/S labels Oct 30, 2019

debug mnist ci/cd testing problem.

7711103

jinchihe force-pushed the resolve_ci_test_problem branch from b4f4dfc to 7711103 Compare October 30, 2019 09:57

jinchihe closed this Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Resolve two problems in ci/cd testing. #668

WIP: Resolve two problems in ci/cd testing. #668

jinchihe commented Oct 23, 2019 •

edited

Loading

k8s-ci-robot commented Oct 23, 2019

jinchihe commented Oct 23, 2019

jinchihe commented Oct 24, 2019

jlewi Oct 24, 2019

jinchihe Oct 24, 2019

jlewi Oct 24, 2019

jinchihe Oct 29, 2019 •

edited

Loading

jlewi commented Oct 30, 2019

jlewi commented Nov 8, 2019

jinchihe commented Dec 7, 2019

WIP: Resolve two problems in ci/cd testing. #668

WIP: Resolve two problems in ci/cd testing. #668

Conversation

jinchihe commented Oct 23, 2019 • edited Loading

k8s-ci-robot commented Oct 23, 2019

jinchihe commented Oct 23, 2019

jinchihe commented Oct 24, 2019

jlewi Oct 24, 2019

Choose a reason for hiding this comment

jinchihe Oct 24, 2019

Choose a reason for hiding this comment

jlewi Oct 24, 2019

Choose a reason for hiding this comment

jinchihe Oct 29, 2019 • edited Loading

Choose a reason for hiding this comment

jlewi commented Oct 30, 2019

jlewi commented Nov 8, 2019

jinchihe commented Dec 7, 2019

jinchihe commented Oct 23, 2019 •

edited

Loading

jinchihe Oct 29, 2019 •

edited

Loading