-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent multiple versions of an E2E test from clobbering each other. #749
Conversation
Hey @jlewi, TravisCI finished with status |
Travis tests have failedHey @jlewi, 1st Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
* It turns out that although we running the v1alpha2 tests, failures were not being properly reported in Prow because the junit xml files had the same names for the v2 pipeline as the v1 pipeline and the v2 results were being clobbered by v1. * Ensure the artifacts for each run of the E2E test have a unix name based on the TFJob version so that the E2E tests for the different TFJob versions won't clobber each other. * Log the exception in wait for condition. * Need to pass --tfjob_version to the tests so it uses the proper client. * run_gpu and run_test stage need to use a v1alpha2 version of the test workflow. * Update the tf_smoke program to accept chief as a valid worker type so that it works with v1alpha2. * In v1alpha2 we need to terminate all workers. It looks like there was a regression in v1alpha2 kubeflow#751 and we require all workers to terminate as opposed to just worker 0. * Delete a bunch of environments for the test app that shouldn't have been committed. Fix kubeflow#748
Here is a passing test run The tests passed when I patched in We will need that PR to be submitted first. |
kubeflow/testing#183 which I was using to test.
/assign @gaocegege |
/retest |
#751 has been submitted. So lets try rerunning the test. /retest |
@gaocegege @kunmingg Tests are passing now that kubeflow/testing#183 was submitted so this is ready for review. |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi, kunmingg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* kubeflow/trainer#749 renamed the prototype * ksonnet just fails silently if the component doesn't exist so the test just ends up failing with 404 trying to wait for the job. * Fix 1266
* kubeflow/trainer#749 renamed the prototype * ksonnet just fails silently if the component doesn't exist so the test just ends up failing with 404 trying to wait for the job. * Fix 1266 * Update the test to use v1alpha2 for tf_job_version. * Remove code to run v1alpha1 and v1alpha2; we can only run one version because only 1 CRD version can be installed in the cluster. * The default should be v1alpha2 so lets just test that. * Don't use the mnist image with v1alpha2; use tf_smoke and use the image set in the job prototype.
* Fix tfjob test; the simple tfjob test prototype was renamed. * kubeflow/trainer#749 renamed the prototype * ksonnet just fails silently if the component doesn't exist so the test just ends up failing with 404 trying to wait for the job. * Fix 1266 * Update the test to use v1alpha2 for tf_job_version. * Remove code to run v1alpha1 and v1alpha2; we can only run one version because only 1 CRD version can be installed in the cluster. * The default should be v1alpha2 so lets just test that. * Don't use the mnist image with v1alpha2; use tf_smoke and use the image set in the job prototype. * Fix ksonnet error.
…w#1267) * Fix tfjob test; the simple tfjob test prototype was renamed. * kubeflow/trainer#749 renamed the prototype * ksonnet just fails silently if the component doesn't exist so the test just ends up failing with 404 trying to wait for the job. * Fix 1266 * Update the test to use v1alpha2 for tf_job_version. * Remove code to run v1alpha1 and v1alpha2; we can only run one version because only 1 CRD version can be installed in the cluster. * The default should be v1alpha2 so lets just test that. * Don't use the mnist image with v1alpha2; use tf_smoke and use the image set in the job prototype. * Fix ksonnet error.
…w#1267) * Fix tfjob test; the simple tfjob test prototype was renamed. * kubeflow/trainer#749 renamed the prototype * ksonnet just fails silently if the component doesn't exist so the test just ends up failing with 404 trying to wait for the job. * Fix 1266 * Update the test to use v1alpha2 for tf_job_version. * Remove code to run v1alpha1 and v1alpha2; we can only run one version because only 1 CRD version can be installed in the cluster. * The default should be v1alpha2 so lets just test that. * Don't use the mnist image with v1alpha2; use tf_smoke and use the image set in the job prototype. * Fix ksonnet error.
It turns out that although we running the v1alpha2 tests, failures
were not being properly reported in Prow because the junit xml files
had the same names for the v2 pipeline as the v1 pipeline and the v2
results were being clobbered by v1.
Ensure the artifacts for each run of the E2E test have a unix name
based on the TFJob version so that the E2E tests for the different
TFJob versions won't clobber each other.
Log the exception in wait for condition.
Need to pass --tfjob_version to the tests so it uses the proper client.
run_gpu and run_test stage need to use a v1alpha2 version of the test
workflow.
Update the tf_smoke program to accept chief as a valid worker type so that
it works with v1alpha2.
In v1alpha2 we need to terminate all workers. It looks like there was a
regression in v1alpha2
[v1alpha2] Job should be marked completed when worker 0 exits but other workers are still running #751
and we require all workers to terminate as opposed to just worker 0.
Delete a bunch of environments for the test app that shouldn't have been
committed.
Fix #748
This change is