diff --git a/mnist/README.md b/mnist/README.md index 88032710a..c44626d9f 100644 --- a/mnist/README.md +++ b/mnist/README.md @@ -6,6 +6,7 @@ - [Prerequisites](#prerequisites) - [Deploy Kubeflow](#deploy-kubeflow) - [Local Setup](#local-setup) + - [GCP Setup](#gcp-setup) - [Modifying existing examples](#modifying-existing-examples) - [Prepare model](#prepare-model) - [Build and push model image.](#build-and-push-model-image) @@ -53,6 +54,9 @@ You also need the following command line tools: **Note:** kustomize [v2.0.3](https://github.com/kubernetes-sigs/kustomize/releases/tag/v2.0.3) is recommented since the [problem](https://github.com/kubernetes-sigs/kustomize/issues/1295) in kustomize v2.1.0. +### GCP Setup + +If you are using GCP, need to enable [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) to execute below steps. ## Modifying existing examples @@ -225,94 +229,6 @@ kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${B kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET}/${MODEL_PATH}/export ``` -In order to write to GCS we need to supply the TFJob with GCP credentials. We do -this by telling our training code to use a [Google service account](https://cloud.google.com/docs/authentication/production#obtaining_and_providing_service_account_credentials_manually). - -If you followed the [getting started guide for GKE](https://www.kubeflow.org/docs/started/getting-started-gke/) -then a number of steps have already been performed for you - - 1. We created a Google service account named `${DEPLOYMENT}-user` - - * You can run the following command to list all service accounts in your project - - ``` - gcloud --project=${PROJECT} iam service-accounts list - ``` - - 2. We stored the private key for this account in a K8s secret named `user-gcp-sa` - - * To see the secrets in your cluster - - ``` - kubectl get secrets - ``` - - 3. We granted this service account permission to read/write GCS buckets in this project - - * To see the IAM policy you can do - - ``` - gcloud projects get-iam-policy ${PROJECT} --format=yaml - ``` - - * The output should look like the following - - ``` - bindings: - ... - - members: - - serviceAccount:${DEPLOYMENT}-user@${PROJEC}.iam.gserviceaccount.com - ... - role: roles/storage.admin - ... - etag: BwV_BqSmSCY= - version: 1 - ``` - -To use this service account we perform the following steps - - 1. Mount the secret `user-gcp-sa` into the pod and configure the mount path of the secret. - ``` - kustomize edit add configmap mnist-map-training --from-literal=secretName=user-gcp-sa - kustomize edit add configmap mnist-map-training --from-literal=secretMountPath=/var/secrets - ``` - - * Note: ensure your envrionment is pointed at the same `kubeflow` namespace as the `user-gcp-sa` secret - - 2. Next we need to set the environment variable `GOOGLE_APPLICATION_CREDENTIALS` so that our code knows where to look for the service account key. - - ``` - kustomize edit add configmap mnist-map-training --from-literal=GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json - ``` - - * If we look at the spec for our job we can see that the environment variable `GOOGLE_APPLICATION_CREDENTIALS` is set. - - ``` - kustomize build . - ``` - ``` - apiVersion: kubeflow.org/v1beta2 - kind: TFJob - metadata: - ... - spec: - tfReplicaSpecs: - Chief: - replicas: 1 - template: - spec: - containers: - - command: - .. - env: - ... - - name: GOOGLE_APPLICATION_CREDENTIALS - value: /var/secrets/user-gcp-sa.json - ... - ... - ... - ``` - You can now submit the job @@ -385,21 +301,21 @@ In order to write to S3 we need to supply the TensorFlow code with AWS credentia export S3_MODEL_EXPORT_URI=s3://${BUCKET_NAME}/export ``` - 1. Create a K8s secret containing your AWS credentials + 2. Create a K8s secret containing your AWS credentials ``` kustomize edit add secret aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \ --from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY} ``` - 1. Pass secrets as environment variables into pod + 3. Pass secrets as environment variables into pod ``` kustomize edit add configmap mnist-map-training --from-literal=awsAccessKeyIDName=awsAccessKeyID kustomize edit add configmap mnist-map-training --from-literal=awsSecretAccessKeyName=awsSecretAccessKey ``` - 1. Next we need to set a whole bunch of S3 related environment variables so that TensorFlow knows how to talk to S3 + 4. Next we need to set a whole bunch of S3 related environment variables so that TensorFlow knows how to talk to S3 ``` kustomize edit add configmap mnist-map-training --from-literal=S3_ENDPOINT=${S3_ENDPOINT} diff --git a/mnist/serving/GCS/deployment_patch.yaml b/mnist/serving/GCS/deployment_patch.yaml deleted file mode 100644 index 7e48223e7..000000000 --- a/mnist/serving/GCS/deployment_patch.yaml +++ /dev/null @@ -1,17 +0,0 @@ -- op: add - path: /spec/template/spec/containers/0/volumeMounts/- - value: - mountPath: /secret/gcp-credentials - name: user-gcp-sa - readOnly: true -- op: add - path: /spec/template/spec/volumes/- - value: - name: user-gcp-sa - secret: - secretName: user-gcp-sa -- op: add - path: /spec/template/spec/containers/0/env/- - value: - name: GOOGLE_APPLICATION_CREDENTIALS - value: /secret/gcp-credentials/user-gcp-sa.json diff --git a/mnist/serving/GCS/kustomization.yaml b/mnist/serving/GCS/kustomization.yaml index f3be7609b..ccb97fdb4 100644 --- a/mnist/serving/GCS/kustomization.yaml +++ b/mnist/serving/GCS/kustomization.yaml @@ -3,11 +3,3 @@ kind: Kustomization bases: - ../base - -patchesJson6902: -- path: deployment_patch.yaml - target: - group: extensions - kind: Deployment - name: $(svcName) - version: v1beta1 diff --git a/mnist/testing/conftest.py b/mnist/testing/conftest.py index f70a03d30..f22694330 100644 --- a/mnist/testing/conftest.py +++ b/mnist/testing/conftest.py @@ -1,14 +1,62 @@ +import os import pytest def pytest_addoption(parser): + parser.addoption( - "--master", action="store", default="", help="IP address of GKE master") + "--tfjob_name", help="Name for the TFjob.", + type=str, default="mnist-test-" + os.getenv('BUILD_ID')) + + parser.addoption( + "--namespace", help=("The namespace to run in. This should correspond to" + "a namespace associated with a Kubeflow namespace."), + type=str, default="kubeflow-kubeflow-testing") + + parser.addoption( + "--repos", help="The repos to checkout; leave blank to use defaults", + type=str, default="") + + parser.addoption( + "--trainer_image", help="TFJob training image", + type=str, default="gcr.io/kubeflow-ci/mnist/model:build-" + os.getenv('BUILD_ID')) + + parser.addoption( + "--train_steps", help="train steps for mnist testing", + type=str, default="10") + + parser.addoption( + "--batch_size", help="batch size for mnist trainning", + type=str, default="10") parser.addoption( - "--namespace", action="store", default="", help="namespace of server") + "--learning_rate", help="mnist learnning rate", + type=str, default="0.01") parser.addoption( - "--service", action="store", default="", + "--num_ps", help="The number of PS", + type=str, default="1") + + parser.addoption( + "--num_workers", help="The number of Worker", + type=str, default="2") + + parser.addoption( + "--model_dir", help="Path for model saving", + type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID')) + + parser.addoption( + "--export_dir", help="Path for model exporting", + type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID')) + + parser.addoption( + "--deploy_name", help="Name for the service deployment", + type=str, default="mnist-test-" + os.getenv('BUILD_ID')) + + parser.addoption( + "--master", action="store", default="", help="IP address of GKE master") + + parser.addoption( + "--service", action="store", default="mnist-test-" + os.getenv('BUILD_ID'), help="The name of the mnist K8s service") @pytest.fixture @@ -22,3 +70,47 @@ def namespace(request): @pytest.fixture def service(request): return request.config.getoption("--service") + +@pytest.fixture +def tfjob_name(request): + return request.config.getoption("--tfjob_name") + +@pytest.fixture +def repos(request): + return request.config.getoption("--repos") + +@pytest.fixture +def trainer_image(request): + return request.config.getoption("--trainer_image") + +@pytest.fixture +def train_steps(request): + return request.config.getoption("--train_steps") + +@pytest.fixture +def batch_size(request): + return request.config.getoption("--batch_size") + +@pytest.fixture +def learning_rate(request): + return request.config.getoption("--learning_rate") + +@pytest.fixture +def num_ps(request): + return request.config.getoption("--num_ps") + +@pytest.fixture +def num_workers(request): + return request.config.getoption("--num_workers") + +@pytest.fixture +def model_dir(request): + return request.config.getoption("--model_dir") + +@pytest.fixture +def export_dir(request): + return request.config.getoption("--export_dir") + +@pytest.fixture +def deploy_name(request): + return request.config.getoption("--deploy_name") \ No newline at end of file diff --git a/mnist/testing/deploy_test.py b/mnist/testing/deploy_test.py index f4f408e0e..0cc4b063b 100644 --- a/mnist/testing/deploy_test.py +++ b/mnist/testing/deploy_test.py @@ -10,80 +10,75 @@ * Provides utilities for testing Manually running the test - 1. Configure your KUBECONFIG file to point to the desired cluster - 2. Set --params=name=${NAME},namespace=${NAMESPACE} - * name should be the name for your job - * namespace should be the namespace to use - 3. Use the modelBasePath parameter to the model to test. - --params=...,modelBasePath=${MODEL_BASE_PATH} + pytest deploy_test.py \ + name=mnist-deploy-test-${BUILD_ID} \ + namespace=${namespace} \ + modelBasePath=${modelDir} \ + exportDir=${modelDir} \ """ import logging import os -import subprocess +import pytest +from kubernetes.config import kube_config from kubernetes import client as k8s_client -from kubeflow.tf_operator import test_runner #pylint: disable=no-name-in-module -from kubeflow.testing import test_util from kubeflow.testing import util -# TODO(jlewi): Should we refactor this to use pytest like predict_test -# and not depend on test_runner. -class MnistDeployTest(test_util.TestCase): - def __init__(self, args): - namespace, name, env = test_runner.parse_runtime_params(args) - self.app_dir = args.app_dir - - if not self.app_dir: - self.app_dir = os.path.join(os.path.dirname(__file__), "..", - "serving/GCS") - self.app_dir = os.path.abspath(self.app_dir) - logging.info("--app_dir not set defaulting to: %s", self.app_dir) - - self.env = env - self.namespace = namespace - self.params = args.params - super(MnistDeployTest, self).__init__(class_name="MnistDeployTest", - name=name) - - def test_serve(self): - # We repeat the test multiple times. - # This ensures that if we delete the job we can create a new job with the - # same name. - api_client = k8s_client.ApiClient() - - # TODO (jinchihe) beflow code will be removed once new test-worker image - # is publish in https://github.com/kubeflow/testing/issues/373. - kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \ - 'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64' - util.run(['wget', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=self.app_dir) - util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=self.app_dir) - - # Apply the components - configmap = 'mnist-map-serving' - for pair in self.params.split(","): - k, v = pair.split("=", 1) - if k == "namespace": - util.run(['kustomize', 'edit', 'set', k, v], cwd=self.app_dir) - else: - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=' + k + '=' + v], cwd=self.app_dir) - - # Seems the util.run cannot handle pipes case, using check_call. - subCmd = 'kustomize build ' + self.app_dir + '| kubectl apply -f -' - subprocess.check_call(subCmd, shell=True) - - util.wait_for_deployment(api_client, self.namespace, self.name, - timeout_minutes=4) - - # We don't delete the resources. We depend on the namespace being - # garbage collected. + +def test_deploy(record_xml_attribute, deploy_name, namespace, model_dir, export_dir): + + util.set_pytest_junit(record_xml_attribute, "test_deploy") + + util.maybe_activate_service_account() + + app_dir = os.path.join(os.path.dirname(__file__), "../serving/GCS") + app_dir = os.path.abspath(app_dir) + logging.info("--app_dir not set defaulting to: %s", app_dir) + + # TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue: + # https://github.com/kubernetes-sigs/kustomize/issues/1295 + kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \ + 'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64' + util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir) + util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir) + + # TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue. + # Invalid object doesn't have additional properties ... + kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \ + 'release/v1.14.0/bin/linux/amd64/kubectl' + util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir) + util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir) + + # Configure custom parameters using kustomize + configmap = 'mnist-map-serving' + util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir) + util.run(['kustomize', 'edit', 'add', 'configmap', configmap, + '--from-literal=name' + '=' + deploy_name], cwd=app_dir) + + util.run(['kustomize', 'edit', 'add', 'configmap', configmap, + '--from-literal=modelBasePath=' + model_dir], cwd=app_dir) + util.run(['kustomize', 'edit', 'add', 'configmap', configmap, + '--from-literal=exportDir=' + export_dir], cwd=app_dir) + + # Apply the components + util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir) + util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir) + + kube_config.load_kube_config() + api_client = k8s_client.ApiClient() + util.wait_for_deployment(api_client, namespace, deploy_name, timeout_minutes=4) + + # We don't delete the resources. We depend on the namespace being + # garbage collected. if __name__ == "__main__": - # TODO(jlewi): It looks like using test_runner we don't exit with an error - # if the deployment doesn't succeed. So the Argo workflow continues which - # isn't what we want. Might be a good reason to switch to using - # pytest. - test_runner.main(module=__name__) + logging.basicConfig(level=logging.INFO, + format=('%(levelname)s|%(asctime)s' + '|%(pathname)s|%(lineno)d| %(message)s'), + datefmt='%Y-%m-%dT%H:%M:%S', + ) + logging.getLogger().setLevel(logging.INFO) + pytest.main() diff --git a/mnist/testing/tfjob_test.py b/mnist/testing/tfjob_test.py index cd82ee05e..52422d6f9 100644 --- a/mnist/testing/tfjob_test.py +++ b/mnist/testing/tfjob_test.py @@ -13,129 +13,130 @@ * Provides utilities for testing Manually running the test - 1. Configure your KUBECONFIG file to point to the desired cluster - 2. Set --params=name=${NAME},namespace=${NAMESPACE} - * name should be the name for your job - * namespace should be the namespace to use - 3. To test a new image set the parameter image e.g - --params=name=${NAME},namespace=${NAMESPACE},image=${IMAGE} - 4. To control how long it trains set sample_size and num_epochs - --params=trainSteps=10,batchSize=10,... + pytest tfjobs_test.py \ + tfjob_name=tfjobs-test-${BUILD_ID} \ + namespace=${test_namespace} \ + trainer_image=${trainning_image} \ + train_steps=10 \ + batch_size=10 \ + learning_rate=0.01 \ + num_ps=1 \ + num_workers=2 \ + model_dir=${model_dir} \ + export_dir=${model_dir} \ """ import json import logging import os -import subprocess +import pytest +from kubernetes.config import kube_config from kubernetes import client as k8s_client from kubeflow.tf_operator import tf_job_client #pylint: disable=no-name-in-module -from kubeflow.tf_operator import test_runner #pylint: disable=no-name-in-module -from kubeflow.testing import test_util from kubeflow.testing import util -class TFJobTest(test_util.TestCase): - def __init__(self, args): - namespace, name, env = test_runner.parse_runtime_params(args) - self.app_dir = args.app_dir - - if not self.app_dir: - self.app_dir = os.path.join(os.path.dirname(__file__), "..", - "training/GCS") - self.app_dir = os.path.abspath(self.app_dir) - logging.info("--app_dir not set defaulting to: %s", self.app_dir) - - self.env = env - self.namespace = namespace - self.params = args.params - super(TFJobTest, self).__init__(class_name="TFJobTest", name=name) - - def test_train(self): - # We repeat the test multiple times. - # This ensures that if we delete the job we can create a new job with the - # same name. - api_client = k8s_client.ApiClient() - - # TODO (jinchihe) beflow code will be removed once new test-worker image - # is publish in https://github.com/kubeflow/testing/issues/373. - kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \ - 'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64' - util.run(['wget', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=self.app_dir) - util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=self.app_dir) - - # Setup parameters for kustomize - configmap = 'mnist-map-training' - for pair in self.params.split(","): - k, v = pair.split("=", 1) - if k == "namespace": - util.run(['kustomize', 'edit', 'set', k, v], cwd=self.app_dir) - elif k == "image": - util.run(['kustomize', 'edit', 'set', k, 'training-image=' + v], cwd=self.app_dir) - elif k == "numPs": - util.run(['../base/definition.sh', '--numPs', v], cwd=self.app_dir) - elif k == "numWorkers": - util.run(['../base/definition.sh', '--numWorkers', v], cwd=self.app_dir) - elif k == "secret": - secretName, secretMountPath = v.split("=", 1) - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=secretName=' + secretName], cwd=self.app_dir) - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=secretMountPath=' + secretMountPath], cwd=self.app_dir) - elif k == "envVariables": - var_k, var_v = v.split("=", 1) - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=' + var_k + '=' + var_v], cwd=self.app_dir) - else: - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=' + k + '=' + v], cwd=self.app_dir) - - # Create the TF job - # Seems the util.run cannot handle pipes case, using check_call. - subCmd = 'kustomize build ' + self.app_dir + '| kubectl apply -f -' - subprocess.check_call(subCmd, shell=True) - logging.info("Created job %s in namespaces %s", self.name, self.namespace) - - # Wait for the job to complete. - logging.info("Waiting for job to finish.") - results = tf_job_client.wait_for_job( - api_client, - self.namespace, - self.name, - status_callback=tf_job_client.log_status) - logging.info("Final TFJob:\n %s", json.dumps(results, indent=2)) - - # Check for errors creating pods and services. Can potentially - # help debug failed test runs. - creation_failures = tf_job_client.get_creation_failures_from_tfjob( - api_client, self.namespace, results) - if creation_failures: - logging.warning(creation_failures) - - if not tf_job_client.job_succeeded(results): - self.failure = "Job {0} in namespace {1} in status {2}".format( # pylint: disable=attribute-defined-outside-init - self.name, self.namespace, results.get("status", {})) - logging.error(self.failure) - - # if the TFJob failed, print out the pod logs for debugging. - pod_names = tf_job_client.get_pod_names( - api_client, self.namespace, self.name) - logging.info("The Pods name:\n %s", pod_names) - - core_api = k8s_client.CoreV1Api(api_client) - - for pod in pod_names: - logging.info("Getting logs of Pod %s.", pod) - try: - pod_logs = core_api.read_namespaced_pod_log(pod, self.namespace) - logging.info("The logs of Pod %s log:\n %s", pod, pod_logs) - except k8s_client.rest.ApiException as e: - logging.info("Exception when calling CoreV1Api->read_namespaced_pod_log: %s\n", e) - return - - # We don't delete the jobs. We rely on TTLSecondsAfterFinished - # to delete old jobs. Leaving jobs around should make it - # easier to debug. +def test_training(record_xml_attribute, tfjob_name, namespace, trainer_image, num_ps, #pylint: disable=too-many-arguments + num_workers, train_steps, batch_size, learning_rate, model_dir, export_dir): + + util.set_pytest_junit(record_xml_attribute, "test_mnist") + + util.maybe_activate_service_account() + + app_dir = os.path.join(os.path.dirname(__file__), "../training/GCS") + app_dir = os.path.abspath(app_dir) + logging.info("--app_dir not set defaulting to: %s", app_dir) + + # TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue: + # https://github.com/kubernetes-sigs/kustomize/issues/1295 + kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \ + 'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64' + util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir) + util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir) + + # TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue. + # Invalid object doesn't have additional properties ... + kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \ + 'release/v1.14.0/bin/linux/amd64/kubectl' + util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir) + util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir) + + # Configurate custom parameters using kustomize + configmap = 'mnist-map-training' + util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir) + util.run(['kustomize', 'edit', 'set', 'image', trainer_image], cwd=app_dir) + + util.run(['../base/definition.sh', '--numPs', num_ps], cwd=app_dir) + util.run(['../base/definition.sh', '--numWorkers', num_workers], cwd=app_dir) + + trainning_config = { + "name": tfjob_name, + "trainSteps": train_steps, + "batchSize": batch_size, + "learningRate": learning_rate, + "modelDir": model_dir, + "exportDir": export_dir, + } + + for key, value in trainning_config.items(): + util.run(['kustomize', 'edit', 'add', 'configmap', configmap, + '--from-literal=' + key + '=' + value], cwd=app_dir) + + # Created the TFJobs. + util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir) + util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir) + logging.info("Created job %s in namespaces %s", tfjob_name, namespace) + + kube_config.load_kube_config() + api_client = k8s_client.ApiClient() + + # Wait for the job to complete. + logging.info("Waiting for job to finish.") + results = tf_job_client.wait_for_job( + api_client, + namespace, + tfjob_name, + status_callback=tf_job_client.log_status) + logging.info("Final TFJob:\n %s", json.dumps(results, indent=2)) + + # Check for errors creating pods and services. Can potentially + # help debug failed test runs. + creation_failures = tf_job_client.get_creation_failures_from_tfjob( + api_client, namespace, results) + if creation_failures: + logging.warning(creation_failures) + + if not tf_job_client.job_succeeded(results): + failure = "Job {0} in namespace {1} in status {2}".format( # pylint: disable=attribute-defined-outside-init + tfjob_name, namespace, results.get("status", {})) + logging.error(failure) + + # if the TFJob failed, print out the pod logs for debugging. + pod_names = tf_job_client.get_pod_names( + api_client, namespace, tfjob_name) + logging.info("The Pods name:\n %s", pod_names) + + core_api = k8s_client.CoreV1Api(api_client) + + for pod in pod_names: + logging.info("Getting logs of Pod %s.", pod) + try: + pod_logs = core_api.read_namespaced_pod_log(pod, namespace) + logging.info("The logs of Pod %s log:\n %s", pod, pod_logs) + except k8s_client.rest.ApiException as e: + logging.info("Exception when calling CoreV1Api->read_namespaced_pod_log: %s\n", e) + return + + # We don't delete the jobs. We rely on TTLSecondsAfterFinished + # to delete old jobs. Leaving jobs around should make it + # easier to debug. if __name__ == "__main__": - test_runner.main(module=__name__) + logging.basicConfig(level=logging.INFO, + format=('%(levelname)s|%(asctime)s' + '|%(pathname)s|%(lineno)d| %(message)s'), + datefmt='%Y-%m-%dT%H:%M:%S', + ) + logging.getLogger().setLevel(logging.INFO) + pytest.main() diff --git a/mnist/training/GCS/Chief_patch.yaml b/mnist/training/GCS/Chief_patch.yaml deleted file mode 100644 index 8d3e6c221..000000000 --- a/mnist/training/GCS/Chief_patch.yaml +++ /dev/null @@ -1,17 +0,0 @@ -- op: add - path: /spec/tfReplicaSpecs/Chief/template/spec/containers/0/volumeMounts - value: - - mountPath: $(secretMountPath) - name: user-gcp-sa - readOnly: true -- op: add - path: /spec/tfReplicaSpecs/Chief/template/spec/volumes - value: - - name: user-gcp-sa - secret: - secretName: $(secretName) -- op: add - path: /spec/tfReplicaSpecs/Chief/template/spec/containers/0/env/- - value: - name: GOOGLE_APPLICATION_CREDENTIALS - value: $(GOOGLE_APPLICATION_CREDENTIALS) diff --git a/mnist/training/GCS/Ps_patch.yaml b/mnist/training/GCS/Ps_patch.yaml deleted file mode 100644 index e0258c208..000000000 --- a/mnist/training/GCS/Ps_patch.yaml +++ /dev/null @@ -1,17 +0,0 @@ -- op: add - path: /spec/tfReplicaSpecs/Ps/template/spec/containers/0/volumeMounts - value: - - mountPath: $(secretMountPath) - name: user-gcp-sa - readOnly: true -- op: add - path: /spec/tfReplicaSpecs/Ps/template/spec/volumes - value: - - name: user-gcp-sa - secret: - secretName: $(secretName) -- op: add - path: /spec/tfReplicaSpecs/Ps/template/spec/containers/0/env/- - value: - name: GOOGLE_APPLICATION_CREDENTIALS - value: $(GOOGLE_APPLICATION_CREDENTIALS) diff --git a/mnist/training/GCS/Worker_patch.yaml b/mnist/training/GCS/Worker_patch.yaml deleted file mode 100644 index 6e0bcaedf..000000000 --- a/mnist/training/GCS/Worker_patch.yaml +++ /dev/null @@ -1,17 +0,0 @@ -- op: add - path: /spec/tfReplicaSpecs/Worker/template/spec/containers/0/volumeMounts - value: - - mountPath: $(secretMountPath) - name: user-gcp-sa - readOnly: true -- op: add - path: /spec/tfReplicaSpecs/Worker/template/spec/volumes - value: - - name: user-gcp-sa - secret: - secretName: $(secretName) -- op: add - path: /spec/tfReplicaSpecs/Worker/template/spec/containers/0/env/- - value: - name: GOOGLE_APPLICATION_CREDENTIALS - value: $(GOOGLE_APPLICATION_CREDENTIALS) diff --git a/mnist/training/GCS/kustomization.yaml b/mnist/training/GCS/kustomization.yaml index a0d85ce9d..5d9fcd414 100644 --- a/mnist/training/GCS/kustomization.yaml +++ b/mnist/training/GCS/kustomization.yaml @@ -4,9 +4,6 @@ kind: Kustomization bases: - ../base -configurations: -- params.yaml - # TBD (jinchihe) Need move the image to base file once. # the issue addressed: kubernetes-sigs/kustomize/issues/1040 # TBD (jinchihe) Need to update the image once @@ -16,33 +13,3 @@ images: newName: gcr.io/kubeflow-examples/mnist/model newTag: v20190111-v0.2-148-g313770f -vars: -- fieldref: - fieldPath: data.GOOGLE_APPLICATION_CREDENTIALS - name: GOOGLE_APPLICATION_CREDENTIALS - objref: - apiVersion: v1 - kind: ConfigMap - name: mnist-map-training -- fieldref: - fieldPath: data.secretName - name: secretName - objref: - apiVersion: v1 - kind: ConfigMap - name: mnist-map-training -- fieldref: - fieldPath: data.secretMountPath - name: secretMountPath - objref: - apiVersion: v1 - kind: ConfigMap - name: mnist-map-training - -patchesJson6902: -- path: Chief_patch.yaml - target: - group: kubeflow.org - kind: TFJob - name: $(trainingName) - version: v1beta2 diff --git a/mnist/training/GCS/params.yaml b/mnist/training/GCS/params.yaml deleted file mode 100644 index 6da474775..000000000 --- a/mnist/training/GCS/params.yaml +++ /dev/null @@ -1,15 +0,0 @@ -varReference: -- path: metadata/name - kind: TFJob -- path: spec/tfReplicaSpecs/Chief/template/spec/volumes/secret/secretName - kind: TFJob -- path: spec/tfReplicaSpecs/Chief/template/spec/containers/volumeMounts/mountPath - kind: TFJob -- path: spec/tfReplicaSpecs/Worker/template/spec/volumes/secret/secretName - kind: TFJob -- path: spec/tfReplicaSpecs/Worker/template/spec/containers/volumeMounts/mountPath - kind: TFJob -- path: spec/tfReplicaSpecs/Ps/template/spec/volumes/secret/secretName - kind: TFJob -- path: spec/tfReplicaSpecs/Ps/template/spec/containers/volumeMounts/mountPath - kind: TFJob diff --git a/mnist/training/base/Chief.yaml b/mnist/training/base/Chief.yaml index ed8c0d8c6..6b8991e7a 100644 --- a/mnist/training/base/Chief.yaml +++ b/mnist/training/base/Chief.yaml @@ -1,4 +1,4 @@ -apiVersion: kubeflow.org/v1beta2 +apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: $(trainingName) diff --git a/mnist/training/base/Ps.yaml b/mnist/training/base/Ps.yaml index 859de9fb6..e694337e9 100644 --- a/mnist/training/base/Ps.yaml +++ b/mnist/training/base/Ps.yaml @@ -1,4 +1,4 @@ -apiVersion: kubeflow.org/v1beta2 +apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: $(trainingName) diff --git a/mnist/training/base/Worker.yaml b/mnist/training/base/Worker.yaml index 11630e2a4..9692dcb9b 100644 --- a/mnist/training/base/Worker.yaml +++ b/mnist/training/base/Worker.yaml @@ -1,4 +1,4 @@ -apiVersion: kubeflow.org/v1beta2 +apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: $(trainingName) diff --git a/mnist/training/base/definition.sh b/mnist/training/base/definition.sh index f73c30d7f..62def4bb9 100755 --- a/mnist/training/base/definition.sh +++ b/mnist/training/base/definition.sh @@ -40,7 +40,7 @@ if [ "x${numPs}" != "x" ]; then \ group: kubeflow.org \ \ kind: TFJob \ \ name: \$(trainingName) \ -\ version: v1beta2 \ +\ version: v1 \ \ ' kustomization.yaml else @@ -59,7 +59,7 @@ if [ "x${numWorkers}" != "x" ]; then \ group: kubeflow.org \ \ kind: TFJob \ \ name: \$(trainingName) \ -\ version: v1beta2 \ +\ version: v1 \ \ ' kustomization.yaml else diff --git a/prow_config.yaml b/prow_config.yaml index 981dd20b1..5dc93e196 100644 --- a/prow_config.yaml +++ b/prow_config.yaml @@ -21,17 +21,6 @@ workflows: include_dirs: - code_search/* - # E2E test for mnist example - - app_dir: kubeflow/examples/test/workflows - component: mnist - name: mnist - job_types: - - periodic - - presubmit - - postsubmit - include_dirs: - - mnist/* - # E2E test for github issue summarization example - app_dir: kubeflow/examples/test/workflows component: gis @@ -75,3 +64,14 @@ workflows: include_dirs: - xgboost_synthetic/* - py/kubeflow/examples/create_e2e_workflow.py + + # E2E test for mnist example + - py_func: kubeflow.examples.create_e2e_workflow.create_workflow + name: mnist + job_types: + - periodic + - presubmit + - postsubmit + include_dirs: + - mnist/* + - py/kubeflow/examples/create_e2e_workflow.py diff --git a/py/kubeflow/examples/create_e2e_workflow.py b/py/kubeflow/examples/create_e2e_workflow.py index 61dcb9555..6f391373a 100644 --- a/py/kubeflow/examples/create_e2e_workflow.py +++ b/py/kubeflow/examples/create_e2e_workflow.py @@ -51,7 +51,9 @@ MAIN_REPO = "kubeflow/examples" -EXTRA_REPOS = ["kubeflow/testing@HEAD"] +EXTRA_REPOS = ["kubeflow/testing@HEAD", "kubeflow/tf-operator@HEAD"] + +PROW_DICT = argo_build_util.get_prow_dict() class Builder: def __init__(self, name=None, namespace=None, test_target_name=None, @@ -97,6 +99,10 @@ def __init__(self, name=None, namespace=None, test_target_name=None, # py scripts to use. self.kubeflow_testing_py = self.src_root_dir + "/kubeflow/testing/py" + # The directory within the tf-operator submodule containing + # py scripts to use. + self.kubeflow_tfjob_py = self.src_root_dir + "/kubeflow/tf-operator/py" + # The class name to label junit files. # We want to be able to group related tests in test grid. # Test grid allows grouping by target which corresponds to the classname @@ -199,7 +205,7 @@ def _build_task_template(self): common_env = [ {'name': 'PYTHONPATH', 'value': ":".join([self.kubeflow_py, self.kubeflow_py + "/py", - self.kubeflow_testing_py,])}, + self.kubeflow_testing_py, self.kubeflow_tfjob_py])}, {'name': 'KUBECONFIG', 'value': os.path.join(self.test_dir, 'kfctl_test/.kube/kubeconfig')}, ] @@ -229,7 +235,7 @@ def _build_step(self, name, workflow, dag_name, task_template, return None - def _build_tests_dag(self): + def _build_tests_dag_notebooks(self): """Build the dag for the set of tests to run against a KF deployment.""" task_template = self._build_task_template() @@ -253,6 +259,82 @@ def _build_tests_dag(self): "xgboost_synthetic", "testing") + def _build_tests_dag_mnist(self): + """Build the dag for the set of tests to run mnist TFJob tests.""" + + task_template = self._build_task_template() + + # *************************************************************************** + # Build mnist image + step_name = "build-image" + train_image_base = "gcr.io/kubeflow-ci/mnist" + train_image_tag = "build-" + PROW_DICT['BUILD_ID'] + command = ["/bin/bash", + "-c", + "gcloud auth activate-service-account --key-file=$(GOOGLE_APPLICATION_CREDENTIALS) \ + && make build-gcb IMG=" + train_image_base + " TAG=" + train_image_tag, + ] + dependencies = [] + build_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, + command, dependencies) + build_step["container"]["workingDir"] = os.path.join(self.src_dir, "mnist") + + # *************************************************************************** + # Test mnist TFJob + step_name = "tfjob-test" + # Using python2 to run the test to avoid dependency error. + command = ["python2", "-m", "pytest", "tfjob_test.py", + # Increase the log level so that info level log statements show up. + "--log-cli-level=info", + "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", + # Test timeout in seconds. + "--timeout=1800", + "--junitxml=" + self.artifacts_dir + "/junit_tfjob-test.xml", + ] + + dependencies = [build_step['name']] + tfjob_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, + command, dependencies) + tfjob_step["container"]["workingDir"] = os.path.join(self.src_dir, + "mnist", + "testing") + + # *************************************************************************** + # Test mnist deploy + step_name = "deploy-test" + command = ["pytest", "deploy_test.py", + # Increase the log level so that info level log statements show up. + "--log-cli-level=info", + "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", + # Test timeout in seconds. + "--timeout=1800", + "--junitxml=" + self.artifacts_dir + "/junit_deploy-test.xml", + ] + + dependencies = [tfjob_step["name"]] + deploy_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, + command, dependencies) + deploy_step["container"]["workingDir"] = os.path.join(self.src_dir, + "mnist", + "testing") + # *************************************************************************** + # Test mnist predict + step_name = "predict-test" + command = ["pytest", "predict_test.py", + # Increase the log level so that info level log statements show up. + "--log-cli-level=info", + "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", + # Test timeout in seconds. + "--timeout=1800", + "--junitxml=" + self.artifacts_dir + "/junit_predict-test.xml", + ] + + dependencies = [deploy_step["name"]] + predict_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, + command, dependencies) + predict_step["container"]["workingDir"] = os.path.join(self.src_dir, + "mnist", + "testing") def _build_exit_dag(self): """Build the exit handler dag""" @@ -337,7 +419,12 @@ def build(self): #************************************************************************** # Run a dag of tests - self._build_tests_dag() + if self.test_target_name == "notebooks": + self._build_tests_dag_notebooks() + elif self.test_target_name == "mnist": + self._build_tests_dag_mnist() + else: + raise RuntimeError('Invalid test_target_name') # Add a task to run the dag dependencies = [credentials["name"]]