Add a ResNet example from NVIDIA #964

khoa-ho · 2019-03-13T00:17:00Z

Add an end-to-end training & serving Kubeflow pipeline for ResNet on CIFAR10, using various NVIDIA technologies

This change is

googlebot · 2019-03-13T00:17:11Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

k8s-ci-robot · 2019-03-13T00:17:14Z

Hi @khoa-ho. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-03-13T00:17:14Z

Hi @khoa-ho. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

khoa-ho · 2019-03-13T01:31:25Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).

The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.

The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

I signed it!

googlebot · 2019-03-13T01:31:28Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

khoa-ho · 2019-03-13T01:44:48Z

/assign @gaoning777

vicaire · 2019-03-15T01:51:05Z

samples/nvidia-resnet/LICENSE

@@ -0,0 +1,25 @@
+Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.


Would it be possible to match the license file in the other directories (Apaches 2.0)?

Example: https://github.com/kubeflow/pipelines/blob/master/samples/resnet-cmle/resnet-train-pipeline.py

Yeah Apache should be fine too. I'm confirming with legal and will update that.

License has been changed to Apache

vicaire · 2019-03-15T01:53:00Z

samples/nvidia-resnet/install_kubeflow_and_dependencies.sh

@@ -0,0 +1,84 @@
+#!/bin/bash


Is it possible to re-use the launch process for Kubeflow and Kubeflow Pipeline? We probably don't want each pipeline to provide its own installation of Kubeflow on Minikube.

Because of some GPU dependencies for Docker runtime and Kubernetes, we wanted to provide a one-time installation script when someone tries this example in a new system. After that, for every new pipeline, the user only has to run the builld_pipeline.py again, which just rebuilds the images for each pipeline component and recompiles the pipeline definition. If there're better practices for this process, please let me know.

/cc @IronPan

Yang, is there a way we could modify the launcher scripts so that the NVIDIA use case is supported?

Do you have any update on this? Also which launcher scripts are you referring to?

vicaire · 2019-03-15T01:53:57Z

samples/nvidia-resnet/pipeline/persistent-volume-claim.yaml

@@ -0,0 +1,11 @@
+kind: PersistentVolumeClaim


Would it make sense for the pipeline itself to perform these steps?

Yeah it would be nice if we can mount persistent volumes within the pipeline but I'm not sure how to do that besides running kubectl create -f volume.yaml in the host system like what we did with the mount_persistent_volume.sh script.

/cc @hongye-sun

vicaire · 2019-03-15T04:46:18Z

samples/nvidia-resnet/trtis_client/demo_client_ui.py

@@ -0,0 +1,93 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.


Is there a way to add this visualization as an output of the pipeline so that the user does not need to run a new webserver? (The Kubeflow/Pipeline UI can display static HTML).

It might be possible. We'll look into it.

We've integrated the webapp component to the pipeline itself. In addition, the webapp UI is routed to a subpath of Kubeflow UI (e.g. localhost:[kubeflow-port]/webapp/)

Ark-kun · 2019-03-21T06:36:24Z

samples/nvidia-resnet/pipeline/pipeline.py

+
+    op_dict['train'] = train_op(
+        persistent_volume_path, processed_data_dir, model_dir)
+    op_dict['train'].after(op_dict['preprocess'])


Having to manually specify dependencies using .after is usually a symptom of components that are not fully composable. Making components composable is a bit complicated now due to the lack of artifact passing support, but it's still pretty easy to do.

If some component produces some data that another component will use, this data dependency must be made explicit: The producer component must have an output a reference to which is then passed to the consumer component as input. When you pass output reference to an input as an argument, there is no need for .after

preprocess_op should output processed_data_dir which will be passed to train_op
train_op should output model_dir which will be passed to serve_op

Data dependency is now made explicit (i.e. passing output between ops instead of using .after)

Ark-kun · 2019-03-21T06:39:28Z

samples/nvidia-resnet/pipeline/pipeline.py

+PIPELINE_NAME = 'resnet_cifar10_pipeline'
+
+
+def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'):


It might be better to remove the persistent_volume_path parameter and just pass the full paths in input_dir and output_dir.

Ops use full paths now.

Ark-kun · 2019-03-21T06:42:31Z

samples/nvidia-resnet/pipeline/pipeline.py

+
+def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'):
+    return dsl.ContainerOp(
+        name=step_name,


There is no need to pass get step name through function parameter. (It was needed in the past when names had to be unique). Just use a constant name here name='nvidian/sae/ananths - Preprocess' (or a better name).

Ark-kun · 2019-03-21T06:49:51Z

samples/nvidia-resnet/pipeline/pipeline.py

+    return dsl.ContainerOp(
+        name=step_name,
+        image='nvcr.io/nvidian/sae/ananths:kubeflow-preprocess',
+        command=['python'],


Writing something along these lines will add the processed_data_dir output to this component which can then be passed to the next component.

command=[ 'sh', '-c', 'echo "$0" > "$1"; "$*"', output_dir, '/tmp/output_dir', 'python', ], file_outputs={'processed_data_dir': '/tmp/output_dir'}

googlebot · 2019-04-01T18:14:48Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

googlebot · 2019-04-01T19:22:34Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

khoa-ho · 2019-04-01T20:45:44Z

@vicaire @Ark-kun Please have a look the 2nd commit and my replies above. Thank you!

khoa-ho · 2019-04-08T16:25:20Z

@vicaire @Ark-kun Hi guys, would you be able to review the codes this week? We're trying to have a public release for a demo at Google Cloud Next '19.

vicaire · 2019-04-09T02:50:13Z

samples/nvidia-resnet/components/webapp/src/flask_server.py

+                      )
+  logging.getLogger().setLevel(logging.INFO)
+  logging.info("Starting flask.")
+  app.run(debug=True, host='0.0.0.0', port=8080)


Is it possible to avoid checking-in all the images below in the repo? Could the images be provided in a public dataset instead?

I agree. Commit 85c2ce9 removes all the images and adds a script to download them from a Cloud Storage bucket.

vicaire · 2019-04-09T02:53:15Z

/ok-to-test
/lgtm
/approve

* Integrate webapp into the pipeline * Change license from BSD to Apache * Route webapp UI to Kubeflow UI subpath * Passing output between ops to establish flow * Explicit input & output dir path * Restructure folder

khoa-ho · 2019-04-09T05:37:57Z

@vicaire Images will now be downloaded from a bucket. Also, I don't think we still need "do-not-merge/work-in-progress" label.

khoa-ho · 2019-04-09T05:46:59Z

@vicaire Please kindly review by tomorrow if possible. Thanks!

vicaire · 2019-04-10T18:25:08Z

/lgtm
/approve

k8s-ci-robot · 2019-04-10T18:25:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~samples/OWNERS~~ [vicaire]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-04-10T18:25:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~samples/OWNERS~~ [vicaire]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

khoa-ho · 2019-04-10T18:49:54Z

@vicaire @Ark-kun Label do-not-merge/work-in-progress is blocking merging. Can you help remove it? Thanks.

khoa-ho · 2019-04-11T02:58:45Z

@IronPan @gaoning777 @hongye-sun Sorry for asking again but can someone help remove do-not-merge/work-in-progress label. It's blocking merging. Thanks.

vicaire · 2019-04-11T21:35:37Z

/lgtm

vicaire · 2019-04-11T21:36:27Z

done

related-issue: kubeflow#1355

* handle step from task results * address review comments

k8s-ci-robot added do-not-merge/work-in-progress labels Mar 13, 2019

k8s-ci-robot requested review from Ark-kun, gaoning777 and hongye-sun March 13, 2019 00:17

k8s-ci-robot added the size/XXL label Mar 13, 2019

k8s-ci-robot added needs-ok-to-test labels Mar 13, 2019

khoa-ho marked this pull request as ready for review March 13, 2019 01:08

k8s-ci-robot assigned gaoning777 Mar 13, 2019

vicaire reviewed Mar 15, 2019

View reviewed changes

vicaire self-assigned this Mar 15, 2019

k8s-ci-robot requested a review from IronPan March 19, 2019 02:12

Ark-kun reviewed Mar 21, 2019

View reviewed changes

khoa-ho force-pushed the master branch from 847f76e to dd8b4a2 Compare April 1, 2019 19:22

vicaire reviewed Apr 9, 2019

View reviewed changes

k8s-ci-robot added lgtm ok-to-test approved and removed needs-ok-to-test labels Apr 9, 2019

khoa-ho added 2 commits April 8, 2019 22:19

Add a ResNet example from NVIDIA

d0bba1d

Integrate webapp into the pipeline

07c54a1

* Integrate webapp into the pipeline * Change license from BSD to Apache * Route webapp UI to Kubeflow UI subpath * Passing output between ops to establish flow * Explicit input & output dir path * Restructure folder

khoa-ho force-pushed the master branch from dd8b4a2 to 78ceac8 Compare April 9, 2019 05:23

k8s-ci-robot removed lgtm labels Apr 9, 2019

Use test images from storage bucket

85c2ce9

khoa-ho force-pushed the master branch from 78ceac8 to 85c2ce9 Compare April 9, 2019 05:27

k8s-ci-robot added the lgtm label Apr 10, 2019

vicaire removed do-not-merge/work-in-progress lgtm labels Apr 11, 2019

k8s-ci-robot added the lgtm label Apr 11, 2019

vicaire merged commit ef39385 into kubeflow:master Apr 11, 2019

Linchin pushed a commit to Linchin/pipelines that referenced this pull request Apr 11, 2023

Enable tf-operator tests in v1.2-branch (kubeflow#964)

30c45f6

related-issue: kubeflow#1355

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023

Fix link to logging documentation in README.md (kubeflow#964)

99f5e7f

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024

handle step from task results (kubeflow#964)

7e6cabc

* handle step from task results * address review comments

		@@ -0,0 +1,25 @@
		Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,93 @@
		# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

		PIPELINE_NAME = 'resnet_cifar10_pipeline'


		def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'):

Add a ResNet example from NVIDIA #964

Add a ResNet example from NVIDIA #964

Conversation

khoa-ho commented Mar 13, 2019 • edited by jlewi Loading

googlebot commented Mar 13, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

k8s-ci-robot commented Mar 13, 2019

k8s-ci-robot commented Mar 13, 2019

khoa-ho commented Mar 13, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

googlebot commented Mar 13, 2019

khoa-ho commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ark-kun Mar 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khoa-ho Apr 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

googlebot commented Apr 1, 2019

googlebot commented Apr 1, 2019

khoa-ho commented Apr 1, 2019

khoa-ho commented Apr 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vicaire commented Apr 9, 2019

khoa-ho commented Apr 9, 2019

khoa-ho commented Apr 9, 2019

vicaire commented Apr 10, 2019

k8s-ci-robot commented Apr 10, 2019

k8s-ci-robot commented Apr 10, 2019

khoa-ho commented Apr 10, 2019

khoa-ho commented Apr 11, 2019

vicaire commented Apr 11, 2019

vicaire commented Apr 11, 2019

khoa-ho commented Mar 13, 2019 •

edited by jlewi

Loading

Ark-kun Mar 21, 2019 •

edited

Loading

khoa-ho Apr 1, 2019 •

edited

Loading