Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a ResNet example from NVIDIA #964

Merged
merged 3 commits into from
Apr 11, 2019
Merged

Add a ResNet example from NVIDIA #964

merged 3 commits into from
Apr 11, 2019

Conversation

khoa-ho
Copy link
Contributor

@khoa-ho khoa-ho commented Mar 13, 2019

Add an end-to-end training & serving Kubeflow pipeline for ResNet on CIFAR10, using various NVIDIA technologies


This change is Reviewable

@googlebot
Copy link
Collaborator

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@k8s-ci-robot
Copy link
Contributor

Hi @khoa-ho. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@k8s-ci-robot
Copy link
Contributor

Hi @khoa-ho. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@khoa-ho khoa-ho marked this pull request as ready for review March 13, 2019 01:08
@khoa-ho
Copy link
Contributor Author

khoa-ho commented Mar 13, 2019

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

I signed it!

@googlebot
Copy link
Collaborator

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@khoa-ho
Copy link
Contributor Author

khoa-ho commented Mar 13, 2019

/assign @gaoning777

@@ -0,0 +1,25 @@
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to match the license file in the other directories (Apaches 2.0)?

Example: https://github.com/kubeflow/pipelines/blob/master/samples/resnet-cmle/resnet-train-pipeline.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah Apache should be fine too. I'm confirming with legal and will update that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License has been changed to Apache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License has been changed to Apache

@@ -0,0 +1,84 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to re-use the launch process for Kubeflow and Kubeflow Pipeline? We probably don't want each pipeline to provide its own installation of Kubeflow on Minikube.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of some GPU dependencies for Docker runtime and Kubernetes, we wanted to provide a one-time installation script when someone tries this example in a new system. After that, for every new pipeline, the user only has to run the builld_pipeline.py again, which just rebuilds the images for each pipeline component and recompiles the pipeline definition. If there're better practices for this process, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @IronPan

Yang, is there a way we could modify the launcher scripts so that the NVIDIA use case is supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any update on this? Also which launcher scripts are you referring to?

@@ -0,0 +1,11 @@
kind: PersistentVolumeClaim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for the pipeline itself to perform these steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be nice if we can mount persistent volumes within the pipeline but I'm not sure how to do that besides running kubectl create -f volume.yaml in the host system like what we did with the mount_persistent_volume.sh script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,93 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to add this visualization as an output of the pipeline so that the user does not need to run a new webserver? (The Kubeflow/Pipeline UI can display static HTML).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible. We'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've integrated the webapp component to the pipeline itself. In addition, the webapp UI is routed to a subpath of Kubeflow UI (e.g. localhost:[kubeflow-port]/webapp/)

@vicaire vicaire self-assigned this Mar 15, 2019
@k8s-ci-robot k8s-ci-robot requested a review from IronPan March 19, 2019 02:12

op_dict['train'] = train_op(
persistent_volume_path, processed_data_dir, model_dir)
op_dict['train'].after(op_dict['preprocess'])
Copy link
Contributor

@Ark-kun Ark-kun Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to manually specify dependencies using .after is usually a symptom of components that are not fully composable. Making components composable is a bit complicated now due to the lack of artifact passing support, but it's still pretty easy to do.

If some component produces some data that another component will use, this data dependency must be made explicit: The producer component must have an output a reference to which is then passed to the consumer component as input. When you pass output reference to an input as an argument, there is no need for .after

preprocess_op should output processed_data_dir which will be passed to train_op
train_op should output model_dir which will be passed to serve_op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data dependency is now made explicit (i.e. passing output between ops instead of using .after)

PIPELINE_NAME = 'resnet_cifar10_pipeline'


def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to remove the persistent_volume_path parameter and just pass the full paths in input_dir and output_dir.

Copy link
Contributor Author

@khoa-ho khoa-ho Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops use full paths now.


def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'):
return dsl.ContainerOp(
name=step_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to pass get step name through function parameter. (It was needed in the past when names had to be unique). Just use a constant name here name='nvidian/sae/ananths - Preprocess' (or a better name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

return dsl.ContainerOp(
name=step_name,
image='nvcr.io/nvidian/sae/ananths:kubeflow-preprocess',
command=['python'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing something along these lines will add the processed_data_dir output to this component which can then be passed to the next component.

command=[
  'sh', '-c', 'echo "$0" > "$1"; "$*"', output_dir, '/tmp/output_dir',  
  'python',
],
file_outputs={'processed_data_dir': '/tmp/output_dir'}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@googlebot
Copy link
Collaborator

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link
Collaborator

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@khoa-ho
Copy link
Contributor Author

khoa-ho commented Apr 1, 2019

@vicaire @Ark-kun Please have a look the 2nd commit and my replies above. Thank you!

@khoa-ho
Copy link
Contributor Author

khoa-ho commented Apr 8, 2019

@vicaire @Ark-kun Hi guys, would you be able to review the codes this week? We're trying to have a public release for a demo at Google Cloud Next '19.

)
logging.getLogger().setLevel(logging.INFO)
logging.info("Starting flask.")
app.run(debug=True, host='0.0.0.0', port=8080)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid checking-in all the images below in the repo? Could the images be provided in a public dataset instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Commit 85c2ce9 removes all the images and adds a script to download them from a Cloud Storage bucket.

@vicaire
Copy link
Contributor

vicaire commented Apr 9, 2019

/ok-to-test
/lgtm
/approve

khoa-ho added 2 commits April 8, 2019 22:19
* Integrate webapp into the pipeline

* Change license from BSD to Apache

* Route webapp UI to Kubeflow UI subpath

* Passing output between ops to establish flow

* Explicit input & output dir path

* Restructure folder
@khoa-ho
Copy link
Contributor Author

khoa-ho commented Apr 9, 2019

@vicaire Images will now be downloaded from a bucket. Also, I don't think we still need "do-not-merge/work-in-progress" label.

@khoa-ho
Copy link
Contributor Author

khoa-ho commented Apr 9, 2019

@vicaire Please kindly review by tomorrow if possible. Thanks!

@vicaire
Copy link
Contributor

vicaire commented Apr 10, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@khoa-ho
Copy link
Contributor Author

khoa-ho commented Apr 10, 2019

@vicaire @Ark-kun Label do-not-merge/work-in-progress is blocking merging. Can you help remove it? Thanks.

@khoa-ho
Copy link
Contributor Author

khoa-ho commented Apr 11, 2019

@IronPan @gaoning777 @hongye-sun Sorry for asking again but can someone help remove do-not-merge/work-in-progress label. It's blocking merging. Thanks.

@vicaire
Copy link
Contributor

vicaire commented Apr 11, 2019

/lgtm

@vicaire
Copy link
Contributor

vicaire commented Apr 11, 2019

done

@vicaire vicaire merged commit ef39385 into kubeflow:master Apr 11, 2019
Linchin pushed a commit to Linchin/pipelines that referenced this pull request Apr 11, 2023
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024
* handle step from task results

* address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants