-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a ResNet example from NVIDIA #964
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hi @khoa-ho. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
Hi @khoa-ho. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
/assign @gaoning777 |
samples/nvidia-resnet/LICENSE
Outdated
@@ -0,0 +1,25 @@ | |||
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to match the license file in the other directories (Apaches 2.0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah Apache should be fine too. I'm confirming with legal and will update that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License has been changed to Apache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License has been changed to Apache
@@ -0,0 +1,84 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to re-use the launch process for Kubeflow and Kubeflow Pipeline? We probably don't want each pipeline to provide its own installation of Kubeflow on Minikube.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of some GPU dependencies for Docker runtime and Kubernetes, we wanted to provide a one-time installation script when someone tries this example in a new system. After that, for every new pipeline, the user only has to run the builld_pipeline.py
again, which just rebuilds the images for each pipeline component and recompiles the pipeline definition. If there're better practices for this process, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @IronPan
Yang, is there a way we could modify the launcher scripts so that the NVIDIA use case is supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any update on this? Also which launcher scripts are you referring to?
@@ -0,0 +1,11 @@ | |||
kind: PersistentVolumeClaim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for the pipeline itself to perform these steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be nice if we can mount persistent volumes within the pipeline but I'm not sure how to do that besides running kubectl create -f volume.yaml
in the host system like what we did with the mount_persistent_volume.sh
script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @hongye-sun
@@ -0,0 +1,93 @@ | |||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to add this visualization as an output of the pipeline so that the user does not need to run a new webserver? (The Kubeflow/Pipeline UI can display static HTML).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible. We'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've integrated the webapp component to the pipeline itself. In addition, the webapp UI is routed to a subpath of Kubeflow UI (e.g. localhost:[kubeflow-port]/webapp/)
|
||
op_dict['train'] = train_op( | ||
persistent_volume_path, processed_data_dir, model_dir) | ||
op_dict['train'].after(op_dict['preprocess']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to manually specify dependencies using .after
is usually a symptom of components that are not fully composable. Making components composable is a bit complicated now due to the lack of artifact passing support, but it's still pretty easy to do.
If some component produces some data that another component will use, this data dependency must be made explicit: The producer component must have an output a reference to which is then passed to the consumer component as input. When you pass output reference to an input as an argument, there is no need for .after
preprocess_op
should output processed_data_dir
which will be passed to train_op
train_op
should output model_dir
which will be passed to serve_op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data dependency is now made explicit (i.e. passing output between ops instead of using .after
)
PIPELINE_NAME = 'resnet_cifar10_pipeline' | ||
|
||
|
||
def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to remove the persistent_volume_path
parameter and just pass the full paths in input_dir
and output_dir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops use full paths now.
|
||
def preprocess_op(persistent_volume_path, input_dir, output_dir, step_name='preprocess'): | ||
return dsl.ContainerOp( | ||
name=step_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to pass get step name through function parameter. (It was needed in the past when names had to be unique). Just use a constant name here name='nvidian/sae/ananths - Preprocess'
(or a better name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
return dsl.ContainerOp( | ||
name=step_name, | ||
image='nvcr.io/nvidian/sae/ananths:kubeflow-preprocess', | ||
command=['python'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing something along these lines will add the processed_data_dir
output to this component which can then be passed to the next component.
command=[
'sh', '-c', 'echo "$0" > "$1"; "$*"', output_dir, '/tmp/output_dir',
'python',
],
file_outputs={'processed_data_dir': '/tmp/output_dir'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
) | ||
logging.getLogger().setLevel(logging.INFO) | ||
logging.info("Starting flask.") | ||
app.run(debug=True, host='0.0.0.0', port=8080) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid checking-in all the images below in the repo? Could the images be provided in a public dataset instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Commit 85c2ce9 removes all the images and adds a script to download them from a Cloud Storage bucket.
/ok-to-test |
* Integrate webapp into the pipeline * Change license from BSD to Apache * Route webapp UI to Kubeflow UI subpath * Passing output between ops to establish flow * Explicit input & output dir path * Restructure folder
@vicaire Images will now be downloaded from a bucket. Also, I don't think we still need "do-not-merge/work-in-progress" label. |
@vicaire Please kindly review by tomorrow if possible. Thanks! |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vicaire The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vicaire The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@IronPan @gaoning777 @hongye-sun Sorry for asking again but can someone help remove |
/lgtm |
done |
* handle step from task results * address review comments
Add an end-to-end training & serving Kubeflow pipeline for ResNet on CIFAR10, using various NVIDIA technologies
This change is