[Component] Add Managed Spot Training Support for SageMaker #2219

RedbackThomson · 2019-09-24T16:55:22Z

Adds the ability to enable managed spot training as a configuration option for the Sagemaker component.

Added unit tests for the new feature as can be found in the tests/ folder in root.

This change is

googlebot · 2019-09-24T16:55:25Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

k8s-ci-robot · 2019-09-24T16:55:36Z

Hi @RedbackThomson. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RedbackThomson · 2019-09-24T16:57:01Z

@googlebot I signed it!

googlebot · 2019-09-24T16:57:05Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

RedbackThomson · 2019-09-24T16:57:19Z

/assign @Jeffwan

RedbackThomson · 2019-09-24T17:19:41Z

Still need to upload and edit the new container into the configuration. Not ready to merge yet.

Jeffwan · 2019-09-24T17:24:51Z

@RedbackThomson Could you make sure the interface changes are included in samples/contribute/aws/ ?

Ark-kun · 2019-09-24T18:10:35Z

components/aws/sagemaker/train/component.yaml

+    default: '86400'
+  - name: checkpoint_config
+    description: 'Dictionary of information about the output location for managed spot training checkpoint data.'
+    default: '{}'


JFYI: The recent SDK (0.1.29+) supports the JsonObject type. Constant list arguments passed to type: JsonObject inputs will be serialized as JSON.

Does this update support nicer/easier input in the UI for JSON fields?

Not yet. For now it only works for constant lists passed to components. The auto-conversion does not work in UX or when passing arguments to pipeline during the submission.

So this will only work for this case:

aws_sagemaker_trainop(..., checkpoint_config={'a' = 1, 'b' = 1},...)

Previously this would have just resulted in some wrong string (produced by str()) being passed. Now it would be json.dumps-serialized string.

This is a great addition. Is there any way to suppress the warning that dsl-compile throws when it infers the JsonObject type?

Is there any way to suppress the warning that dsl-compile throws when it infers the JsonObject type?

The warning is shown every time a non-string argument is passed to an untyped input.
The easiest way to make it go away is to add the type to the input. Then there would be no inference.

The easiest way to make it go away is to add the type to the input.

Through explicit cast?

Through explicit cast?

No. Just add the type attribute to the input annotation:

- name: checkpoint_config type: JsonObject description: 'Dictionary of information about the output location for managed spot training checkpoint data.' default: '{}'

Jeffwan · 2019-09-26T01:19:25Z

/hold

RedbackThomson · 2019-09-30T17:06:43Z

@Jeffwan While we wait for a more permanent container location, I wanted to push this (and another feature through). I've built this container onto my own personal repository.

Jeffwan · 2019-09-30T22:45:37Z

/hold cancel

Jeffwan · 2019-09-30T22:45:50Z

/ok-to-test

RedbackThomson · 2019-10-01T20:39:42Z

I hope to create a separate PR for JsonObject types so as to test it separately.

Jeffwan · 2019-10-01T20:41:14Z

samples/contrib/aws-samples/mnist-kmeans-sagemaker/mnist-classification-pipeline.py

@@ -72,6 +75,9 @@ def mnist_classification(region='us-west-2',
                "CompressionType": "None", \
                "RecordWrapperType": "None", \
                "InputMode": "File"}]',
+    train_spot_instance='False',


Ok. I see the changes on the example side.

Jeffwan · 2019-10-01T20:42:06Z

components/aws/sagemaker/train/src/train.py

+  ### Start spot instance support
+  parser.add_argument('--spot_instance', type=_utils.str_to_bool, required=False, help='Use managed spot training.', default=False)
+  parser.add_argument('--max_wait_time', type=_utils.str_to_int, required=False, help='The maximum time in seconds you are willing to wait for a managed spot training job to complete.', default=86400)
+  parser.add_argument('--checkpoint_config', type=_utils.str_to_json_dict, required=False, help='Dictionary of information about the output location for managed spot training checkpoint data.', default='{}')


I am curious checkpoint_config is just for spot training? isn't it a generic config?

Correct. The script will only create a checkpoint if it is interrupted by spot instance downtime. Otherwise, it will run from start to finish.

Jeffwan · 2019-10-01T20:45:56Z

/lgtm

Wait for one day to see other reviewers have follow up comments. @Ark-kun

RedbackThomson · 2019-10-02T20:44:47Z

Is this ready to merge?

Ark-kun · 2019-10-02T23:18:54Z

/approve

Ark-kun · 2019-10-02T23:20:46Z

/approve cancel
I'll let the AWS team approve this.

Jeffwan · 2019-10-03T17:55:37Z

/approve

k8s-ci-robot · 2019-10-03T17:56:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~components/aws/OWNERS~~ [Jeffwan]
~~samples/contrib/aws-samples/OWNERS~~ [Jeffwan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Added Spot Instance Support

7297d49

k8s-ci-robot added the do-not-merge/work-in-progress label Sep 24, 2019

k8s-ci-robot requested review from Jeffwan and mameshini September 24, 2019 16:55

k8s-ci-robot added the size/L label Sep 24, 2019

k8s-ci-robot added the needs-ok-to-test label Sep 24, 2019

k8s-ci-robot assigned Jeffwan Sep 24, 2019

RedbackThomson marked this pull request as ready for review September 24, 2019 17:02

k8s-ci-robot removed the do-not-merge/work-in-progress label Sep 24, 2019

Fixed missing output configuration

987a3cc

Added spot instance support to example pipelines

a8209c1

Ark-kun reviewed Sep 24, 2019

View reviewed changes

k8s-ci-robot added the do-not-merge/hold label Sep 26, 2019

Updated image to new repository

4c70843

k8s-ci-robot removed the do-not-merge/hold label Sep 30, 2019

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Sep 30, 2019

Jeffwan reviewed Oct 1, 2019

View reviewed changes

k8s-ci-robot added the lgtm label Oct 1, 2019

Jeffwan approved these changes Oct 1, 2019

View reviewed changes

k8s-ci-robot added the approved label Oct 2, 2019

k8s-ci-robot removed the approved label Oct 2, 2019

k8s-ci-robot added the approved label Oct 3, 2019

k8s-ci-robot merged commit 12dde37 into kubeflow:master Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Component] Add Managed Spot Training Support for SageMaker #2219

[Component] Add Managed Spot Training Support for SageMaker #2219

RedbackThomson commented Sep 24, 2019 •

edited by jlewi

Loading

googlebot commented Sep 24, 2019

k8s-ci-robot commented Sep 24, 2019

RedbackThomson commented Sep 24, 2019

googlebot commented Sep 24, 2019

RedbackThomson commented Sep 24, 2019

RedbackThomson commented Sep 24, 2019

Jeffwan commented Sep 24, 2019

Ark-kun Sep 24, 2019 •

edited

Loading

RedbackThomson Sep 24, 2019

Ark-kun Sep 25, 2019

Ark-kun Sep 25, 2019

RedbackThomson Sep 25, 2019

Ark-kun Sep 25, 2019

RedbackThomson Sep 25, 2019 •

edited

Loading

Ark-kun Sep 25, 2019

Jeffwan commented Sep 26, 2019

RedbackThomson commented Sep 30, 2019

Jeffwan commented Sep 30, 2019

Jeffwan commented Sep 30, 2019

RedbackThomson commented Oct 1, 2019

Jeffwan Oct 1, 2019

Jeffwan Oct 1, 2019 •

edited

Loading

RedbackThomson Oct 1, 2019

Jeffwan commented Oct 1, 2019 •

edited

Loading

RedbackThomson commented Oct 2, 2019

Ark-kun commented Oct 2, 2019

Ark-kun commented Oct 2, 2019

Jeffwan commented Oct 3, 2019

k8s-ci-robot commented Oct 3, 2019

[Component] Add Managed Spot Training Support for SageMaker #2219

[Component] Add Managed Spot Training Support for SageMaker #2219

Conversation

RedbackThomson commented Sep 24, 2019 • edited by jlewi Loading

googlebot commented Sep 24, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

k8s-ci-robot commented Sep 24, 2019

RedbackThomson commented Sep 24, 2019

googlebot commented Sep 24, 2019

RedbackThomson commented Sep 24, 2019

RedbackThomson commented Sep 24, 2019

Jeffwan commented Sep 24, 2019

Ark-kun Sep 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RedbackThomson Sep 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Sep 26, 2019

RedbackThomson commented Sep 30, 2019

Jeffwan commented Sep 30, 2019

Jeffwan commented Sep 30, 2019

RedbackThomson commented Oct 1, 2019

Choose a reason for hiding this comment

Jeffwan Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Oct 1, 2019 • edited Loading

RedbackThomson commented Oct 2, 2019

Ark-kun commented Oct 2, 2019

Ark-kun commented Oct 2, 2019

Jeffwan commented Oct 3, 2019

k8s-ci-robot commented Oct 3, 2019

RedbackThomson commented Sep 24, 2019 •

edited by jlewi

Loading

Ark-kun Sep 24, 2019 •

edited

Loading

RedbackThomson Sep 25, 2019 •

edited

Loading

Jeffwan Oct 1, 2019 •

edited

Loading

Jeffwan commented Oct 1, 2019 •

edited

Loading