Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): add runtime resource requests. Fixes #1956 #5447

Merged

Conversation

NikeNano
Copy link
Member

@NikeNano NikeNano commented Apr 8, 2021

Signed-off-by: NikeNano niklas.sven.hansson@gmail.com

Description of your changes:

UPDATE: linking don't seem to work, relates to: #1956

Checklist:

@google-cla google-cla bot added the cla: yes label Apr 8, 2021
@NikeNano NikeNano marked this pull request as draft April 8, 2021 18:33
@NikeNano NikeNano marked this pull request as ready for review April 8, 2021 18:50
@NikeNano NikeNano changed the title feat(sdk): add runtime resource requests. Fixes #1956 [WIP] feat(sdk): add runtime resource requests. Fixes #1956 Apr 8, 2021
@NikeNano NikeNano changed the title [WIP] feat(sdk): add runtime resource requests. Fixes #1956 feat(sdk): add runtime resource requests. Fixes #1956 Apr 8, 2021
@NikeNano
Copy link
Member Author

NikeNano commented Apr 8, 2021

/assign @hongye-sun

@NikeNano
Copy link
Member Author

Maybe you could take a look on this @Ark-kun? Thanks.

@NikeNano
Copy link
Member Author

@numerology could you please take a look? Thanks

@NikeNano
Copy link
Member Author

NikeNano commented May 3, 2021

@Bobgy do you have the possibility to look at this? Have been hard to get a review on this one :(

@NikeNano
Copy link
Member Author

@capri-xiyue do you have the opportunity to look at this?

@capri-xiyue
Copy link
Contributor

capri-xiyue commented May 12, 2021

@capri-xiyue do you have the opportunity to look at this?

This PR looks good to me. But I don't have a lot of experience on the sdk side.
@chensun @Ark-kun Can you help review this PR?

@capri-xiyue
Copy link
Contributor

/retest

@capri-xiyue capri-xiyue requested a review from chensun May 12, 2021 19:21
@NikeNano
Copy link
Member Author

@capri-xiyue do you have the opportunity to look at this?
This PR looks good to me. But I don't have a lot of experience on the sdk side.
@chensun @Ark-kun Can you help review this PR?

Thanks @capri-xiyue :)

@Ark-kun
Copy link
Contributor

Ark-kun commented May 13, 2021

Thank you for this PR. I've left couple of comments.

@Bobgy Bobgy assigned Ark-kun and chensun and unassigned hongye-sun May 13, 2021
@NikeNano
Copy link
Member Author

Pushed some changes, will double check the tests as well before it is ready for a new review.

@NikeNano
Copy link
Member Author

NikeNano commented Jun 3, 2021

@chensun and @Ark-kun could could you please take a look again?

@tiru1930
Copy link

tiru1930 commented Jun 3, 2021

@NikeNano is this merged to master , which version sdk i can use to get benefit of this approach

@NikeNano
Copy link
Member Author

NikeNano commented Jun 3, 2021

@NikeNano is this merged to master , which version sdk i can use to get benefit of this approach

Not yet @tiru1930 but hopefully soon :)

Copy link
Member

@chensun chensun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NikeNano. Only a couple nitpicks otherwise LGTM.
The test case being added looks awesome.

.set_cpu_limit(resouce_task.outputs['cpu'])\
.set_cpu_request('200m')

# Disable cache for KFP v1 mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation for the comment is off.

@@ -288,6 +288,24 @@ def _op_to_template(op: BaseOp):
template['volumes'] = [convert_k8s_obj_to_json(volume) for volume in processed_op.volumes]
template['volumes'].sort(key=lambda x: x['name'])

# Runtime resource requests
if isinstance(op, dsl.ContainerOp) and ("resources" in op.container.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use consistent quotes.

@chensun
Copy link
Member

chensun commented Jun 4, 2021

/retest

@chensun
Copy link
Member

chensun commented Jun 4, 2021

@NikeNano: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-pipelines-samples-v2 2888390 link /test kubeflow-pipelines-samples-v2

The test added in this PR failed. The error was:

invalid spec: templates.runtime-resource-request-pipeline.tasks.training-op templates.training-op: failed to resolve {{inputs.parameters.generate-resouce-request-memory}}

@NikeNano
Copy link
Member Author

NikeNano commented Jun 4, 2021

invalid spec: templates.runtime-resource-request-pipeline.tasks.training-op templates.training-op: failed to resolve {{inputs.parameters.generate-resouce-request-memory}}

Where in the logs do you find this @chensun? But it make sens, i have not updated for V2, will do so as well.

@chensun
Copy link
Member

chensun commented Jun 6, 2021

Where in the logs do you find this @chensun?

I'm not sure if you may have access to it. Right above the call stack at the end, there's this

Run details page URL:
https://4e18c21c9d33d20f-dot-datalab-vm-staging.googleusercontent.com/#/runs/details/973f40b0-f9ec-4c9d-9bea-49e8db4f6f2e

Open that link, you will find a failed test among a list of ParallelFor tasks.

But it make sens, i have not updated for V2, will do so as well.

Note that the test is v2 compatible mode in v1. I haven't debugged this, but I think the change you need would probably be in v1 compiler code.

@NikeNano
Copy link
Member Author

NikeNano commented Jun 7, 2021

I'm not sure if you may have access to it. Right above the call stack at the end, there's this

Will run it locally for debugging.

@NikeNano
Copy link
Member Author

NikeNano commented Jun 9, 2021

invalid spec: templates.runtime-resource-request-pipeline.tasks.training-op templates.training-op: failed to resolve {{inputs.parameters.generate-resouce-request-memory}}

After debugging this I found that the issues comes from

if kfp.COMPILING_FOR_V2:
which result in that the:

dependencies: [generate-resouce-request]

,

        arguments:
          parameters:
          - {name: generate-resouce-request-cpu, value: '{{tasks.generate-resouce-request.outputs.parameters.generate-resouce-request-cpu}}'}
          - {name: generate-resouce-request-memory, value: '{{tasks.generate-resouce-request.outputs.parameters.generate-resouce-request-memory}}'}

and

    inputs:
      parameters:
      - {name: generate-resouce-request-cpu}
      - {name: generate-resouce-request-memory}

are removed from the generate pipeline compare to not setting mode=kfp.dsl.PipelineExecutionMode.V2_COMPATIBLE which gives the error mentioned above.

I believe this is a design decision so before I run off and change something could you elaborate on what we are trying to accomplish and why we # Override command and arguments if compiling to v2. @chensun, thank you?

@chensun
Copy link
Member

chensun commented Jun 10, 2021

I believe this is a design decision so before I run off and change something could you elaborate on what we are trying to accomplish and why we # Override command and arguments if compiling to v2. @chensun, thank you?

In v2, we use a different set of placeholders in command and arguments. For instance:

"command": [
"sh",
"-c",
"set -e -x\necho \"$0\" | gsutil cp - \"$1\"\n",
"{{$.inputs.parameters['text']}}",
"{{$.outputs.artifacts['output_gcs_path'].uri}}"

I think the issue is this line:

# limit this to v2 compiling only to avoid possible behavior change in v1.
task.inputs = list(input_params_set)

I'll think about how to fix this properly.
For now, to unblock your change, how about you create an issue assign to me, comment out the V2_COMPATIBLE test case in your PR with a reference to the issue. Then let's merge this PR to complete the feature in V1 mode.

@chensun
Copy link
Member

chensun commented Jun 10, 2021

/lgtm
/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chensun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 5db8431 into kubeflow:master Jun 10, 2021
jagadeeshi2i pushed a commit to chauhang/pipelines that referenced this pull request Jun 12, 2021
…ow#5447)

* added resource request at runtime

* fixed things

* Update to use read only parameter insteadt

* added test case and better example

* Updated again

* add the validation

* add to the test suit

* work in progress

* update after feedback

* fix the test

* clean up

* clean up

* fix the path

* add the test again

* clean up

* fix tests

* feedback fix

* comment out and clean up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants