chore: tuning resource usage for operator pod #1120

zdtsw · 2024-07-15T09:23:01Z

reduce cpu and mem usage
request is some data from testing in a large cluster psi-04
limit is data from attached file in jira 9806

Description

related to https://issues.redhat.com/browse/RHOAIENG-9806

How Has This Been Tested?

Screenshot or short clip

Merge criteria

You have read the contributors guide.
Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

adelton · 2024-07-18T14:01:16Z

Is this related to https://issues.redhat.com/browse/RHOAIENG-9806, as some preliminary stop-gap?

some data from testing in a large cluster psi-04

Wouldn't data from small clusters be more relevant for the actual value we aim for?

Do we need for example CPU requests at all? What functionality will suffer when the Operator becomes CPU-starved?

And vice versa -- do we need to lower the limit at all?

zdtsw · 2024-07-18T14:37:42Z

Is this related to https://issues.redhat.com/browse/RHOAIENG-9806, as some preliminary stop-gap?

some data from testing in a large cluster psi-04

Wouldn't data from small clusters be more relevant for the actual value we aim for?

Do we need for example CPU requests at all? What functionality will suffer when the Operator becomes CPU-starved?

And vice versa -- do we need to lower the limit at all?

I would take a step by step to see if this can make the "large" cluster working first, then we can go even more fine tuning to do the low boundary for "small" cluster.

Do we need for example CPU requests at all?
I am not sure i understand this question ? you mean do not set requests.cpu at all?
then the operator pod get first throttling or evicted, is this what we want?

to have a high "limit" (to keep what we have now) i would not say do much harm, but it impacts k8s node selection. ofc, if we are talking about SNO i guess there is no such needs for consideration. lower or higher "limit" is the same

VaishnaviHire · 2024-07-18T16:25:38Z

bundle/manifests/opendatahub-operator.clusterserviceversion.yaml

                  requests:
-                    cpu: 500m
+                    cpu: 25m


I think we should only update requests instead of limits.

i am confused for this part. I've left some question/comments in the jira and slack, would be good to understand these data first.

VaishnaviHire · 2024-07-18T16:27:29Z

I agree with @adelton to use data from small clusters to set defaults. The jira issue linked has data from PSAP team

zdtsw · 2024-07-19T07:34:18Z

tbh, when i started this PR, i did not know this jira ticket.
Mainly was from some test we did for another case. Then I recalled we had an old issue regarding resource utilization enhancement, so I submitted this PR after we finalized certain tests.

one thing on my mind after reading your comments:
for ticket https://issues.redhat.com/browse/RHOAIENG-9806 , should we use the same data from pref test in ODH?
I would assume these data were collected from downstream build. we can use it to set for downstream but how you feel we should use the same value in ODH (if it is not for the sake of sync code)

adelton · 2024-07-19T07:59:44Z

one thing on my mind after reading your comments: for ticket https://issues.redhat.com/browse/RHOAIENG-9806 , should we use the same data from pref test in ODH? I would assume these data were collected from downstream build. we can use it to set for downstream but how you feel we should use the same value in ODH (if it is not for the sake of sync code)

For the benefit of the folks who might not have access to the internal information, it might be useful to get the data from an ODH installation and share them here or in some other public place, so that the reasons for the numerical changes are documented. I would assume the numbers from ODH and downstream don't differ much, so if we can use and publish the numbers we got for downstream really depends on whether they are considered internal-only or not.

adelton · 2024-07-25T20:08:07Z

Is this also related to https://issues.redhat.com/browse/RHOAIENG-494?

- reduce cpu and mem usage in requests from profiling data Signed-off-by: Wen Zhou <wenzhou@redhat.com>

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

zdtsw · 2024-08-02T13:18:58Z

Is this also related to https://issues.redhat.com/browse/RHOAIENG-494?

I dont think so, but more for https://issues.redhat.com/browse/RHOAIENG-9806

adelton · 2024-08-05T18:20:05Z

I dont think so, but more for https://issues.redhat.com/browse/RHOAIENG-9806

And specifically https://issues.redhat.com/browse/RHOAIENG-10889, it seems.

adelton

LGTM.

openshift-ci · 2024-08-05T18:20:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adelton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adelton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…1120) * chore: tuning resource usage for operator pod - reduce cpu and mem usage in requests from profiling data Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: data from 2.11 with all default components up Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> (cherry picked from commit d6ae248)

* chore: tuning resource usage for operator pod - reduce cpu and mem usage in requests from profiling data Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: data from 2.11 with all default components up Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> (cherry picked from commit d6ae248)

openshift-ci bot requested review from asanzgom and VaishnaviHire July 15, 2024 09:23

VaishnaviHire reviewed Jul 18, 2024

View reviewed changes

zdtsw force-pushed the chore_78 branch from c48916f to eaadf31 Compare July 19, 2024 07:34

openshift-merge-robot added needs-rebase and removed needs-rebase labels Jul 19, 2024

zdtsw requested review from adelton and removed request for asanzgom July 19, 2024 07:35

zdtsw marked this pull request as draft July 19, 2024 07:36

openshift-ci bot added the do-not-merge/work-in-progress label Jul 19, 2024

zdtsw force-pushed the chore_78 branch from fa08faa to 123d88b Compare July 29, 2024 08:43

zdtsw added 2 commits July 30, 2024 12:57

chore: tuning resource usage for operator pod

4f21c71

- reduce cpu and mem usage in requests from profiling data Signed-off-by: Wen Zhou <wenzhou@redhat.com>

update: data from 2.11 with all default components up

53c1871

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

zdtsw force-pushed the chore_78 branch from 123d88b to 53c1871 Compare August 1, 2024 15:59

zdtsw marked this pull request as ready for review August 1, 2024 16:03

openshift-ci bot removed the do-not-merge/work-in-progress label Aug 1, 2024

openshift-ci bot requested review from AjayJagan and grdryn August 1, 2024 16:04

zdtsw removed request for AjayJagan and grdryn August 1, 2024 16:04

adelton approved these changes Aug 5, 2024

View reviewed changes

openshift-ci bot assigned adelton Aug 5, 2024

openshift-ci bot added the lgtm label Aug 5, 2024

openshift-ci bot added the approved label Aug 5, 2024

openshift-merge-bot bot merged commit d6ae248 into opendatahub-io:incubation Aug 5, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: tuning resource usage for operator pod #1120

chore: tuning resource usage for operator pod #1120

zdtsw commented Jul 15, 2024 •

edited

Loading

adelton commented Jul 18, 2024

zdtsw commented Jul 18, 2024

VaishnaviHire Jul 18, 2024

zdtsw Jul 19, 2024

VaishnaviHire commented Jul 18, 2024

zdtsw commented Jul 19, 2024

adelton commented Jul 19, 2024

adelton commented Jul 25, 2024

zdtsw commented Aug 2, 2024

adelton commented Aug 5, 2024

adelton left a comment

openshift-ci bot commented Aug 5, 2024

chore: tuning resource usage for operator pod #1120

chore: tuning resource usage for operator pod #1120

Conversation

zdtsw commented Jul 15, 2024 • edited Loading

Description

How Has This Been Tested?

Screenshot or short clip

Merge criteria

adelton commented Jul 18, 2024

zdtsw commented Jul 18, 2024

VaishnaviHire Jul 18, 2024

Choose a reason for hiding this comment

zdtsw Jul 19, 2024

Choose a reason for hiding this comment

VaishnaviHire commented Jul 18, 2024

zdtsw commented Jul 19, 2024

adelton commented Jul 19, 2024

adelton commented Jul 25, 2024

zdtsw commented Aug 2, 2024

adelton commented Aug 5, 2024

adelton left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 5, 2024

zdtsw commented Jul 15, 2024 •

edited

Loading