Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: tuning resource usage for operator pod #1120

Merged

Conversation

zdtsw
Copy link
Member

@zdtsw zdtsw commented Jul 15, 2024

  • reduce cpu and mem usage
  • request is some data from testing in a large cluster psi-04
  • limit is data from attached file in jira 9806

Description

related to https://issues.redhat.com/browse/RHOAIENG-9806

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci openshift-ci bot requested review from asanzgom and VaishnaviHire July 15, 2024 09:23
@adelton
Copy link
Contributor

adelton commented Jul 18, 2024

Is this related to https://issues.redhat.com/browse/RHOAIENG-9806, as some preliminary stop-gap?

some data from testing in a large cluster psi-04

Wouldn't data from small clusters be more relevant for the actual value we aim for?

Do we need for example CPU requests at all? What functionality will suffer when the Operator becomes CPU-starved?

And vice versa -- do we need to lower the limit at all?

@zdtsw
Copy link
Member Author

zdtsw commented Jul 18, 2024

Is this related to https://issues.redhat.com/browse/RHOAIENG-9806, as some preliminary stop-gap?

some data from testing in a large cluster psi-04

Wouldn't data from small clusters be more relevant for the actual value we aim for?

Do we need for example CPU requests at all? What functionality will suffer when the Operator becomes CPU-starved?

And vice versa -- do we need to lower the limit at all?

I would take a step by step to see if this can make the "large" cluster working first, then we can go even more fine tuning to do the low boundary for "small" cluster.

Do we need for example CPU requests at all?
I am not sure i understand this question ? you mean do not set requests.cpu at all?
then the operator pod get first throttling or evicted, is this what we want?

to have a high "limit" (to keep what we have now) i would not say do much harm, but it impacts k8s node selection. ofc, if we are talking about SNO i guess there is no such needs for consideration. lower or higher "limit" is the same

requests:
cpu: 500m
cpu: 25m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only update requests instead of limits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am confused for this part. I've left some question/comments in the jira and slack, would be good to understand these data first.

@VaishnaviHire
Copy link
Member

I agree with @adelton to use data from small clusters to set defaults. The jira issue linked has data from PSAP team

@zdtsw
Copy link
Member Author

zdtsw commented Jul 19, 2024

tbh, when i started this PR, i did not know this jira ticket.
Mainly was from some test we did for another case. Then I recalled we had an old issue regarding resource utilization enhancement, so I submitted this PR after we finalized certain tests.

one thing on my mind after reading your comments:
for ticket https://issues.redhat.com/browse/RHOAIENG-9806 , should we use the same data from pref test in ODH?
I would assume these data were collected from downstream build. we can use it to set for downstream but how you feel we should use the same value in ODH (if it is not for the sake of sync code)

@zdtsw zdtsw requested review from adelton and removed request for asanzgom July 19, 2024 07:35
@zdtsw zdtsw marked this pull request as draft July 19, 2024 07:36
@adelton
Copy link
Contributor

adelton commented Jul 19, 2024

one thing on my mind after reading your comments: for ticket https://issues.redhat.com/browse/RHOAIENG-9806 , should we use the same data from pref test in ODH? I would assume these data were collected from downstream build. we can use it to set for downstream but how you feel we should use the same value in ODH (if it is not for the sake of sync code)

For the benefit of the folks who might not have access to the internal information, it might be useful to get the data from an ODH installation and share them here or in some other public place, so that the reasons for the numerical changes are documented. I would assume the numbers from ODH and downstream don't differ much, so if we can use and publish the numbers we got for downstream really depends on whether they are considered internal-only or not.

@adelton
Copy link
Contributor

adelton commented Jul 25, 2024

Is this also related to https://issues.redhat.com/browse/RHOAIENG-494?

zdtsw added 2 commits July 30, 2024 12:57
- reduce cpu and mem usage in requests from profiling data

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
@zdtsw zdtsw marked this pull request as ready for review August 1, 2024 16:03
@openshift-ci openshift-ci bot requested review from AjayJagan and grdryn August 1, 2024 16:04
@zdtsw zdtsw removed request for AjayJagan and grdryn August 1, 2024 16:04
@zdtsw
Copy link
Member Author

zdtsw commented Aug 2, 2024

Is this also related to https://issues.redhat.com/browse/RHOAIENG-494?

I dont think so, but more for https://issues.redhat.com/browse/RHOAIENG-9806

@adelton
Copy link
Contributor

adelton commented Aug 5, 2024

I dont think so, but more for https://issues.redhat.com/browse/RHOAIENG-9806

And specifically https://issues.redhat.com/browse/RHOAIENG-10889, it seems.

Copy link
Contributor

@adelton adelton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@openshift-ci openshift-ci bot added the lgtm label Aug 5, 2024
Copy link

openshift-ci bot commented Aug 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adelton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Aug 5, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit d6ae248 into opendatahub-io:incubation Aug 5, 2024
8 checks passed
zdtsw referenced this pull request in zdtsw-forking/rhods-operator Aug 15, 2024
…1120)

* chore: tuning resource usage for operator pod

- reduce cpu and mem usage in requests from profiling data

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* update: data from 2.11 with all default components up

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

---------

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
(cherry picked from commit d6ae248)
openshift-merge-bot bot referenced this pull request in red-hat-data-services/rhods-operator Aug 16, 2024
* chore: tuning resource usage for operator pod

- reduce cpu and mem usage in requests from profiling data

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* update: data from 2.11 with all default components up

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

---------

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
(cherry picked from commit d6ae248)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants