Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable QPS and burst settings for kube API client #2411

Open
wants to merge 11 commits into
base: release-1.9
Choose a base branch
from

Conversation

ronk21runai
Copy link

What this PR does / why we need it:
Currently, the default configuration of QPS (20) and Burst (30) is configured by the controller runtime defaults, which are not adjustable by the user. This PR allows users to fine-tune these values, improving the controller's performance.
Introduce new flags to configure QPS and Burst for the Kubernetes API client, enabling better control over API rate limits.

*Proposed Changes
This PR introduces two new argument flags:

flag.IntVar(&clientQps, "kube-api-qps", 20, "QPS indicates the maximum QPS to the master from this client.")
flag.IntVar(&clientBurst, "kube-api-burst", 30, "Maximum burst for throttle.")

These flags allow users to configure API rate limits dynamically instead of relying on the default values.

Checklist:

  • Docs included if any changes are user facing

andreyvelich and others added 10 commits January 7, 2025 02:18
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add Changelog for Training Operator v1.9.0-rc.0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Group PR for new features

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…w#2379)

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Add MNIST example with SPMD for JAX

Illustrate how to use JAX's `pmap` to express and execute
single-program multiple-data (SPMD) programs for data parallelism
along a batch dimension

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>

* Update CONTRIBUTING.md

Use -- server-side to install the latest local changes of Training
Operator control plane

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>

* Add JAXJob output

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>

* Update JAXJob CI images

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>

* Adjust jaxjob spmd example batch size

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>

* Add JAX Example Docker Image Build in CI

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Fix script name typo

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Update script permissions

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add KIND_CLUSTER env var

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Increase timeouts

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Test higher resources

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Increase Timeout

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* remove resource reqs

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* test low batch size

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* test small batch size

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Hardcode number of batches

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

---------

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>
Co-authored-by: Sandipan Panda <samparksandipan@gmail.com>
…lizers (kubeflow#2323)

* KEP-2170: Add unit and integration tests for model and dataset initializers

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>

* refactor tests

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>

---------

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.30.0 to 0.33.0.
- [Commits](golang/net@v0.30.0...v0.33.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: ChristianZaccaria <christian.zaccaria.cz@gmail.com>
* KEP-2170: Deploy JobSet in kubeflow-system namespace

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove namespace from base

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove label from namespace

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Create third-party dir for JobSet

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Bump JobSet to v0.7.3

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Drop namespace from JobSet config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oferla oferla force-pushed the support-configure-qps-and-burst branch from 25eb8f6 to 5bf1d1e Compare February 4, 2025 11:19
Introduce new flags to configure `QPS` and `Burst` for the Kubernetes API client, enabling better control over API rate limits.

Signed-off-by: R.K <ron.kahn@run.ai>
@ronk21runai ronk21runai force-pushed the support-configure-qps-and-burst branch from 8a8ce28 to a5c93da Compare February 4, 2025 12:16
@coveralls
Copy link

Pull Request Test Coverage Report for Build 13135248279

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 13092942826: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

@tenzen-y tenzen-y changed the base branch from master to release-1.9 February 5, 2025 14:21
@google-oss-prow google-oss-prow bot added size/XXL and removed size/S labels Feb 5, 2025
@tenzen-y
Copy link
Member

tenzen-y commented Feb 5, 2025

@ronk21runai Could you rebase this PR top on the release-1.9 branch? We has already been removed v1 codes from the master branch.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm
Thank you

@ronk21runai Once you rebase and address my comment, we can contain this in the release-1.9.

Comment on lines +138 to +139
cfg.QPS = float32(clientQps)
cfg.Burst = clientBurst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cfg.QPS = float32(clientQps)
cfg.Burst = clientBurst
cfg.RateLimiter = flowcontrol.NewTokenBucketRateLimiter(float32(clientQps), clientBurst)

Due to controller-runtime specification, IIUC, we need to specify those parameters throughout the RateLimiter.

@andreyvelich
Copy link
Member

Thank you for this great contribution!
I am wondering, should we support it in TrainJob controller as well ?

@tenzen-y
Copy link
Member

tenzen-y commented Feb 5, 2025

Thank you for this great contribution! I am wondering, should we support it in TrainJob controller as well ?

Ideally, we want to support those and manager specific parameters in the Config API for v2

@andreyvelich
Copy link
Member

I agree, I will create an issue to add Config API support into Kubeflow Trainer V2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants