Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for kube_scheduler SLI metrics #15731

Merged

Conversation

jennchenn
Copy link
Member

@jennchenn jennchenn commented Aug 31, 2023

What does this PR do?

This PR introduces two new scheduler metrics: kube_scheduler.slis.kubernetes_healthcheck and kube_scheduler.slis.kubernetes_healthchecks_total.

Motivation

Kubernetes v1.26 exposed a new /metrics/slis endpoint (reference here). This PR adds support for capturing the new metrics exposed for the scheduler: kubernetes_healthcheck and kubernetes_healthcheck.

Additional Notes

  • Open to suggestions if the current metrics names are unclear!

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Changelog entries must be created for modifications to shipped code
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.

@ghost ghost added the documentation label Aug 31, 2023
@jennchenn jennchenn force-pushed the jenn/CONT-4201-add-support-for-kube-scheduler-sli-metrics branch from b5f71d1 to f43d0be Compare August 31, 2023 15:23
@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

Merging #15731 (ca0570e) into master (d64642e) will increase coverage by 0.05%.
The diff coverage is 97.70%.

Flag Coverage Δ
activemq ?
cassandra ?
hive ?
hivemq ?
hudi ?
ignite ?
jboss_wildfly ?
kafka ?
kube_scheduler 97.50% <97.70%> (+0.04%) ⬆️
presto ?
solr ?
tomcat ?

Flags with carried forward coverage won't be shown. Click here to find out more.

@github-actions
Copy link

github-actions bot commented Aug 31, 2023

Test Results

  4 files    4 suites   12s ⏱️
11 tests 11 ✔️ 0 💤 0
24 runs  22 ✔️ 2 💤 0

Results for commit ca0570e.

♻️ This comment has been updated with latest results.

@jennchenn jennchenn marked this pull request as ready for review August 31, 2023 15:43
@jennchenn jennchenn requested review from a team as code owners August 31, 2023 15:43
cswatt
cswatt previously approved these changes Aug 31, 2023
Copy link
Contributor

@cswatt cswatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changelog approved by docs

Copy link
Member

@sblumenthal sblumenthal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, just a couple of comments on my side

except Exception as e:
self.log.debug("Unable to collect query slis endpoint: %s", e)
return False
self._slis_available = r.status_code != 404 and r.status_code != 403
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is supposed to be called, then I think we want to ignore a 404, but at the very least log an error for a 403, as that means that their agent or environment is not properly configured and they should be made aware of that

sblumenthal
sblumenthal previously approved these changes Sep 7, 2023
Copy link
Contributor

@yzhan289 yzhan289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small comments!


def assert_metric(name, **kwargs):
# Wrapper to keep assertions < 120 chars
aggregator.assert_metric(NAMESPACE + name, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aggregator.assert_metric(NAMESPACE + name, **kwargs)
aggregator.assert_metric(f"{NAMESPACE}.{name}", **kwargs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we run tests in Python 2 I couldn't use f-strings, but I used format instead

Comment on lines 44 to 48
assert_metric('.slis.kubernetes_healthcheck', value=1, tags=['name:ping', 'type:healthz'])
assert_metric(
'.slis.kubernetes_healthchecks_total', value=2450, tags=['name:ping', 'status:success', 'type:healthz']
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_metric('.slis.kubernetes_healthcheck', value=1, tags=['name:ping', 'type:healthz'])
assert_metric(
'.slis.kubernetes_healthchecks_total', value=2450, tags=['name:ping', 'status:success', 'type:healthz']
)
assert_metric('slis.kubernetes_healthcheck', value=1, tags=['name:ping', 'type:healthz'])
assert_metric(
'slis.kubernetes_healthchecks_total', value=2450, tags=['name:ping', 'status:success', 'type:healthz']
)

Comment on lines 17 to 18
CHECK_NAME = 'kube_scheduler'
NAMESPACE = 'kube_scheduler'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: personally I think you can just remove NAMESPACE and use CHECK_NAME.

kube_scheduler/tests/test_sli_metrics.py Outdated Show resolved Hide resolved

@pytest.fixture()
def mock_metrics():
f_name = os.path.join(os.path.dirname(__file__), 'fixtures', 'metrics_slis_1.27.3.txt')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of os.path.dirname(__file__), you can call get_here(). Example:

HERE = get_here()

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

1 similar comment
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@jennchenn jennchenn requested a review from yzhan289 September 7, 2023 20:32
yzhan289
yzhan289 previously approved these changes Sep 8, 2023
Copy link
Contributor

@yzhan289 yzhan289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for agent integ!

@jennchenn jennchenn force-pushed the jenn/CONT-4201-add-support-for-kube-scheduler-sli-metrics branch from 4b479a2 to 563d9ef Compare September 8, 2023 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants