[MRG] Serve pod usage information in `/health` handler #912

betatim · 2019-08-01T20:21:15Z

This adds a new endpoint /_usage that returns information about the number of build and user pods currently running and their sum.

It introduces a new config variable called max_pod_capacity that will be served to indicate the maximum capacity the cluster can handle. Currently this value isn't used to enforce an upper limit during launches, so it is more indicative.

The idea is to use information from this endpoint as part of the decision to send users to a cluster from the federation redirect. We could implement a strategy that sends users to the least full cluster (total_pods/max_pod_capacity) or considers clusters were total_pods > max_pod_capacity to be unavailable for more launches.

I am not super happy with the naming. Currently it is /_usage and the handler is UsageHandler but that somehow suggests that you get historical usage information as well. So maybe "load" is a better name, but then is that the load on the system or an endpoint where you can load something?

I used the _ prefix in _usage to signal that this is somehow an "internal" endpoint.

closes #874

WDYT?

betatim · 2019-08-07T05:05:59Z

@choldgraf do you have thoughts on naming and what information to return?

I am trying to think of a word to replace max_capacity with. Want something that signals that this isn't a hard cap more like a goal/desire/design capacity.

nuest · 2019-08-07T12:35:06Z

Since the docs say "Serve statistics about", how about/stats as the name for the endpoint?

I suggest quota for the maximum capacity. You may go over the quota, but that has costs, so it should be avoided.

binderhub/metrics.py

minrk · 2019-08-08T09:34:36Z

Does it make sense to include this info in /health rather than a new endpoint? Health is the logical place to put things we've thought about like failure rate, etc.

I'd also be happy with /stats

betatim · 2019-08-08T11:03:30Z

/health or new endpoint

I started a new endpoint because I wasn't sure how I'd integrate it with /health. In particular I was wondering if being over quota means the service is unhealthy or not. Right now the health endpoint only knows "healthy" and "sick".

I don't think that "over quota" should mean "sick", because at least right now it is a "soft quota".

I then pondered if we should have an intermediate state "so-so" for /health which means "please don't send traffic but if you have to I'll take it". So a bit like a traffic light system instead of the yes/no we have right now. However I couldn't think of a nice way to deal with that while keeping {"status": true} (or false) as simple JSON response. Best guess is we'd have to switch to words ("green", "orange", "red').

In the end I decided to punt on all that confusion and made a new endpoint so that we could prototype the actual code. Especially if someone can help resolve my confusion over how to reflect "kinda sick" in /health I'd integrate it there instead of making a new end point.

betatim · 2019-08-10T07:02:24Z

I added the pod quota information to the health handler. Right now it ignores the quota information when determining the overall health of the hub. WDYT?

consideRatio · 2019-08-10T08:06:12Z

If the usage/load/stats info returned relates to health it makes sense to me to included it in the health endpoint. Or hmmm wait a second...

So there is health and readiness in k8s. The health saying "not ok" means the pod needs to restart. The readiness saying no means "dont restart but also dont pass more traffic to me" i think.

So, including it in a readiness endpoint may be suitable, but perhaps not in a health one?

Hmmmm, im not confident about what makes sense, but it is relevant to understand the potential confusion that could arise by mismatching with how health/readiness endpoints have been used in k8s on pods

betatim · 2019-08-10T08:43:19Z

Maybe this should be in /ready (or what ever the standard name is for the readyness endpoint)? Thanks for reminding us that there is already a concept for "ready" and "healthy" -> someone already solved this for us.

To be super explicit the /health endpoint is about the health of the whole "binderhub", not just this individual pod. For example it will show "unhealthy" when it can't reach the JupyterHub API or the docker registry. Restarting the binderhub pod might or might not fix this, so for now I am not sure if we should use this endpoint to determine the health of the pod.

The goal of the existing /health end point is to provide information for upstream services like the federation proxy or a status page, not to be a endpoint for pod health checks. If that is incompatible with being a good endpoint for pod health checks then we should introduce a second end point for pod health (and maybe rename things).

consideRatio · 2019-08-10T12:20:21Z

I think this endpoint could be a good readiness endpoint for the main entrypoint to a binderhub in both kubernetes sense and a cluster federation sense.

a kubernetes service would be able to target multiple binderhub entrypoint for example, and delegate traffic but exclude those that are not in ready state.

Is it both about the health of the who binderhub and the readiness of the whole binderhub, or only one of these? Health example to me: critical pods in system not healthy or similar. Readiness example to me: hub is ready for more incomming traffic.

Hmmm... new vs old traffic may be of relevance to consider... Yikes gtg for now

betatim · 2019-08-16T08:16:07Z

merged this so we can get it deployed and start using it.

jupyterhub/binderhub#912

Add a handler to serve usage information

3644945

betatim added 2 commits August 8, 2019 08:26

Rename usage handler to pod quota handler

8d0ab0e

Improve documentation for pod quota setting

a74e7e3

betatim force-pushed the load-endpoint branch from 73213a4 to a74e7e3 Compare August 8, 2019 06:32

betatim changed the title ~~[WIP] Add a handler to serve usage information~~ [MRG] Add a handler to serve usage information Aug 8, 2019

minrk reviewed Aug 8, 2019

View reviewed changes

binderhub/metrics.py Outdated Show resolved Hide resolved

minrk reviewed Aug 8, 2019

View reviewed changes

binderhub/metrics.py Outdated Show resolved Hide resolved

betatim added 2 commits August 8, 2019 13:44

Always innclude pod quota in response

0d25a67

Integrate pod quota information into /health enndpoint

08bad66

betatim changed the title ~~[MRG] Add a handler to serve usage information~~ [MRG] Serve pod usage information in /health handler Aug 10, 2019

Fix comparison of None and integer

fa4bd76

betatim force-pushed the load-endpoint branch from d0aac5d to fa4bd76 Compare August 16, 2019 06:38

Update unit tests for health handler

5054913

betatim merged commit 501f767 into jupyterhub:master Aug 16, 2019

betatim deleted the load-endpoint branch August 16, 2019 08:15

yuvipanda pushed a commit to jupyterhub/helm-chart that referenced this pull request Aug 16, 2019

[binderhub] Automatic update for commit 501f767

53da191

jupyterhub/binderhub#912

henchbot mentioned this pull request Aug 16, 2019

binderhub: 8f31091...501f767 jupyterhub/mybinder.org-deploy#1132

Merged

choldgraf added the enhancement label Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Serve pod usage information in `/health` handler #912

[MRG] Serve pod usage information in `/health` handler #912

betatim commented Aug 1, 2019 •

edited

Loading

betatim commented Aug 7, 2019

nuest commented Aug 7, 2019

minrk commented Aug 8, 2019

betatim commented Aug 8, 2019

betatim commented Aug 10, 2019

consideRatio commented Aug 10, 2019 •

edited

Loading

betatim commented Aug 10, 2019

consideRatio commented Aug 10, 2019

betatim commented Aug 16, 2019

[MRG] Serve pod usage information in /health handler #912

[MRG] Serve pod usage information in /health handler #912

Conversation

betatim commented Aug 1, 2019 • edited Loading

betatim commented Aug 7, 2019

nuest commented Aug 7, 2019

minrk commented Aug 8, 2019

betatim commented Aug 8, 2019

betatim commented Aug 10, 2019

consideRatio commented Aug 10, 2019 • edited Loading

betatim commented Aug 10, 2019

consideRatio commented Aug 10, 2019

betatim commented Aug 16, 2019

[MRG] Serve pod usage information in `/health` handler #912

[MRG] Serve pod usage information in `/health` handler #912

betatim commented Aug 1, 2019 •

edited

Loading

consideRatio commented Aug 10, 2019 •

edited

Loading