Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Serve pod usage information in /health handler #912

Merged
merged 7 commits into from
Aug 16, 2019

Conversation

betatim
Copy link
Member

@betatim betatim commented Aug 1, 2019

This adds a new endpoint /_usage that returns information about the number of build and user pods currently running and their sum.

It introduces a new config variable called max_pod_capacity that will be served to indicate the maximum capacity the cluster can handle. Currently this value isn't used to enforce an upper limit during launches, so it is more indicative.

The idea is to use information from this endpoint as part of the decision to send users to a cluster from the federation redirect. We could implement a strategy that sends users to the least full cluster (total_pods/max_pod_capacity) or considers clusters were total_pods > max_pod_capacity to be unavailable for more launches.

I am not super happy with the naming. Currently it is /_usage and the handler is UsageHandler but that somehow suggests that you get historical usage information as well. So maybe "load" is a better name, but then is that the load on the system or an endpoint where you can load something?

I used the _ prefix in _usage to signal that this is somehow an "internal" endpoint.

closes #874

WDYT?

@betatim
Copy link
Member Author

betatim commented Aug 7, 2019

@choldgraf do you have thoughts on naming and what information to return?

I am trying to think of a word to replace max_capacity with. Want something that signals that this isn't a hard cap more like a goal/desire/design capacity.

@nuest
Copy link
Contributor

nuest commented Aug 7, 2019

Since the docs say "Serve statistics about", how about/stats as the name for the endpoint?

I suggest quota for the maximum capacity. You may go over the quota, but that has costs, so it should be avoided.

@betatim betatim changed the title [WIP] Add a handler to serve usage information [MRG] Add a handler to serve usage information Aug 8, 2019
binderhub/metrics.py Outdated Show resolved Hide resolved
binderhub/metrics.py Outdated Show resolved Hide resolved
@minrk
Copy link
Member

minrk commented Aug 8, 2019

Does it make sense to include this info in /health rather than a new endpoint? Health is the logical place to put things we've thought about like failure rate, etc.

I'd also be happy with /stats

@betatim
Copy link
Member Author

betatim commented Aug 8, 2019

/health or new endpoint

I started a new endpoint because I wasn't sure how I'd integrate it with /health. In particular I was wondering if being over quota means the service is unhealthy or not. Right now the health endpoint only knows "healthy" and "sick".

I don't think that "over quota" should mean "sick", because at least right now it is a "soft quota".

I then pondered if we should have an intermediate state "so-so" for /health which means "please don't send traffic but if you have to I'll take it". So a bit like a traffic light system instead of the yes/no we have right now. However I couldn't think of a nice way to deal with that while keeping {"status": true} (or false) as simple JSON response. Best guess is we'd have to switch to words ("green", "orange", "red').

In the end I decided to punt on all that confusion and made a new endpoint so that we could prototype the actual code. Especially if someone can help resolve my confusion over how to reflect "kinda sick" in /health I'd integrate it there instead of making a new end point.

@betatim betatim changed the title [MRG] Add a handler to serve usage information [MRG] Serve pod usage information in /health handler Aug 10, 2019
@betatim
Copy link
Member Author

betatim commented Aug 10, 2019

I added the pod quota information to the health handler. Right now it ignores the quota information when determining the overall health of the hub. WDYT?

@consideRatio
Copy link
Member

consideRatio commented Aug 10, 2019

If the usage/load/stats info returned relates to health it makes sense to me to included it in the health endpoint. Or hmmm wait a second...

So there is health and readiness in k8s. The health saying "not ok" means the pod needs to restart. The readiness saying no means "dont restart but also dont pass more traffic to me" i think.

So, including it in a readiness endpoint may be suitable, but perhaps not in a health one?

Hmmmm, im not confident about what makes sense, but it is relevant to understand the potential confusion that could arise by mismatching with how health/readiness endpoints have been used in k8s on pods

@betatim
Copy link
Member Author

betatim commented Aug 10, 2019

Maybe this should be in /ready (or what ever the standard name is for the readyness endpoint)? Thanks for reminding us that there is already a concept for "ready" and "healthy" -> someone already solved this for us.

To be super explicit the /health endpoint is about the health of the whole "binderhub", not just this individual pod. For example it will show "unhealthy" when it can't reach the JupyterHub API or the docker registry. Restarting the binderhub pod might or might not fix this, so for now I am not sure if we should use this endpoint to determine the health of the pod.

The goal of the existing /health end point is to provide information for upstream services like the federation proxy or a status page, not to be a endpoint for pod health checks. If that is incompatible with being a good endpoint for pod health checks then we should introduce a second end point for pod health (and maybe rename things).

@consideRatio
Copy link
Member

I think this endpoint could be a good readiness endpoint for the main entrypoint to a binderhub in both kubernetes sense and a cluster federation sense.

a kubernetes service would be able to target multiple binderhub entrypoint for example, and delegate traffic but exclude those that are not in ready state.

Is it both about the health of the who binderhub and the readiness of the whole binderhub, or only one of these? Health example to me: critical pods in system not healthy or similar. Readiness example to me: hub is ready for more incomming traffic.

Hmmm... new vs old traffic may be of relevance to consider... Yikes gtg for now

@betatim betatim merged commit 501f767 into jupyterhub:master Aug 16, 2019
@betatim betatim deleted the load-endpoint branch August 16, 2019 08:15
@betatim
Copy link
Member Author

betatim commented Aug 16, 2019

merged this so we can get it deployed and start using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advertise hub's current usage and capacity
5 participants