Add /-/healthy endpoint #1159

simonpasquier · 2017-12-21T11:33:31Z

This partially addresses #991. Regarding the readiness probe, I'm not sure that we can reliably evaluate whether the mesh is ready or not. Thoughts?

iksaif · 2017-12-21T11:44:22Z

ui/web.go

@@ -76,6 +77,11 @@ func Register(r *route.Router, reloadCh chan<- struct{}, logger log.Logger) {
 		reloadCh <- struct{}{}
 	})

+	r.Get("/-/healthy", func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)


could /healthy check that we can:

read data from memory

read data from disk

Healthiness and readiness is often hard to distinguish, but I in my opinion you nailed it: anything regarding it's local operability defines it's healthiness. Readiness describes whether requests can be successfully served, meaning depending services can be reached, or in the mesh case, a mesh network was able to be established.

wow, my comment was for prometheus, not alertmanager, sorry for the confusion.

For alertmanager we could just make sure that nothing is horribly wrong with the nflog or silences ?

both of those points are applicable for Alertmanager as well though :)

For alertmanager we could just make sure that nothing is horribly wrong with the nflog or silences ?

Looking at the code I don't see anything that would stop silences and nflog from working once they have been instantiated. We could check that the maintenance hasn't failed for the last X calls but it seems a bit overkill to me.

We could check that the maintenance hasn't failed for the last X calls but it seems a bit overkill to me.

I want real metrics to alert on something like that, that happens at runtime. If instantiation is successful and there is nothing standing in the way of setting up the HTTP server to serve requests I'd say it's healthy. Readiness is to express that operationally we assume requests will actually successfully served including, basically if all other dependencies are "ready".

brancz · 2017-12-21T12:12:40Z

I'm not sure that we can reliably evaluate whether the mesh is ready or not. Thoughts?

If mesh is enabled, then I would expect readiness to express that at least a successful connection to at least one peer has been established, if possible to inspect this, then even that an initial sync of data has been performed.

simonpasquier · 2017-12-21T13:47:07Z

If mesh is enabled, then I would expect readiness to express that at least a successful connection to at least one peer has been established

hmm I imagine that it is ok for a mesh with at least 3 nodes where you tolerate one node to be down. But with 2 nodes cluster, it wouldn't work, right?
Maybe we can check the mesh status only when number of configured mesh peers > 2?

stuartnelson3 · 2018-01-08T10:09:32Z

post holiday poke to @brancz, if you have time to follow up on this

brancz · 2018-01-08T10:24:15Z

As the Alertmanager process exists if the nflog or silencelog from disk is not loadable, this lgtm for healthiness.

Add /-/healthy endpoint

e8661f5

simonpasquier mentioned this pull request Dec 21, 2017

Provide a HTTP Health / Readiness Check #991

Closed

iksaif reviewed Dec 21, 2017

View reviewed changes

brancz merged commit 0b5af75 into prometheus:master Jan 8, 2018

simonpasquier deleted the add-healthy-probes branch January 23, 2018 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add /-/healthy endpoint #1159

Add /-/healthy endpoint #1159

simonpasquier commented Dec 21, 2017

iksaif Dec 21, 2017

brancz Dec 21, 2017

iksaif Dec 21, 2017

brancz Dec 21, 2017

simonpasquier Dec 21, 2017 •

edited

Loading

brancz Dec 21, 2017

brancz commented Dec 21, 2017 •

edited

Loading

simonpasquier commented Dec 21, 2017

stuartnelson3 commented Jan 8, 2018

brancz commented Jan 8, 2018

Add /-/healthy endpoint #1159

Add /-/healthy endpoint #1159

Conversation

simonpasquier commented Dec 21, 2017

iksaif Dec 21, 2017

Choose a reason for hiding this comment

brancz Dec 21, 2017

Choose a reason for hiding this comment

iksaif Dec 21, 2017

Choose a reason for hiding this comment

brancz Dec 21, 2017

Choose a reason for hiding this comment

simonpasquier Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

brancz Dec 21, 2017

Choose a reason for hiding this comment

brancz commented Dec 21, 2017 • edited Loading

simonpasquier commented Dec 21, 2017

stuartnelson3 commented Jan 8, 2018

brancz commented Jan 8, 2018

simonpasquier Dec 21, 2017 •

edited

Loading

brancz commented Dec 21, 2017 •

edited

Loading