Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /-/healthy endpoint #1159

Merged
merged 1 commit into from
Jan 8, 2018
Merged

Conversation

simonpasquier
Copy link
Member

This partially addresses #991. Regarding the readiness probe, I'm not sure that we can reliably evaluate whether the mesh is ready or not. Thoughts?

@@ -76,6 +77,11 @@ func Register(r *route.Router, reloadCh chan<- struct{}, logger log.Logger) {
reloadCh <- struct{}{}
})

r.Get("/-/healthy", func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could /healthy check that we can:

  • read data from memory
  • read data from disk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Healthiness and readiness is often hard to distinguish, but I in my opinion you nailed it: anything regarding it's local operability defines it's healthiness. Readiness describes whether requests can be successfully served, meaning depending services can be reached, or in the mesh case, a mesh network was able to be established.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, my comment was for prometheus, not alertmanager, sorry for the confusion.

For alertmanager we could just make sure that nothing is horribly wrong with the nflog or silences ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both of those points are applicable for Alertmanager as well though :)

Copy link
Member Author

@simonpasquier simonpasquier Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For alertmanager we could just make sure that nothing is horribly wrong with the nflog or silences ?

Looking at the code I don't see anything that would stop silences and nflog from working once they have been instantiated. We could check that the maintenance hasn't failed for the last X calls but it seems a bit overkill to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check that the maintenance hasn't failed for the last X calls but it seems a bit overkill to me.

I want real metrics to alert on something like that, that happens at runtime. If instantiation is successful and there is nothing standing in the way of setting up the HTTP server to serve requests I'd say it's healthy. Readiness is to express that operationally we assume requests will actually successfully served including, basically if all other dependencies are "ready".

@brancz
Copy link
Member

brancz commented Dec 21, 2017

I'm not sure that we can reliably evaluate whether the mesh is ready or not. Thoughts?

If mesh is enabled, then I would expect readiness to express that at least a successful connection to at least one peer has been established, if possible to inspect this, then even that an initial sync of data has been performed.

@simonpasquier
Copy link
Member Author

If mesh is enabled, then I would expect readiness to express that at least a successful connection to at least one peer has been established

hmm I imagine that it is ok for a mesh with at least 3 nodes where you tolerate one node to be down. But with 2 nodes cluster, it wouldn't work, right?
Maybe we can check the mesh status only when number of configured mesh peers > 2?

@stuartnelson3
Copy link
Contributor

post holiday poke to @brancz, if you have time to follow up on this

@brancz
Copy link
Member

brancz commented Jan 8, 2018

As the Alertmanager process exists if the nflog or silencelog from disk is not loadable, this lgtm for healthiness.

@brancz brancz merged commit 0b5af75 into prometheus:master Jan 8, 2018
@simonpasquier simonpasquier deleted the add-healthy-probes branch January 23, 2018 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants