-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add /-/healthy endpoint #1159
Add /-/healthy endpoint #1159
Conversation
@@ -76,6 +77,11 @@ func Register(r *route.Router, reloadCh chan<- struct{}, logger log.Logger) { | |||
reloadCh <- struct{}{} | |||
}) | |||
|
|||
r.Get("/-/healthy", func(w http.ResponseWriter, _ *http.Request) { | |||
w.WriteHeader(http.StatusOK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could /healthy check that we can:
- read data from memory
- read data from disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Healthiness and readiness is often hard to distinguish, but I in my opinion you nailed it: anything regarding it's local operability defines it's healthiness. Readiness describes whether requests can be successfully served, meaning depending services can be reached, or in the mesh case, a mesh network was able to be established.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, my comment was for prometheus, not alertmanager, sorry for the confusion.
For alertmanager we could just make sure that nothing is horribly wrong with the nflog or silences ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both of those points are applicable for Alertmanager as well though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For alertmanager we could just make sure that nothing is horribly wrong with the nflog or silences ?
Looking at the code I don't see anything that would stop silences and nflog from working once they have been instantiated. We could check that the maintenance hasn't failed for the last X calls but it seems a bit overkill to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could check that the maintenance hasn't failed for the last X calls but it seems a bit overkill to me.
I want real metrics to alert on something like that, that happens at runtime. If instantiation is successful and there is nothing standing in the way of setting up the HTTP server to serve requests I'd say it's healthy. Readiness is to express that operationally we assume requests will actually successfully served including, basically if all other dependencies are "ready".
If mesh is enabled, then I would expect readiness to express that at least a successful connection to at least one peer has been established, if possible to inspect this, then even that an initial sync of data has been performed. |
hmm I imagine that it is ok for a mesh with at least 3 nodes where you tolerate one node to be down. But with 2 nodes cluster, it wouldn't work, right? |
post holiday poke to @brancz, if you have time to follow up on this |
As the Alertmanager process exists if the nflog or silencelog from disk is not loadable, this lgtm for healthiness. |
This partially addresses #991. Regarding the readiness probe, I'm not sure that we can reliably evaluate whether the mesh is ready or not. Thoughts?