-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
silence|alerts: add metrics about current silences and alerts #998
Conversation
4dbe09d
to
40f1b18
Compare
cmd/alertmanager/main.go
Outdated
func newAlertMetricByState(marker types.Marker, st types.AlertState) prometheus.GaugeFunc { | ||
return prometheus.NewGaugeFunc( | ||
prometheus.GaugeOpts{ | ||
Name: "alertmanager_alerts_count", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _count suffix is for summaries/histograms time series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
28a3adc
to
1dcd1fe
Compare
count removed, tests fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments but largely looks ok.
types/types.go
Outdated
func (m *memMarker) Count(states ...AlertState) int { | ||
count := 0 | ||
|
||
m.mtx.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just needs a read lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
silence/silence.go
Outdated
ConstLabels: prometheus.Labels{"state": string(st)}, | ||
}, | ||
func() float64 { | ||
count, _ := s.CountState(st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably at least log this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -584,6 +610,16 @@ func (s *Silences) Query(params ...QueryParam) ([]*pb.Silence, error) { | |||
return sils, err | |||
} | |||
|
|||
// Count silences by state. | |||
func (s *Silences) CountState(states ...SilenceState) (int, error) { | |||
// This could probably be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping that we'll one day we'll be able to index the silence and alert in-memory databases, so we can search, count, query a bit more efficiently.
1dcd1fe
to
b2f7f8e
Compare
Should be good now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good if @fabxc and or @stuartnelson3 also give this a review, I'm unsure about the performance implications the locking for counting may have. It's probably fine, but would like to get a second opinion.
types/types.go
Outdated
count := 0 | ||
|
||
m.mtx.RLock() | ||
defer m.mtx.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be RUnlock
, when using RLock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, sorry for that, fixed
This adds metrics that look like this: ``` alertmanager_alerts{state="active"} 6 alertmanager_alerts{state="suppressed"} 0 alertmanager_silences{state="active"} 1 alertmanager_silences{state="expired"} 1 alertmanager_silences{state="pending"} 0 ``` This can be used to monitor alertmanager's usage and validate that alertmanagers in a mesh have a similar number of silences and alerts.
b2f7f8e
to
bff889b
Compare
Should not be worse than accessing the API or the UI. |
Fair enough. Thanks for the contributions! |
* vendor: Update prometheus/procfs Signed-off-by: Hannes Körber <hannes.koerber@haktec.de> * mountstats: Use new NFS protocol field In prometheus/procfs#100, the NFSTransportStats struct was expanded by a field called protocol that specifies the NFS protocol in use, either "tcp" or "udp". This commit adds the protocol as a label to all NFS metrics exported via the mountstats collector. Signed-off-by: Hannes Körber <hannes.koerber@haktec.de> * Update fixtures for UDP mount Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
This adds metrics that look like this:
This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.