Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

silence|alerts: add metrics about current silences and alerts #998

Merged
merged 1 commit into from
Oct 5, 2017

Conversation

iksaif
Copy link
Contributor

@iksaif iksaif commented Sep 15, 2017

This adds metrics that look like this:

alertmanager_alerts_count{state="active"} 6
alertmanager_alerts_count{state="suppressed"} 0
alertmanager_silences_count{state="active"} 1
alertmanager_silences_count{state="expired"} 1
alertmanager_silences_count{state="pending"} 0

This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.

@iksaif iksaif force-pushed the silence-alert-counters branch from 4dbe09d to 40f1b18 Compare September 15, 2017 13:36
func newAlertMetricByState(marker types.Marker, st types.AlertState) prometheus.GaugeFunc {
return prometheus.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "alertmanager_alerts_count",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _count suffix is for summaries/histograms time series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@iksaif iksaif force-pushed the silence-alert-counters branch 3 times, most recently from 28a3adc to 1dcd1fe Compare September 19, 2017 12:37
@iksaif
Copy link
Contributor Author

iksaif commented Sep 20, 2017

count removed, tests fixed

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but largely looks ok.

types/types.go Outdated
func (m *memMarker) Count(states ...AlertState) int {
count := 0

m.mtx.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just needs a read lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ConstLabels: prometheus.Labels{"state": string(st)},
},
func() float64 {
count, _ := s.CountState(st)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably at least log this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -584,6 +610,16 @@ func (s *Silences) Query(params ...QueryParam) ([]*pb.Silence, error) {
return sils, err
}

// Count silences by state.
func (s *Silences) CountState(states ...SilenceState) (int, error) {
// This could probably be optimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that we'll one day we'll be able to index the silence and alert in-memory databases, so we can search, count, query a bit more efficiently.

@iksaif iksaif force-pushed the silence-alert-counters branch from 1dcd1fe to b2f7f8e Compare September 29, 2017 15:01
@iksaif
Copy link
Contributor Author

iksaif commented Oct 2, 2017

Should be good now

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good if @fabxc and or @stuartnelson3 also give this a review, I'm unsure about the performance implications the locking for counting may have. It's probably fine, but would like to get a second opinion.

types/types.go Outdated
count := 0

m.mtx.RLock()
defer m.mtx.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be RUnlock, when using RLock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, sorry for that, fixed

This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```

This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
@iksaif iksaif force-pushed the silence-alert-counters branch from b2f7f8e to bff889b Compare October 2, 2017 11:33
@iksaif
Copy link
Contributor Author

iksaif commented Oct 2, 2017

I'm unsure about the performance implications the locking for counting may have

Should not be worse than accessing the API or the UI.

@brancz
Copy link
Member

brancz commented Oct 5, 2017

Fair enough. Thanks for the contributions!

@brancz brancz merged commit d47f8b9 into prometheus:master Oct 5, 2017
hh pushed a commit to ii/alertmanager that referenced this pull request Aug 5, 2018
* vendor: Update prometheus/procfs

Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>

* mountstats: Use new NFS protocol field

In prometheus/procfs#100, the NFSTransportStats
struct was expanded by a field called protocol that specifies the NFS
protocol in use, either "tcp" or "udp". This commit adds the protocol as
a label to all NFS metrics exported via the mountstats collector.

Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>

* Update fixtures for UDP mount

Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants