-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Use alerting.GrafanaAlertmanager
instead of initialising Alertmanager components directly
#61230
Conversation
b1355ee
to
234c73e
Compare
e33ccb6
to
206e2a3
Compare
206e2a3
to
339f3eb
Compare
339f3eb
to
297ddc2
Compare
@@ -54,6 +54,7 @@ func setupAMTest(t *testing.T) *Alertmanager { | |||
} | |||
|
|||
func TestPutAlert(t *testing.T) { | |||
t.SkipNow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why this no longer passes? Or are there just outdated references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is asserting on the Alertmanager internals. It uses what the Alertmanager calls the "marker" to assert for alerts that have been received but not processed.
These components are private and are no longer accessible in any form from grafana/alerting
(as a consumer shouldn't have a need to access them).
The test was ported 1:1 to grafana/alerting in https://github.com/grafana/alerting/blob/main/alerting/grafana_alertmanager_test.go#L34 so I reckon is safe to skip for now. There are several integration tests that asserts this functionality anyway.
@@ -1,3 +1,4 @@ | |||
//nolint:golint,unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remember to drop before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't drop it - In order to keep the diff minimal, there are now a lot of attributes and functions that remain unused. My next PR shortly after this one will take care of removing this linter skip and delete all the dead code.
alerting.GrafanaAlertmanager
instead of initialising Alertmanager components directly
@@ -260,61 +200,48 @@ func (am *Alertmanager) Ready() bool { | |||
// We consider AM as ready only when the config has been | |||
// applied at least once successfully. Until then, some objects | |||
// can still be nil. | |||
am.reloadConfigMtx.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer needs a lock as Base
already has one.
} | ||
|
||
// SaveAndApplyDefaultConfig saves the default configuration the database and applies the configuration to the Alertmanager. | ||
// It rollbacks the save if we fail to apply the configuration. | ||
func (am *Alertmanager) SaveAndApplyDefaultConfig(ctx context.Context) error { | ||
am.reloadConfigMtx.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff here is ruined because we need to take Base
's lock, which causes nesting - there's no functional change here. We just need to make sure that we lock (with the same lock) all the way down to calling applyConfig
which does not use Base's lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It preserved it well enough to follow. These lock changes seem straightforward 👍
@@ -325,26 +252,27 @@ func (am *Alertmanager) SaveAndApplyConfig(ctx context.Context, cfg *apimodels.P | |||
return fmt.Errorf("failed to serialize to the Alertmanager configuration: %w", err) | |||
} | |||
|
|||
am.reloadConfigMtx.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff here is ruined because we need to take Base's lock, which causes nesting - there's no functional change here. We just need to make sure that we lock (with the same lock) all the way down to calling applyConfig which does not use Base's lock.
@@ -355,16 +283,18 @@ func (am *Alertmanager) ApplyConfig(dbCfg *ngmodels.AlertConfiguration) error { | |||
return fmt.Errorf("failed to parse Alertmanager config: %w", err) | |||
} | |||
|
|||
am.reloadConfigMtx.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not conflate ApplyConfig
with applyConfig
- this one also needs a lock because all public functions should be safe to be called concurrently.
if err != nil { | ||
return err | ||
} | ||
|
||
// Finally, build the integrations map using the receiver configuration and templates. | ||
integrationsMap, err := am.buildIntegrationsMap(cfg.AlertmanagerConfig.Receivers, tmpl) | ||
err = am.Base.ApplyConfig(AlertingConfiguration{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff is somewhat clear but the gist is that everything where we doing is now done by Base.ApplyConfig
- whatever we need from here - we inject via AlertingConfiguration
@@ -179,28 +179,28 @@ func TestMultiOrgAlertmanager_SyncAlertmanagersForOrgsWithFailures(t *testing.T) | |||
{ | |||
require.NoError(t, mam.LoadAndSyncAlertmanagersForOrgs(ctx)) | |||
require.Len(t, mam.alertmanagers, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer have ready
as this was originally designed to be an internal non-safe version of Ready
- this now lives in in grafana/alerting
.
That aside, there's no reason why this test should rely on ready
except to make it a tiny bit faster.
greceivers := make([]*alerting.GrafanaReceiver, 0, len(r.GrafanaManagedReceivers)) | ||
for _, gr := range r.PostableGrafanaReceivers.GrafanaManagedReceivers { | ||
var settings map[string]string | ||
//TODO: We shouldn't need to do this marshalling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as grafana/alerting#38 but inverse.
@@ -80,3 +80,29 @@ func TestProcessNotifierError(t *testing.T) { | |||
require.Equal(t, err, processNotifierError(r, err)) | |||
}) | |||
} | |||
|
|||
// TODO: Copied from Alerting, needs to be made public. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed in grafana/alerting#39
apimodels "github.com/grafana/grafana/pkg/services/ngalert/api/tooling/definitions" | ||
) | ||
|
||
// TODO: We no longer do apimodels at this layer, move it to the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issue as is part of a very large, very obvious effort - don't have a concrete plan yet will create one once I do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if am.Base.Ready() { | ||
if err := json.Unmarshal(am.Base.GetStatus(), config); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the lock here? there is a gap between am.Base.Ready()
and am.Base.GetStatus()
, the config could be concurrently updated there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout -- it was a bit confusing, so I've removed the am.Base.Ready()
check.
If you look at Base.GetStatus
it does hold a lock for getting the status from the configuration - so really all that we need is to execute the call against this check and safely remove the check to Base.Ready()
as you correctly guessed it could change between those two calls so we don't gain anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to answer your question concretely - no, we don't need a lock and we also don't need the Ready
check as we can rely on Base.GetStatus
to give us the accurate information.
In this PR we're introducing the replacement of Grafana's Alertmanager components with a unified struct directly from https://github.com/grafana/alerting.
Grafana's multi-tenancy and configuration CRUD (e.g. list configurations from the database) remain untouched as those are Grafana-specific operations that do not belong within
grafana/alerting
Configuration is injected as part of the struct initialisation to ensure a 1:1 match with what the Alertmanager in here is currently doing. There should be no functional change as part of this PR.