Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor nflog configuration options to make it similar to Silences. #3220

Merged
merged 9 commits into from
Jan 19, 2023

Conversation

gotjosh
Copy link
Member

@gotjosh gotjosh commented Jan 18, 2023

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and run to be similar to the silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh force-pushed the gotjosh/nflog-similar-to-silences branch from 675ee0c to d17244d Compare January 18, 2023 19:49
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh force-pushed the gotjosh/nflog-similar-to-silences branch from 13aaeeb to 5e9a323 Compare January 18, 2023 19:52
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh requested a review from simonpasquier January 18, 2023 20:14
type Log struct {
clock clock.Clock
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most, if not all, of the diff in the tests, has to do with the fact that we now mock the clock instead of injecting. Technically, this can be done in separate PR, but this PR seems simple enough to follow so I decided to include it here.

Comment on lines -171 to -176
func WithNow(f func() time.Time) Option {
return func(l *Log) error {
l.now = f
return nil
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this plumbing and the now func() time.Time on the manager struct was only to help with tests -- this is all replaced by using clock.Clock

// If not nil, the last argument is an override for what to do as part of the maintenance - for advanced usage.
func (l *Log) Maintenance(interval time.Duration, snapf string, stopc <-chan struct{}, override MaintenanceFunc) {
if interval == 0 || stopc == nil {
level.Error(l.logger).Log("msg", "interval or stop signal are missing - not running maintenance")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new - it kind of annoyed me that we'd return from this function on silences when you misconfigured the maintenance but fail silently. I don't think we should do this -- ideally, return an error but I settled with a log line to keep the diff sane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. I'd suggest that we check the validity of the maintenance interval in cmd/alertmanager/main.go in a later PR. As of today, a negative maintenance interval triggers a panic...

@@ -377,6 +381,7 @@ func (s *Silences) Maintenance(interval time.Duration, snapf string, stopc <-cha
return size, err
}
if size, err = s.Snapshot(f); err != nil {
f.Close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like a bug - we only close on the second return, but we also return here without closing the file descriptor. This case should be very rare but possible nonetheless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: gotjosh <josue.abreu@gmail.com>
nflog/nflog.go Outdated

if o.SnapshotFile != "" {
if r, err := os.Open(o.SnapshotFile); err != nil {
if !os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe log at info-level that a previous log to load up didn't exist, so it'll create one? Or, maybe the inverse path, something like "Loading a previous snapshot..."

At a minimum, we should log it at debug level, since there are sort of two independent paths here. If for some reason a file can't be accessed, things could end up in a weird state, and there is no evidence trail left behind that it happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A debug log sounds good to me. Alertmanager would hit this code path when it starts from scratch.

nflog.WithMaintenance(*maintenanceInterval, stopc, wg.Done, nil),
nflog.WithMetrics(prometheus.DefaultRegisterer),
nflog.WithLogger(log.With(logger, "component", "nflog")),
notificationLogOpts := nflog.Options{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change 👍

Comment on lines +221 to +222
SnapshotReader io.Reader
SnapshotFile string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a doc-comment to each, saying only one of these fields should be set. That way, callers can see this in their editors directly - just makes it a little more convenient to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level comment reads as:

// A snapshot file or reader from which the initial state is loaded.
// None or only one of them must be set.

In my editor it shows as:

Screenshot 2023-01-19 at 15 23 58

Which indicates:

// None or only one of them must be set.

Are you thinking of something different? Happy to do it, but unsure of what this means for a different editor 🤔

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -377,6 +381,7 @@ func (s *Silences) Maintenance(interval time.Duration, snapf string, stopc <-cha
return size, err
}
if size, err = s.Snapshot(f); err != nil {
f.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// If not nil, the last argument is an override for what to do as part of the maintenance - for advanced usage.
func (l *Log) Maintenance(interval time.Duration, snapf string, stopc <-chan struct{}, override MaintenanceFunc) {
if interval == 0 || stopc == nil {
level.Error(l.logger).Log("msg", "interval or stop signal are missing - not running maintenance")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. I'd suggest that we check the validity of the maintenance interval in cmd/alertmanager/main.go in a later PR. As of today, a negative maintenance interval triggers a panic...

Signed-off-by: gotjosh <josue.abreu@gmail.com>
nflog/nflog.go Outdated Show resolved Hide resolved
gotjosh and others added 2 commits January 19, 2023 16:25
Co-authored-by: Simon Pasquier <pasquier.simon@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh merged commit f59460b into main Jan 19, 2023
@gotjosh gotjosh deleted the gotjosh/nflog-similar-to-silences branch January 19, 2023 16:39
hoperays pushed a commit to hoperays/alertmanager that referenced this pull request Apr 23, 2023
…rometheus#3220)

* Refactor nflog configuration options to make it similar to Silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.
radek-ryckowski pushed a commit to goldmansachs/alertmanager that referenced this pull request Nov 6, 2023
…rometheus#3220)

* Refactor nflog configuration options to make it similar to Silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants