Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return no error when deleting expired silence #2817

Conversation

soonping-amzn
Copy link
Contributor

Delete silence currently returns HTTP 500 if called on an expired silence. This makes it non-idempotent, as the following call sequence can result:

T0 initial: DELETE /api/v1/silence/MySilence => Timeout (but actually succeeded)
T1 retry:   DELETE /api/v1/silence/MySilence => 500

After this sequence, the user is left uncertain whether there's a problem in AlertManager/Prometheus. Even fetching the silence to verify its state won't confirm whether the 500 was indicative of a separate underlying problem that could manifest in other ways later.

Prometheus exposes the delete_series API which does exhibit idempotent behaviour -- calling delete_series on an already deleted series returns the same status as the initial delete_series call: HTTP 200.

This PR changes Silences.expire(id) to return nil for expired silences, which impacts APIs as follows:

  • DELETE silence will no longer fail if the silence was already expired
  • PUT silence will no longer fail if the silence exists and was already expired
    if err := s.expire(prev.Id); err != nil {
    return "", errors.Wrap(err, "expire previous silence")

…ilence

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
@soonping-amzn soonping-amzn force-pushed the 20211207-expire-silence-idempotency-fix branch from 91d6855 to d389091 Compare January 12, 2022 17:11
@soonping-amzn
Copy link
Contributor Author

Note: a PR (#2815) was posted shortly before this one addressing a related but distinct issue: return a 404 instead of a 500 for non-existent silences.

The 2 PRs address different issues arising from delete silence returning 500 on any error.

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense, I would suggest to add a comment in the code to make it clear that we want this method to be indempotent.

Note that this is not guaranteed to work indefinitvely because expired silences get garbage collected at some point.

cc @simonpasquier @gotjosh for a second look.

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
@alvinlin123
Copy link
Contributor

@simonpasquier @gotjosh, sorry to interrupt, but would you be able to take some time to take a look at this PR soon?

@@ -622,7 +622,8 @@ func (s *Silences) expire(id string) error {

switch getState(sil, now) {
case types.SilenceStateExpired:
return errors.Errorf("silence %s already expired", id)
// returning nil ensures idempotent behaviour, at least in the short term before the silence is gc'd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// returning nil ensures idempotent behaviour, at least in the short term before the silence is gc'd
// Returning nil ensures idempotent behaviour, at least in the short term before the silence is gc'd.

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
@@ -622,7 +622,8 @@ func (s *Silences) expire(id string) error {

switch getState(sil, now) {
case types.SilenceStateExpired:
return errors.Errorf("silence %s already expired", id)
// Returning nil ensures idempotent behaviour, at least in the short term before the silence is gc'd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Returning nil ensures idempotent behaviour, at least in the short term before the silence is gc'd
// Returning nil ensures idempotent behaviour, at least in the short term before the silence is gc'd.

@roidelapluie
Copy link
Member

Sorry for the nits, our styleguide for comments is capitalized sentences with full stop. Once correct I'd merge this pull request.

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
@soonping-amzn
Copy link
Contributor Author

No problem. Apologies for missing the full stop in your previous recommended change.

@@ -622,7 +622,8 @@ func (s *Silences) expire(id string) error {

switch getState(sil, now) {
case types.SilenceStateExpired:
return errors.Errorf("silence %s already expired", id)
Copy link
Member

@gotjosh gotjosh Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for coming in late - I feel that this kind of comment about the semantics of the function is best put as part of the function description.

How do you feel about changing the top-level comment from:

// Expire the silence with the given ID immediately.

to

// Expire the silence with the given ID immediately. It is idempotent, nil is returned if the silence already expired before it is GC'd.
// If the silence is not found an error is returned.

and removing the comment on the body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll make the change.

@gotjosh
Copy link
Member

gotjosh commented Jan 19, 2022

I'm a bit surprised that the CI passes here - if we don't have a test at an API level, we should create them. I'd love to see two tests:

  • One that confirms that subsequent retries to delete a silence don't result in 500
  • One that confirms creating a new silence that already existed and was expired don't result in a 500

This affects both v1 and v2 APIs, but I believe we should only fix v2 as v1 should be close to getting deprecated per #2469

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with all the comments from @roidelapluie and @gotjosh

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
@soonping-amzn
Copy link
Contributor Author

I've updated the function comment, and added tests for DeleteSilence and PostSilences.

Found during testing that PostSilences actually probably won't change in behaviour, as expire() is only called if the existing silence is unexpired. i.e. The test passed even without my change to the behaviour of expire(). If you want me to remove that test, let me know.

if getState(prev, s.now()) != types.SilenceStateExpired {
// We cannot update the silence, expire the old one.
if err := s.expire(prev.Id); err != nil {
return "", errors.Wrap(err, "expire previous silence")
}
}

@soonping-amzn
Copy link
Contributor Author

@simonpasquier @gotjosh Please let me know if you think I need to add to or change the tests I've added, and if anything else is needed. Thank you

@soonping-amzn
Copy link
Contributor Author

@simonpasquier @gotjosh Please let me know if you think I need to add to or change the tests I've added, and if anything else is needed. Thank you

@gotjosh
Copy link
Member

gotjosh commented Feb 10, 2022

Sorry - I've been on PTO for the past few weeks. Will take a look at my earliest convenience.

@simonpasquier
Copy link
Member

LGTM

@simonpasquier simonpasquier merged commit a2d18c9 into prometheus:main Feb 22, 2022
@simonpasquier
Copy link
Member

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants