Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring snoozeEndTime to AlertAttributesExcludedFromAAD #135663

Merged
merged 8 commits into from
Jul 5, 2022

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Jul 4, 2022

Re-submit of #135602.
Resolves #135740

Summary

This PR fixes a bug introduced in 8.3.0 which causes any rule created or updated by an 8.2.x cluster to fail on migration on upgrade to 8.3.0 or 8.3.1.

The PR also fixes an issue where Kibana could crash while running snoozed rules. This was caused by the task runner updating the rule twice after a run when snoozed, one of the updates is partial while the other takes OCC into consideration. If the OCC failed, Kibana would crash due to missing await.

The root cause of the bug is that the snoozeEndTime field which was added in 8.2.0 and subsequently removed in 8.3.0 was omitted from the AlertAttributesExcludedFromAAD list, which meant they could no longer be decrypted.
As a result, any rule Saved Object which was created or updated in 8.2.x would try to use the snoozeEndTime field to decrypt the rule as part of the 8.3.0 migration - which consistently fails as the SO uses the field (even if empty) as part of AAD.

When this migration fails for such a rule the following error message is visible in the server log:

Failed to decrypt "apiKey" attribute: Unsupported state or unable to authenticate data

Once Kibana runs it might still try to run these failed rules, which should result in an error such as this:

<rule-type>:<UUID>: execution failed - security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""

Remediation for customers who have already upgraded to 8.3.0 or 8.3.1

If a customer upgrades to 8.3.0 or 8.3.1 and their alerting rules have stopped running with an error similar to the example above, they will need to use the Update API key operation in Stack Management > Rules and Connectors to restore the alerting rule to a functional state. It should be noted that the Update API Key operation causes the alerting rule to run as the user that performed the operation.

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

To verify

You will notice the tests are skipped for the time being. They are currently flaky until further enhancements are made to make them part of CI. In the meantime, it is best to test manually, reproduce the issue with main and then confirm this branch fixes the issue.

Scenario 1

  1. Create a rule in 8.2
  2. Upgrade to 8.3
  3. Notice the rule passed the migration and continues running successfully (this fails in main)

Scenario 2

  1. Create a rule in 8.1
  2. Upgrade to 8.3
  3. Notice the rule passed the migration and is still running

Testing scenarios

  • Rule created in 8.1 still runs when upgrading to 8.3.2
  • Rule created in 8.2 still runs when upgrading to 8.3.2
  • Rule created in 8.1, upgraded to 8.2, edited, still runs when upgrading to 8.3.2
  • Rule created in 8.1, muted, still runs when upgrading to 8.3.2
  • Rule created in 8.2, muted, still runs when upgrading to 8.3.2
  • Rule created in 8.2, snoozed w/ end time, still runs when upgrading to 8.3.2

Release Note

Fixes bug where alerting rules that were created or edited in 8.2 will stop running on upgrade to 8.3.0 or 8.3.1. Users upgrading directly to 8.3.2 will not experience this bug. For additional details and remediation steps, see https://www.elastic.co/guide/en/kibana/current/release-notes-8.3.0.html#known-issues-8.3.0

@mikecote mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.4.0 v8.3.2 labels Jul 4, 2022
@mikecote mikecote self-assigned this Jul 4, 2022
@mikecote mikecote marked this pull request as ready for review July 4, 2022 18:49
@mikecote mikecote requested a review from a team as a code owner July 4, 2022 18:49
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote mikecote added the release_note:skip Skip the PR/issue when compiling release notes label Jul 4, 2022
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikecote mikecote added the auto-backport Deprecated - use backport:version if exact versions are needed label Jul 5, 2022
@mikecote mikecote merged commit 2d29934 into elastic:main Jul 5, 2022
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.3 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 135663

Questions ?

Please refer to the Backport tool documentation

mikecote added a commit to mikecote/kibana that referenced this pull request Jul 5, 2022
…5663)

* Restoring snoozeEndTime to AlertAttributesExcludedFromAAD

* Apply suggestions from code review

Co-authored-by: Gidi Meir Morris <github@gidi.io>

* Temp fix migration

* New way to do migrations

* Add two scenarios

* Skip functional test

* Revert some archive changes

Co-authored-by: Ying Mao <ying.mao@elastic.co>
Co-authored-by: Gidi Meir Morris <github@gidi.io>
(cherry picked from commit 2d29934)

# Conflicts:
#	x-pack/plugins/alerting/server/task_runner/task_runner.ts
@gmmorris
Copy link
Contributor

gmmorris commented Jul 5, 2022

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

ignoring request to update branch, pull request is closed

mikecote added a commit that referenced this pull request Jul 5, 2022
) (#135714)

* Restoring snoozeEndTime to AlertAttributesExcludedFromAAD (#135663)

* Restoring snoozeEndTime to AlertAttributesExcludedFromAAD

* Apply suggestions from code review

Co-authored-by: Gidi Meir Morris <github@gidi.io>

* Temp fix migration

* New way to do migrations

* Add two scenarios

* Skip functional test

* Revert some archive changes

Co-authored-by: Ying Mao <ying.mao@elastic.co>
Co-authored-by: Gidi Meir Morris <github@gidi.io>
(cherry picked from commit 2d29934)

# Conflicts:
#	x-pack/plugins/alerting/server/task_runner/task_runner.ts

* Fix backport

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed ci:cloud-deploy Create or update a Cloud deployment Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.3.2 v8.4.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Rules failing after upgrading to 8.3.0/8.3.1 from 8.2.*
9 participants