Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring snoozeEndTime to AlertAttributesExcludedFromAAD #135602

Closed
wants to merge 2 commits into from

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Jun 30, 2022

Summary

This PR fixes a bug introduced in 8.3.0 which causes any rule created or updated by an 8.2.x cluster to fail on migration on upgrade to 8.3.0 or 8.3.1.

The root cause of the bug is that the snoozeEndTime field which was added in 8.2.0 and subsequently removed in 8.3.0 was omitted from the AlertAttributesExcludedFromAAD list, which meant they could no longer be decrypted.
As a result, any rule Saved Object which was created or updated in 8.2.x would try to use the snoozeEndTime field to decrypt the rule as part of the 8.3.0 migration - which consistently fails as the SO uses the field (even if empty) as part of AAD.

When this migration fails for such a rule the following error message is visible in the server log:

Failed to decrypt "apiKey" attribute: Unsupported state or unable to authenticate data

Once Kibana runs it might still try to run these failed rules, which should result in an error such as this:

<rule-type>:<UUID>: execution failed - security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""

Remediation for customers who have already upgraded to 8.3.0 or 8.3.1

If a customer upgrades to 8.3.0 or 8.3.1 and their alerting rules have stopped running with an error similar to the example above, they will need to use the Update API key operation in Stack Management > Rules and Connectors to restore the alerting rule to a functional state. It should be noted that the Update API Key operation causes the alerting rule to run as the user that performed the operation.

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

expect(response.body._source?.alert?.snoozeSchedule).not.to.be(undefined);

const snoozeSchedule = response.body._source?.alert.snoozeSchedule!;
expect(snoozeSchedule.length).to.eql(1);
Copy link
Contributor Author

@ymao1 ymao1 Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should add expect(response.body._source?.alert?.apiKey).not.to.be(undefined);

"throttle":null,
"notifyWhen" : "onActiveAlert",
"apiKeyOwner" : "elastic",
"apiKey" : "y86eFOuG06GNd4p/90Zh4V4C+4wF8rFHiLcOeNqHQFFAbagYCXIuggq/N/AB0HBncCsWrh5PsCs4RiX53+LP2o+oLtoQkdTK9m+iv0BnzP6aavoypIudKtnElN3SotI3XpIRm7c3MEdM/C/zZXHgCmn4UMExhkulOO1S/Q//FwvIr2/SKw/mp8Vx5qP41hqDX4YFfSMoFkPSLw==",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a correct-er way to generate this data so that the api key can be decrypted by the functional test server. This is just one I've copied from local data and it gives a decryption error during the test :(

Comment on lines +153 to +155
"notifyWhen": {
"type": "keyword"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary here? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any field that's inside a document in data.json needs to have a mapping defined. It just so happened that none of the other rule SO examples in data.json had notifyWhen but the SO I just added does, so I added the mapping. It's not strictly needed for this PR, I could remove it from the new doc I added to data.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh gotcha.
Thanks Ying.... enjoy the weekend! ;)

Co-authored-by: Gidi Meir Morris <github@gidi.io>
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@@ -443,5 +443,22 @@ export default function createGetTests({ getService }: FtrProviderContext) {
expect(response.statusCode).to.equal(200);
expect(response.body._source?.alert?.tags).to.eql(['test-tag-1', 'foo-tag']);
});

it('8.3.0 removes snoozeEndTime in favor of snoozeSchedule', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't seem to be working.
I commented out the return of the snoozeEndTime field in the AlertAttributesExcludedFromAAD variable and this test is still passing - it should be failing.

I know there's more work to be done by @mikecote today, so this might be a known issue, but I wanted to flag just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is because the migration will succeed even if decryption fails. I left a comment that we should add an expect here to ensure the api key still exists after migration but that can't be done until this (loading the test data in the correct manner)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant, thanks, I'll work with Mike to get that sorted asap.

@ymao1 ymao1 closed this Sep 21, 2022
@ymao1 ymao1 deleted the alerting/restore-snoozeEndTime branch September 21, 2022 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants