-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] TimeSeriesLifecycleActionsIT testHistoryIsWrittenWithFailure #50353
Comments
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
Muted in c1580dc |
This moves the `putAsync` method in `ILMHistoryStore` never to block. Previously due to the way that the `BulkProcessor` works, it was possible for `BulkProcessor#add` to block executing a bulk request. This was bad as we may be adding things to the history store in cluster state update threads. This also moves the index creation to be done prior to the bulk request execution, rather than being checked every time an operation was added to the queue. This lessens the chance of the index being created, then deleted (by some external force), and then recreated via a bulk indexing request. Resolves elastic#50353
I also muted |
* Make ILMHistoryStore.putAsync truly async This moves the `putAsync` method in `ILMHistoryStore` never to block. Previously due to the way that the `BulkProcessor` works, it was possible for `BulkProcessor#add` to block executing a bulk request. This was bad as we may be adding things to the history store in cluster state update threads. This also moves the index creation to be done prior to the bulk request execution, rather than being checked every time an operation was added to the queue. This lessens the chance of the index being created, then deleted (by some external force), and then recreated via a bulk indexing request. Resolves #50353
* Make ILMHistoryStore.putAsync truly async This moves the `putAsync` method in `ILMHistoryStore` never to block. Previously due to the way that the `BulkProcessor` works, it was possible for `BulkProcessor#add` to block executing a bulk request. This was bad as we may be adding things to the history store in cluster state update threads. This also moves the index creation to be done prior to the bulk request execution, rather than being checked every time an operation was added to the queue. This lessens the chance of the index being created, then deleted (by some external force), and then recreated via a bulk indexing request. Resolves elastic#50353
* Add ILM histore store index (#50287) * Add ILM histore store index This commit adds an ILM history store that tracks the lifecycle execution state as an index progresses through its ILM policy. ILM history documents store output similar to what the ILM explain API returns. An example document with ALL fields (not all documents will have all fields) would look like: ```json { "@timestamp": 1203012389, "policy": "my-ilm-policy", "index": "index-2019.1.1-000023", "index_age":123120, "success": true, "state": { "phase": "warm", "action": "allocate", "step": "ERROR", "failed_step": "update-settings", "is_auto-retryable_error": true, "creation_date": 12389012039, "phase_time": 12908389120, "action_time": 1283901209, "step_time": 123904107140, "phase_definition": "{\"policy\":\"ilm-history-ilm-policy\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"rollover\":{\"max_size\":\"50gb\",\"max_age\":\"30d\"}}},\"version\":1,\"modified_date_in_millis\":1576517253463}", "step_info": "{... etc step info here as json ...}" }, "error_details": "java.lang.RuntimeException: etc\n\tcaused by:etc etc etc full stacktrace" } ``` These documents go into the `ilm-history-1-00000N` index to provide an audit trail of the operations ILM has performed. This history storage is enabled by default but can be disabled by setting `index.lifecycle.history_index_enabled` to `false.` Resolves #49180 * Make ILMHistoryStore.putAsync truly async (#50403) This moves the `putAsync` method in `ILMHistoryStore` never to block. Previously due to the way that the `BulkProcessor` works, it was possible for `BulkProcessor#add` to block executing a bulk request. This was bad as we may be adding things to the history store in cluster state update threads. This also moves the index creation to be done prior to the bulk request execution, rather than being checked every time an operation was added to the queue. This lessens the chance of the index being created, then deleted (by some external force), and then recreated via a bulk indexing request. Resolves #50353
This happened again on 7.x - https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=centos-6&&immutable/471/consoleFull. There's a build scan available https://gradle-enterprise.elastic.co/s/u7p4epuphegx6 I was, however, unable to reproduce this locally in ~300 runs |
Another failure in a PR build check against master-ish : https://gradle-enterprise.elastic.co/s/aryfl6vkivmcm |
Just got failures in
Could not immediately reproduce with:
|
Another on master (TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithSuccess). Maybe muting makes sense again. Will wait a bit longer but this seems to get frequent.
|
Another one:
Will mute on master and 7.x, please revert those commits if you need more logs etc... |
Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure. Tracked in #50353
Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure. Tracked in #50353
These tests use the same index name, making it hard to read logs when diagnosing the failures. Additionally more information about the current state of the index could be retrieved when failing. This changes these two things in the hope of capturing more data about why this fails on some CI nodes but not others. Relates to elastic#50353
These tests use the same index name, making it hard to read logs when diagnosing the failures. Additionally more information about the current state of the index could be retrieved when failing. This changes these two things in the hope of capturing more data about why this fails on some CI nodes but not others. Relates to #50353 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
These tests use the same index name, making it hard to read logs when diagnosing the failures. Additionally more information about the current state of the index could be retrieved when failing. This changes these two things in the hope of capturing more data about why this fails on some CI nodes but not others. Relates to elastic#50353 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure. Tracked in elastic#50353
These tests use the same index name, making it hard to read logs when diagnosing the failures. Additionally more information about the current state of the index could be retrieved when failing. This changes these two things in the hope of capturing more data about why this fails on some CI nodes but not others. Relates to elastic#50353 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This commit adds additional logging as well as applies some test fixes in an attempt to address elastic#50353.
This commit adds additional logging as well as applies some test fixes in an attempt to address #50353.
I pushed yet more logging and un-muted a single test on master only, so if it fails again please do update this issue and I'll re-mute. |
@dakrone here are a few more failures that I assume include your additional logging: https://gradle-enterprise.elastic.co/s/b6om55dj2eqk2/tests/ao2nxxbijves6-iqisezgxfz7vq |
Given we've gotten a few more failures since adding the additional logging I've remuted this test with bd6d6f7. I've also noticed this test failing recently with a different assertion error. Since it's not clear if these might be related I've opened a separate issue(#52853) to make discoverability easier for folks. |
@mark-vieira thanks, these do have more information, thanks for re-muting for me. |
This change modifies ILMHistoryStore to always apply correct settings and mappings, even if template is deleted and not yet recreated. This ensures that ILM history index is correctly managed by ILM and also fixes flaky history tests that were prone to triggenring this race. This commit also refactors and simplifies ILM history tests. Closes elastic#50353 and elastic#52853
* Avoid race condition in ILMHistorySotre This change modifies ILMHistoryStore to always apply correct settings and mappings, even if template is deleted and not yet recreated. This ensures that ILM history index is correctly managed by ILM and also fixes flaky history tests that were prone to triggenring this race. This commit also refactors and simplifies ILM history tests. Closes #50353 and #52853 * Review comment Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Avoid race condition in ILMHistorySotre This change modifies ILMHistoryStore to always apply correct settings and mappings, even if template is deleted and not yet recreated. This ensures that ILM history index is correctly managed by ILM and also fixes flaky history tests that were prone to triggenring this race. This commit also refactors and simplifies ILM history tests. Closes elastic#50353 and elastic#52853 * Review comment Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Avoid race condition in ILMHistorySotre (#53039) * Avoid race condition in ILMHistorySotre This change modifies ILMHistoryStore to always apply correct settings and mappings, even if template is deleted and not yet recreated. This ensures that ILM history index is correctly managed by ILM and also fixes flaky history tests that were prone to triggenring this race. This commit also refactors and simplifies ILM history tests. Closes #50353 and #52853 * Review comment Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> * fixed tests * backport #53306 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Bad news: this failure came up again on
|
This is continuing to fail with some regularity (a dozen times in the past week on |
Another failure on 7.x. I keep it unmuted for now, as it a only a first failure from Jan 2nd. Build Scans: https://gradle-enterprise.elastic.co/s/ehgyqqem7x2r4 Failure doesn't reproduce for me on 7.x REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure"
|
Numerous failures on 7.x I muted the test again - a38e5ca |
This test has been un-muted (and fixed) in #64521 |
A number of failures on the master branch in the last day:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2012-r2/331/console
https://gradle-enterprise.elastic.co/s/clxgk64w274hw
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-18.04&&immutable/460/console
https://gradle-enterprise.elastic.co/s/ptsngwrgxzn64
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-6&&immutable/460/console
https://gradle-enterprise.elastic.co/s/jdmkulxql5ris
Does not reproduce:
The text was updated successfully, but these errors were encountered: