Move metadata #280

bowenlan-amzn · 2020-08-07T18:25:15Z

Issue #, if available:
#207

Description of changes:

move ManagedIndexMetadata from saving in cluster state to saving in config index
- use a flag to stop new node run job when there is still old node existing in the cluster:
  this is to make sure metadata in the cluster state will never be newer than metadata in the config index, so that we can safely move metadata from cluster state to config index when the node is running this version of ISM.
expose step metadata for Explain API
- deliberately check or wait for step metadata showing step status for some tests
refractor some variable names in Retry and ChangePolicy actions to be more meaningful to me :)

tests

MetadataRegressionIT

deliberately add some metadata to the cluster state, test if it can be moved to config index and job can take that metadata
setup a mixed cluster with one old node, one new node, test SkipExecution flag
- use this command to run: ./gradlew mixedCluster --tests "com.amazon.opendistroforelasticsearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test new node skip execution when old node exist in cluster" -PnumNodes=2 -DmixCluster=true

SkipExecutionTests

mock test cluster change event for setting up SkipExecution flag

Example Response

metadata saved in config index

"managed_index_metadata": {
    "index": "test_index",
    "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
    "policy_id": "policy_1",
    "policy_seq_no": 0,
    "policy_primary_term": 1,
    "policy_completed": false,
    "rolled_over": false,
    "transition_to": null,
    "state": {
        "name": "warm",
        "start_time": 1613015158272
    },
    "action": {
        "name": "replica_count",
        "start_time": 1613015217265,
        "index": 0,
        "failed": false,
        "consumed_retries": 0,
        "last_retry_time": 0
    },
    "step": {
        "name": "attempt_set_replica_count",
        "start_time": 1613015218330,
        "step_status": "completed"
    },
    "retry_info": {
        "failed": false,
        "consumed_retries": 0
    },
    "info": {
        "message": "Successfully set number_of_replicas to 0 [index=test_index]"
    }
}

Explain API output

{
  "test_index": {
    "index.opendistro.index_state_management.policy_id": "policy_1",
    "index": "test_index",
    "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
    "policy_id": "policy_1",
    "policy_seq_no": 0,
    "policy_primary_term": 1,
    "state": {
      "name": "warm",
      "start_time": 1613015158272
    },
    "action": {
      "name": "replica_count",
      "start_time": 1613015217265,
      "index": 0,
      "failed": false,
      "consumed_retries": 0,
      "last_retry_time": 0
    },
    "step": {
      "name": "attempt_set_replica_count",
      "start_time": 1613015218330,
      "step_status": "completed"
    },
    "retry_info": {
      "failed": false,
      "consumed_retries": 0
    },
    "info": {
      "message": "Successfully set number_of_replicas to 0 [index=test_index]"
    }
  }
}

history index content

    "hits": [
      {
        "_index": ".opendistro-ism-managed-index-history-2021.02.11-1",
        "_type": "_doc",
        "_id": "-L0yj3cBlTj35GPfFMW_",
        "_score": 1.0,
        "_source": {
          "managed_index_meta_data": {
            "index": "test_index",
            "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
            "policy_id": "policy_1",
            "policy_seq_no": 0,
            "policy_primary_term": 1,
            "state": {
              "name": "warm",
              "start_time": 1613015158272
            },
            "retry_info": {
              "failed": false,
              "consumed_retries": 0
            },
            "info": {
              "message": "Successfully initialized policy: policy_1"
            },
            "history_timestamp": 1613015159998
          }
        }
      },
      {
        "_index": ".opendistro-ism-managed-index-history-2021.02.11-1",
        "_type": "_doc",
        "_id": "-b0yj3cBlTj35GPf_MUM",
        "_score": 1.0,
        "_source": {
          "managed_index_meta_data": {
            "index": "test_index",
            "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
            "policy_id": "policy_1",
            "policy_seq_no": 0,
            "policy_primary_term": 1,
            "state": {
              "name": "warm",
              "start_time": 1613015158272
            },
            "action": {
              "name": "replica_count",
              "start_time": 1613015217265,
              "index": 0,
              "failed": false,
              "consumed_retries": 0,
              "last_retry_time": 0
            },
            "step": {
              "name": "attempt_set_replica_count",
              "start_time": 1613015218330,
              "step_status": "completed"
            },
            "retry_info": {
              "failed": false,
              "consumed_retries": 0
            },
            "info": {
              "message": "Successfully set number_of_replicas to 0 [index=test_index]"
            },
            "history_timestamp": 1613015219212
          }
        }
      },
      {
        "_index": ".opendistro-ism-managed-index-history-2021.02.11-1",
        "_type": "_doc",
        "_id": "-r0zj3cBlTj35GPf48Wa",
        "_score": 1.0,
        "_source": {
          "managed_index_meta_data": {
            "index": "test_index",
            "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
            "policy_id": "policy_1",
            "policy_seq_no": 0,
            "policy_primary_term": 1,
            "policy_completed": true,
            "history_timestamp": 1613015278490
          }
        }
      }
    ]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2020-08-07T21:20:21Z

Codecov Report

Merging #280 (56252bb) into main (34735cf) will decrease coverage by 0.32%.
The diff coverage is 69.98%.

@@             Coverage Diff              @@
##               main     #280      +/-   ##
============================================
- Coverage     77.39%   77.07%   -0.33%     
- Complexity     1509     1550      +41     
============================================
  Files           196      198       +2     
  Lines          7831     8217     +386     
  Branches       1246     1317      +71     
============================================
+ Hits           6061     6333     +272     
- Misses         1095     1172      +77     
- Partials        675      712      +37

Impacted Files	Coverage Δ	Complexity Δ
...ndexstatemanagement/IndexStateManagementHistory.kt	`80.00% <ø> (+1.73%)`	`27.00 <0.00> (+1.00)`
...exstatemanagement/resthandler/RestExplainAction.kt	`100.00% <ø> (ø)`	`4.00 <0.00> (ø)`
...sport/action/addpolicy/TransportAddPolicyAction.kt	`71.60% <0.00%> (-0.90%)`	`3.00 <0.00> (ø)`
...asticsearch/indexmanagement/rollup/model/Rollup.kt	`89.17% <ø> (ø)`	`53.00 <0.00> (ø)`
...icsearch/indexmanagement/IndexManagementIndices.kt	`58.75% <40.00%> (-2.68%)`	`10.00 <1.00> (+1.00)`	⬇️
...action/removepolicy/TransportRemovePolicyAction.kt	`68.42% <55.55%> (-2.60%)`	`2.00 <0.00> (ø)`
...agement/indexstatemanagement/ManagedIndexRunner.kt	`55.00% <55.73%> (-0.66%)`	`46.00 <6.00> (+5.00)`	⬇️
...nt/indexstatemanagement/ManagedIndexCoordinator.kt	`74.45% <61.19%> (-2.75%)`	`47.00 <4.00> (+4.00)`	⬇️
...management/indexstatemanagement/MetadataService.kt	`62.36% <62.36%> (ø)`	`12.00 <12.00> (?)`
...gedindex/TransportRetryFailedManagedIndexAction.kt	`72.97% <64.00%> (-2.27%)`	`2.00 <0.00> (ø)`
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34735cf...acf4ac8. Read the comment docs.

add routing to get request

bowenlan-amzn · 2021-02-10T20:48:45Z

...lasticsearch/indexmanagement/indexstatemanagement/model/managedindexmetadata/StepMetaData.kt

@@ -48,11 +48,12 @@ data class StepMetaData(
    }

    override fun toXContent(builder: XContentBuilder, params: ToXContent.Params): XContentBuilder {
-        return builder.startObject(STEP)


Previously we don't show step metadata out as json. The change here is to be consistent with ActionMetadata code link

build.gradle

dbbaughe · 2021-02-10T21:43:45Z

Moving this comment to this PR from other

High level comment - worried about perf issues by moving it into config index. Job scheduler does have a postIndex subscriber to the config index which is executed every time a doc is added/updated. Imagining someone with 10k indices every 1 minute which means 20k updates per minute (each ISM execution ends up doing 2 updates to metadata for starting/closing a transaction) which are all triggering that callback which then has to check if it's of a job type to schedule (which it isn't). Alternatively we can consider another ism index which is purely for metadata which won't have this listener attached... we could have both rollup and ism metadata in there instead of the config index. Some testing could be good to see if this is a real concern.

...n/opendistroforelasticsearch/indexmanagement/indexstatemanagement/ManagedIndexCoordinator.kt

qreshi · 2021-02-10T22:59:12Z

...n/opendistroforelasticsearch/indexmanagement/indexstatemanagement/ManagedIndexCoordinator.kt

+                val bulkRequest = BulkRequest().add(deleteRequests)
+                val bulkResponse: BulkResponse = client.suspendUntil { bulk(bulkRequest, it) }
+                bulkResponse.forEach {
+                    if (it.isFailed) logger.error("Failed to clear ManagedIndexMetadata for [index=${it.index}]. " +


Are there any particular failure reasons in which we want to collect the failures and throw an exception with a retryCause to trigger the retryPolicy on them here?

Something similar to this

I think since this is just to remove metadata, we may not need to do that. @dbbaughe to give some opinion

...management/indexstatemanagement/transport/action/changepolicy/TransportChangePolicyAction.kt

...arch/indexmanagement/indexstatemanagement/transport/action/explain/TransportExplainAction.kt

...istroforelasticsearch/indexmanagement/indexstatemanagement/IndexStateManagementITTestCase.kt

...orelasticsearch/indexmanagement/indexstatemanagement/resthandler/RestChangePolicyActionIT.kt

build.gradle

src/main/kotlin/com/amazon/opendistroforelasticsearch/indexmanagement/IndexManagementPlugin.kt

dbbaughe · 2021-02-11T03:28:17Z

.../com/amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/SkipExecution.kt

+        val request = NodesInfoRequest().clear().addMetric("plugins")
+        client.execute(NodesInfoAction.INSTANCE, request, object : ActionListener<NodesInfoResponse> {
+            override fun onResponse(response: NodesInfoResponse) {
+                flag = false


Probably should not reset flag at start. What happens when:
Cluster has multiple versions when new node is added and this logic is executed and sets flag to true.
As job is running another node is added and this sets flag back to false while checking and job is running and goes through. Small race condition, but could happen.

that's some good thinking! I move it to

if (versionSet.size > 1) { flag = true } else flag = false

.../com/amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/SkipExecution.kt

src/main/resources/mappings/opendistro-ism-config.json

dbbaughe · 2021-02-11T03:38:22Z

Can you post examples of any API response additions/changes and/or history document example too.

...pendistroforelasticsearch/indexmanagement/indexstatemanagement/model/ManagedIndexMetaData.kt

dbbaughe · 2021-02-11T18:41:45Z

...pendistroforelasticsearch/indexmanagement/indexstatemanagement/model/ManagedIndexMetaData.kt

-                    PolicyRetryInfoMetaData.RETRY_INFO -> retryInfo = PolicyRetryInfoMetaData.parse(xcp)
+                    TRANSITION_TO -> transitionTo = if (xcp.currentToken() == Token.VALUE_NULL) null else xcp.text()
+                    StateMetaData.STATE -> {
+                        // check null for invalid policy situation


Can you explain more about this situation?

not sure what this comment means, but since we now save state metadata even it's null, we should have corresponding parse to deal with null state metadata

...in/com/amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/TestHelpers.kt

...management/transport/action/updateindexmetadata/TransportUpdateManagedIndexMetaDataAction.kt

...on/opendistroforelasticsearch/indexmanagement/indexstatemanagement/util/ManagedIndexUtils.kt

...ndistroforelasticsearch/indexmanagement/indexstatemanagement/elasticapi/ElasticExtensions.kt

dbbaughe · 2021-02-11T22:41:45Z

...anagement/transport/action/retryfailedmanagedindex/TransportRetryFailedManagedIndexAction.kt

+        private val listOfMetadata: MutableList<ManagedIndexMetaData> = mutableListOf()
+        private val listOfIndexToMetadata: MutableList<Pair<Index, ManagedIndexMetaData>> = mutableListOf()
+        private val mapOfItemIdToIndex: MutableMap<Int, Index> = mutableMapOf()
+        private lateinit var clusterState: ClusterState


Generic comment for whole file:

What happens when you have ManagedIndexMetadata from the cluster state and ManagedIndexMetadata from the config index (i.e. the migration started but failed on delete part)?

Also what happens when only cluster state has the ManagedIndexMetadata (i.e. migration hasn't happened yet)?

now using metadataservice to move metadata, ISM only rely on metadata saved in/moved to config index, if metadata stay in cluster state, ISM will wait until it get moved to config index

...management/indexstatemanagement/transport/action/changepolicy/TransportChangePolicyAction.kt

dbbaughe · 2021-02-11T23:47:03Z

...azon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/MetadataRegressionIT.kt

+        updateManagedIndexConfigStartTime(managedIndexConfig)
+
+        // check no job has been run
+        wait { assertEquals(null, getExistingManagedIndexConfig(indexName).policy) }


So best case (in successful path) this just runs the assert block every 100ms for 10 seconds to verify nothing ran? Why not just sleep the thread for 10 seconds and assert it never ran instead of a new function etc.?

...n/opendistroforelasticsearch/indexmanagement/indexstatemanagement/ManagedIndexCoordinator.kt

...istroforelasticsearch/indexmanagement/indexstatemanagement/runner/ManagedIndexRunnerTests.kt

...amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/ManagedIndexRunner.kt

...om/amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/MetadataService.kt

...azon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/MetadataRegressionIT.kt

...om/amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/MetadataService.kt

qreshi · 2021-03-04T22:56:13Z

...om/amazon/opendistroforelasticsearch/indexmanagement/indexstatemanagement/MetadataService.kt

+                finishFlag = false; runningLock = false
+                return
+            }
+            if (counter++ > 2) {


It looks like this if check is looking for this code path being hit a certain number of times. If I'm not mistaken, since the counter++ increment is a postfix, it will return the original value and then increment. So we hit if (clusterStateMetadata.isEmpty()) 4 times in a row (since the else sets counter back to 0), we enter this if statement. Is there a reason that was the condition chosen for this?

wanna choose 3. No specific reason for this, I just want a way to cancel the scheduling job at some point. pls lemme know if any suggestion

...roforelasticsearch/indexmanagement/indexstatemanagement/IndexStateManagementIntegTestCase.kt

dbbaughe · 2021-03-05T04:37:30Z

...n/opendistroforelasticsearch/indexmanagement/indexstatemanagement/ManagedIndexCoordinator.kt

@@ -138,6 +139,9 @@ class ManagedIndexCoordinator(
            indexStateManagementEnabled = it
            if (!indexStateManagementEnabled) disable() else enable()
        }
+        clusterService.clusterSettings.addSettingsUpdateConsumer(METADATA_SERVICE_ENABLED) {


Seems like it keeps scheduling in the other functions even if disabled?

yeah, add a metadataServiceEnabled variable to save this setting and use it to return early if it's false

bowenlan-amzn added the enhancement An improvement on the existing feature’s functionalities label Aug 10, 2020

bowenlan-amzn requested review from dbbaughe and qreshi August 17, 2020 16:56

move metadata to config index

d88ba17

add routing to get request

Base automatically changed from master to main February 9, 2021 07:16

bowenlan-amzn added 3 commits February 10, 2021 09:24

Merge branch 'main' into move-metadata

d0654fb

resolve obvious problem

c166c50

debuging regression test

92803b1

bowenlan-amzn commented Feb 10, 2021

View reviewed changes

dbbaughe reviewed Feb 10, 2021

View reviewed changes

build.gradle Show resolved Hide resolved

self review

b458016

qreshi reviewed Feb 11, 2021

View reviewed changes

address Mo's comments

60e374f