Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Move metadata #280

Merged
merged 22 commits into from
Mar 8, 2021
Merged

Move metadata #280

merged 22 commits into from
Mar 8, 2021

Conversation

bowenlan-amzn
Copy link
Contributor

@bowenlan-amzn bowenlan-amzn commented Aug 7, 2020

Issue #, if available:
#207

Description of changes:

  • move ManagedIndexMetadata from saving in cluster state to saving in config index
    • use a flag to stop new node run job when there is still old node existing in the cluster:
      this is to make sure metadata in the cluster state will never be newer than metadata in the config index, so that we can safely move metadata from cluster state to config index when the node is running this version of ISM.
  • expose step metadata for Explain API
    • deliberately check or wait for step metadata showing step status for some tests
  • refractor some variable names in Retry and ChangePolicy actions to be more meaningful to me :)

tests

MetadataRegressionIT

  1. deliberately add some metadata to the cluster state, test if it can be moved to config index and job can take that metadata
  2. setup a mixed cluster with one old node, one new node, test SkipExecution flag
    • use this command to run: ./gradlew mixedCluster --tests "com.amazon.opendistroforelasticsearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test new node skip execution when old node exist in cluster" -PnumNodes=2 -DmixCluster=true

SkipExecutionTests

  1. mock test cluster change event for setting up SkipExecution flag

Example Response

metadata saved in config index

"managed_index_metadata": {
    "index": "test_index",
    "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
    "policy_id": "policy_1",
    "policy_seq_no": 0,
    "policy_primary_term": 1,
    "policy_completed": false,
    "rolled_over": false,
    "transition_to": null,
    "state": {
        "name": "warm",
        "start_time": 1613015158272
    },
    "action": {
        "name": "replica_count",
        "start_time": 1613015217265,
        "index": 0,
        "failed": false,
        "consumed_retries": 0,
        "last_retry_time": 0
    },
    "step": {
        "name": "attempt_set_replica_count",
        "start_time": 1613015218330,
        "step_status": "completed"
    },
    "retry_info": {
        "failed": false,
        "consumed_retries": 0
    },
    "info": {
        "message": "Successfully set number_of_replicas to 0 [index=test_index]"
    }
}

Explain API output

{
  "test_index": {
    "index.opendistro.index_state_management.policy_id": "policy_1",
    "index": "test_index",
    "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
    "policy_id": "policy_1",
    "policy_seq_no": 0,
    "policy_primary_term": 1,
    "state": {
      "name": "warm",
      "start_time": 1613015158272
    },
    "action": {
      "name": "replica_count",
      "start_time": 1613015217265,
      "index": 0,
      "failed": false,
      "consumed_retries": 0,
      "last_retry_time": 0
    },
    "step": {
      "name": "attempt_set_replica_count",
      "start_time": 1613015218330,
      "step_status": "completed"
    },
    "retry_info": {
      "failed": false,
      "consumed_retries": 0
    },
    "info": {
      "message": "Successfully set number_of_replicas to 0 [index=test_index]"
    }
  }
}

history index content

    "hits": [
      {
        "_index": ".opendistro-ism-managed-index-history-2021.02.11-1",
        "_type": "_doc",
        "_id": "-L0yj3cBlTj35GPfFMW_",
        "_score": 1.0,
        "_source": {
          "managed_index_meta_data": {
            "index": "test_index",
            "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
            "policy_id": "policy_1",
            "policy_seq_no": 0,
            "policy_primary_term": 1,
            "state": {
              "name": "warm",
              "start_time": 1613015158272
            },
            "retry_info": {
              "failed": false,
              "consumed_retries": 0
            },
            "info": {
              "message": "Successfully initialized policy: policy_1"
            },
            "history_timestamp": 1613015159998
          }
        }
      },
      {
        "_index": ".opendistro-ism-managed-index-history-2021.02.11-1",
        "_type": "_doc",
        "_id": "-b0yj3cBlTj35GPf_MUM",
        "_score": 1.0,
        "_source": {
          "managed_index_meta_data": {
            "index": "test_index",
            "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
            "policy_id": "policy_1",
            "policy_seq_no": 0,
            "policy_primary_term": 1,
            "state": {
              "name": "warm",
              "start_time": 1613015158272
            },
            "action": {
              "name": "replica_count",
              "start_time": 1613015217265,
              "index": 0,
              "failed": false,
              "consumed_retries": 0,
              "last_retry_time": 0
            },
            "step": {
              "name": "attempt_set_replica_count",
              "start_time": 1613015218330,
              "step_status": "completed"
            },
            "retry_info": {
              "failed": false,
              "consumed_retries": 0
            },
            "info": {
              "message": "Successfully set number_of_replicas to 0 [index=test_index]"
            },
            "history_timestamp": 1613015219212
          }
        }
      },
      {
        "_index": ".opendistro-ism-managed-index-history-2021.02.11-1",
        "_type": "_doc",
        "_id": "-r0zj3cBlTj35GPf48Wa",
        "_score": 1.0,
        "_source": {
          "managed_index_meta_data": {
            "index": "test_index",
            "index_uuid": "WfU-3nzQQ6CbHsph0wn3UQ",
            "policy_id": "policy_1",
            "policy_seq_no": 0,
            "policy_primary_term": 1,
            "policy_completed": true,
            "history_timestamp": 1613015278490
          }
        }
      }
    ]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Aug 7, 2020

Codecov Report

Merging #280 (56252bb) into main (34735cf) will decrease coverage by 0.32%.
The diff coverage is 69.98%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #280      +/-   ##
============================================
- Coverage     77.39%   77.07%   -0.33%     
- Complexity     1509     1550      +41     
============================================
  Files           196      198       +2     
  Lines          7831     8217     +386     
  Branches       1246     1317      +71     
============================================
+ Hits           6061     6333     +272     
- Misses         1095     1172      +77     
- Partials        675      712      +37     
Impacted Files Coverage Δ Complexity Δ
...ndexstatemanagement/IndexStateManagementHistory.kt 80.00% <ø> (+1.73%) 27.00 <0.00> (+1.00)
...exstatemanagement/resthandler/RestExplainAction.kt 100.00% <ø> (ø) 4.00 <0.00> (ø)
...sport/action/addpolicy/TransportAddPolicyAction.kt 71.60% <0.00%> (-0.90%) 3.00 <0.00> (ø)
...asticsearch/indexmanagement/rollup/model/Rollup.kt 89.17% <ø> (ø) 53.00 <0.00> (ø)
...icsearch/indexmanagement/IndexManagementIndices.kt 58.75% <40.00%> (-2.68%) 10.00 <1.00> (+1.00) ⬇️
...action/removepolicy/TransportRemovePolicyAction.kt 68.42% <55.55%> (-2.60%) 2.00 <0.00> (ø)
...agement/indexstatemanagement/ManagedIndexRunner.kt 55.00% <55.73%> (-0.66%) 46.00 <6.00> (+5.00) ⬇️
...nt/indexstatemanagement/ManagedIndexCoordinator.kt 74.45% <61.19%> (-2.75%) 47.00 <4.00> (+4.00) ⬇️
...management/indexstatemanagement/MetadataService.kt 62.36% <62.36%> (ø) 12.00 <12.00> (?)
...gedindex/TransportRetryFailedManagedIndexAction.kt 72.97% <64.00%> (-2.27%) 2.00 <0.00> (ø)
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34735cf...acf4ac8. Read the comment docs.

@bowenlan-amzn bowenlan-amzn added the enhancement An improvement on the existing feature’s functionalities label Aug 10, 2020
add routing to get request
Base automatically changed from master to main February 9, 2021 07:16
@@ -48,11 +48,12 @@ data class StepMetaData(
}

override fun toXContent(builder: XContentBuilder, params: ToXContent.Params): XContentBuilder {
return builder.startObject(STEP)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we don't show step metadata out as json. The change here is to be consistent with ActionMetadata code link

@dbbaughe
Copy link
Contributor

Moving this comment to this PR from other

High level comment - worried about perf issues by moving it into config index. Job scheduler does have a postIndex subscriber to the config index which is executed every time a doc is added/updated. Imagining someone with 10k indices every 1 minute which means 20k updates per minute (each ISM execution ends up doing 2 updates to metadata for starting/closing a transaction) which are all triggering that callback which then has to check if it's of a job type to schedule (which it isn't). Alternatively we can consider another ism index which is purely for metadata which won't have this listener attached... we could have both rollup and ism metadata in there instead of the config index. Some testing could be good to see if this is a real concern.

val bulkRequest = BulkRequest().add(deleteRequests)
val bulkResponse: BulkResponse = client.suspendUntil { bulk(bulkRequest, it) }
bulkResponse.forEach {
if (it.isFailed) logger.error("Failed to clear ManagedIndexMetadata for [index=${it.index}]. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any particular failure reasons in which we want to collect the failures and throw an exception with a retryCause to trigger the retryPolicy on them here?

Something similar to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this is just to remove metadata, we may not need to do that. @dbbaughe to give some opinion

val request = NodesInfoRequest().clear().addMetric("plugins")
client.execute(NodesInfoAction.INSTANCE, request, object : ActionListener<NodesInfoResponse> {
override fun onResponse(response: NodesInfoResponse) {
flag = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should not reset flag at start. What happens when:
Cluster has multiple versions when new node is added and this logic is executed and sets flag to true.
As job is running another node is added and this sets flag back to false while checking and job is running and goes through. Small race condition, but could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's some good thinking! I move it to

if (versionSet.size > 1) {
    flag = true
} else flag = false

@dbbaughe
Copy link
Contributor

Can you post examples of any API response additions/changes and/or history document example too.

PolicyRetryInfoMetaData.RETRY_INFO -> retryInfo = PolicyRetryInfoMetaData.parse(xcp)
TRANSITION_TO -> transitionTo = if (xcp.currentToken() == Token.VALUE_NULL) null else xcp.text()
StateMetaData.STATE -> {
// check null for invalid policy situation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about this situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this comment means, but since we now save state metadata even it's null, we should have corresponding parse to deal with null state metadata

private val listOfMetadata: MutableList<ManagedIndexMetaData> = mutableListOf()
private val listOfIndexToMetadata: MutableList<Pair<Index, ManagedIndexMetaData>> = mutableListOf()
private val mapOfItemIdToIndex: MutableMap<Int, Index> = mutableMapOf()
private lateinit var clusterState: ClusterState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic comment for whole file:

What happens when you have ManagedIndexMetadata from the cluster state and ManagedIndexMetadata from the config index (i.e. the migration started but failed on delete part)?

Also what happens when only cluster state has the ManagedIndexMetadata (i.e. migration hasn't happened yet)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now using metadataservice to move metadata, ISM only rely on metadata saved in/moved to config index, if metadata stay in cluster state, ISM will wait until it get moved to config index

updateManagedIndexConfigStartTime(managedIndexConfig)

// check no job has been run
wait { assertEquals(null, getExistingManagedIndexConfig(indexName).policy) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So best case (in successful path) this just runs the assert block every 100ms for 10 seconds to verify nothing ran? Why not just sleep the thread for 10 seconds and assert it never ran instead of a new function etc.?

finishFlag = false; runningLock = false
return
}
if (counter++ > 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this if check is looking for this code path being hit a certain number of times. If I'm not mistaken, since the counter++ increment is a postfix, it will return the original value and then increment. So we hit if (clusterStateMetadata.isEmpty()) 4 times in a row (since the else sets counter back to 0), we enter this if statement. Is there a reason that was the condition chosen for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanna choose 3. No specific reason for this, I just want a way to cancel the scheduling job at some point. pls lemme know if any suggestion

@@ -138,6 +139,9 @@ class ManagedIndexCoordinator(
indexStateManagementEnabled = it
if (!indexStateManagementEnabled) disable() else enable()
}
clusterService.clusterSettings.addSettingsUpdateConsumer(METADATA_SERVICE_ENABLED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it keeps scheduling in the other functions even if disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, add a metadataServiceEnabled variable to save this setting and use it to return early if it's false

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement on the existing feature’s functionalities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants