Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Identify stats for remote store feature #6789

Closed
sachinpkale opened this issue Mar 22, 2023 · 12 comments
Closed

[Draft] Identify stats for remote store feature #6789

sachinpkale opened this issue Mar 22, 2023 · 12 comments
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'

Comments

@sachinpkale
Copy link
Member

sachinpkale commented Mar 22, 2023

This is work in progress and we will keep adding more stats/metrics around remote store as we identify them.

Goal

Get visibility into remote store related operations. These stats would help in debugging an issue or monitor the cluster for potential issues. As we start ingesting data into remote store backed index, as a user, I would like to know if the segments and translog files are getting uploaded successfully to the configured remote store, if there are any failures, if the remote store is lagging etc.

Changes to existing APIs

  • Index Stats API response should provide remote_store and remote_translog stats similar to store and translog stats
  • Cat Segments API should take a query parameter to provide details of segments in remote store
  • Index Segments API should take a query parameter to provide details of segments in remote store
  • Cat Recovery API should provide details on the recovery from remote store and remote translog

New APIs

Cat Remote Store

  • Query Parameters
    • Index Name - required
    • Shard ID - optional
Remote Segment Store Stats
  1. number of segment files that are uploaded to remote segment store

    • Provides number of uploaded segments at the time of the API call
    • This metric will not consider inactive segments
  2. remote segment store lag with respect to local store

    • number of segments
      • Provides diff between number of segments on local and remote
      • This will be used to understand if remote store is in sync with local or not
    • size in bytes
      • Provides diff between size of segments on local and those uploaded to remote.
    • time in millis
      • diff between creation time of last file created on local vs max creation time of file uploaded to remote store
    • number of refresh checkpoints since the last successful upload
  3. timestamp of last successful file upload

  4. time taken to upload a segment file (total, avg, max, min, P90)

  5. time taken to delete a segment file (total, avg, max, min, P90)

  6. size of a segment file in bytes (avg, max, min, P90)

  7. total upload failures

  8. live/current upload failures

  9. total delete failures

  10. live/current delete failures

  11. total successful uploads

  12. total successful deletes

  13. time spent in remote store uploads during refresh (total, avg, max, min, P90)

Remote Translog stats
  • Mostly same as above (will add translog specific stats below)
@sachinpkale sachinpkale added enhancement Enhancement or improvement to existing feature or request untriaged Storage:Durability Issues and PRs related to the durability framework v2.7.0 and removed untriaged labels Mar 22, 2023
@ashking94
Copy link
Member

for remote segment store lag with respect to local store, we can also have stats around how much remote store is lagging behind the local store in terms of "N" checkpoints. This also ties with the segment replication.

@sachinpkale
Copy link
Member Author

sachinpkale commented Mar 29, 2023

for remote segment store lag with respect to local store, we can also have stats around how much remote store is lagging behind the local store in terms of "N" checkpoints.

By checkpoint, do you mean refresh checkpoint or checkpoint used in Segment Replication for publishing to replicas?

This also ties with the segment replication.

I am not sure why we want to tie it up with Segment Replication. Can you explain a scenario where this would be useful?

@ashking94
Copy link
Member

for remote segment store lag with respect to local store, we can also have stats around how much remote store is lagging behind the local store in terms of "N" checkpoints.

By checkpoint, do you mean refresh checkpoint or checkpoint used in Segment Replication for publishing to replicas?

Refresh checkpoint. Since the refresh checkpoint and publish checkpoint today are same, just called it as a checkpoint.

This also ties with the segment replication.

I am not sure why we want to tie it up with Segment Replication. Can you explain a scenario where this would be useful?

Segment replication relies on the checkpoint lag today for defining stale replicas. We might want similar related metrics for the ease of understanding and also correlating remote segments upload with publishing checkpoint for segment replication in stats front.

@sachinpkale
Copy link
Member Author

Refresh checkpoint.

Makes sense to include number of refreshes since last successful upload. This will give the sense of progress on local.

Segment replication relies on the checkpoint lag today for defining stale replicas

I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two. For example, what happens if checkpoint publish fails? Does it fail the refresh or re-tried on next refresh?

@ashking94
Copy link
Member

ashking94 commented Mar 29, 2023

I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two.

The 2 stats would be different, but they can be used to know what is causing the delay ultimately in search freshness. For eg let's say that segments upload is lagging behind the local segments by "x" refresh checkpoints and segment replication is lagging behind the local by "y" segrep checkpoints. In such cases looking at "x" and "y" will help identify issues source whether it is segments upload or seg rep segments download from remote store.

I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two.

Never said this should be same, my point is that we have these 2 stats (seg rep checkpoint lag & remote store segments upload refresh checkpoint lag) so that it can help give visibility on what could ultimately be causing the lag in search freshness on replica.

For example, what happens if checkpoint publish fails? Does it fail the refresh or re-tried on next refresh?

Today it looks to be harmless if checkpoint publish fails -

public void handleException(TransportException e) {
timer.stop();
logger.trace("[shardId {}] Failed to publish checkpoint, timing: {}", indexShard.shardId().getId(), timer.time());
task.setPhase("finished");
taskManager.unregister(task);
if (ExceptionsHelper.unwrap(
e,
NodeClosedException.class,
IndexNotFoundException.class,
AlreadyClosedException.class,
IndexShardClosedException.class,
ShardNotInPrimaryModeException.class
) != null) {
// Node is shutting down or the index was deleted or the shard is closed
return;
}
logger.warn(
new ParameterizedMessage("{} segment replication checkpoint publishing failed", indexShard.shardId()),
e
);

It probably would be the next refresh when the checkpoints would be published.

@sachinpkale
Copy link
Member Author

Added number of refresh checkpoints since the last successful upload under lag category.

@mgodwan
Copy link
Member

mgodwan commented Apr 4, 2023

size of a segment file in bytes (avg, max, min, P90)

Will this include all the files i.e. segments_N, si, cfs, etc? Asking since the time may vary a lot based on the file sizes and averages may not make much sense.

live/current upload failures

What is the reference for "current" here?

  • Can we include sum stat for size and time so that we are able to get an idea of total data transferred off the node?
  • Also, it may be a good idea to determine the rate of transfer between remote store and node so that customers can decide if they may need to have instances with enough network bandwidth. Can we see if the value can be derived based on the stats?
  • There will also be nodes downloading from remote store (e.g. recovery). Are we including the stats for downloads as well in this?

@linuxpi
Copy link
Collaborator

linuxpi commented Apr 14, 2023

  1. timestamp of last successful file upload

Should we also add timestamp of last upload started and failed? this will help check if uploads are completely failing after a certain timestamp.

  1. live/current upload failures

Is this more like a moving avg?

  1. time spent in remote store uploads during refresh (total, avg, max, min, P90).

Do we want an absolute value here or a percentage in comparison to total refresh time or both? I think just absolute upload time wont provide much insights

@linuxpi
Copy link
Collaborator

linuxpi commented Apr 14, 2023

Can we include sum stat for size and time so that we are able to get an idea of total data transferred off the node?

@sachinpkale @mgodwan having node level stats would be helpful. We would be able to easily identify if any particular node is handling large amount of remote store transfers.?

@linuxpi
Copy link
Collaborator

linuxpi commented Apr 14, 2023

@sachinpkale @gbbafna @ashking94 Another point that comes to mind while thinking about node level stats -- Have we discussed on distributing Remote store shards/indices equally across nodes? Currently, can we end up with uneven distribution of remote transfer load on nodes?

@DarshitChanpura
Copy link
Member

Hey @sachinpkale, This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

@DarshitChanpura DarshitChanpura added v2.8.0 'Issues and PRs related to version v2.8.0' and removed v2.7.0 labels Apr 17, 2023
@DarshitChanpura
Copy link
Member

Tagging it for next release: v2.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
Status: ✅ Done
Development

No branches or pull requests

6 participants