[Draft] Identify stats for remote store feature #6789

sachinpkale · 2023-03-22T10:59:51Z

This is work in progress and we will keep adding more stats/metrics around remote store as we identify them.

Goal

Get visibility into remote store related operations. These stats would help in debugging an issue or monitor the cluster for potential issues. As we start ingesting data into remote store backed index, as a user, I would like to know if the segments and translog files are getting uploaded successfully to the configured remote store, if there are any failures, if the remote store is lagging etc.

Changes to existing APIs

Index Stats API response should provide remote_store and remote_translog stats similar to store and translog stats
Cat Segments API should take a query parameter to provide details of segments in remote store
Index Segments API should take a query parameter to provide details of segments in remote store
Cat Recovery API should provide details on the recovery from remote store and remote translog

New APIs

Cat Remote Store

Query Parameters
- Index Name - required
- Shard ID - optional

Remote Segment Store Stats

number of segment files that are uploaded to remote segment store
- Provides number of uploaded segments at the time of the API call
- This metric will not consider inactive segments
remote segment store lag with respect to local store
- number of segments
  - Provides diff between number of segments on local and remote
  - This will be used to understand if remote store is in sync with local or not
- size in bytes
  - Provides diff between size of segments on local and those uploaded to remote.
- time in millis
  - diff between creation time of last file created on local vs max creation time of file uploaded to remote store
- number of refresh checkpoints since the last successful upload
timestamp of last successful file upload
time taken to upload a segment file (total, avg, max, min, P90)
time taken to delete a segment file (total, avg, max, min, P90)
size of a segment file in bytes (avg, max, min, P90)
total upload failures
live/current upload failures
total delete failures
live/current delete failures
total successful uploads
total successful deletes
time spent in remote store uploads during refresh (total, avg, max, min, P90)

Remote Translog stats

Mostly same as above (will add translog specific stats below)

The text was updated successfully, but these errors were encountered:

ashking94 · 2023-03-28T07:29:31Z

for remote segment store lag with respect to local store, we can also have stats around how much remote store is lagging behind the local store in terms of "N" checkpoints. This also ties with the segment replication.

sachinpkale · 2023-03-29T03:27:29Z

for remote segment store lag with respect to local store, we can also have stats around how much remote store is lagging behind the local store in terms of "N" checkpoints.

By checkpoint, do you mean refresh checkpoint or checkpoint used in Segment Replication for publishing to replicas?

This also ties with the segment replication.

I am not sure why we want to tie it up with Segment Replication. Can you explain a scenario where this would be useful?

ashking94 · 2023-03-29T03:43:28Z

for remote segment store lag with respect to local store, we can also have stats around how much remote store is lagging behind the local store in terms of "N" checkpoints.

By checkpoint, do you mean refresh checkpoint or checkpoint used in Segment Replication for publishing to replicas?

Refresh checkpoint. Since the refresh checkpoint and publish checkpoint today are same, just called it as a checkpoint.

This also ties with the segment replication.

I am not sure why we want to tie it up with Segment Replication. Can you explain a scenario where this would be useful?

Segment replication relies on the checkpoint lag today for defining stale replicas. We might want similar related metrics for the ease of understanding and also correlating remote segments upload with publishing checkpoint for segment replication in stats front.

sachinpkale · 2023-03-29T03:56:03Z

Refresh checkpoint.

Makes sense to include number of refreshes since last successful upload. This will give the sense of progress on local.

Segment replication relies on the checkpoint lag today for defining stale replicas

I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two. For example, what happens if checkpoint publish fails? Does it fail the refresh or re-tried on next refresh?

ashking94 · 2023-03-29T05:06:14Z

I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two.

The 2 stats would be different, but they can be used to know what is causing the delay ultimately in search freshness. For eg let's say that segments upload is lagging behind the local segments by "x" refresh checkpoints and segment replication is lagging behind the local by "y" segrep checkpoints. In such cases looking at "x" and "y" will help identify issues source whether it is segments upload or seg rep segments download from remote store.

I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two.

Never said this should be same, my point is that we have these 2 stats (seg rep checkpoint lag & remote store segments upload refresh checkpoint lag) so that it can help give visibility on what could ultimately be causing the lag in search freshness on replica.

For example, what happens if checkpoint publish fails? Does it fail the refresh or re-tried on next refresh?

Today it looks to be harmless if checkpoint publish fails -

OpenSearch/server/src/main/java/org/opensearch/indices/replication/checkpoint/PublishCheckpointAction.java

Lines 154 to 173 in bd9b00d

    
           public void handleException(TransportException e) { 
        
               timer.stop(); 
        
               logger.trace("[shardId {}] Failed to publish checkpoint, timing: {}", indexShard.shardId().getId(), timer.time()); 
        
               task.setPhase("finished"); 
        
               taskManager.unregister(task); 
        
               if (ExceptionsHelper.unwrap( 
        
                   e, 
        
                   NodeClosedException.class, 
        
                   IndexNotFoundException.class, 
        
                   AlreadyClosedException.class, 
        
                   IndexShardClosedException.class, 
        
                   ShardNotInPrimaryModeException.class 
        
               ) != null) { 
        
                   // Node is shutting down or the index was deleted or the shard is closed 
        
                   return; 
        
               } 
        
               logger.warn( 
        
                   new ParameterizedMessage("{} segment replication checkpoint publishing failed", indexShard.shardId()), 
        
                   e 
        
               );

It probably would be the next refresh when the checkpoints would be published.

sachinpkale · 2023-03-30T04:24:16Z

Added number of refresh checkpoints since the last successful upload under lag category.

mgodwan · 2023-04-04T17:46:14Z

size of a segment file in bytes (avg, max, min, P90)

Will this include all the files i.e. segments_N, si, cfs, etc? Asking since the time may vary a lot based on the file sizes and averages may not make much sense.

live/current upload failures

What is the reference for "current" here?

Can we include sum stat for size and time so that we are able to get an idea of total data transferred off the node?
Also, it may be a good idea to determine the rate of transfer between remote store and node so that customers can decide if they may need to have instances with enough network bandwidth. Can we see if the value can be derived based on the stats?
There will also be nodes downloading from remote store (e.g. recovery). Are we including the stats for downloads as well in this?

linuxpi · 2023-04-14T00:31:26Z

timestamp of last successful file upload

Should we also add timestamp of last upload started and failed? this will help check if uploads are completely failing after a certain timestamp.

live/current upload failures

Is this more like a moving avg?

time spent in remote store uploads during refresh (total, avg, max, min, P90).

Do we want an absolute value here or a percentage in comparison to total refresh time or both? I think just absolute upload time wont provide much insights

linuxpi · 2023-04-14T01:01:52Z

Can we include sum stat for size and time so that we are able to get an idea of total data transferred off the node?

@sachinpkale @mgodwan having node level stats would be helpful. We would be able to easily identify if any particular node is handling large amount of remote store transfers.?

linuxpi · 2023-04-14T01:06:10Z

@sachinpkale @gbbafna @ashking94 Another point that comes to mind while thinking about node level stats -- Have we discussed on distributing Remote store shards/indices equally across nodes? Currently, can we end up with uneven distribution of remote transfer load on nodes?

DarshitChanpura · 2023-04-14T15:26:30Z

Hey @sachinpkale, This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

DarshitChanpura · 2023-04-17T16:03:50Z

Tagging it for next release: v2.8.0

sachinpkale added enhancement Enhancement or improvement to existing feature or request untriaged Storage:Durability Issues and PRs related to the durability framework v2.7.0 and removed untriaged labels Mar 22, 2023

linuxpi mentioned this issue Apr 14, 2023

[RFC] [Remote Store] Remote Store Stats API #7153

Closed

DarshitChanpura added v2.8.0 'Issues and PRs related to version v2.8.0' and removed v2.7.0 labels Apr 17, 2023

BhumikaSaini-Amazon mentioned this issue Jun 28, 2023

[RFC] [Remote Store] /_remotestore/stats API and _nodes/stats API enhancements for observability on Remote Translog Store upload operations #8311

Closed

Bukhtawar added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

gbbafna closed this as completed Apr 5, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Identify stats for remote store feature #6789

[Draft] Identify stats for remote store feature #6789

sachinpkale commented Mar 22, 2023 •

edited

Loading

ashking94 commented Mar 28, 2023

sachinpkale commented Mar 29, 2023 •

edited

Loading

ashking94 commented Mar 29, 2023

sachinpkale commented Mar 29, 2023

ashking94 commented Mar 29, 2023 •

edited

Loading

sachinpkale commented Mar 30, 2023

mgodwan commented Apr 4, 2023

linuxpi commented Apr 14, 2023 •

edited

Loading

linuxpi commented Apr 14, 2023 •

edited

Loading

linuxpi commented Apr 14, 2023 •

edited

Loading

DarshitChanpura commented Apr 14, 2023

DarshitChanpura commented Apr 17, 2023

[Draft] Identify stats for remote store feature #6789

[Draft] Identify stats for remote store feature #6789

Comments

sachinpkale commented Mar 22, 2023 • edited Loading

Goal

Changes to existing APIs

New APIs

Cat Remote Store

Remote Segment Store Stats

Remote Translog stats

ashking94 commented Mar 28, 2023

sachinpkale commented Mar 29, 2023 • edited Loading

ashking94 commented Mar 29, 2023

sachinpkale commented Mar 29, 2023

ashking94 commented Mar 29, 2023 • edited Loading

sachinpkale commented Mar 30, 2023

mgodwan commented Apr 4, 2023

linuxpi commented Apr 14, 2023 • edited Loading

linuxpi commented Apr 14, 2023 • edited Loading

linuxpi commented Apr 14, 2023 • edited Loading

DarshitChanpura commented Apr 14, 2023

DarshitChanpura commented Apr 17, 2023

sachinpkale commented Mar 22, 2023 •

edited

Loading

sachinpkale commented Mar 29, 2023 •

edited

Loading

ashking94 commented Mar 29, 2023 •

edited

Loading

linuxpi commented Apr 14, 2023 •

edited

Loading

linuxpi commented Apr 14, 2023 •

edited

Loading

linuxpi commented Apr 14, 2023 •

edited

Loading