-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Identify stats for remote store feature #6789
Comments
for |
By
I am not sure why we want to tie it up with Segment Replication. Can you explain a scenario where this would be useful? |
Refresh checkpoint. Since the refresh checkpoint and publish checkpoint today are same, just called it as a checkpoint.
Segment replication relies on the checkpoint lag today for defining stale replicas. We might want similar related metrics for the ease of understanding and also correlating remote segments upload with publishing checkpoint for segment replication in stats front. |
Makes sense to include number of refreshes since last successful upload. This will give the sense of progress on local.
I assume this would be on the Segment replication checkpoint. I am trying to explicitly differentiate refresh and SegRep checkpoint as I don't know the consistency semantics between these two. For example, what happens if checkpoint publish fails? Does it fail the refresh or re-tried on next refresh? |
The 2 stats would be different, but they can be used to know what is causing the delay ultimately in search freshness. For eg let's say that segments upload is lagging behind the local segments by "x" refresh checkpoints and segment replication is lagging behind the local by "y" segrep checkpoints. In such cases looking at "x" and "y" will help identify issues source whether it is segments upload or seg rep segments download from remote store.
Never said this should be same, my point is that we have these 2 stats (seg rep checkpoint lag & remote store segments upload refresh checkpoint lag) so that it can help give visibility on what could ultimately be causing the lag in search freshness on replica.
Today it looks to be harmless if checkpoint publish fails - Lines 154 to 173 in bd9b00d
It probably would be the next refresh when the checkpoints would be published. |
Added |
Will this include all the files i.e. segments_N, si, cfs, etc? Asking since the time may vary a lot based on the file sizes and averages may not make much sense.
What is the reference for "current" here?
|
Should we also add timestamp of last upload started and failed? this will help check if uploads are completely failing after a certain timestamp.
Is this more like a moving avg?
Do we want an absolute value here or a percentage in comparison to total refresh time or both? I think just absolute upload time wont provide much insights |
@sachinpkale @mgodwan having node level stats would be helpful. We would be able to easily identify if any particular node is handling large amount of remote store transfers.? |
@sachinpkale @gbbafna @ashking94 Another point that comes to mind while thinking about node level stats -- Have we discussed on distributing Remote store shards/indices equally across nodes? Currently, can we end up with uneven distribution of remote transfer load on nodes? |
Hey @sachinpkale, This issue will be marked for next-release |
Tagging it for next release: |
Goal
Get visibility into remote store related operations. These stats would help in debugging an issue or monitor the cluster for potential issues. As we start ingesting data into remote store backed index, as a user, I would like to know if the segments and translog files are getting uploaded successfully to the configured remote store, if there are any failures, if the remote store is lagging etc.
Changes to existing APIs
remote_store
andremote_translog
stats similar tostore
andtranslog
statsNew APIs
Cat Remote Store
Remote Segment Store Stats
number of segment files that are uploaded to remote segment store
remote segment store lag with respect to local store
timestamp of last successful file upload
time taken to upload a segment file (total, avg, max, min, P90)
time taken to delete a segment file (total, avg, max, min, P90)
size of a segment file in bytes (avg, max, min, P90)
total upload failures
live/current upload failures
total delete failures
live/current delete failures
total successful uploads
total successful deletes
time spent in remote store uploads during refresh (total, avg, max, min, P90)
Remote Translog stats
The text was updated successfully, but these errors were encountered: