Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [Remote Store] Remote Store Stats API #7153

Closed
linuxpi opened this issue Apr 14, 2023 · 9 comments · Fixed by #7441
Closed

[RFC] [Remote Store] Remote Store Stats API #7153

linuxpi opened this issue Apr 14, 2023 · 9 comments · Fixed by #7441
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework

Comments

@linuxpi
Copy link
Collaborator

linuxpi commented Apr 14, 2023

Is your feature request related to a problem? Please describe.
#6789 talks about adding stats related to Remote Store. This will help us identify how we are performing with Remote Store enabled indices. We need Rest API to expose these stats to provide visibility into how the Remote Store enabled indices are performing.

Scope of this issue is to add required API(s) for exposing Remote Store related stats

Describe the solution you'd like

GET /_remote_store/stats/<index>/<shardId>
{
      "shard_id" : "[my-index-1][0]",
      "local_refresh_timestamp_in_millis" : 196439653,
      "local_refresh_cumulative_count" : 0,
      "remote_refresh_timestamp_in_millis" : 196439653,
      "remote_refresh_cumulative_count" : 0,
      "bytes_lag" : 0,
      "rejection_count" : 0,
      "consecutive_failure_count" : 0,
      "total_remote_refresh" : {
        "started" : 0,
        "succeeded" : 0,
        "failed" : 0
      },
      "total_uploads_in_bytes" : {
        "started" : 0,
        "succeeded" : 0,
        "failed" : 0
      },
      "remote_refresh_size_in_bytes" : {
        "last_successful" : 0,
        "moving_avg" : 0.0
      },
      "upload_latency_in_bytes_per_sec" : {
        "moving_avg" : 0.0
      },
      "remote_refresh_latency_in_nanos" : {
        "moving_avg" : 0.0
      }
}

Index Level Stats

GET /_remote_store/stats/<index>
[
  {
     "shardId": <>,
     ...
  },
  {
     "shardId": <>,
     ...
  }
  ...
]

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@linuxpi linuxpi added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 14, 2023
@minalsha minalsha added RFC Issues requesting major changes and removed untriaged labels Apr 14, 2023
@ashking94
Copy link
Member

@linuxpi Couple of points -

  • We can add a field bytes_behind which gives info on how many bytes are we lagging behind the local store.
  • We could allow passing '*' as the value in the stats api - /cat/remote_store/{index". This will allow to fetch stats for all the shards in one api call.
  • We could have the api return total sum across all shards present in the cluster.
  • We would want the aggregated information per node basis as well. This will provide data on the outgoing traffic to remote store and be a feedback for reallocating the shard across nodes manually/programmatically.
  • Currently each remote-backed index allows user to set translog and segments repository. We should also have aggregate on repository level. This will give insights on when a repository is not acting as usual.

@sachinpkale
Copy link
Member

@linuxpi Can you please also provide details around the API permissions?

@sachinpkale sachinpkale added the Storage:Durability Issues and PRs related to the durability framework label Apr 24, 2023
@linuxpi
Copy link
Collaborator Author

linuxpi commented Apr 25, 2023

@ashking94

We can add a field bytes_behind which gives info on how many bytes are we lagging behind the local store.

Yes. update the structure

We could allow passing '*' as the value in the stats api - /cat/remote_store/{index". This will allow to fetch stats for all the shards in one api call.

Not sure if this is a good idea. if a cluster has many shards the api response become huge

We could have the api return total sum across all shards present in the cluster.

sum of all metrics? I dont think all metrics would make sense when summed

We would want the aggregated information per node basis as well. This will provide data on the outgoing traffic to remote store and be a feedback for reallocating the shard across nodes manually/programmatically.

Node level aggregation would be very useful. but i am planning to add it incrementally

Currently each remote-backed index allows user to set translog and segments repository. We should also have aggregate on repository level. This will give insights on when a repository is not acting as usual.

Thats a good point. We can implement all aggregate level metrics incrementally - cluster, node and repository level

@sachinpkale
Copy link
Member

@linuxpi What does started signify under upload_bytes, total_uploads and total_deletes? Can we check existing stats API and use the same naming conventions?

@linuxpi
Copy link
Collaborator Author

linuxpi commented Apr 25, 2023

@sachinpkale started signifies the bytes/objects sent for upload. succeeded and failed would reflect out of those how many succeeded or failed. started should be equal to succeeded + failed

@linuxpi
Copy link
Collaborator Author

linuxpi commented Apr 25, 2023

Can we check existing stats API and use the same naming conventions?

@sachinpkale i checked a various stats objects part of ClusterStatsIndices . What i've noticed there is each stat is appended with _in_bytes where-ever application. i'll check more and try to comply as much as possible but if you had anything specific in mind do let me know

@sachinpkale
Copy link
Member

@linuxpi Which metric/s will be used to determine the time it takes to complete one run of segment uploads post a refresh? Reference: #7474

@linuxpi
Copy link
Collaborator Author

linuxpi commented May 16, 2023

@linuxpi Which metric/s will be used to determine the time it takes to complete one run of segment uploads post a refresh? Reference: #7474

Should we covered by remote_refresh_latency_in_nanos

@linuxpi
Copy link
Collaborator Author

linuxpi commented Mar 5, 2024

Closing this as the API was released with 2.10 release

@linuxpi linuxpi closed this as completed Mar 5, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

5 participants