Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture and report replication lag metric #339

Closed
mch2 opened this issue Jun 20, 2023 · 5 comments · Fixed by #346
Closed

Capture and report replication lag metric #339

mch2 opened this issue Jun 20, 2023 · 5 comments · Fixed by #346
Assignees
Labels
enhancement New feature or request

Comments

@mch2
Copy link
Member

mch2 commented Jun 20, 2023

Is your feature request related to a problem? Please describe.

OpenSearch has recently launched Segment Replication. Segrep includes a new API exposing metrics specific to replication performance. The most critical of these metrics is replication lag, which measures the time between a primary refreshing on a new set of segments and a replica refreshing on the same set of segments. It would be valuable to capture and report on this metric when benchmarking using the segment replication strategy.

The service currently returns the last completed lag and any ongoing lag (shard is currently syncing to a new set of segments) through its /_cat/segment_replication API.

Describe the solution you'd like

Capture and report the min/max/avg lags for a benchmark run.

Describe alternatives you've considered
Aggregate the metrics service side and invoke the API to fetch after a benchmark is completed.

Additional context

https://opensearch.org/docs/latest/api-reference/cat/cat-segment-replication/

@tlfeng
Copy link
Contributor

tlfeng commented Jun 21, 2023

I will work on this feature request to add new telemeter device of segment replication statistic by adding codes to the file https://github.com/opensearch-project/opensearch-benchmark/blob/1.0.0/osbenchmark/telemetry.py#L43 to capture response of /_cat/segment_replication API.

@gkamat
Copy link
Collaborator

gkamat commented Jun 28, 2023

Please raise a PR when your work is complete. Thanks.

@gkamat gkamat removed the untriaged label Jun 28, 2023
@tlfeng
Copy link
Contributor

tlfeng commented Jul 10, 2023

Temporarily put the code I wrote here tlfeng@f2eaf20
For unknown reason, it's having the below error. I remembered there was no error on last Friday, but I forgot what changed which caused the error keeps showing.

2023-07-10 17:35:59,56 ActorAddr-(T|:36707)/PID:156715 osbenchmark.telemetry ERROR Could not determine segment replication stats
Traceback (most recent call last):

  File "/home/ftianli/github/opensearch-benchmark/osbenchmark/telemetry.py", line 161, in run
    self.recorder.record()

  File "/home/ftianli/github/opensearch-benchmark/osbenchmark/telemetry.py", line 1767, in record
    stats_api_endpoint = "/_cat/segment_replication"

NameError: name 'index' is not defined

@tlfeng
Copy link
Contributor

tlfeng commented Jul 12, 2023

The above mysterious error got solved after restarting the system, it might be an issue of the python environment. 😅
Nothing wrong with the code and I added some comments into the code: tlfeng@aa3d2de

The existing code for "telemetry device" of "legacy" searchable snapshots stats and new ccr stats are direct reference.
https://github.com/opensearch-project/opensearch-benchmark/blob/1.0.0/osbenchmark/telemetry.py#L969
https://github.com/opensearch-project/opensearch-benchmark/blob/1.0.0/osbenchmark/telemetry.py#L317 & commit 29b1c48

@tlfeng
Copy link
Contributor

tlfeng commented Aug 9, 2023

There are 2 problems in the above PR #346.

  1. I didn't realized that there maybe more than 1 space between adjacent values. Using split(" ") is incorrect in
    stats_arr.append(line_of_shard_stats.split(" "))
  2. I didn't realize that adding bytes parameters in the URL can make the byte size unit in the response disappeared, so the value becomes to be numeric and can be comparable.
    I have created PR to update the documentation. Add bytes into "Query parameters" for CAT Segment Replication API documentation-website#4731

I also find out that adding format=JSON parameter can make the response in JSON format, it may help parsing the response more robustly.

Examples of the output:

curl "opens-clust-xxx.elb.us-west-2.amazonaws.com/_cat/segment_replication
[logs-211998][9] ip-10-0-3-36.us-west-2.compute.internal  10.0.3.36  1 6.9kb  1.2s  0s   0
[logs-211998][9] ip-10-0-4-86.us-west-2.compute.internal  10.0.4.86  1 6.9kb  1.2s  0s   0
[logs-231998][0] ip-10-0-3-170.us-west-2.compute.internal 10.0.3.170 1 7kb    688ms 0s   0
[logs-231998][0] ip-10-0-5-230.us-west-2.compute.internal 10.0.5.230 1 7kb    688ms 0s   0
$ curl "opens-clust-xxx.elb.us-west-2.amazonaws.com/_cat/segment_replication?v"
shardId          target_node                              target_host checkpoints_behind bytes_behind current_lag last_completed_lag rejected_requests
[logs-221998][7] ip-10-0-4-86.us-west-2.compute.internal  10.0.4.86   1                  47.3kb       988ms       823ms              0
[logs-221998][7] ip-10-0-3-170.us-west-2.compute.internal 10.0.3.170  1                  39.2kb       988ms       1.2s               0
[logs-231998][0] ip-10-0-3-66.us-west-2.compute.internal  10.0.3.66   1                  31.6kb       791ms       699ms              0
[logs-231998][0] ip-10-0-4-187.us-west-2.compute.internal 10.0.4.187  1                  40kb         791ms       1.1s               0
[logs-231998][0] ip-10-0-3-170.us-west-2.compute.internal 10.0.3.170  1                  40kb         791ms       342ms              0
[logs-231998][0] ip-10-0-4-86.us-west-2.compute.internal  10.0.4.86   1                  31.6kb       791ms       343ms              0
$ curl "opens-clust-xxx.elb.us-west-2.amazonaws.com/_cat/segment_replication?time=ms&bytes=b&v"
shardId          target_node                              target_host checkpoints_behind bytes_behind current_lag last_completed_lag rejected_requests
[logs-221998][0] ip-10-0-5-102.us-west-2.compute.internal 10.0.5.102  1                  7480         1116        1664               0
[logs-221998][0] ip-10-0-3-36.us-west-2.compute.internal  10.0.3.36   1                  16034        1116        1316               0
[logs-221998][0] ip-10-0-4-86.us-west-2.compute.internal  10.0.4.86   1                  16034        1116        2018               0
[logs-221998][2] ip-10-0-3-36.us-west-2.compute.internal  10.0.3.36   2                  15179        1753        2282               0
[logs-221998][2] ip-10-0-5-155.us-west-2.compute.internal 10.0.5.155  2                  15179        1753        2673               0
$ curl "opens-clust-xxx.elb.us-west-2.amazonaws.com/_cat/segment_replication?time=ms&format=JSON&v"
[
  {
    "shardId" : "[logs-191998][8]",
    "target_node" : "ip-10-0-4-187.us-west-2.compute.internal",
    "target_host" : "10.0.4.187",
    "checkpoints_behind" : "1",
    "bytes_behind" : "35.6kb",
    "current_lag" : "2158",
    "last_completed_lag" : "1335",
    "rejected_requests" : "0"
  },
  {
    "shardId" : "[logs-201998][8]",
    "target_node" : "ip-10-0-5-230.us-west-2.compute.internal",
    "target_host" : "10.0.5.230",
    "checkpoints_behind" : "1",
    "bytes_behind" : "40.3kb",
    "current_lag" : "889",
    "last_completed_lag" : "2040",
    "rejected_requests" : "0"
  },
  {
    "shardId" : "[logs-201998][8]",
    "target_node" : "ip-10-0-5-163.us-west-2.compute.internal",
    "target_host" : "10.0.5.163",
    "checkpoints_behind" : "1",
    "bytes_behind" : "47.7kb",
    "current_lag" : "889",
    "last_completed_lag" : "1381",
    "rejected_requests" : "0"
  }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants