Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish recall as a kpi metric #581

Merged
merged 8 commits into from
Jul 24, 2024

Conversation

finnroblin
Copy link
Contributor

Description

Displays recall as part of the final benchmarking stats.
i.e.:

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |                     Task |       Value |   Unit |
|---------------------------------------------------------------:|-------------------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                          |     16.1933 |    min |
.............................................................................................      .....................................................................
| Mean recall@k |             prod-queries |        0.99 |        |

|                                                  Mean recall@1 |             prod-queries |           1 |        |
|                           50th percentile recall_k percentiles |             prod-queries |           1 |        |
|                           90th percentile recall_k percentiles |             prod-queries |           1 |        |
|                           99th percentile recall_k percentiles |             prod-queries |           1 |        |
|                          100th percentile recall_k percentiles |             prod-queries |           1 |        |

Issues Resolved

(From workloads repo): 282
While this PR is not a general solution to #199, it does address the specific need mentioned in that issue of displaying the recall metadata as a summary.

I have added a kpi_metrics field to the GlobalStats metric collector where additional metrics could be added. Users would then need to add display logic to the results_publisher.py file in order for the metrics to show up in the summary.

Testing

  • New functionality includes testing

I added some unit tests to assert that recall metrics were collected. I also altered the OsResultsStoreTest to account for the new kpi_metrics field.

I manually tested the train-test workload, which contains a vector search operation, and the no-train-test-index-only workload, which does not contain a vector search operation. Both workloads ran to completion and displayed the expected statistics (recall for train-test, and no recall metrics for no-train-test-index-only).


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Finn Roblin <finnrobl@amazon.com>
@@ -13,7 +13,7 @@
# not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
# http://www.apache.org/licenses/LICENSE-2.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why my code formatter keeps on changing code outside of the selected lines. I'll revert this and the other spacing changes to out-of-scope code in the next revision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend reverting for this PR, just so its more clear what functional changes are being made.

"operation": operation,
"recall@k": recall_at_k_stats,
"recall@1":recall_at_1_stats,
"recall_time_ms": recall_time_ms_stats,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with @VijayanB and recall_time_ms is not a stat we want to display. I'll remove it from the kpi_metrics method in the next revision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove from this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not sure what "this" is referring to. Do you mean remove the add_kpi_metrics function (and instead append directly to the kpi_metrics list), remove the recall_time_ms metric, or remove kpi_metrics all together and go a different direction for passing recall from metadata to the metrics store?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry meant the recall_time_ms - not sure if this is needed

name=recall_metric_name,
value=sample.request_meta_data[recall_metric_name],
unit="",
task=sample.task.name, # todo change unit to segment count unit...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove todo comment in next revision

@@ -1436,97 +1436,105 @@ def test_store_results(self):
"benchmark-version": "0.4.4",
"benchmark-revision": "123abc",
"environment": "unittest",
"test-execution-id": OsResultsStoreTests.TEST_EXECUTION_ID,
"test-execution-id": "6ebc6e53-ee20-4b0c-99b4-09697987e9f4",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why modify so much in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ordering of the json got scrambled when I added the kpi_metrics variable. I agree though that I shouldn't have changed so much; I'll revert most of the unit test and just add a single extra element in the json array with the kpi_metrics and associated data.

@@ -13,7 +13,7 @@
# not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
# http://www.apache.org/licenses/LICENSE-2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend reverting for this PR, just so its more clear what functional changes are being made.

@@ -1966,6 +1986,7 @@ def single_latency(self, task, operation_type, metric_name="latency"):
class GlobalStats:
def __init__(self, d=None):
self.op_metrics = self.v(d, "op_metrics", default=[])
self.kpi_metrics = self.v(d, "kpi_metrics", default=[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope for kpi metrics? Aside from recall and maybe ndcg, what else would go in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any performance metric for k-NN is in scope for this. Right now it seems like recall is the most major one but if k-NN adds ndcg we could also track that in OSB. (The process for adding new metrics would be to calculate/add it to the metadata in the search query runner and then pull it from the metadata into this kpi_metrics list). Likewise if we or the anomaly detection plugin want to track false positives rate/specificity.

I mainly wanted to explicitly distinguish between algorithm metrics like recall and op_metrics like latency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. KPI though might include latency right? It seems like a good category here would be correctness_metrics or relevance_metrics

"operation": operation,
"recall@k": recall_at_k_stats,
"recall@1":recall_at_1_stats,
"recall_time_ms": recall_time_ms_stats,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove from this PR?

Signed-off-by: Finn Roblin <finnrobl@amazon.com>
…bug)

Signed-off-by: Finn Roblin <finnrobl@amazon.com>
Signed-off-by: Finn Roblin <finnrobl@amazon.com>
@finnroblin finnroblin requested a review from jmazanec15 July 18, 2024 18:09
@@ -13,7 +13,7 @@
# not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
# http://www.apache.org/licenses/LICENSE-2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove unrelated formatting changes from this PR?


keys = record.keys()
recall_keys_in_task_dict = "recall@1" in keys and "recall@k" in keys
if recall_keys_in_task_dict and record["recall@1"] and record["recall@k"]:
Copy link
Member

@VijayanB VijayanB Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't check for record['recall@k"], if recall is 0, it will be skipped. just check only for keys

Comment on lines 222 to 223
except KeyError:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to catch this? can you return from like 214 if keys doesn't exist?

@@ -1369,7 +1413,7 @@ def calculate_task_throughput(self, task, current_samples, bucket_interval_secs)
self.task_stats[task] = ThroughputCalculator.TaskStats(bucket_interval=bucket_interval_secs,
sample_type=first_sample.sample_type,
start_time=first_sample.absolute_time - first_sample.time_period)
current = self.task_stats[task]
current = self.task_stats[task] # TaskStats object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this comment here and as part of this PR?

@@ -1190,6 +1233,7 @@ def __init__(self, start_timestamp, buffer_size=16384):
def add(self, task, client_id, sample_type, meta_data, absolute_time, request_start, latency, service_time,
client_processing_time, processing_time, throughput, ops, ops_unit, time_period, percent_completed,
dependent_timing=None):
self.logger.debug("Logging with metadata: [%s]", meta_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Signed-off-by: Finn Roblin <finnrobl@amazon.com>
Signed-off-by: Finn Roblin <finnrobl@amazon.com>
@finnroblin finnroblin requested a review from VijayanB July 22, 2024 19:26
Signed-off-by: Finn Roblin <finnrobl@amazon.com>
sample.request_meta_data,
)

self.logger.debug(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this to log?

Copy link
Contributor Author

@finnroblin finnroblin Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this -- It's debug level so it shouldn't show up with normal logging settings. I was using it to figure out how samples worked, so this log has served its purpose and I'll remove it now.

Copy link
Member

@VijayanB VijayanB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VijayanB
Copy link
Member

@jmazanec15 Can you take a look at this PR? Thanks

Signed-off-by: Finn Roblin <finnrobl@amazon.com>
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! great job @finnroblin !

@VijayanB VijayanB merged commit 5886866 into opensearch-project:main Jul 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants