Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to specify number of docs per index when creating workloads #291

Merged
merged 7 commits into from
May 11, 2023

Conversation

IanHoang
Copy link
Collaborator

@IanHoang IanHoang commented May 1, 2023

Description

create-workload currently fetches all documents from specified indices in --indices. Users should have the option to specify a subset of documents if they do not want to use all of the documents. In this PR, OSB now supports running --total-docs in conjunction with --indices. It takes in a comma-separated list of document counts that correspond to the respective index in the list of indices in --indices.

# Example: if movies and actors has a total document count of 3000 and 2000 respectively, users can specify that they want only 1500 documents from movies index and 1200 documents from actors index with these parameters.
--indices=movies,actors --number-of-docs movies:1500 actors:1200

See #289 for more details.

Issues Resolved

#289

Testing

  • New functionality includes testing

Tested it with a few indices in private cluster. Tested the following cases:

  • with single index and single doc count
  • with multiple indices and multiple doc counts
  • with multiple indices and single doc count (should result in error as both doc counts need to be provided)
  • with single index (should get all documents)
  • with multiple indices (should get all documents)

See #289 for outputs and more details


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@IanHoang IanHoang requested a review from gkamat as a code owner May 1, 2023 21:05
@IanHoang IanHoang added Minor Release Backport to minor version branch and removed Minor Release Backport to minor version branch labels May 2, 2023
osbenchmark/benchmark.py Outdated Show resolved Hide resolved
osbenchmark/benchmark.py Outdated Show resolved Hide resolved
osbenchmark/benchmark.py Outdated Show resolved Hide resolved
osbenchmark/workload_generator/corpus.py Outdated Show resolved Hide resolved
osbenchmark/workload_generator/corpus.py Outdated Show resolved Hide resolved
osbenchmark/workload_generator/corpus.py Outdated Show resolved Hide resolved
osbenchmark/workload_generator/workload_generator.py Outdated Show resolved Hide resolved
@gkamat
Copy link
Collaborator

gkamat commented May 7, 2023

Also consider modifying the user interface to options of this sort: --number-of-docs idx1:count1 --number-of-docs idx2:count2 which might be more intuitive.

Ian Hoang added 4 commits May 8, 2023 13:39
Signed-off-by: Ian Hoang <hoangia@amazon.com>
Signed-off-by: Ian Hoang <hoangia@amazon.com>
Signed-off-by: Ian Hoang <hoangia@amazon.com>
Signed-off-by: Ian Hoang <hoangia@amazon.com>
@IanHoang IanHoang force-pushed the create-workload-with-docs branch from 1e90e73 to 174da48 Compare May 9, 2023 17:15
Ian Hoang added 3 commits May 9, 2023 12:21
Signed-off-by: Ian Hoang <hoangia@amazon.com>
… of comma separated values that need to match --indices list

Signed-off-by: Ian Hoang <hoangia@amazon.com>
Signed-off-by: Ian Hoang <hoangia@amazon.com>
@IanHoang IanHoang requested a review from gkamat May 9, 2023 20:59
osbenchmark/workload_generator/corpus.py Show resolved Hide resolved
dump_documents(client, index, get_doc_outpath(output_path, index, "-1k"), min(total_docs, 1000), " for test mode")
dump_documents(client, index, docs_path, total_docs)
return template_vars(index, docs_path, total_docs)
dump_documents(client, index, get_doc_outpath(output_path, index, "-1k"), min(ndocs_to_extract, 1000), " for test mode")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the doc count specified should not be respected for test mode -- it should always be min(1000, total_docs). Users generally don't know the internals of test mode and it is unlikely they intend to specify that value.

@IanHoang IanHoang merged commit 8dedfb6 into opensearch-project:main May 11, 2023
Copy link
Collaborator Author

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to comment this in a difffernt issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants