Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for zstd compressed corpora #542

Merged
merged 2 commits into from
Jan 21, 2025

Conversation

OVI3D0
Copy link
Member

@OVI3D0 OVI3D0 commented Jan 6, 2025

Description

Adds a workload param so users can use the new zstd compressed corpora by passing in:
--workload-params=use_zst:true
Only applies to the larger workloads where ZSTD compressed corpora have been added.

Issues Resolved

#357

Testing

  • New functionality includes testing

Tested by running workloads using new param.

Backport to Branches:

  • 6
  • 7
  • 1
  • 2
  • 3

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
@OVI3D0 OVI3D0 added backport 2 Backport to the "2" branch backport 1 backport 3 Backport to the "3" branch backport 7 Backport to the "7" branch labels Jan 6, 2025
}
{% if use_zst %}
{
"source-file": "documents-100.json.zst",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would zstd be a better suffix, since that seems to be the commonly used abbreviation for Zstandard? If this is changed, the workload parameter should be changed to correspond as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. The param is now use_zstd in the latest revision

{% else %}
{
"source-file": "documents-1000.json.bz2",
"source-file-parts": [ { "name": "documents-1000-part0", "size": 20189061054 }, { "name": "documents-1000-part1", "size": 20189061054 }, { "name": "documents-1000-part2", "size": 20189061055 } ],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names would probably need to change, since the documents-1000-part* are the contributing chunks for the bzip2 corpus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded new chunks to the s3 bucket for the ZSTD compressed corpora. They're now called documents-1000-zstd-part*

Comment on lines 42 to 46
{
"source-file": "documents-880.json.zst",
"document-count": 1020000000,
"compressed-bytes": 27685953536,
"uncompressed-bytes": 943679382267
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the document-count and uncompressed-bytes should be moved out of the Jinja if markup? That would eliminate the duplication. Might be a bit more work for http_logs but still probably worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this looks a lot cleaner. Reduced quite a few lines of code as well

{% if use_zst %}
{
"source-file": "documents.json.zst",
"#COMMENT": "ML benchmark rely on the fact that the document count stays constant.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "relies on"

@IanHoang
Copy link
Collaborator

IanHoang commented Jan 8, 2025

@OVI3D0 @gkamat When users list out workloads with opensearch-benchmark list workloads, should we include a new column listing out all the compressed extensions each workload supports? This would help users understand quickly which workloads can use zstd

…1tb workload

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
@OVI3D0
Copy link
Member Author

OVI3D0 commented Jan 9, 2025

@OVI3D0 @gkamat When users list out workloads with opensearch-benchmark list workloads, should we include a new column listing out all the compressed extensions each workload supports? This would help users understand quickly which workloads can use zstd

I think this would be helpful. I can open up a separate PR for this

@OVI3D0 OVI3D0 requested a review from gkamat January 9, 2025 22:46
@OVI3D0 OVI3D0 merged commit a593f0c into opensearch-project:main Jan 21, 2025
2 checks passed
@OVI3D0 OVI3D0 deleted the add-zstd-support branch January 21, 2025 18:37
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 21, 2025
* add support for zstd compressed corpora

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

* revise jinja templating + rename workload param + add new chunks for 1tb workload

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

---------

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
(cherry picked from commit a593f0c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 21, 2025
* add support for zstd compressed corpora

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

* revise jinja templating + rename workload param + add new chunks for 1tb workload

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

---------

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
(cherry picked from commit a593f0c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 21, 2025
* add support for zstd compressed corpora

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

* revise jinja templating + rename workload param + add new chunks for 1tb workload

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

---------

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
(cherry picked from commit a593f0c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@opensearch-trigger-bot
Copy link

The backport to 7 failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-7 7
# Navigate to the new working tree
pushd ../.worktrees/backport-7
# Create a new branch
git switch --create backport/backport-542-to-7
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 a593f0ce7099550c2ccaa65ef8d45447877e36e5
# Push it to GitHub
git push --set-upstream origin backport/backport-542-to-7
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-7

Then, create a pull request where the base branch is 7 and the compare/head branch is backport/backport-542-to-7.

OVI3D0 pushed a commit that referenced this pull request Jan 22, 2025
* add support for zstd compressed corpora



* revise jinja templating + rename workload param + add new chunks for 1tb workload



---------


(cherry picked from commit a593f0c)

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
OVI3D0 pushed a commit that referenced this pull request Jan 22, 2025
* add support for zstd compressed corpora



* revise jinja templating + rename workload param + add new chunks for 1tb workload



---------


(cherry picked from commit a593f0c)

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
OVI3D0 pushed a commit that referenced this pull request Jan 22, 2025
* add support for zstd compressed corpora



* revise jinja templating + rename workload param + add new chunks for 1tb workload



---------


(cherry picked from commit a593f0c)

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1 backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch backport 7 Backport to the "7" branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants