Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding zstd compression and decompression support for workload corpora #410

Merged
merged 2 commits into from
Nov 20, 2023

Conversation

beaioun
Copy link
Collaborator

@beaioun beaioun commented Nov 8, 2023

Description

This PR adds support for zstd compression and decompression of workload corpora

Issues Resolved

This PR is aiming to solve issue #385

Testing

  • New functionality includes testing

REQUEST FOR HELP:
I need help understanding compression in the io module. I have not found where the io.compress function is referenced thus not really sure how the zstd compression will work in this case. But I included compress_zstd as a separate function to be called in actual use cases.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…orpora

Signed-off-by: beaioun <mshi@ucsd.edu>
@@ -344,6 +370,18 @@ def _do_decompress_manually_with_lib(target_directory, filename, compressed_file
compressed_file.close()


def _do_decompress_zstd(target_directory, filename):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a separate function for this, is it possible to implement this through the _do_decompress_manually as is done with other decompressors like bz2? This would remove boilerplate code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll be working on that, thanks

@IanHoang
Copy link
Collaborator

IanHoang commented Nov 9, 2023

@beaioun You are correct, there is no io.compress(). As of now, the only time compressing is used is in the metrics.py and corpus.py, but these call the libraries zlib and bz2 directly. Adding a separate function to io.py to compress like you did doesn't hurt. The issue I wrote should've clarified this but I can create a separate issue on this: it'd be good to include different compression options in corpus.py. I'll explain this further in a separate issue since this is out of scope for this PR!

Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to fix the pylint error in the CI Unittests and we should be good

@beaioun
Copy link
Collaborator Author

beaioun commented Nov 9, 2023

@beaioun You are correct, there is no io.compress(). As of now, the only time compressing is used is in the metrics.py and corpus.py, but these call the libraries zlib and bz2 directly. Adding a separate function to io.py to compress like you did doesn't hurt. The issue I wrote should've clarified this but I can create a separate issue on this: it'd be good to include different compression options in corpus.py. I'll explain this further in a separate issue since this is out of scope for this PR!

Thank you @IanHoang for the clarification. I'll take a look at corpus.py and see if we can implement that.

@beaioun
Copy link
Collaborator Author

beaioun commented Nov 9, 2023

Just need to fix the pylint error in the CI Unittests and we should be good

Thanks @IanHoang, will do

@IanHoang
Copy link
Collaborator

Just need to fix the pylint error in the CI Unittests and we should be good

Thanks @IanHoang, will do

@beaioun I think it's fine to go ahead and fix the CI unittests here and we can merge this in. We can open a separate issue for corpus.py and create a separate PR for it. This PR has a good amount of changes

… If not I will fix from this point

Signed-off-by: beaioun <mshi@ucsd.edu>
@beaioun
Copy link
Collaborator Author

beaioun commented Nov 18, 2023

Just need to fix the pylint error in the CI Unittests and we should be good

Thanks @IanHoang, will do

@beaioun I think it's fine to go ahead and fix the CI unittests here and we can merge this in. We can open a separate issue for corpus.py and create a separate PR for it. This PR has a good amount of changes

Hey @IanHoang , it's been a few days. I'll go ahead and commit the recent changes. Even though I am still left with this throughput throttled error unfixed, I feel like it has something to do with my PC setup. Let's see how the tests go and I will fix it from there. 4f37b4b
image

@beaioun beaioun requested a review from IanHoang November 20, 2023 06:39
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

@IanHoang IanHoang merged commit 20d4928 into opensearch-project:main Nov 20, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants