-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357
Comments
I can take this one on🖐️ |
I decided to run some tests comparing the current bz2 compression format to ZSTD, as well as LZ4. I used OSB's I then used each EC2 instance to compare the time taken to decompress each format of a certain file size. One was used to compare all the 60gb files, one to compare all the 100gb files, etc. The tests were run in a single-threaded fashion. BZ2 and ZSTD currently do not support multi-threaded decompression. LZ4 recently added support for multi-threaded decompression, but it is not widely available yet, and does not make a large difference for decompression (LZ4 claims a 60% speed boost in decompression using multiple cores). Here were my results:
So, on average, ZSTD decompression seems to outperform LZ4 by 2x, and BZ2 by 3x. The ZSTD compressed files are also 1.2x the size of the BZ2 compressed files, so there is a bit of a trade off there, but I believe the faster decompression speeds make up for this. |
Another thing to note is that we can also involve Also, these tests were run on the structure derived from the http_logs workload, which is text, similar to most of the workloads available. However, these tests were also run on other workloads such as Overall, I think the next steps should be to compress the text based workloads we currently have into the ZSTD format, keeping the BZ2 versions as fallback/backwards compatible options. We can then measure and document improvements in overall benchmark runtime using the ZSTD decompression. I think we should also consider whether parallelized ZSTD can be leveraged for further performance improvements. |
Is your feature request related to a problem?
A while back, @beaioun added support for ZSTD compression and decompression in OSB opensearch-project/opensearch-benchmark#385. He suggested that we should create compressed file versions of larger corpora such as NYC Taxis, Http Logs, and Big5.
What solution would you like?
Create a compressed version of each of these workloads by doing something along the following:
~/.benchmark/benchmarks/data
The text was updated successfully, but these errors were encountered: