Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357

Open
IanHoang opened this issue Jul 25, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@IanHoang
Copy link
Collaborator

IanHoang commented Jul 25, 2024

Is your feature request related to a problem?

A while back, @beaioun added support for ZSTD compression and decompression in OSB opensearch-project/opensearch-benchmark#385. He suggested that we should create compressed file versions of larger corpora such as NYC Taxis, Http Logs, and Big5.

What solution would you like?

Create a compressed version of each of these workloads by doing something along the following:

  1. Use a virtual machine and run OSB against a cluster for workloads such as NYC_Taxis, Http_logs, and Big5.
  2. Compress the files for each workload in ~/.benchmark/benchmarks/data
  3. Add them to a cloud storage that's shareable (such as S3)
  4. Share with maintainers
@IanHoang IanHoang added enhancement New feature or request untriaged labels Jul 25, 2024
@IanHoang IanHoang changed the title [FEATURE] [FEATURE] Add ZSTD Compressed Files of NYC Taxis, HTTP_Logs, and Big5 Workloads Jul 25, 2024
@IanHoang IanHoang added good first issue Good for newcomers and removed untriaged labels Jul 25, 2024
@IanHoang IanHoang changed the title [FEATURE] Add ZSTD Compressed Files of NYC Taxis, HTTP_Logs, and Big5 Workloads [FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP_Logs, and Big5 Workloads Jul 25, 2024
@IanHoang IanHoang changed the title [FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP_Logs, and Big5 Workloads [FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads Jul 25, 2024
@OVI3D0
Copy link
Member

OVI3D0 commented Oct 1, 2024

I can take this one on🖐️

@IanHoang IanHoang moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Oct 1, 2024
@OVI3D0 OVI3D0 moved this from 🏗 In progress to 👀 In review in Engineering Effectiveness Board Oct 15, 2024
@OVI3D0 OVI3D0 moved this from 👀 In review to 🏗 In progress in Engineering Effectiveness Board Oct 15, 2024
@OVI3D0
Copy link
Member

OVI3D0 commented Dec 2, 2024

I decided to run some tests comparing the current bz2 compression format to ZSTD, as well as LZ4.

I used OSB's expand-data-corpora script to create 60gb, 100gb, 300gb, and 500gb files from the http_logs workload. I then set up 4 c5.large EC2 instances, which are commonly used instances with the lowest amount of CPU cores. Each instance contained the compressed versions of each file, in BZ2, ZSTD, and LZ4 formats. The maximum compression size available was used, as it's preferred to keep the file sizes smaller to reduce download times.

I then used each EC2 instance to compare the time taken to decompress each format of a certain file size. One was used to compare all the 60gb files, one to compare all the 100gb files, etc. The tests were run in a single-threaded fashion. BZ2 and ZSTD currently do not support multi-threaded decompression. LZ4 recently added support for multi-threaded decompression, but it is not widely available yet, and does not make a large difference for decompression (LZ4 claims a 60% speed boost in decompression using multiple cores).

Here were my results:

Original file size BZ2 compressed file size ZSTD compressed file size LZ4 compressed file size
60GB 2.34GB 2.84GB 5.53GB
100GB 3.82GB 4.66GB 9.03GB
300GB 10.93GB 13.44GB 25.85GB
500GB 17.94GB 22.01GB 42.43GB
Original file size BZ2 decompression time ZSTD decompression time LZ4 decompression time
60GB 1451.87 seconds 590.19 seconds 1145.35 seconds
100GB 2302.63 seconds 1039.74 seconds 2005.86 seconds
300GB 6729.51 seconds 2773.78 seconds 5371.01 seconds
500GB 11456.41 seconds 4414.46 seconds 8371.00 seconds

So, on average, ZSTD decompression seems to outperform LZ4 by 2x, and BZ2 by 3x. The ZSTD compressed files are also 1.2x the size of the BZ2 compressed files, so there is a bit of a trade off there, but I believe the faster decompression speeds make up for this.

@OVI3D0
Copy link
Member

OVI3D0 commented Dec 4, 2024

Another thing to note is that we can also involve pzstd (parallelized ZSTD) for even faster speeds.

Also, these tests were run on the structure derived from the http_logs workload, which is text, similar to most of the workloads available. However, these tests were also run on other workloads such as percolator, big5 and geonames, with similar results (geonames actually turned out to decompress much more than 3x faster when in ZSTD format).

Overall, I think the next steps should be to compress the text based workloads we currently have into the ZSTD format, keeping the BZ2 versions as fallback/backwards compatible options. We can then measure and document improvements in overall benchmark runtime using the ZSTD decompression.

I think we should also consider whether parallelized ZSTD can be leveraged for further performance improvements.

@OVI3D0 OVI3D0 moved this from 🏗 In progress to 👀 In Review in Engineering Effectiveness Board Jan 7, 2025
@OVI3D0 OVI3D0 moved this from 👀 In Review to ✅ Done in Engineering Effectiveness Board Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants