[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357

IanHoang · 2024-07-25T18:38:15Z

Is your feature request related to a problem?

A while back, @beaioun added support for ZSTD compression and decompression in OSB opensearch-project/opensearch-benchmark#385. He suggested that we should create compressed file versions of larger corpora such as NYC Taxis, Http Logs, and Big5.

What solution would you like?

Create a compressed version of each of these workloads by doing something along the following:

Use a virtual machine and run OSB against a cluster for workloads such as NYC_Taxis, Http_logs, and Big5.
Compress the files for each workload in ~/.benchmark/benchmarks/data
Add them to a cloud storage that's shareable (such as S3)
Share with maintainers

The text was updated successfully, but these errors were encountered:

OVI3D0 · 2024-10-01T16:33:18Z

I can take this one on🖐️

OVI3D0 · 2024-12-02T20:28:00Z

I decided to run some tests comparing the current bz2 compression format to ZSTD, as well as LZ4.

I used OSB's expand-data-corpora script to create 60gb, 100gb, 300gb, and 500gb files from the http_logs workload. I then set up 4 c5.large EC2 instances, which are commonly used instances with the lowest amount of CPU cores. Each instance contained the compressed versions of each file, in BZ2, ZSTD, and LZ4 formats. The maximum compression size available was used, as it's preferred to keep the file sizes smaller to reduce download times.

I then used each EC2 instance to compare the time taken to decompress each format of a certain file size. One was used to compare all the 60gb files, one to compare all the 100gb files, etc. The tests were run in a single-threaded fashion. BZ2 and ZSTD currently do not support multi-threaded decompression. LZ4 recently added support for multi-threaded decompression, but it is not widely available yet, and does not make a large difference for decompression (LZ4 claims a 60% speed boost in decompression using multiple cores).

Here were my results:

Original file size	BZ2 compressed file size	ZSTD compressed file size	LZ4 compressed file size
60GB	2.34GB	2.84GB	5.53GB
100GB	3.82GB	4.66GB	9.03GB
300GB	10.93GB	13.44GB	25.85GB
500GB	17.94GB	22.01GB	42.43GB

Original file size	BZ2 decompression time	ZSTD decompression time	LZ4 decompression time
60GB	1451.87 seconds	590.19 seconds	1145.35 seconds
100GB	2302.63 seconds	1039.74 seconds	2005.86 seconds
300GB	6729.51 seconds	2773.78 seconds	5371.01 seconds
500GB	11456.41 seconds	4414.46 seconds	8371.00 seconds

So, on average, ZSTD decompression seems to outperform LZ4 by 2x, and BZ2 by 3x. The ZSTD compressed files are also 1.2x the size of the BZ2 compressed files, so there is a bit of a trade off there, but I believe the faster decompression speeds make up for this.

OVI3D0 · 2024-12-04T22:21:10Z

Another thing to note is that we can also involve pzstd (parallelized ZSTD) for even faster speeds.

Also, these tests were run on the structure derived from the http_logs workload, which is text, similar to most of the workloads available. However, these tests were also run on other workloads such as percolator, big5 and geonames, with similar results (geonames actually turned out to decompress much more than 3x faster when in ZSTD format).

Overall, I think the next steps should be to compress the text based workloads we currently have into the ZSTD format, keeping the BZ2 versions as fallback/backwards compatible options. We can then measure and document improvements in overall benchmark runtime using the ZSTD decompression.

I think we should also consider whether parallelized ZSTD can be leveraged for further performance improvements.

IanHoang added enhancement New feature or request untriaged labels Jul 25, 2024

IanHoang changed the title ~~[FEATURE]~~ [FEATURE] Add ZSTD Compressed Files of NYC Taxis, HTTP_Logs, and Big5 Workloads Jul 25, 2024

IanHoang added good first issue Good for newcomers and removed untriaged labels Jul 25, 2024

IanHoang changed the title ~~[FEATURE] Add ZSTD Compressed Files of NYC Taxis, HTTP_Logs, and Big5 Workloads~~ [FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP_Logs, and Big5 Workloads Jul 25, 2024

IanHoang changed the title ~~[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP_Logs, and Big5 Workloads~~ [FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads Jul 25, 2024

IanHoang assigned OVI3D0 Oct 1, 2024

IanHoang added this to Engineering Effectiveness Board Oct 1, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Oct 1, 2024

IanHoang moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Oct 1, 2024

OVI3D0 moved this from 🏗 In progress to 👀 In review in Engineering Effectiveness Board Oct 15, 2024

OVI3D0 moved this from 👀 In review to 🏗 In progress in Engineering Effectiveness Board Oct 15, 2024

OVI3D0 mentioned this issue Jan 6, 2025

add support for zstd compressed corpora #542

Merged

6 tasks

OVI3D0 moved this from 🏗 In progress to 👀 In Review in Engineering Effectiveness Board Jan 7, 2025

OVI3D0 moved this from 👀 In Review to ✅ Done in Engineering Effectiveness Board Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357

[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357

IanHoang commented Jul 25, 2024 •

edited

Loading

OVI3D0 commented Oct 1, 2024

OVI3D0 commented Dec 2, 2024

OVI3D0 commented Dec 4, 2024 •

edited

Loading

[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357

[FEATURE] Add ZSTD Compressed Corpora of NYC Taxis, HTTP Logs, and Big5 Workloads #357

Comments

IanHoang commented Jul 25, 2024 • edited Loading

Is your feature request related to a problem?

What solution would you like?

OVI3D0 commented Oct 1, 2024

OVI3D0 commented Dec 2, 2024

OVI3D0 commented Dec 4, 2024 • edited Loading

IanHoang commented Jul 25, 2024 •

edited

Loading

OVI3D0 commented Dec 4, 2024 •

edited

Loading