Investigate parquet performance improvements #137

hermanschaaf · 2023-04-20T13:02:31Z

Recent benchmarks on v2 show that performance of the Parquet writer has degraded when compared to v1 (which used a different Parquet library). We should investigate and see how performance can be improved--this may need to happen upstream in Arrow.

NB: this requires a lot of investigation work, so needs to be approached if there's nothing with higher priority.

candiduslynx · 2023-06-14T12:22:34Z

Closing for now, as the following is currently true:

The lib (xitongsys/parquet-go) we used previously is no longer actively supported, and it doesn't support all required Apache Arrow types
Another lib (segmentio/parquet-go) that's out there for Parquet in Go requires concrete structs, so creating schema from arrow schema & reconstructing data into structs might be a rabbit hole on its own

hermanschaaf · 2023-06-14T12:28:29Z

@candiduslynx I don't agree that's really a reason to close this? Yes, those libraries are different, but this issue is not saying we should migrate away from Arrow Parquet; it's saying we observed a ~10x performance regression in direct benchmarks after migrating to the Arrow parquet library, compared to xitongsys/parquet-go, and we should look into why there is such a large performance gap and find out if there is anything that can be done to improve it in the upstream Arrow Parquet library.

candiduslynx · 2023-06-14T12:32:56Z

@hermanschaaf I do agree that we might want to track this in more detail, but I propose doing this only after the reports from the users, as the rabbit hole of tracking why the write is slower will require a bit of hacking.

disq · 2023-06-14T12:34:13Z

This PR has some findings about segmentio's parquet package (it wasn't looking great at the time)

hermanschaaf · 2023-06-14T12:45:35Z

@candiduslynx Let's leave it open if you don't mind; we don't have to do it now, but I think it's worth doing a comparative benchmark at some point that we can share with the upstream Arrow maintainers for feedback. We kind-of have one already in https://github.com/cloudquery/filetypes/blob/main/parquet/write_read_test.go#L70-L98, which was previously working with xitongsys/parquet-go; for a 1:1 comparison we can maybe use only very basic types like int and string to keep it simple. If there's still a difference we can report it upstream

Closes #137 #### Benchmarks Performed by the following command in `parquet` dir: ```sh go test \ -test.run=BenchmarkWrite \ -test.bench=BenchmarkWrite \ -test.count 10 -test.benchmem -test.benchtime 10000x ``` <details><summary>Before this update</summary> ``` goos: darwin goarch: arm64 pkg: github.com/cloudquery/filetypes/v3/parquet BenchmarkWrite-10 10000 4628263 ns/op 5796480 B/op 44245 allocs/op BenchmarkWrite-10 10000 4480788 ns/op 5796474 B/op 44245 allocs/op BenchmarkWrite-10 10000 4591783 ns/op 5796471 B/op 44245 allocs/op BenchmarkWrite-10 10000 4610580 ns/op 5796477 B/op 44245 allocs/op BenchmarkWrite-10 10000 4524806 ns/op 5796473 B/op 44245 allocs/op BenchmarkWrite-10 10000 4557667 ns/op 5796466 B/op 44245 allocs/op BenchmarkWrite-10 10000 4869530 ns/op 5796476 B/op 44245 allocs/op BenchmarkWrite-10 10000 4894571 ns/op 5796474 B/op 44245 allocs/op BenchmarkWrite-10 10000 4700499 ns/op 5796468 B/op 44245 allocs/op BenchmarkWrite-10 10000 4793868 ns/op 5796473 B/op 44245 allocs/op PASS ok github.com/cloudquery/filetypes/v3/parquet 539.889s ``` </details> <details><summary>After this update</summary> ``` goos: darwin goarch: arm64 pkg: github.com/cloudquery/filetypes/v3/parquet BenchmarkWrite-10 10000 923740 ns/op 1146573 B/op 15695 allocs/op BenchmarkWrite-10 10000 970047 ns/op 1146193 B/op 15695 allocs/op BenchmarkWrite-10 10000 920979 ns/op 1146542 B/op 15695 allocs/op BenchmarkWrite-10 10000 923738 ns/op 1146486 B/op 15695 allocs/op BenchmarkWrite-10 10000 918581 ns/op 1146055 B/op 15694 allocs/op BenchmarkWrite-10 10000 906547 ns/op 1146690 B/op 15695 allocs/op BenchmarkWrite-10 10000 912946 ns/op 1146381 B/op 15695 allocs/op BenchmarkWrite-10 10000 921024 ns/op 1146378 B/op 15695 allocs/op BenchmarkWrite-10 10000 905637 ns/op 1146371 B/op 15695 allocs/op BenchmarkWrite-10 10000 919410 ns/op 1146494 B/op 15695 allocs/op PASS ok github.com/cloudquery/filetypes/v3/parquet 158.831s ``` </details>

Follow-up for cloudquery/filetypes#137

yevgenypats added this to Cloud Governance May 8, 2023

yevgenypats moved this to Ready in Cloud Governance May 8, 2023

candiduslynx self-assigned this Jun 9, 2023

candiduslynx moved this from Ready to Assigned in Cloud Governance Jun 9, 2023

candiduslynx removed their assignment Jun 9, 2023

candiduslynx moved this from Assigned to Ready in Cloud Governance Jun 9, 2023

candiduslynx self-assigned this Jun 9, 2023

candiduslynx moved this from Ready to Assigned in Cloud Governance Jun 9, 2023

candiduslynx closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2023

github-project-automation bot moved this from Assigned to ✅ Done in Cloud Governance Jun 14, 2023

candiduslynx reopened this Jun 14, 2023

github-project-automation bot moved this from ✅ Done to Need Engineering Input in Cloud Governance Jun 14, 2023

candiduslynx removed their assignment Jun 14, 2023

candiduslynx mentioned this issue Jun 14, 2023

fix: Use buffered writing mode in parquet #203

Merged

candiduslynx self-assigned this Jun 14, 2023

candiduslynx moved this from Need Engineering Input to 👀 In review in Cloud Governance Jun 14, 2023

kodiakhq bot closed this as completed in #203 Jun 14, 2023

github-project-automation bot moved this from 👀 In review to ✅ Done in Cloud Governance Jun 14, 2023

candiduslynx mentioned this issue Jun 14, 2023

fix: Write buffered parquet files cloudquery/cloudquery#11546

Merged

kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this issue Jun 14, 2023

fix: Write buffered parquet files (#11546)

77699d2

Follow-up for cloudquery/filetypes#137

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate parquet performance improvements #137

Investigate parquet performance improvements #137

hermanschaaf commented Apr 20, 2023 •

edited by candiduslynx

Loading

candiduslynx commented Jun 14, 2023

hermanschaaf commented Jun 14, 2023 •

edited

Loading

candiduslynx commented Jun 14, 2023

disq commented Jun 14, 2023

hermanschaaf commented Jun 14, 2023

Investigate parquet performance improvements #137

Investigate parquet performance improvements #137

Comments

hermanschaaf commented Apr 20, 2023 • edited by candiduslynx Loading

candiduslynx commented Jun 14, 2023

hermanschaaf commented Jun 14, 2023 • edited Loading

candiduslynx commented Jun 14, 2023

disq commented Jun 14, 2023

hermanschaaf commented Jun 14, 2023

hermanschaaf commented Apr 20, 2023 •

edited by candiduslynx

Loading

hermanschaaf commented Jun 14, 2023 •

edited

Loading