-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate parquet performance improvements #137
Comments
Closing for now, as the following is currently true:
|
@candiduslynx I don't agree that's really a reason to close this? Yes, those libraries are different, but this issue is not saying we should migrate away from Arrow Parquet; it's saying we observed a ~10x performance regression in direct benchmarks after migrating to the Arrow parquet library, compared to xitongsys/parquet-go, and we should look into why there is such a large performance gap and find out if there is anything that can be done to improve it in the upstream Arrow Parquet library. |
@hermanschaaf I do agree that we might want to track this in more detail, but I propose doing this only after the reports from the users, as the rabbit hole of tracking why the write is slower will require a bit of hacking. |
This PR has some findings about segmentio's parquet package (it wasn't looking great at the time) |
@candiduslynx Let's leave it open if you don't mind; we don't have to do it now, but I think it's worth doing a comparative benchmark at some point that we can share with the upstream Arrow maintainers for feedback. We kind-of have one already in https://github.com/cloudquery/filetypes/blob/main/parquet/write_read_test.go#L70-L98, which was previously working with xitongsys/parquet-go; for a 1:1 comparison we can maybe use only very basic types like int and string to keep it simple. If there's still a difference we can report it upstream |
Closes #137 #### Benchmarks Performed by the following command in `parquet` dir: ```sh go test \ -test.run=BenchmarkWrite \ -test.bench=BenchmarkWrite \ -test.count 10 -test.benchmem -test.benchtime 10000x ``` <details><summary>Before this update</summary> ``` goos: darwin goarch: arm64 pkg: github.com/cloudquery/filetypes/v3/parquet BenchmarkWrite-10 10000 4628263 ns/op 5796480 B/op 44245 allocs/op BenchmarkWrite-10 10000 4480788 ns/op 5796474 B/op 44245 allocs/op BenchmarkWrite-10 10000 4591783 ns/op 5796471 B/op 44245 allocs/op BenchmarkWrite-10 10000 4610580 ns/op 5796477 B/op 44245 allocs/op BenchmarkWrite-10 10000 4524806 ns/op 5796473 B/op 44245 allocs/op BenchmarkWrite-10 10000 4557667 ns/op 5796466 B/op 44245 allocs/op BenchmarkWrite-10 10000 4869530 ns/op 5796476 B/op 44245 allocs/op BenchmarkWrite-10 10000 4894571 ns/op 5796474 B/op 44245 allocs/op BenchmarkWrite-10 10000 4700499 ns/op 5796468 B/op 44245 allocs/op BenchmarkWrite-10 10000 4793868 ns/op 5796473 B/op 44245 allocs/op PASS ok github.com/cloudquery/filetypes/v3/parquet 539.889s ``` </details> <details><summary>After this update</summary> ``` goos: darwin goarch: arm64 pkg: github.com/cloudquery/filetypes/v3/parquet BenchmarkWrite-10 10000 923740 ns/op 1146573 B/op 15695 allocs/op BenchmarkWrite-10 10000 970047 ns/op 1146193 B/op 15695 allocs/op BenchmarkWrite-10 10000 920979 ns/op 1146542 B/op 15695 allocs/op BenchmarkWrite-10 10000 923738 ns/op 1146486 B/op 15695 allocs/op BenchmarkWrite-10 10000 918581 ns/op 1146055 B/op 15694 allocs/op BenchmarkWrite-10 10000 906547 ns/op 1146690 B/op 15695 allocs/op BenchmarkWrite-10 10000 912946 ns/op 1146381 B/op 15695 allocs/op BenchmarkWrite-10 10000 921024 ns/op 1146378 B/op 15695 allocs/op BenchmarkWrite-10 10000 905637 ns/op 1146371 B/op 15695 allocs/op BenchmarkWrite-10 10000 919410 ns/op 1146494 B/op 15695 allocs/op PASS ok github.com/cloudquery/filetypes/v3/parquet 158.831s ``` </details>
Recent benchmarks on v2 show that performance of the Parquet writer has degraded when compared to v1 (which used a different Parquet library). We should investigate and see how performance can be improved--this may need to happen upstream in Arrow.
NB: this requires a lot of investigation work, so needs to be approached if there's nothing with higher priority.
The text was updated successfully, but these errors were encountered: