Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate parquet performance improvements #137

Closed
hermanschaaf opened this issue Apr 20, 2023 · 5 comments · Fixed by #203
Closed

Investigate parquet performance improvements #137

hermanschaaf opened this issue Apr 20, 2023 · 5 comments · Fixed by #203
Assignees

Comments

@hermanschaaf
Copy link
Member

hermanschaaf commented Apr 20, 2023

Recent benchmarks on v2 show that performance of the Parquet writer has degraded when compared to v1 (which used a different Parquet library). We should investigate and see how performance can be improved--this may need to happen upstream in Arrow.

NB: this requires a lot of investigation work, so needs to be approached if there's nothing with higher priority.

@yevgenypats yevgenypats moved this to Ready in Cloud Governance May 8, 2023
@candiduslynx candiduslynx self-assigned this Jun 9, 2023
@candiduslynx candiduslynx moved this from Ready to Assigned in Cloud Governance Jun 9, 2023
@candiduslynx candiduslynx removed their assignment Jun 9, 2023
@candiduslynx candiduslynx moved this from Assigned to Ready in Cloud Governance Jun 9, 2023
@candiduslynx candiduslynx self-assigned this Jun 9, 2023
@candiduslynx candiduslynx moved this from Ready to Assigned in Cloud Governance Jun 9, 2023
@candiduslynx
Copy link
Contributor

Closing for now, as the following is currently true:

  1. The lib (xitongsys/parquet-go) we used previously is no longer actively supported, and it doesn't support all required Apache Arrow types
  2. Another lib (segmentio/parquet-go) that's out there for Parquet in Go requires concrete structs, so creating schema from arrow schema & reconstructing data into structs might be a rabbit hole on its own

@candiduslynx candiduslynx closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2023
@github-project-automation github-project-automation bot moved this from Assigned to ✅ Done in Cloud Governance Jun 14, 2023
@hermanschaaf
Copy link
Member Author

hermanschaaf commented Jun 14, 2023

@candiduslynx I don't agree that's really a reason to close this? Yes, those libraries are different, but this issue is not saying we should migrate away from Arrow Parquet; it's saying we observed a ~10x performance regression in direct benchmarks after migrating to the Arrow parquet library, compared to xitongsys/parquet-go, and we should look into why there is such a large performance gap and find out if there is anything that can be done to improve it in the upstream Arrow Parquet library.

@candiduslynx
Copy link
Contributor

@hermanschaaf I do agree that we might want to track this in more detail, but I propose doing this only after the reports from the users, as the rabbit hole of tracking why the write is slower will require a bit of hacking.

@disq
Copy link
Member

disq commented Jun 14, 2023

This PR has some findings about segmentio's parquet package (it wasn't looking great at the time)

@candiduslynx candiduslynx reopened this Jun 14, 2023
@github-project-automation github-project-automation bot moved this from ✅ Done to Need Engineering Input in Cloud Governance Jun 14, 2023
@candiduslynx candiduslynx removed their assignment Jun 14, 2023
@hermanschaaf
Copy link
Member Author

@candiduslynx Let's leave it open if you don't mind; we don't have to do it now, but I think it's worth doing a comparative benchmark at some point that we can share with the upstream Arrow maintainers for feedback. We kind-of have one already in https://github.com/cloudquery/filetypes/blob/main/parquet/write_read_test.go#L70-L98, which was previously working with xitongsys/parquet-go; for a 1:1 comparison we can maybe use only very basic types like int and string to keep it simple. If there's still a difference we can report it upstream

@candiduslynx candiduslynx self-assigned this Jun 14, 2023
@candiduslynx candiduslynx moved this from Need Engineering Input to 👀 In review in Cloud Governance Jun 14, 2023
@kodiakhq kodiakhq bot closed this as completed in #203 Jun 14, 2023
kodiakhq bot pushed a commit that referenced this issue Jun 14, 2023
Closes #137 

#### Benchmarks
Performed by the following command in `parquet` dir:
```sh
go test \
  -test.run=BenchmarkWrite \
  -test.bench=BenchmarkWrite \
-test.count 10 -test.benchmem -test.benchtime 10000x
```

<details><summary>Before this update</summary>

```
goos: darwin
goarch: arm64
pkg: github.com/cloudquery/filetypes/v3/parquet
BenchmarkWrite-10          10000           4628263 ns/op         5796480 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4480788 ns/op         5796474 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4591783 ns/op         5796471 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4610580 ns/op         5796477 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4524806 ns/op         5796473 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4557667 ns/op         5796466 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4869530 ns/op         5796476 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4894571 ns/op         5796474 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4700499 ns/op         5796468 B/op      44245 allocs/op
BenchmarkWrite-10          10000           4793868 ns/op         5796473 B/op      44245 allocs/op
PASS
ok      github.com/cloudquery/filetypes/v3/parquet      539.889s
```

</details> 

<details><summary>After this update</summary>

```
goos: darwin
goarch: arm64
pkg: github.com/cloudquery/filetypes/v3/parquet
BenchmarkWrite-10          10000            923740 ns/op         1146573 B/op      15695 allocs/op
BenchmarkWrite-10          10000            970047 ns/op         1146193 B/op      15695 allocs/op
BenchmarkWrite-10          10000            920979 ns/op         1146542 B/op      15695 allocs/op
BenchmarkWrite-10          10000            923738 ns/op         1146486 B/op      15695 allocs/op
BenchmarkWrite-10          10000            918581 ns/op         1146055 B/op      15694 allocs/op
BenchmarkWrite-10          10000            906547 ns/op         1146690 B/op      15695 allocs/op
BenchmarkWrite-10          10000            912946 ns/op         1146381 B/op      15695 allocs/op
BenchmarkWrite-10          10000            921024 ns/op         1146378 B/op      15695 allocs/op
BenchmarkWrite-10          10000            905637 ns/op         1146371 B/op      15695 allocs/op
BenchmarkWrite-10          10000            919410 ns/op         1146494 B/op      15695 allocs/op
PASS
ok      github.com/cloudquery/filetypes/v3/parquet      158.831s
```

</details>
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cloud Governance Jun 14, 2023
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this issue Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants