Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Challenge: Make DataFusion the fastest engine in ClickBench with custom file format #13448

Open
alamb opened this issue Nov 16, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

Is your feature request related to a problem or challenge?

This is a crazy idea

Now that DataFusion is the fastest engine for Parquet in ClickBench

A natural follow-on question is “what would it take to make it the fastest overall engine”?

Describe the solution you'd like

TLDR is that I think it needs a special file format. A custom format is fine and consistent with other systems in ClickBench which use various proprietary formats.

So, as a fascinating experiment / academic project, someone could be design / hack up a “ClickBench” file format and DataFusion TableProvider, specifically designed for getting the fastest ClickBench results.

While I suspect this format would not be particularly general purpose, I think it would show How easy it is to make custom formats for particular use cases with DataFusion (don’t have to worry about all the rest of the query engine machinery)

Describe alternatives you've considered

No response

Additional context

This was inspired by @pauldix talking about using DataFusion to innovate at “the edges” of database design https://twitter.com/pauldix/status/1855330035974160483

@alamb
Copy link
Contributor Author

alamb commented Nov 17, 2024

BTW here is an example of how to create a custom file format in DataFusion: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/custom_file_format.rs

@alamb
Copy link
Contributor Author

alamb commented Dec 8, 2024

It might be really interesting to use the Vortex file format: https://blog.spiraldb.com/trick-or-treating-with-vortex/

(which also integrates into DataFusion)

This file format uses a bunch of cutting edge techniques like FSST, ALP, Fast Lanes 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant