Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Move Ballista to new arrow-ballista repo #2502

Closed
5 tasks done
andygrove opened this issue May 10, 2022 · 16 comments
Closed
5 tasks done

[EPIC] Move Ballista to new arrow-ballista repo #2502

andygrove opened this issue May 10, 2022 · 16 comments
Assignees
Labels
development-process Related to development process of DataFusion enhancement New feature or request

Comments

@andygrove
Copy link
Member

andygrove commented May 10, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is a long-term goal and there are some steps to get there but I would like to discuss this with the community.

Rationale for this

  • Decouple release process for DataFusion and Ballista
  • Allow each project to have top-level documentation and user guides that are targeting the appropriate audience
  • Reduce issue tracking and PR review burden for DataFusion maintainers who are not as interested in Ballista
  • Help avoid accidental circular dependencies being introduced between the projects (such as datafusion-cli crate has circular dependency #2433)
  • Helps formalize the public API for DataFusion that other query engines should be using

Steps

  • Create new arrow-ballista repo
  • Push current arrow-datafusion code to new repo
  • Remove DataFusion code from new repo - Remove DataFusion crates datafusion-ballista#1
  • Set up CI in such a way that we are aware when changes are made to DataFusion that break compatibility with Ballista so that we can ask the contributor to avoid the breaking change, or prepare a PR against Ballista to accommodate the change
  • Update the contributing guide and GitHub PR templates

Design / Discussion Document

@andygrove andygrove added enhancement New feature or request ballista development-process Related to development process of DataFusion labels May 10, 2022
@andygrove
Copy link
Member Author

@alamb @tustvold Here is another proposal that I would like to get your opinion on

@yjshen
Copy link
Member

yjshen commented May 11, 2022

Cc @liukun4515 @yahoNanJing @mingmwang @thinkharderdev @gaojun2048 @realno as I could remember.

@realno
Copy link
Contributor

realno commented May 11, 2022

The proposal looks reasonable. +1. Looking forward to seeing the public API design.

@EricJoy2048
Copy link
Member

EricJoy2048 commented May 11, 2022

That's great!

@tustvold
Copy link
Contributor

No objections from me, but I'm possibly not the right person to ask... As you allude to, we will need to be more careful making breaking changes to DataFusion, but nothing insurmountable.

Reduce issue tracking and PR review burden for DataFusion maintainers who are not as interested in Ballista

We will need to ensure there is still sufficient review capacity for Ballista to thrive, I don't have a good feel for if this is a concern or not.

Allow each project to have top-level documentation and user guides that are targeting the appropriate audience

I like this a lot, the audiences for the projects are likely rather different 👍

@thinkharderdev
Copy link
Contributor

Sounds like a good idea. I'm happy to spend time helping review PRs for Ballista if review capacity is an issue.

@alamb
Copy link
Contributor

alamb commented May 11, 2022

I agree this would be a good step -- and help Ballista and DataFusion both to mature. I am fully supportive.

Thank you for the offer @thinkharderdev

@andygrove andygrove self-assigned this May 11, 2022
@andygrove
Copy link
Member Author

Thanks for the encouraging feedback.

I started a design doc where we can discuss the finer details. https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing

@andygrove
Copy link
Member Author

@liukun4515 @yahoNanJing @mingmwang Please take a look and let us know if you have feedback. We have started a vote on the mailing list to move forward with this proposal.

@yahoNanJing
Copy link
Contributor

Hi @andygrove, sorry for late response. I'm not opposed to move Ballista to a new repo. However, I still have some concerns.

  1. How to manage some features which needs changes in both datafusion and ballista, like configuration refactoring?
  2. For some features, like Morsel-driven parallel execution, it leverages the partition for morsel splitting. However, the partition is also used for task splitting in Ballista. The same concept may have conflicts between datafusion and ballista, if ballista is moved out and new features are introduced to datafusion. How should we avoid that?

@andygrove
Copy link
Member Author

Hi @yahoNanJing thanks for the input. The plan (detailed in the design document) is for the DataFusion CI checks to pull the Ballista repo and run the tests to prevent DataFusion making changes that break the Ballista tests.

@yahoNanJing
Copy link
Contributor

Thanks @andygrove and the elaborated design document. The proposal has covered many points to reduce the risk of broken changes.

One more suggestion is whether it's possible for us to define the public API before moving the Ballista to another top-level repository, or at least document the things may break ballista to let both datafusion and ballista developer to be aware if things change.

@andygrove
Copy link
Member Author

andygrove commented May 19, 2022

Hi @yahoNanJing

Well, we already have a documented public API - it is the one that shows up in docs.rs today for all of the DataFusion crates. However, we will continue to add new logical expressions and operators and those are often breaking changes. Also, there will likely be more changes to ExecutionPlan to support the new scheduler. I do think that we need to document this and I can do that. Also, we should ask DataFusion maintainers to create corresponding Ballista PRs for any breaking API changes. Adding the CI checks will alert us to which PRs would cause regressions in Ballista so we can be careful not to merge those ones without reviewing (and testing) the corresponding Ballsta PR.

What do you think?

@andygrove
Copy link
Member Author

The new repo is created: https://github.com/apache/arrow-ballista

I pushed the arrow-datafusion repo as of commit a08d26e

There is a PR up to remove the datafusion crates from the new repo: apache/datafusion-ballista#1

@andygrove
Copy link
Member Author

The next step is to review & merge #2582

@alamb
Copy link
Contributor

alamb commented May 24, 2022

I think it is merged -- I wonder if this epic is done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants