-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dataset] Improve str/repr of Dataset
to include execution plan
#31604
Conversation
Signed-off-by: Cheng Su <scnju13@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also update the rst docs?
@ericl - yeah plan to do that in the same PR here, if reviewers has no objections on the string representation. |
Just a suggestion, but I think it would be nicer to keep the plan history even after |
@stephanie-wang - I thought it before, the only thing I am worried about, is the plan gets super long after multiple calls, assuming users only care about the latest Dataset. WDYT? @ericl, @clarkzinzow and @jianoaix. Alternative is to add |
It sounds good to me to have a separate API to display the plan. The repr is used quite often and I think it's too much details for a simple |
Signed-off-by: Cheng Su <scnju13@gmail.com>
Dataset
Dataset
to include execution plan
Hmm, for the caching thing I think we should hide the plan if the Dataset is fully independent of the previous stages. if it still has a hidden reference, we should show those previous stages. This might matter since the serialization behavior of the two cases could be different. |
I'm going to just merge this, since I think it's a reasonable first step. We can discuss further refinements on a longer timescale. |
Signed-off-by: Cheng Su scnju13@gmail.com
Why are these changes needed?
This is a followup of #31286, we want to improve the
Dataset.__repr__()
to provide more useful information to users, given lazy execution is default behavior.The change is to include execution plan (stages as a tree) into
Dataset.__repr__()
. Currently each stage only has stage name printed out. We shall add more information per stage/operator in the future, which is orthogonal to this PR. This PR is just to print out the existing information we have.Example:
The code change includes:
ExecutionPlan.get_plan_as_string()
to get the string representation above for the plan.ExecutionPlan
-_get_unified_blocks_schema()
and_get_num_rows_from_blocks_metadata()
Dataset.__repr__
to callExecutionPlan.get_plan_as_string()
directly.Related issue number
Closes #31417
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.