-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make streaming opt-in #3170
Comments
@ion-elgreco why not make it the default? |
@rtyler non-streamed might be better in some cases since you can use min max statistics from the incoming source data to prune the target table. Haven't found a way to do this during streaming |
@ion-elgreco I am really not liking the parameter bloat that our Python functions have already, and the naming on this one is going to be very confusing. I know what is meant here, but for most people I believe If there is a heuristic which would determine negative performance like you mention, is it possible at the Python layer to determine whether the predicate or desired merge will result in poorer performance with Basically, I think the optimization is a really good one and should be on by default, but best would be to disable it in scenarios where there's known sub-optimal behavior |
I get what you mean, but as a counter argument to keep it as streaming or rename it to
I don't think we can. This is really use-case specific, that's why I want to leave it to the users to do the testing. With many concurrent writers that write to predicates and disjoint sets, it could be useful to disable streaming because min-max stats in this case might be useful to deduce smaller disjoint sets |
If a parameter must be had, then I agree that How would a user know when to enable or disable this flag? I have no idea how I would be be able to tell whether min/max stats are being used properly for the execution of my merge. I figure I would just say "hm, I don't like this performance" and then just toggle flags until I'm happy again? (I will be toggling for a long time then 😆 ) |
I think it makes sense to disable streamed_exec for the type of use case where your min max stats are used to do further partition pruning where you didn't provide an explicit partition predicate. In that case having streaming False can help reduce the amount of partitions read, and potential conflicts when you have parallel writers. |
Description
Use Case
Currently I've enabled streaming when we see a potentially streaming source, I think we should add streaming parameter in python to merge and let users configure it. This makes it then possible to disable it or set it for sources that are not streamed but might still benefit from the streamed execution.
We should also raise an experimental warning when it's true and say users should share positive or negative findings :)
Related Issue(s)
The text was updated successfully, but these errors were encountered: