Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make streaming opt-in #3170

Open
ion-elgreco opened this issue Jan 28, 2025 · 6 comments · May be fixed by #3178
Open

Make streaming opt-in #3170

ion-elgreco opened this issue Jan 28, 2025 · 6 comments · May be fixed by #3178
Labels
binding/python Issues for the Python package enhancement New feature or request good first issue Good for newcomers

Comments

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jan 28, 2025

Description

Use Case
Currently I've enabled streaming when we see a potentially streaming source, I think we should add streaming parameter in python to merge and let users configure it. This makes it then possible to disable it or set it for sources that are not streamed but might still benefit from the streamed execution.

We should also raise an experimental warning when it's true and say users should share positive or negative findings :)

Related Issue(s)

@ion-elgreco ion-elgreco added enhancement New feature or request good first issue Good for newcomers binding/python Issues for the Python package labels Jan 28, 2025
@rtyler
Copy link
Member

rtyler commented Jan 28, 2025

@ion-elgreco why not make it the default?

@ion-elgreco
Copy link
Collaborator Author

@rtyler non-streamed might be better in some cases since you can use min max statistics from the incoming source data to prune the target table.

Haven't found a way to do this during streaming

@ion-elgreco ion-elgreco linked a pull request Feb 2, 2025 that will close this issue
@rtyler
Copy link
Member

rtyler commented Feb 3, 2025

@ion-elgreco I am really not liking the parameter bloat that our Python functions have already, and the naming on this one is going to be very confusing. I know what is meant here, but for most people I believe streaming is going to be misconstrued as something relating to data stream processing a la Flink or Spark Structured Streaming.

If there is a heuristic which would determine negative performance like you mention, is it possible at the Python layer to determine whether the predicate or desired merge will result in poorer performance with LazyMemoryExec and instead disable it automatically?

Basically, I think the optimization is a really good one and should be on by default, but best would be to disable it in scenarios where there's known sub-optimal behavior

@ion-elgreco
Copy link
Collaborator Author

I get what you mean, but as a counter argument to keep it as streaming or rename it to streamed_execution, Polars has a streaming engine, it doesn't take in streamed data but processes the data in a streamed fashion. So perhaps streamed_execution would be clearer here.

If there is a heuristic which would determine negative performance like you mention

I don't think we can. This is really use-case specific, that's why I want to leave it to the users to do the testing. With many concurrent writers that write to predicates and disjoint sets, it could be useful to disable streaming because min-max stats in this case might be useful to deduce smaller disjoint sets

@rtyler
Copy link
Member

rtyler commented Feb 3, 2025

If a parameter must be had, then I agree that streamed_exec (or some flavor thereof) is better.

How would a user know when to enable or disable this flag? I have no idea how I would be be able to tell whether min/max stats are being used properly for the execution of my merge. I figure I would just say "hm, I don't like this performance" and then just toggle flags until I'm happy again? (I will be toggling for a long time then 😆 )

@ion-elgreco
Copy link
Collaborator Author

If a parameter must be had, then I agree that streamed_exec (or some flavor thereof) is better.

How would a user know when to enable or disable this flag? I have no idea how I would be be able to tell whether min/max stats are being used properly for the execution of my merge. I figure I would just say "hm, I don't like this performance" and then just toggle flags until I'm happy again? (I will be toggling for a long time then 😆 )

I think it makes sense to disable streamed_exec for the type of use case where your min max stats are used to do further partition pruning where you didn't provide an explicit partition predicate. In that case having streaming False can help reduce the amount of partitions read, and potential conflicts when you have parallel writers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants