Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query with limit on s3 is not working optimally #21595

Open
AlonHarmon opened this issue Apr 17, 2024 · 3 comments
Open

Query with limit on s3 is not working optimally #21595

AlonHarmon opened this issue Apr 17, 2024 · 3 comments

Comments

@AlonHarmon
Copy link

AlonHarmon commented Apr 17, 2024

When query on hive partitioned table that is on s3 with limit, trino loads all of the queried partition and only then evaluates the limit part of the query.
On huge partitioned tables it makes it impossible to do the simple following query:

select * from hive.schema.tablename limit 10;

Env for reproduction (altho it shouldn't matter) -
Trino version - 436
Catalog - hive
Storage - s3 compatible ceph
Objects format - parquet

Do you think it's possible to make the coordinator check every x seconds how many rows each task retrieved and then choose if to abort the rest and return the combined results?

About queries with filtering AND limit, maybe the same is possible but to do it only to the last query stage (where the limit should happen)

@hashhar
Copy link
Member

hashhar commented Apr 25, 2024

Are you using FTE by any chance?

@julienlau
Copy link

julienlau commented Jun 13, 2024

you suggest retry-policy=NONE can also impact performance ?

I also observed large query plan differences between Spark-sql and Trino on S3 for simple queries like select * table limit 100;

@julienlau
Copy link

I think you had this behind your mind :
#18862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants