Presto Hive Connector creates too many small splits #21911

pranjalssh · 2024-02-12T23:16:38Z

We have a table with wide columns(maps), and it takes a long time(3.6 min)

select count(1) from table

Meta internal query id: 20240212_231123_91020_7kizs

Using session property hive.file_splittable=false makes the same query take 10s!

Meta internal query id: 20240212_230955_07056_7jgdp

This session property creates exactly 1 split from 1 file in table. If otherwise true, presto scheduler will create multiple splits from the same file.

This is because presto scheduler creates splits according to file sizes - but does not take into account if we read only selected columns from the file. This is an extreme case where we read no data, but just metadata. In general, we should be able to tune split sizes based on amount of data we select from the files - so we can have fewer splits and presto runs faster.

There is an orthogonal problem when Presto is slow to process a large number of splits - that needs to be looked at in future.

This issue focuses on increasing split sizes when possible so query can be run with fewer splits. There can be 2 ways to fix this:

Use column stats from metastore to find relative size of file that we will be reading - and make split sizes according to it, not the whole file size.
Make it adaptive. If workers report they didn't read too much from the files when processing splits - increase split sizes.

The text was updated successfully, but these errors were encountered:

mbasmanova · 2024-02-12T23:25:54Z

CC: @tdcmeehan @majetideepak @aditi-pandit @pettyjamesm

tdcmeehan · 2024-02-12T23:36:12Z

For most table formats, I'll bet approach 1 is the way to go. For example, Iceberg already stores column-level information on the size in bytes for each column per file, and it already knows what columns will be accessed from the file during planning time, so it should be possible to change Iceberg to avoid this edge case. I imagine it's the same for Delta and Hudi.

For Hive, we'd need to build this inside of Presto. I wonder if the partition stats in the Hive metastore are sufficient for us to consistently make a good determination on the split size (i.e. are we losing anything by not knowing it per file).

mbasmanova · 2024-02-13T13:18:41Z

Use column stats from metastore to find relative size of file that we will be reading - and make split sizes according to it, not the whole file size.

I think this solution can be effective and seems not too hard to implement.

kaikalur · 2024-02-13T19:35:53Z

@pranjalssh this will be a great issue to fix as it helps current adhoc queries as well as prestissimo. So I say let's prioritize this

pranjalssh · 2024-02-15T17:57:54Z

Cc @rschlussel @agrawaldevesh

kaikalur · 2024-03-01T01:23:35Z

A related problem - if the hive table is badly laid out with a ton of tiny files, each with a few rows (happens quite a bit), we still produce a lot of splits.

It would be interesting to see if making splits span across files (I believe Spark already has something like that) will fix that issue as well.

mbasmanova · 2024-03-01T08:12:17Z

It would be interesting to see if making splits span across files

That would be nice.

kaikalur · 2024-03-05T23:09:43Z

[heart] Sreeni Viswanadha reacted to your message:

…

________________________________ From: Pranjal Shankhdhar ***@***.***> Sent: Tuesday, March 5, 2024 10:48:56 PM To: prestodb/presto ***@***.***> Cc: Sreeni Viswanadha ***@***.***>; Assign ***@***.***> Subject: Re: [prestodb/presto] Presto Hive Connector creates too many small splits (Issue #21911) Closed #21911 as completed via #22051. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned. Message ID: <prestodb/presto/issue/21911/issue_event/12020472354@ github. com> ZjQcmQRYFpfptBannerStart This Message Is From an External Sender ZjQcmQRYFpfptBannerEnd Closed #21911<#21911> as completed via #22051<#22051>. — Reply to this email directly, view it on GitHub<#21911 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAPNYAJPWNEQGS4J4FGEFDDYWZDVRAVCNFSM6AAAAABDFRJXWWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGAZDANBXGIZTKNA>. You are receiving this because you were assigned.Message ID: ***@***.***>

pranjalssh added the bug label Feb 12, 2024

github-project-automation bot added this to Bugs and support requests Feb 12, 2024

github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Feb 12, 2024

pranjalssh assigned kaikalur, mbasmanova and mlyublena Feb 12, 2024

mbasmanova assigned pranjalssh and unassigned kaikalur, mbasmanova and mlyublena Feb 12, 2024

tdcmeehan changed the title ~~Presto creates too many small splits~~ Presto Hive Connector creates too many small splits Feb 12, 2024

pranjalssh mentioned this issue Feb 29, 2024

Use dynamic split sizes in hive connector #22051

Merged

6 tasks

pranjalssh closed this as completed in #22051 Mar 5, 2024

github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Bugs and support requests Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presto Hive Connector creates too many small splits #21911

Presto Hive Connector creates too many small splits #21911

pranjalssh commented Feb 12, 2024 •

edited

Loading

mbasmanova commented Feb 12, 2024

tdcmeehan commented Feb 12, 2024

mbasmanova commented Feb 13, 2024

kaikalur commented Feb 13, 2024

pranjalssh commented Feb 15, 2024

kaikalur commented Mar 1, 2024 •

edited

Loading

mbasmanova commented Mar 1, 2024

kaikalur commented Mar 5, 2024 via email

Presto Hive Connector creates too many small splits #21911

Presto Hive Connector creates too many small splits #21911

Comments

pranjalssh commented Feb 12, 2024 • edited Loading

mbasmanova commented Feb 12, 2024

tdcmeehan commented Feb 12, 2024

mbasmanova commented Feb 13, 2024

kaikalur commented Feb 13, 2024

pranjalssh commented Feb 15, 2024

kaikalur commented Mar 1, 2024 • edited Loading

mbasmanova commented Mar 1, 2024

kaikalur commented Mar 5, 2024 via email

pranjalssh commented Feb 12, 2024 •

edited

Loading

kaikalur commented Mar 1, 2024 •

edited

Loading