LocalFileSystem::list_with_offset is very slow over network file system #7018
Labels
enhancement
Any new improvement worthy of a entry in the changelog
object-store
Object Store Interface
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We've encountered an issue when using
arrow-rs
viaDeltaLake
. Many times, delta files will hold dozens of parquet files in a log directory. To select recent files, the functionLocalFileSystem::list_with_offset
is called. This does not have an efficient implementation, instead, the entire directory is scanned in our example resulting in >100,000 statx system calls, several times for each file in the _delta_log subdirectory. This is terribly slow for our use case.Describe the solution you'd like
Upgrade
LocalFileSystem::list_with_offset
to filter the files and cut the number of statx calls. We have a simple PR (to follow) which does this.For our use case, it cuts the time to open the delta table from 35 seconds to 4 seconds.
Describe alternatives you've considered
There are likely fancier ways to filter these files and cut the number of statx calls, but a simple pre-filter like what we have done in the associated PR is quite effective. In general, it would be nice to have a more optimized LocalFileSystem implementation.
Additional context
The text was updated successfully, but these errors were encountered: