Switch to object_store crate (#2489) #2677

tustvold · 2022-06-01T21:40:05Z

Which issue does this PR close?

Closes #2489

Rationale for this change

See ticket

What changes are included in this PR?

Switches DataFusion to using object_store crate in place

Are there any user-facing changes?

Yes this moves to using the object_store crate.

Does this PR break compatibility with Ballista?

Possibly

codecov-commenter · 2022-06-06T17:30:46Z

Codecov Report

Merging #2677 (54dd6d2) into master (88b88d4) will decrease coverage by 0.10%.
The diff coverage is 89.03%.

@@            Coverage Diff             @@
##           master    #2677      +/-   ##
==========================================
- Coverage   85.26%   85.15%   -0.11%     
==========================================
  Files         275      275              
  Lines       48830    48846      +16     
==========================================
- Hits        41633    41597      -36     
- Misses       7197     7249      +52

Impacted Files	Coverage Δ
datafusion/core/src/catalog/schema.rs	`84.48% <ø> (-0.52%)`	⬇️
...afusion/core/src/physical_plan/file_format/avro.rs	`0.00% <0.00%> (ø)`
datafusion/core/src/physical_plan/mod.rs	`88.00% <ø> (ø)`
datafusion/core/src/datasource/file_format/avro.rs	`61.53% <50.00%> (-8.03%)`	⬇️
datafusion/core/src/datasource/file_format/json.rs	`93.75% <64.28%> (-5.13%)`	⬇️
datafusion/common/src/error.rs	`80.00% <66.66%> (-2.28%)`	⬇️
datafusion/core/src/datasource/listing/mod.rs	`55.55% <66.66%> (+10.10%)`	⬆️
...afusion/core/src/physical_plan/file_format/json.rs	`91.06% <77.77%> (-2.13%)`	⬇️
datafusion/core/src/datasource/file_format/csv.rs	`98.91% <80.00%> (-1.09%)`	⬇️
datafusion/core/tests/path_partition.rs	`85.86% <80.55%> (-0.69%)`	⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88b88d4...54dd6d2. Read the comment docs.

tustvold · 2022-06-08T16:25:11Z

I think this is now ready for review, I've created #2711 which uses currently unreleased functionality in arrow-rs to do byte range fetches to object storage.

This PR does represent a 10-20% performance regression in the parquet SQL benchmarks when operating on local files. This largely results from moving from spawn_blocking and the corresponding scheduler implications documented in apache/arrow-rs#1473.

However, I am inclined to think this is fine for a couple of reasons:

The new scheduler, which is currently blocked by this PR, was specifically created to address this scheduling disparity
The difference becomes inconsequential for any non-trivial queries
The ongoing work by @Ted-Jiang will help to reduce the IO costs of parquet
I think this lays a solid foundation on which we can iterate

tustvold · 2022-06-08T16:49:02Z

datafusion/core/src/datasource/listing/helpers.rs

-        );
-    }
-
-    #[cfg(target_os = "windows")]


This test is removed as it no longer makes sense, paths are normalized

tustvold · 2022-06-08T16:50:13Z

datafusion/core/tests/sql/mod.rs

-                let path = url.path().strip_prefix('/').unwrap();
-                replacements.push((path.to_string(), key.to_string()));
-            }
+            // Push URL representation of path


Standardized paths 🎉

tustvold · 2022-06-08T16:51:19Z

datafusion/core/src/physical_plan/file_format/parquet.rs

-            Err(_) => Ok(Box::pin(stream)),
-        }
+        let stream =
+            FileStream::new(&self.base_config, partition_index, context, opener)?;


Parquet now uses the same FileStream interface as other formats, this reduces code duplication

andygrove · 2022-06-08T17:27:03Z

@tustvold I am planning on creating the 9.0.0 RC on Friday. Would we want to hold off merging this until after the 9.0.0 release?

tustvold · 2022-06-08T17:39:22Z

Would we want to hold off merging this until after the 9.0.0 release

That isn't really my call to make, especially since IOx consumes via git pin and not a release, however, I would say:

Without Use ParquetRecordBatchStream #2711 which is dependent on the next arrow-rs release, this may represent a regression for people consuming data from remote object storage, although its a bit of an apples and oranges comparison, fetching whole files vs futures::block_on range requests, and unclear which is necessarily better
The sooner we make the switch the less painful it will be
I'm not sure in what capacity people are using the current object store interface

My personal preference would be for 9.0.0 to include the switch so we can start to bring the ecosystem along, but I'm not sure if the timings will work out for that and I don't feel especially strongly. @alamb probably has a view on this.

alamb · 2022-06-09T12:35:40Z

My personal preference would be for 9.0.0 to include the switch so we can start to bring the ecosystem along, but I'm not sure if the timings will work out for that and I don't feel especially strongly. @alamb probably has a view on this.

I also recommend waiting until after the 9.0.0 release. Rationale:

The datafusion releases are already substantial effort, so anything we can do to reduce the potential issues requiring a second one is good
I think given the wide ranging implications of the PR up/down the ecosystem, we should have some additional reviewers on it prior to merging
I believe @tustvold will be out next week so waiting for his return is probably the wisest course of action

alamb

I like the code; Thank you very much @tustvold. I love to see the unification plan coming together. Really nice work

Prior to merging this PR, I recommend the following steps:

Make sure we can get Ballista to compile
Run some basic parquet based benchmarks (e.g. the tpch ones)
Send a note to the dev@arrow.apache.org mailing list with a link to this PR (and also maybe on slack)
Get some other opinions (e.g. @yjshen @timvw @matthewmturner @kyotoYaho and @thinkharderdev perhaps) from people who use the existing object store abstraction. There are at least three crates on datafusion-contrib that would seem to use itL https://github.com/datafusion-contrib?q=objectstore&type=all&language=&sort=

alamb · 2022-06-09T12:37:45Z

datafusion/core/Cargo.toml

 chrono = { version = "0.4", default-features = false }
-datafusion-common = { path = "../common", version = "8.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../data-access", version = "8.0.0" }


should we also perhaps remove the data-access directory as part of the same PR?

alamb · 2022-06-09T12:43:16Z

datafusion/core/src/datasource/file_format/json.rs

-            }))?;
+            };
+
+            let schema = match store.get(&object.location).await? {


I like how this interface allows for specialized access of LocalFiles as well as streams 👍

datafusion/core/src/datasource/file_format/parquet.rs

alamb · 2022-06-09T12:53:15Z

datafusion/core/src/datasource/file_format/parquet.rs

@@ -580,7 +554,7 @@ mod tests {
        let batches = collect(exec, task_ctx).await?;
        assert_eq!(1, batches.len());
        assert_eq!(11, batches[0].num_columns());
-        assert_eq!(8, batches[0].num_rows());
+        assert_eq!(1, batches[0].num_rows());


why is this different?

Because we now use FileStream which slices the returned batches based on the provided limit

datafusion/core/src/datasource/listing/url.rs

alamb · 2022-06-09T12:58:54Z

datafusion/core/src/physical_plan/file_format/file_stream.rs

+enum FileStreamState {
+    Idle,
+    Open {
+        future: ReaderFuture,


Perhaps we can add some docstrings -- especially for what future represents

alamb · 2022-06-09T12:59:07Z

datafusion/core/src/physical_plan/file_format/file_stream.rs

-                            self.next_batch().transpose()
-                        })
-                        .transpose()
+    fn poll_inner(


cc @rdettai

matthewmturner · 2022-06-10T14:05:55Z

This is great work - really excited to get this integrated. I hope to provide some comments / questions this weekend.

thinkharderdev

Looks good to me. We has actually come to the same approach in our project. When fetching from an external object store it turned out to be much more efficient to prefetch the entire file into memory than to try and do a lot of sequential range requests.

I wonder if there is more to gain (in a future iteration of course :)) by reading the metadata and then doing buffered prefetch of only the projected columns and non-pruned row groups. If we can also crack the metadata caching then this should be a pure win.

datafusion/core/src/datasource/file_format/parquet.rs

alamb · 2022-06-10T17:19:44Z

I wonder if there is more to gain (in a future iteration of course :)) by reading the metadata and then doing buffered prefetch of only the projected columns and non-pruned row groups. If we can also crack the metadata caching then this should be a pure win.

I think this is precisely what @tustvold is working towards -- I am not sure we have a unified vision writeup / ticket anywhere but we are working on one...

tustvold · 2022-06-10T21:08:50Z

I think this is precisely what @tustvold is working towards

Indeed, #2711 adds buffered prefetch of projected columns and non-pruned row groups, using the functionality added in apache/arrow-rs#1803. Further with the work of @Ted-Jiang on ColumnIndex support apache/arrow-rs#1705, we may in the not to distant future support page-level pushdown 🎉

will be out next week so waiting for his return is probably the wisest course of action

I am out for the next week and a bit, and am not sure how much time I will have to work on this, but please do leave feedback and I'll get to it on my return 😄

alamb · 2022-06-29T15:09:02Z

@tustvold -- given the work for #2226, is the the eventual plan to interleave IO and CPU decoding?

I wonder if we can find some workaround so that @Ted-Jiang and his team doesn't lose performance while we continue to make progress (e.g. could we fetch to local file? or put in some hack for people who wanted to decode using a blocking IO thread or something)

tustvold · 2022-06-29T15:25:06Z

is the the eventual plan to interleave IO and CPU decoding

Yes, once we properly support reading and writing the column index structures apache/arrow-rs#1705 we will have sufficient information to interleave IO at the page-level. Currently ParquetRecordBatchStream does not have information on where the pages are actually located, which means it cannot interleave IO at a granularity lower than the column chunk. That being said we could potentially use a heuristic and only fetch the first 1MB or something, I'll have an experiment 🤔

Full disclosure the column in question is somewhat degenerate, it is 106MB over 100x 1MB pages across two row groups. Another obvious way to improve the performance would be to reduce the size of the row groups.

tustvold · 2022-06-29T17:25:25Z

So here is where we stand with regards to this PR:

Pros

Less range requests will be made to object storage, reducing latency and monetary costs
Threads will not be blocked on network IO
Does not make use of futures::block_on or tokio::spawn_blocking
Will integrate well with future work to reduce bytes fetched from object storage - Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex) arrow-rs#1705
Fits with the longer-term vision of morsel-driven IO within DataFusion - [EPIC]: Morsel-Driven Scheduler IO #2504

Cons

Slightly higher memory usage for some queries as buffers encoded column chunks instead of reading pages on-demand
Queries to local files with column chunks containing large numbers of pages may be slower

Conclusion

I therefore think on-balance this PR represents a step forward, with the only regression mitigated by using smaller row groups.

alamb · 2022-06-29T17:28:13Z

Given the tradeoffs articulated by @tustvold in #2677 (comment) I think we should merge this PR.

@Ted-Jiang what do you think?

cc @thinkharderdev @matthewmturner @andygrove @liukun4515 @yjshen @wjones127 @houqp -- any other thoughts / concerns before doing so? It will cause churn downstream but we have know that ever since #2489 was proposed

matthewmturner · 2022-06-29T21:09:11Z

+1 for me. I will add a note to the README of the s3 object store repo to let users know of the new crate.

matthewmturner · 2022-06-29T21:20:49Z

I apologize if i missed it (the github UI is being buggy for me right now) but it might be worth adding to the docs examples of how to use this with different object_store features enabled. this could be done as a follow on though.

Ted-Jiang · 2022-06-30T03:20:09Z

Finally got profiles (by switching the VM to fedora), and it certainly fits with my hypothesis above

On the left we have master, and right this branch. The CPU activity under parquet_query_s demarcates each benchmark iteration, within this you have two row groups being read. We can clearly see that with this PR there is a noticeable delay as it fetches the bytes into memory before starting decoding the data, whereas master interleaves the IO and decoding. There is a trade-off here, the approach of master is faster for this particular benchmark, but comes at the cost of stalling out worker threads on IO that could have been doing other work during decode.

There are some ways we could potentially improve this, e.g. interleaving IO at the page instead of column chunk, but this is unlikely to help with object storage and may actually perform worse. I'm not sure if this is something worth optimising, but would appreciate other people's thoughts

@tustvold Thanks a lot for your sharing 👍.
I am not clear about the whereas master interleaves the IO and decoding i think master use block IO, decode must wait for IO. this patch uses interleaving with async function to reduce the blocked IO.

And about the delay as it fetches the bytes into memory, is it cause of the IO unit is large even use async reader.

If i miss something plz tell me 😂

Ted-Jiang · 2022-06-30T03:28:07Z

I wonder if we can find some workaround so that @Ted-Jiang and his team doesn't lose performance while we continue to make progress (e.g. could we fetch to local file? or put in some hack for people who wanted to decode using a blocking IO thread or something)

@alamb Thanks for your kindly attention ❤️ I think this change is reasonable !
we can keep both async and non-async, we will test in our env.

andygrove · 2022-06-30T05:15:37Z

I haven't reviewed the changes here yet but I have no objection to this being merged if the community supports it.

rdettai · 2022-06-30T07:07:26Z

That being said, I'm not really sure I agree that the object store abstraction is all that core to DataFusion. It is just an IO abstraction used at the edges of plans

That's quite of a lot files that got modified for switching an "IO abstraction used at the edge of plans" 😄. I also believe that reading the data in from files is very crucial to an analytics query engine. Indeed it isn't core in the sense that you can do things with your engine without it (reading in memory or streaming data...), but it is still one of its main use case and more importantly, a critical performance bottleneck. And as always with optimization, you sometime need to bend the separation of concern a bit to reach your goal, which means that you will need to tweak the abstraction to get the performance you want (as you can see with topics like prefetch strategies....). And this can be made more complicated if we refer to an external repository that is not owned by us.

TL;DR: I would also be more comfortable with this change if we first integrated the object store abstraction into the repository.

rdettai · 2022-06-30T07:17:29Z

I would be interested with @wesm point of view on this governance question. Just to recap the question:
-> we are about to replace the file system abstraction (that we call object store here, https://github.com/apache/arrow-datafusion/tree/master/datafusion/data-access) with an external one that is currently owned by InfluxData (https://github.com/influxdata/object_store_rs/blob/main/src/lib.rs). They are some concerns about whether this is a wise decision or not.

tustvold · 2022-06-30T08:46:39Z

I am not clear about the whereas master interleaves the IO and decoding i think master use block IO, decode must wait for IO. this patch uses interleaving with async function to reduce the blocked IO.

Master interleaves IO at the page level, reading individual pages as required, blocking the calling thread as it does so. This branch instead performs async IO fetching column chunks into memory without blocking threads, this is significantly better for object stores, but will perform "worse" for certain workloads accessing local files where the approach on master may be faster, but with the obvious drawback of blocking threads.

if we first integrated the object store abstraction into the repository.

I would be fine waiting until the donation to arrow-rs goes through (influxdata/object_store_rs#41) but I had hoped that given this intent had been clearly broadcast, rather than waiting the 3 or so weeks it will take to go through this process, we could just get this in. What do you think?

wesm · 2022-06-30T17:36:50Z

I think it's fine to switch without the code donation and not wait, but if you think that other DataFusion contributors will want to participate in the maintenance and governance of the object store crate, then doing the code donation sounds like a good idea to me.

rdettai · 2022-07-01T07:12:37Z

Great! I missed that the donation was in progress. Obviously no need to wait then 😉

alamb

I plan to give this a final review and merge tomorrow unless anyone objects. Thank you all

alamb · 2022-07-02T11:21:39Z

I took the liberty of merging up from master to resolve some conflicts in Cargo.toml

Switch to object_store crate (apache#2489)

9e05086

tustvold added the api change Changes the API exposed to users of the crate label Jun 1, 2022

github-actions bot added core Core DataFusion crate datafusion Changes in the datafusion crate labels Jun 1, 2022

tustvold mentioned this pull request Jun 2, 2022

feat: path handling UX improvements influxdata/object_store_rs#21

Merged

andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022

tustvold added 2 commits June 6, 2022 18:04

Test fixes

2848bba

Merge remote-tracking branch 'upstream/master' into iox-object-store

6344612

tustvold added 3 commits June 6, 2022 18:34

Update to object_store 0.2.0

a7ec988

More windows pacification

155f9d1

Fix windows test

a2b89c0

tustvold force-pushed the iox-object-store branch from a467083 to a2b89c0 Compare June 8, 2022 15:24

tustvold mentioned this pull request Jun 8, 2022

Use ParquetRecordBatchStream #2711

Closed

Fix windows test_prefix_path

bbbfc55

tustvold marked this pull request as ready for review June 8, 2022 16:25

More windows fixes

e9fc7ec

tustvold commented Jun 8, 2022

View reviewed changes

Simplify ListingTableUrl::strip_prefix

7f43ca2

alamb reviewed Jun 9, 2022

View reviewed changes

thinkharderdev reviewed Jun 10, 2022

View reviewed changes

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

alamb closed this Jun 29, 2022

alamb reopened this Jun 29, 2022

alamb closed this Jun 29, 2022

alamb reopened this Jun 29, 2022

Add host to ObjectStoreRegistry

b6c069b

alamb approved these changes Jul 1, 2022

View reviewed changes

Merge remote-tracking branch 'apache/master' into iox-object-store

54dd6d2

alamb merged commit bf7564f into apache:master Jul 4, 2022

alamb mentioned this pull request Jul 8, 2022

Incorporate object_store into arrow-rs repository apache/arrow-rs#2030

Closed

14 tasks

This was referenced Jul 12, 2022

Upgrade dependency of arrow-datafusion to commit d0d5564b8f689a01e542b8c1df829d74d0fab2b0 apache/datafusion-ballista#84

Closed

Remove datafusion-data-access crate #2903

Closed

This was referenced Jul 17, 2022

bug: new ObjectStore breaks backward compatibility with contrib plugins #2931

Closed

Streaming CSV/JSON Object Store Read #2935

Closed

alamb mentioned this pull request Aug 15, 2022

Push additional parquet filtering into the parquet scan [EPIC] #3147

Closed

5 tasks

tustvold mentioned this pull request Oct 20, 2022

Local object store accepts file:/// as base path, but LocalStore returns meta without the prefix. #1923

Closed

Switch to object_store crate (#2489) #2677

Switch to object_store crate (#2489) #2677

Conversation

tustvold commented Jun 1, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Does this PR break compatibility with Ballista?

codecov-commenter commented Jun 6, 2022 • edited Loading

Codecov Report

tustvold commented Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

andygrove commented Jun 8, 2022

tustvold commented Jun 8, 2022 • edited Loading

alamb commented Jun 9, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewmturner commented Jun 10, 2022

thinkharderdev left a comment

Choose a reason for hiding this comment

alamb commented Jun 10, 2022

tustvold commented Jun 10, 2022

alamb commented Jun 29, 2022

tustvold commented Jun 29, 2022

tustvold commented Jun 29, 2022

Pros

Cons

Conclusion

alamb commented Jun 29, 2022

matthewmturner commented Jun 29, 2022

matthewmturner commented Jun 29, 2022

Ted-Jiang commented Jun 30, 2022 • edited Loading

Ted-Jiang commented Jun 30, 2022

andygrove commented Jun 30, 2022

rdettai commented Jun 30, 2022 • edited Loading

rdettai commented Jun 30, 2022

tustvold commented Jun 30, 2022 • edited Loading

wesm commented Jun 30, 2022

rdettai commented Jul 1, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 2, 2022

tustvold commented Jun 1, 2022 •

edited

Loading

codecov-commenter commented Jun 6, 2022 •

edited

Loading

tustvold commented Jun 8, 2022 •

edited

Loading

tustvold Jun 8, 2022 •

edited

Loading

tustvold commented Jun 8, 2022 •

edited

Loading

Ted-Jiang commented Jun 30, 2022 •

edited

Loading

rdettai commented Jun 30, 2022 •

edited

Loading

tustvold commented Jun 30, 2022 •

edited

Loading