-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to object_store crate (#2489) #2677
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2677 +/- ##
==========================================
- Coverage 85.26% 85.15% -0.11%
==========================================
Files 275 275
Lines 48830 48846 +16
==========================================
- Hits 41633 41597 -36
- Misses 7197 7249 +52
Continue to review full report at Codecov.
|
a467083
to
a2b89c0
Compare
I think this is now ready for review, I've created #2711 which uses currently unreleased functionality in arrow-rs to do byte range fetches to object storage. This PR does represent a 10-20% performance regression in the parquet SQL benchmarks when operating on local files. This largely results from moving from spawn_blocking and the corresponding scheduler implications documented in apache/arrow-rs#1473. However, I am inclined to think this is fine for a couple of reasons:
|
); | ||
} | ||
|
||
#[cfg(target_os = "windows")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is removed as it no longer makes sense, paths are normalized
let path = url.path().strip_prefix('/').unwrap(); | ||
replacements.push((path.to_string(), key.to_string())); | ||
} | ||
// Push URL representation of path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standardized paths 🎉
Err(_) => Ok(Box::pin(stream)), | ||
} | ||
let stream = | ||
FileStream::new(&self.base_config, partition_index, context, opener)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet now uses the same FileStream interface as other formats, this reduces code duplication
@tustvold I am planning on creating the 9.0.0 RC on Friday. Would we want to hold off merging this until after the 9.0.0 release? |
That isn't really my call to make, especially since IOx consumes via git pin and not a release, however, I would say:
My personal preference would be for 9.0.0 to include the switch so we can start to bring the ecosystem along, but I'm not sure if the timings will work out for that and I don't feel especially strongly. @alamb probably has a view on this. |
I also recommend waiting until after the 9.0.0 release. Rationale:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the code; Thank you very much @tustvold. I love to see the unification plan coming together. Really nice work
Prior to merging this PR, I recommend the following steps:
- Make sure we can get Ballista to compile
- Run some basic parquet based benchmarks (e.g. the tpch ones)
- Send a note to the dev@arrow.apache.org mailing list with a link to this PR (and also maybe on slack)
- Get some other opinions (e.g. @yjshen @timvw @matthewmturner @kyotoYaho and @thinkharderdev perhaps) from people who use the existing object store abstraction. There are at least three crates on datafusion-contrib that would seem to use itL https://github.com/datafusion-contrib?q=objectstore&type=all&language=&sort=
datafusion/core/Cargo.toml
Outdated
chrono = { version = "0.4", default-features = false } | ||
datafusion-common = { path = "../common", version = "8.0.0", features = ["parquet"] } | ||
datafusion-data-access = { path = "../data-access", version = "8.0.0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also perhaps remove the data-access
directory as part of the same PR?
}))?; | ||
}; | ||
|
||
let schema = match store.get(&object.location).await? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this interface allows for specialized access of LocalFiles as well as streams 👍
@@ -580,7 +554,7 @@ mod tests { | |||
let batches = collect(exec, task_ctx).await?; | |||
assert_eq!(1, batches.len()); | |||
assert_eq!(11, batches[0].num_columns()); | |||
assert_eq!(8, batches[0].num_rows()); | |||
assert_eq!(1, batches[0].num_rows()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we now use FileStream which slices the returned batches based on the provided limit
enum FileStreamState { | ||
Idle, | ||
Open { | ||
future: ReaderFuture, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can add some docstrings -- especially for what future
represents
self.next_batch().transpose() | ||
}) | ||
.transpose() | ||
fn poll_inner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rdettai
This is great work - really excited to get this integrated. I hope to provide some comments / questions this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. We has actually come to the same approach in our project. When fetching from an external object store it turned out to be much more efficient to prefetch the entire file into memory than to try and do a lot of sequential range requests.
I wonder if there is more to gain (in a future iteration of course :)) by reading the metadata and then doing buffered prefetch of only the projected columns and non-pruned row groups. If we can also crack the metadata caching then this should be a pure win.
I think this is precisely what @tustvold is working towards -- I am not sure we have a unified vision writeup / ticket anywhere but we are working on one... |
Indeed, #2711 adds buffered prefetch of projected columns and non-pruned row groups, using the functionality added in apache/arrow-rs#1803. Further with the work of @Ted-Jiang on ColumnIndex support apache/arrow-rs#1705, we may in the not to distant future support page-level pushdown 🎉
I am out for the next week and a bit, and am not sure how much time I will have to work on this, but please do leave feedback and I'll get to it on my return 😄 |
@tustvold -- given the work for #2226, is the the eventual plan to interleave IO and CPU decoding? I wonder if we can find some workaround so that @Ted-Jiang and his team doesn't lose performance while we continue to make progress (e.g. could we fetch to local file? or put in some hack for people who wanted to decode using a blocking IO thread or something) |
Yes, once we properly support reading and writing the column index structures apache/arrow-rs#1705 we will have sufficient information to interleave IO at the page-level. Currently ParquetRecordBatchStream does not have information on where the pages are actually located, which means it cannot interleave IO at a granularity lower than the column chunk. That being said we could potentially use a heuristic and only fetch the first 1MB or something, I'll have an experiment 🤔 Full disclosure the column in question is somewhat degenerate, it is 106MB over 100x 1MB pages across two row groups. Another obvious way to improve the performance would be to reduce the size of the row groups. |
So here is where we stand with regards to this PR: Pros
Cons
ConclusionI therefore think on-balance this PR represents a step forward, with the only regression mitigated by using smaller row groups. |
Given the tradeoffs articulated by @tustvold in #2677 (comment) I think we should merge this PR. @Ted-Jiang what do you think? cc @thinkharderdev @matthewmturner @andygrove @liukun4515 @yjshen @wjones127 @houqp -- any other thoughts / concerns before doing so? It will cause churn downstream but we have know that ever since #2489 was proposed |
+1 for me. I will add a note to the README of the s3 object store repo to let users know of the new crate. |
I apologize if i missed it (the github UI is being buggy for me right now) but it might be worth adding to the docs examples of how to use this with different |
@tustvold Thanks a lot for your sharing 👍. And about the If i miss something plz tell me 😂 |
@alamb Thanks for your kindly attention ❤️ I think this change is reasonable ! |
I haven't reviewed the changes here yet but I have no objection to this being merged if the community supports it. |
That's quite of a lot files that got modified for switching an "IO abstraction used at the edge of plans" 😄. I also believe that reading the data in from files is very crucial to an analytics query engine. Indeed it isn't core in the sense that you can do things with your engine without it (reading in memory or streaming data...), but it is still one of its main use case and more importantly, a critical performance bottleneck. And as always with optimization, you sometime need to bend the separation of concern a bit to reach your goal, which means that you will need to tweak the abstraction to get the performance you want (as you can see with topics like prefetch strategies....). And this can be made more complicated if we refer to an external repository that is not owned by us. TL;DR: I would also be more comfortable with this change if we first integrated the object store abstraction into the repository. |
I would be interested with @wesm point of view on this governance question. Just to recap the question: |
Master interleaves IO at the page level, reading individual pages as required, blocking the calling thread as it does so. This branch instead performs async IO fetching column chunks into memory without blocking threads, this is significantly better for object stores, but will perform "worse" for certain workloads accessing local files where the approach on master may be faster, but with the obvious drawback of blocking threads.
I would be fine waiting until the donation to arrow-rs goes through (influxdata/object_store_rs#41) but I had hoped that given this intent had been clearly broadcast, rather than waiting the 3 or so weeks it will take to go through this process, we could just get this in. What do you think? |
I think it's fine to switch without the code donation and not wait, but if you think that other DataFusion contributors will want to participate in the maintenance and governance of the object store crate, then doing the code donation sounds like a good idea to me. |
Great! I missed that the donation was in progress. Obviously no need to wait then 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to give this a final review and merge tomorrow unless anyone objects. Thank you all
I took the liberty of merging up from master to resolve some conflicts in |
Which issue does this PR close?
Closes #2489
Rationale for this change
See ticket
What changes are included in this PR?
Switches DataFusion to using object_store crate in place
Are there any user-facing changes?
Yes this moves to using the object_store crate.
Does this PR break compatibility with Ballista?
Possibly