Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting object store FFI #7075

Open
lisasgoh opened this issue Feb 4, 2025 · 15 comments
Open

Supporting object store FFI #7075

lisasgoh opened this issue Feb 4, 2025 · 15 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@lisasgoh
Copy link

lisasgoh commented Feb 4, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The use case here would be to be able to dynamically load libraries with custom object store implementations outside of s3/azure/gcp etc in libraries like polars in which there’s no way currently to register a new object store.

Describe the solution you'd like

An object store FFI would be necessary due to Rust’s unstable ABI.

Describe alternatives you've considered

Additional context

@lisasgoh lisasgoh added the enhancement Any new improvement worthy of a entry in the changelog label Feb 4, 2025
@tustvold
Copy link
Contributor

tustvold commented Feb 4, 2025

Tagging @kylebarron.

I think there are a couple of questions here:

  • Is the major use-case python interop
  • Does a python focused abstraction have more utliity than a generic FFI abstraction
  • Is there maintainer bandwidth for this within arrow-rs, which is primarily rust focused
  • Is the project better served incubating outside by a group of motivated individuals (who may not be arrow committers)
  • Do the various python projects have sufficient commonality that sharing this is even useful (e.g. have they already built their own bespoke setup)

I think there are many people interested in such functionality, however, I suspect there is a significant non-technical aspect to any such initiative.

@kylebarron
Copy link
Contributor

It sounds like the OP here is interested in a custom implementation of ObjectStore to serve something other than AWS/GCP/Azure.

That's a bit different than what I've been focused on. The solution I've been working towards is to have reusable Python bindings that other Rust-Python developers can use in their own Python bindings. This doesn't require FFI (though by not using FFI it means you can't share ObjectStore instances across Python libraries). But it's also not a solution to a third-party extensible implementation of ObjectStore.

I don't think I have the bandwidth to try and implement stable object store FFI. I'll tag @timsaucer who wrote the DataFusion FFI support.

@lisasgoh
Copy link
Author

lisasgoh commented Feb 4, 2025

The use case here would be that unlike Datafusion that has a register_object_store method that allows for custom implementations for object stores, Polars doesn't (pola-rs/polars#20568). I'm looking to contribute to polars to add support for it, but the main blocker is the lack of an existing ObjectStore FFI.

The idea I have is to have a Python API register_object_store(path: str, method: str, scheme: str) in Polars (similar to DF) that accepts the path to the library with the custom object store, a method name and a url scheme. This will call into Rust via PyO3 which would dynamically load that library via libloading and call the method name which would return an instance of the object store.

The main issue here is that there's no ObjectStore FFI at the moment AFAIK, which makes it tricky to implement.
Edit: There's this https://github.com/RelationalAI/object_store_ffi, but I'm not sure whether it would work for my use case.

@tustvold
Copy link
Contributor

tustvold commented Feb 5, 2025

Polars has both a python and Rust API, and IMO trying to support both with one system using mechanisms like libloading is a bad UX for both. Managing shared libraries is a PIA, especially within the python ecoystem.

Instead I'd suggest Rust polars code should be able to provide an ObjectStore directly.

Similarly for python, polars could provide an object store shim that delegates to a user provided python impl (this may even already exist). This would allow people to then plugin python based object stores.

If libloading is important for other reasons, e.g. some GPL dance, then these Rust/python impls can orchestrate that.

I think this would satisfy your use-case, whilst not requiring a stable C FFI?

@lisasgoh
Copy link
Author

lisasgoh commented Feb 5, 2025

@tustvold The Rust API to directly provide an object store makes sense, but I'll be interacting with polars via python so I was wondering what's the best way to register a new object store in python. I might be misunderstanding, but regarding your idea for python, does it involve creating some sort of generic ObjectStore python impl that wraps the user provided python impl and invokes its methods? Something like:
Python::with_gil(|py| { let _: PyResult<()> = obj.call_method1("put", (location, bytes)); })

I am worried that for this approach, doesn't this mean that the interactions with the object store will be -> rust -> python -> rust? Since the user provided python impl would just be a object store wrapped with python bindings. This would have some overheard and I'm not sure if there would be performance implications due to the GIL. It seems like libloading is already used in polars for plugins https://docs.pola.rs/api/python/stable/reference/plugins.html, which is where I got the idea from.

@tustvold
Copy link
Contributor

tustvold commented Feb 5, 2025

I am worried that for this approach, doesn't this mean that the interactions with the object store will be -> rust -> python -> rust?

I was somewhat presuming that if you're using the python API you want to author your extension in python. I agree if the implementation is in Rust, proxying via python seems unnecessary, even if the performance impact is likely irrelevant when compared to network overheads. I'm not familiar with how polars has hooked up its python bindings, but I wonder if you can setup a polars "context" in Rust and then invoke it from python?

@lisasgoh
Copy link
Author

lisasgoh commented Feb 5, 2025

Are you referring to something like SessionContext that Datafusion has? It would be nice but AFAIK, I don't think such a concept exists in Polars unfortunately. I did float the idea of using libloading to the Polars devs but the main concern/blocker that was raised was the lack of a stable C FFI for ObjectStore. I agree from a UX perspective it might be non-ideal, but given that a similar concept already exists in Polars (registering a Rust plugin via python), this wouldn't be unprecedented. Is there any chance for an ObjectStore C FFI to be on the roadmap?

@tustvold
Copy link
Contributor

tustvold commented Feb 5, 2025

Is there any chance for an ObjectStore C FFI to be on the roadmap?

I will be frank, ObjectStore has a fairly large, async API, and so defining and maintaining an FFI interface for it would be a fairly substantial undertaking. Given what I know of the interests of the various maintainers, I think such an initiative would likely stand the best chance of success incubating as a third-party project.

I did float the idea of using libloading to the Polars devs but the main concern/blocker that was raised was the lack of a stable C FFI for ObjectStore

IMO I would suggest adding support for custom ObjectStore within polars from the Rust API first, i.e. introducing something similar to ObjectStoreRegistry. Once such an abstraction exists, then it will be possible to devise ways to potentially orchestrate that from python code.

This wouldn't necessarily require a C FFI, for example, you could potentially build a shared library bundling both polars and your extension code, and then use that from python or something along those lines.

@lisasgoh
Copy link
Author

lisasgoh commented Feb 5, 2025

I will be frank, ObjectStore has a fairly large, async API, and so defining and maintaining an FFI interface for it would be a fairly substantial undertaking. Given what I know of the interests of the various maintainers, I think such an initiative would likely stand the best chance of success incubating as a third-party project.

Ah that's unfortunate :( There's only https://github.com/RelationalAI/object_store_ffi I think.

This wouldn't necessarily require a C FFI, for example, you could potentially build a shared library bundling both polars and your extension code, and then use that from python or something along those lines.

Do you mind elaborating on this approach? I was thinking that a Rust API for this might not really be helpful since after registering a custom object store into the registry (which maybe can be represented as some sort of global map), how will python polars be aware of this state?

@tustvold
Copy link
Contributor

tustvold commented Feb 5, 2025

how will python polars be aware of this state?

My understanding is that the polars python API really just acts as glue to orchestrate the underlying Rust execution engine. As such it should be possible to do the initial setup in Rust and then do further orchestration from Python.

Ultimately I'd be very surprised if adding support for this from Rust wasn't a necessary precondition to python support.

@lisasgoh
Copy link
Author

lisasgoh commented Feb 6, 2025

Just to make sure I understand correctly, is the idea here along the lines of:

  1. Create a register API in Rust to store the custom object store in memory.
  2. Maybe create a Rust function in the external library with the custom impl that would invoke that register API
  3. Expose that function to python and call it at runtime before executing any query via python polars?

Or am I getting it wrong?

I’m not sure this would work, since wouldn’t the Rust binary for the Rust API be different from the underlying Rust binary for python polars which is separately compiled and published? So the state in Rust polars with the registered object store won’t be shared with python polars.

@tustvold
Copy link
Contributor

tustvold commented Feb 6, 2025

separately compiled and published

You wouldn't use the standard polars distribution, just the custom one with your extension built into it.

I think this discussion is probably best moved to polars.

@kylebarron
Copy link
Contributor

I think the confusion here might be that @tustvold is expecting Polars to act like DataFusion, where it's intended to be fully embedded into your own project. Whereas my understanding is that the Polars Python API isn't designed to be embedded. The intended extension point is via runtime-linked extensions.

@tustvold
Copy link
Contributor

tustvold commented Feb 8, 2025

If this is indeed the case, then polars will indeed require a stable C FFI for all such extensions. My point was that this is a very limiting methodology, especially given Rust's lack of a stable ABI, and it seems surprising that polars would not support build-time extension in addition.

@lisasgoh
Copy link
Author

A potential temporary solution could be to create a fork of Polars to integrate my new object store, though it creates some maintenance overhead which isn't ideal. There seems to be some demand for ObjectStore FFI, e.g. here, so it would be immensely helpful if the Arrow team could consider adding this capability to their roadmap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants