Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34971: [Format] Add non-CPU version of C Data Interface #34972
GH-34971: [Format] Add non-CPU version of C Data Interface #34972
Changes from 5 commits
3f9519d
63a65dc
fc16391
26c9aa9
e85d307
13e94a5
826390e
09e5ee6
564ce58
e74c286
1eb7ee9
05f70f7
43ce6c9
2c9a7a6
4f636a8
4e1d1f6
2b24a10
451d3a3
e34746f
b359239
4845e19
fa8aab0
e068bc3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain I follow. Isn't the
ArrowDeviceArray
passed toget_next
an "out" parameter? Are you saying that theArrowDeviceArray
struct itself (not the buffers) needs to be allocated on the device?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I was referring to the
device_id
member anddevice_type
member that get populated in theArrowDeviceArray
that is returned fromget_next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weird API choice, all because you want the consumer to pass its CUDA stream of choice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultimately this is a consequence of the fact that the existing frameworks and APIs don't provide any good way to manage the stream's lifetime easily which makes having the consumer pass the stream be the safest route to take.
I'm absolutely open to suggestions to make this better as long as the consumer is able to pass in the desired stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by that? Would you care to give a more concrete example? For example CUDA allows you to destroy a stream:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1gfda584f1788ca983cb21c5f4d2033a62
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of discussion in that issue related to internal stream handling under contexts and schedulers and similar terms. It all boils down to the same discussion of many producers being unable to release or share ownership of their streams.
I think this comment does a good job of summarizing the options that were considered: dmlc/dlpack#57 (comment)
And then this comment summarizes discussion of those options: dmlc/dlpack#57 (comment)
The lifetime management of streams as defined by the Numba documentation for
__cuda_array_interface__
(https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html#streams) requires that keeping the object (typically array) that produces the__cuda_array_interface__
alive also keeps the stream alive. In most cases libraries don't associate a stream with the object since it's valid to use multiple streams with a single object.Here's the current state of things across a handful of projects:
__cuda_array_interface__
since they store a stream object as part of their array: https://github.com/numba/numba/blob/008077553b558bd183668ecd581d4d0bc54bd32c/numba/cuda/cudadrv/devicearray.py#L119-L139__dlpack__
__cuda_array_interface__
as far as I can tell, where they just get a stream ptr as an integer and if it's not the default stream someone could change the current stream ptr and end up having it be destroyed out from underneath it: https://github.com/cupy/cupy/blob/c92d5bc16293300297b843b4ebb364125697c131/cupy/_core/core.pyx#L258-L262 (cc @leofang)__dlpack__
properly: https://github.com/cupy/cupy/blob/c92d5bc16293300297b843b4ebb364125697c131/cupy/_core/core.pyx#L285-L327__cuda_array_interface__
currently: https://github.com/pytorch/pytorch/blob/def50d253401540cfdc6c0fffa444d0ee643cc11/torch/_tensor.py#L1005-L1066__dlpack__
properly: https://github.com/pytorch/pytorch/blob/def50d253401540cfdc6c0fffa444d0ee643cc11/torch/_tensor.py#L1345__cuda_array_interface__
currently: https://github.com/tensorflow/tensorflow/blob/ea6a0f282d2b7ce20891dfc24ec8fe107eeaf22d/tensorflow/compiler/xla/python/py_buffer.cc#L254-L300__dlpack__
, but has explicitto
andfrom
functions for using dlpack pycapsules but does not handle streams: https://github.com/tensorflow/tensorflow/blob/6a050b6c15ed2a545693bc171f5e95dacbe05839/tensorflow/compiler/xla/python/dlpack.cc#L285-L363__cuda_array_interface__
currently: https://github.com/rapidsai/cudf/blob/50718e673ff53b18706cf66c6e02cda8e30681fe/python/cudf/cudf/core/column/numerical.py#L169-L191__dlpack__
but has explicitto
andfrom
functions for using dlpack pycapsules but does not handle streams: https://github.com/rapidsai/cudf/blob/50718e673ff53b18706cf66c6e02cda8e30681fe/cpp/src/interop/dlpack.cpp#L218-L294There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rgommers @leofang @tqchen (please feel free to include anyone else) do you happen to know if there was any other discussion captured that could be linked here regarding the decision to have a consumer hand a stream to the producer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the pointers.
Yes, I read this. It looks like solution S1, which is also the one I'm proposing, is considered the most flexible (I don't understand the "harder for compilers" comment, though).
I read this too, but it doesn't actually mention S1, for reasons I wasn't able to understand.
But you have to actually synchronize on the right stream before being able to use the object, right? How does the user know which stream to synchronize on, if they didn't produce the data themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From: dmlc/dlpack#57 (comment)
I believe the compilers being referred to here are deep learning compilers like XLA which do things like kernel fusion and set up execution graphs of kernels that use streams internally to parallelize the execution of said graphs.
Something / someone needs to guarantee that there isn't a data race with regards to using multiple non-blocking streams, yes. That could be done with events, stream synchronization, or device synchronization.
If you're staying within your framework / library then the expectation is for the framework / library to handle things for the user. If crossing framework / library boundaries, then the expectation is to be reliant on things like interchange protocols to handle the synchronization semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't able to respond promptly. Is the question still open?
In the case of CAI, it is required that someone handles the exporting stream's lifetime properly:
and this was considered a burden when discussing the DLPack support. A few libraries like Numba, for example, had to hold the reference to the underlying stream. I believe this was the main concern for DLPack to place the requirement on the consumer instead of the producer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused here. It sounds like I need to call
get_next_device_id
to determine which queue to use and then I need to pass that queue on to the call toget_next
. But why? Why isn't the producer managing the queues?If the producer controls which device id gets used (
get_next_device_id
seems to suggest this) then why does the consumer need to give it the queue? For example, if I were merging streams from two different devices it seems like I would do something like (apologies for the butchered pseudo-code)...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace The idea here is that the consumer of the interface provides a queue to the producer and the producer is responsible for ensuring that the data is safe to consume on the provided queue.
The reason for doing this instead of the producer returning a pointer to a queue that the data is safe to consume on is that frameworks generally manage these queues internally and don't have a mechanism to share a queue and control its lifetime over a boundary like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard "mechanism to share a queue and control its lifetime over a boundary like this" in the C Data Interface would be the release callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou the issue is that there isn't a mechanism that you could call in the release callback (to my knowledge) to cleanly control the lifetime. (@kkraus14 correct me if I'm wrong and this isn't what you meant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't make sense, does it? How is the consumer supposed to manage the stream's lifetime if "there isn't a mechanism that you could call in the release callback to cleanly control the lifetime"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't they? They can easily refcount the usage of their own CUDA streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is making a lot of assumptions about how folks use and manage CUDA streams 😄. Again, some places use them similarly to thread pools and only control the lifetime of the pool.
I tried to dig through Tensorflow's code to figure exactly how they're managing the lifetime of their streams but I'm not confident, everything I say below may not be correct:
AllocateStream
andDeallocateStream
(https://github.com/tensorflow/tensorflow/blob/b9fc6a9b611ec373c02e5b5ab432b1d7aff9392e/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc#L759-L774) to create and destroy CUDA streams.Stream
(https://github.com/tensorflow/tensorflow/blob/b9fc6a9b611ec373c02e5b5ab432b1d7aff9392e/tensorflow/compiler/xla/stream_executor/stream.cc#L262-L286) which has constructor, destructor, and init functions roughly of what you'd expect.Stream
objects are managed in aStreamPool
(https://github.com/tensorflow/tensorflow/blob/b9fc6a9b611ec373c02e5b5ab432b1d7aff9392e/tensorflow/compiler/xla/service/stream_pool.h#L27-L59) which then allows "borrowing" the streams using unique ptrs.I guess in theory that if they ultimately have
Stream
objects being used that it could be moved into the private data being used by the release callback.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that they're handling those lifetimes should be enough to plug a refcounting mechanism (or interact with the GC, in case of a managed language). This is already necessary to manage the lifetime of data exported through the C Data Interface.
I understand that they might not have a refcounting mechanism in place already, but that's basic engineering anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, regardless if we take the producer provided path then I think it makes a lot more sense for the producer to share an Event than a Stream.
An Event can be waited on via
cudaStreamWaitEvent
/hipStreamWaitEvent
which does a device side wait which would have minimal overhead if it's the same stream orcudaEventSynchronize
/hipEventSynchronize
if blocking host code is desired.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it seems we're going to take the Producer providing an event path, there isn't really a need for the
get_next_device_id
callback anymore, correct? Or am I missing something?