-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches()
#36842
[Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches()
#36842
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
…prefetch-batches-args
@@ -97,8 +98,13 @@ def iter_batches( | |||
process. If set to greater than 0, a separate thread will be used to fetch | |||
the specified amount of formatted batches from blocks. This improves | |||
performance for non-CPU bound UDFs, allowing batch fetching compute and | |||
formatting to be overlapped with the UDF. Defaults to 0 (no prefetching | |||
enabled). | |||
formatting to be overlapped with the UDF. Defaults to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the docs from 0 to 1, based on the current default value of 1
in the param definition.
@@ -34,6 +34,7 @@ def iter_batches( | |||
shuffle_seed: Optional[int] = None, | |||
ensure_copy: bool = False, | |||
prefetch_batches: int = 1, | |||
gpu_prefetch_batches: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be for iter_torch_batches() only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataIterator.iter_torch_batches()
calls DataIterator.iter_batches()
, which calls this iter_batches
function in block_batching/iter_batches.py
, so I believe we still need to expose this param here.
threadpool will be used to format batches and apply the collate_fn. | ||
Defaults to 1. You can revert back to the old prefetching behavior | ||
that uses `prefetch_blocks` by setting `use_legacy_iter_batches` to | ||
True in the DataContext. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it's unlikely someone would want this to be >1, I don't think the comment needs to mention the legacy behavior.
@@ -149,7 +154,7 @@ def _async_iter_batches( | |||
stats=stats, | |||
batch_format=batch_format, | |||
collate_fn=collate_fn, | |||
num_threadpool_workers=prefetch_batches, | |||
num_threadpool_workers=gpu_prefetch_batches, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still want full prefetching for the format conversion right? Just not the final GPU loading step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 changes we would need to make here
- As @ericl mentions, we don’t want to change the existing thread pool for formatting+collate_fn but rather use a new thread pool for host to device transfer
- We would also need to change our collate_fn API. Currently, the API is such that if the user specifies a collate_fn, they are expected to do the host to device transfer in the collate_fn. This won’t work if we want to parallelize collate_fn and host to device transfer independently
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@amogkam @ericl I tried separating the formatting and collate, as suggested by Amog in (1) above. For (2) changing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I mean the collate_fn we would want to do in the CPU based threadpool, along with formatting. Only the host to device transfer should happen in the GPU based threadpool.
How about we have a collate_fn and finalize_fn (could be marked internal argument), and say the finalize_fn has concurrency 1 always? We can put things like H2D in the finalize_fn. I don't think you'd ever need more than 1 prefetch for the H2D load. |
in that case we don't need |
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, thanks! Can we add a couple tests?
- Test that finalize_fn is not run in more than 1 thread. We can use a threading.Lock to test this.
- Test the logic for the different combinations of collate_fn and finalize_fn, and when the defaults are used.
@@ -180,6 +180,32 @@ def collate( | |||
yield CollatedBatch(batch.batch_idx, collated_batch) | |||
|
|||
|
|||
def finalize_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you could call make_async_gen twice in the same function right?
def collate_and_finalize(...):
collated_iter = make_async_gen(base_iter, num_workers=prefetch_batches)
finalized_iter = make_async_gen(collated_iter, num_workers=1)
return finalized_iter
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work! just some final comments
python/ray/data/iterator.py
Outdated
"appropriate dtype and device." | ||
"collate_fn cannot be used with dtypes and device." | ||
"You should manually move the output Torch tensors to the" | ||
"desired dtype and device, outside of collate_fn." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"desired dtype and device, outside of collate_fn." | |
"desired dtype and device outside of collate_fn." |
@@ -193,7 +201,7 @@ def _format_in_threadpool( | |||
num_threadpool_workers: The number of threads to use in the threadpool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add finalize_fn to the docstring
finalized_iter = make_async_gen( | ||
base_iterator=collated_iter, | ||
fn=threadpool_computations_finalize_fn, | ||
num_workers=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think we need this case at all. We should just be able to call threadpool_computations_finalize_fn(collated_iter)
directly in all cases since the whole thing is being run in a separate thread anyways.
Otherwise we would be prefetching 1 extra batch beyond what's specified in prefetch_batches
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
…ons in `DataIterator.iter_batches()` (ray-project#36842) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>
…ons in `DataIterator.iter_batches()` (ray-project#36842) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>
…ons in `DataIterator.iter_batches()` (#36842) (#37260) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>
…ons in `DataIterator.iter_batches()` (ray-project#36842) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
Currently, the
prefetch_batches
arg ofDataset.iter_batches
is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU.This PR adds a separate parameter,
_finalize_fn
, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default_finalize_fn
used when_collate_fn
is not specified. Note that when_collate_fn
is provided by the user, they should also handle the host to device transfer themselves outside of_collate_fn
in order to maximize performance.Related issue number
Closes #35305
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.