[Data] Add option for parallelizing post-collation data batch operations in `DataIterator.iter_batches()` #36842

scottjlee · 2023-06-26T23:31:37Z

Why are these changes needed?

Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU.

This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance.

Related issue number

Closes #35305

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

…prefetch-batches-args

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-06-27T03:20:49Z

python/ray/data/_internal/block_batching/iter_batches.py

@@ -97,8 +98,13 @@ def iter_batches(
            process. If set to greater than 0, a separate thread will be used to fetch
            the specified amount of formatted batches from blocks. This improves
            performance for non-CPU bound UDFs, allowing batch fetching compute and
-            formatting to be overlapped with the UDF. Defaults to 0 (no prefetching
-            enabled).
+            formatting to be overlapped with the UDF. Defaults to 1.


updated the docs from 0 to 1, based on the current default value of 1 in the param definition.

ericl · 2023-06-27T03:26:24Z

python/ray/data/_internal/block_batching/iter_batches.py

@@ -34,6 +34,7 @@ def iter_batches(
    shuffle_seed: Optional[int] = None,
    ensure_copy: bool = False,
    prefetch_batches: int = 1,
+    gpu_prefetch_batches: int = 1,


Shouldn't this be for iter_torch_batches() only?

DataIterator.iter_torch_batches() calls DataIterator.iter_batches(), which calls this iter_batches function in block_batching/iter_batches.py, so I believe we still need to expose this param here.

ericl · 2023-06-27T03:26:45Z

python/ray/data/_internal/block_batching/iter_batches.py

+            threadpool will be used to format batches and apply the collate_fn.
+            Defaults to 1. You can revert back to the old prefetching behavior
+            that uses `prefetch_blocks` by setting `use_legacy_iter_batches` to
+            True in the DataContext.


Given that it's unlikely someone would want this to be >1, I don't think the comment needs to mention the legacy behavior.

ericl · 2023-06-27T03:27:44Z

python/ray/data/_internal/block_batching/iter_batches.py

@@ -149,7 +154,7 @@ def _async_iter_batches(
            stats=stats,
            batch_format=batch_format,
            collate_fn=collate_fn,
-            num_threadpool_workers=prefetch_batches,
+            num_threadpool_workers=gpu_prefetch_batches,


We still want full prefetching for the format conversion right? Just not the final GPU loading step.

amogkam

There are 2 changes we would need to make here

As @ericl mentions, we don’t want to change the existing thread pool for formatting+collate_fn but rather use a new thread pool for host to device transfer
We would also need to change our collate_fn API. Currently, the API is such that if the user specifies a collate_fn, they are expected to do the host to device transfer in the collate_fn. This won’t work if we want to parallelize collate_fn and host to device transfer independently

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-06-28T21:46:05Z

@amogkam @ericl I tried separating the formatting and collate, as suggested by Amog in (1) above. For (2) changing the collate_fn API to allow independent parallelization from formatting, where would be the best place make this change? Would we would need to call torch.as_tensor(...) over each batch with device specified, after we format, but before we collate?

amogkam

Sorry I mean the collate_fn we would want to do in the CPU based threadpool, along with formatting. Only the host to device transfer should happen in the GPU based threadpool.

ericl · 2023-06-28T21:56:36Z

How about we have a collate_fn and finalize_fn (could be marked internal argument), and say the finalize_fn has concurrency 1 always? We can put things like H2D in the finalize_fn.

I don't think you'd ever need more than 1 prefetch for the H2D load.

amogkam · 2023-06-28T21:59:08Z

in that case we don't need gpu_prefetch_batches

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam

Lgtm, thanks! Can we add a couple tests?

Test that finalize_fn is not run in more than 1 thread. We can use a threading.Lock to test this.
Test the logic for the different combinations of collate_fn and finalize_fn, and when the defaults are used.

amogkam · 2023-07-03T17:38:02Z

python/ray/data/_internal/block_batching/util.py

@@ -180,6 +180,32 @@ def collate(
        yield CollatedBatch(batch.batch_idx, collated_batch)


+def finalize_batches(


Hmm you could call make_async_gen twice in the same function right?

def collate_and_finalize(...): collated_iter = make_async_gen(base_iter, num_workers=prefetch_batches) finalized_iter = make_async_gen(collated_iter, num_workers=1) return finalized_iter

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam

nice work! just some final comments

amogkam · 2023-07-04T03:18:33Z

python/ray/data/iterator.py

-                "appropriate dtype and device."
+                "collate_fn cannot be used with dtypes and device."
+                "You should manually move the output Torch tensors to the"
+                "desired dtype and device, outside of collate_fn."


Suggested change

"desired dtype and device, outside of collate_fn."

"desired dtype and device outside of collate_fn."

amogkam · 2023-07-04T03:23:39Z

python/ray/data/_internal/block_batching/iter_batches.py

@@ -193,7 +201,7 @@ def _format_in_threadpool(
        num_threadpool_workers: The number of threads to use in the threadpool.


Add finalize_fn to the docstring

amogkam · 2023-07-04T03:37:31Z

python/ray/data/_internal/block_batching/iter_batches.py

+        finalized_iter = make_async_gen(
+            base_iterator=collated_iter,
+            fn=threadpool_computations_finalize_fn,
+            num_workers=1,


Actually I don't think we need this case at all. We should just be able to call threadpool_computations_finalize_fn(collated_iter) directly in all cases since the whole thing is being run in a separate thread anyways.

Otherwise we would be prefetching 1 extra batch beyond what's specified in prefetch_batches

Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

…ons in `DataIterator.iter_batches()` (ray-project#36842) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>

…ons in `DataIterator.iter_batches()` (#36842) (#37260) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>

…ons in `DataIterator.iter_batches()` (ray-project#36842) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Scott Lee added 2 commits June 26, 2023 16:26

add gpu_prefetch_batches param

122986f

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into gpu-prefetch-batches-args

4cdc646

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review June 27, 2023 01:09

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani and raulchen as code owners June 27, 2023 01:09

scottjlee assigned raulchen and c21 Jun 27, 2023

Scott Lee added 3 commits June 26, 2023 20:18

fix test

0a24b87

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into gpu-…

b12d0bf

…prefetch-batches-args

fix test

7980c08

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee commented Jun 27, 2023

View reviewed changes

ericl reviewed Jun 27, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 27, 2023

amogkam requested changes Jun 27, 2023

View reviewed changes

Scott Lee added 2 commits June 28, 2023 14:37

separate collate from format

702b8f7

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into gpu-prefetch-batches-args

1845348

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam reviewed Jun 28, 2023

View reviewed changes

Scott Lee added 4 commits June 28, 2023 17:53

initial finalize_fn rework

8bca783

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into gpu-prefetch-batches-args

30bc885

Signed-off-by: Scott Lee <sjl@anyscale.com>

missing param

68bd137

Signed-off-by: Scott Lee <sjl@anyscale.com>

complete

1060bd6

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam reviewed Jul 3, 2023

View reviewed changes

Scott Lee added 5 commits July 3, 2023 13:25

comments

f2abe90

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into gpu-prefetch-batches-args

6acbd42

Signed-off-by: Scott Lee <sjl@anyscale.com>

additional tests

10d54fa

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into gpu-prefetch-batches-args

7635b91

Signed-off-by: Scott Lee <sjl@anyscale.com>

add early failure so test doesnt hang until timeout

8b94c99

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from amogkam July 4, 2023 00:27

amogkam self-assigned this Jul 4, 2023

amogkam approved these changes Jul 4, 2023

View reviewed changes

Scott Lee added 2 commits July 4, 2023 21:53

Merge branch 'master' into gpu-prefetch-batches-args

37ead20

Signed-off-by: Scott Lee <sjl@anyscale.com>

address amog's comments

30743d6

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from amogkam July 5, 2023 15:57

amogkam added 7 commits July 5, 2023 10:55

update

3473d47

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

8f40e1d

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

c029f00

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

cleanup

4e93946

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

f91a583

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix test

62320e4

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

2121f96

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam merged commit e55e1fe into ray-project:master Jul 7, 2023

scottjlee mentioned this pull request Jul 10, 2023

[Data] Cherry-pick #36842 #37256

Closed

8 tasks

scottjlee mentioned this pull request Jul 10, 2023

[Data] Cherry-pick #36842 #37260

Merged

8 tasks

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add option for parallelizing post-collation data batch operations in `DataIterator.iter_batches()` #36842

[Data] Add option for parallelizing post-collation data batch operations in `DataIterator.iter_batches()` #36842

scottjlee commented Jun 26, 2023 •

edited

Loading

scottjlee Jun 27, 2023

ericl Jun 27, 2023

scottjlee Jun 28, 2023 •

edited

Loading

ericl Jun 27, 2023

ericl Jun 27, 2023

amogkam left a comment

scottjlee commented Jun 28, 2023 •

edited

Loading

amogkam left a comment

ericl commented Jun 28, 2023 •

edited

Loading

amogkam commented Jun 28, 2023

amogkam left a comment

amogkam Jul 3, 2023

amogkam left a comment

amogkam Jul 4, 2023

amogkam Jul 4, 2023

amogkam Jul 4, 2023

		@@ -180,6 +180,32 @@ def collate(
		yield CollatedBatch(batch.batch_idx, collated_batch)


		def finalize_batches(

	"desired dtype and device, outside of collate_fn."
	"desired dtype and device outside of collate_fn."

		@@ -193,7 +201,7 @@ def _format_in_threadpool(
		num_threadpool_workers: The number of threads to use in the threadpool.

[Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches() #36842

[Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches() #36842

Conversation

scottjlee commented Jun 26, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

scottjlee commented Jun 28, 2023 • edited Loading

amogkam left a comment

Choose a reason for hiding this comment

ericl commented Jun 28, 2023 • edited Loading

amogkam commented Jun 28, 2023

amogkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Add option for parallelizing post-collation data batch operations in `DataIterator.iter_batches()` #36842

[Data] Add option for parallelizing post-collation data batch operations in `DataIterator.iter_batches()` #36842

scottjlee commented Jun 26, 2023 •

edited

Loading

scottjlee Jun 28, 2023 •

edited

Loading

scottjlee commented Jun 28, 2023 •

edited

Loading

ericl commented Jun 28, 2023 •

edited

Loading