-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches()
#36842
Changes from 1 commit
122986f
4cdc646
0a24b87
b12d0bf
7980c08
702b8f7
1845348
8bca783
30bc885
68bd137
1060bd6
3fa1af6
5589a51
1165ee2
ecd64da
a40e391
2c7f00c
34507aa
131c5db
0da72f5
0d8afe5
f978348
e436ae6
f2abe90
6acbd42
10d54fa
7635b91
8b94c99
37ead20
30743d6
3473d47
8f40e1d
c029f00
4e93946
f91a583
62320e4
2121f96
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
Signed-off-by: Scott Lee <sjl@anyscale.com>
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,7 @@ def iter_batches( | |
shuffle_seed: Optional[int] = None, | ||
ensure_copy: bool = False, | ||
prefetch_batches: int = 1, | ||
gpu_prefetch_batches: int = 1, | ||
) -> Iterator[DataBatch]: | ||
"""Create formatted batches of data from an iterator of block object references and | ||
corresponding metadata. | ||
|
@@ -97,8 +98,13 @@ def iter_batches( | |
process. If set to greater than 0, a separate thread will be used to fetch | ||
the specified amount of formatted batches from blocks. This improves | ||
performance for non-CPU bound UDFs, allowing batch fetching compute and | ||
formatting to be overlapped with the UDF. Defaults to 0 (no prefetching | ||
enabled). | ||
formatting to be overlapped with the UDF. Defaults to 1. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated the docs from 0 to 1, based on the current default value of |
||
gpu_prefetch_batches: The number of batches to fetch ahead of the current | ||
batch to fetch on the GPU. If set to greater than 0, a separate | ||
threadpool will be used to format batches and apply the collate_fn. | ||
Defaults to 1. You can revert back to the old prefetching behavior | ||
that uses `prefetch_blocks` by setting `use_legacy_iter_batches` to | ||
True in the DataContext. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that it's unlikely someone would want this to be >1, I don't think the comment needs to mention the legacy behavior. |
||
|
||
Returns: | ||
An iterator over record batches. | ||
|
@@ -119,7 +125,6 @@ def iter_batches( | |
def _async_iter_batches( | ||
block_refs: Iterator[Tuple[ObjectRef[Block], BlockMetadata]], | ||
) -> Iterator[DataBatch]: | ||
|
||
# Step 1: Prefetch logical batches locally. | ||
block_refs = prefetch_batches_locally( | ||
block_ref_iter=block_refs, | ||
|
@@ -149,7 +154,7 @@ def _async_iter_batches( | |
stats=stats, | ||
batch_format=batch_format, | ||
collate_fn=collate_fn, | ||
num_threadpool_workers=prefetch_batches, | ||
num_threadpool_workers=gpu_prefetch_batches, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We still want full prefetching for the format conversion right? Just not the final GPU loading step. |
||
) | ||
|
||
# Step 5: Restore original order. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be for iter_torch_batches() only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataIterator.iter_torch_batches()
callsDataIterator.iter_batches()
, which calls thisiter_batches
function inblock_batching/iter_batches.py
, so I believe we still need to expose this param here.