-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] [streaming] Support a streaming_repartition() operator #36724
Comments
Suppose I have 20 big size files and I implement a specfic datasource for it. I would like to load the datset by read_datasource with a parallesim ,say , 200. Now I see the splitblocks function to split each block to many smaller blocks. My question is , how does the splitblocks work? It will split a single big file in each row into many many binary parts and then to coalesce them somewhere in the downstream? |
SplitBlocks works within the read task to split the read output into multiple smaller pieces. These will remain as smaller individual blocks for the remainder of the computation unless the dataset is explicitly repartitioned. Ray Data will automatically insert SplitBlocks to ensure the desired/autodetected parallelism is met after a read. |
Also been looking for this feature. We have pipelines where processing single rows can take many minutes where we need greater control over block size at various parts of our pipelines. |
## Why are these changes needed? Add repartition by target number of rows per block Addresses #36724 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
@srinathk10 Why is this closed; does #50179 provide a true streaming split? I mean, without the need to materialize the whole dataset. Because the documentation still says this operation requires all inputs to be materialized in object store for it to execute, could you please clarify what the exact meaning of all inputs here refers to: the entire dataset or only the input of the current node . Can I understand that in the case of |
## Why are these changes needed? Add repartition by target number of rows per block Addresses ray-project#36724 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.repartition.html Yes, #50179 does handle streaming repartition when |
## Why are these changes needed? Add repartition by target number of rows per block Addresses ray-project#36724 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
## Why are these changes needed? Add repartition by target number of rows per block Addresses ray-project#36724 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
## Why are these changes needed? Add repartition by target number of rows per block Addresses ray-project#36724 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
## Why are these changes needed? Add repartition by target number of rows per block Addresses ray-project#36724 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
In several use cases, it is useful to change the block size of datasets in a streaming way. The current
repartition()
operator is an all-to-all operator and is incompatible with streaming.We could implement a general purpose
streaming_repartition()
operator that supports repartitioning in a few streaming-compatible ways:This could be implemented as a new PhysicalOperator that implements the online repartitioning. This could also replace the current SplitBlocks mechanism from #36352
The text was updated successfully, but these errors were encountered: