Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

kennknowles · 2022-06-03T16:24:18Z

Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).

Spark does not seem to provide any out-of-the-box implementations for this.

One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.

Another approach would be to create a custom InputDStream that does this.

An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.

Imported from Jira BEAM-1444. Original Jira may contain additional context.
Reported by: amitsela.

The text was updated successfully, but these errors were encountered:

twosom · 2025-02-23T02:49:23Z

.take-issue

twosom · 2025-02-23T02:49:29Z

It seems to be a duplicate of issue #20426

kennknowles added bug P3 runner-spark labels Jun 3, 2022

damccorm added runners spark and removed runner-spark labels Jun 16, 2022

github-actions bot assigned twosom Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

kennknowles commented Jun 3, 2022

twosom commented Feb 23, 2025

twosom commented Feb 23, 2025

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

Comments

kennknowles commented Jun 3, 2022

twosom commented Feb 23, 2025

twosom commented Feb 23, 2025