Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

Open
kennknowles opened this issue Jun 3, 2022 · 2 comments

Comments

@kennknowles
Copy link
Member

Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).

Spark does not seem to provide any out-of-the-box implementations for this.

One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.

Another approach would be to create a custom InputDStream that does this.

An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.

Imported from Jira BEAM-1444. Original Jira may contain additional context.
Reported by: amitsela.

@twosom
Copy link
Contributor

twosom commented Feb 23, 2025

.take-issue

@twosom
Copy link
Contributor

twosom commented Feb 23, 2025

It seems to be a duplicate of issue #20426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants