-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17782][STREAMING][KAFKA] eliminate race condition of poll twice #15387
Conversation
…g called twice and moving position
Test build #66477 has finished for PR 15387 at commit
|
Test build #66479 has finished for PR 15387 at commit
|
@@ -223,7 +223,7 @@ private[spark] class DirectKafkaInputDStream[K, V]( | |||
|
|||
override def start(): Unit = { | |||
val c = consumer | |||
c.poll(0) | |||
assert(c.poll(0).isEmpty, "Driver shouldn't consume messages; pause if you poll during setup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this poll(0) guaranteed to not return any record if the previous poll(0) is paused immediately? Is there a race condition possible where the first poll(0) (inside consumer strategy) manages to actually fetch records internally before it is paused, which is then returned by this poll(0) (inside DStream)?
I'm not going to say anything is impossible, which is the point of the The whole poll 0 / pause thing is a gross hack, but it's what was suggested On Oct 7, 2016 6:16 AM, "Tathagata Das" notifications@github.com wrote:
|
Even if
|
I set auto commit to false, and still recreated the test failure. That makes sense to me, consumer position should still be getting updated At any rate, there are valid (albeit ill advised in my opinion) reasons to On Fri, Oct 7, 2016 at 4:24 PM, Shixiong Zhu notifications@github.com
|
@koeninger If |
#15397 is the fix for structured streaming. |
You dont want poll consuming messages, its not just about offset On Friday, October 7, 2016, Shixiong Zhu notifications@github.com wrote:
|
Poll also isn't going to return you just messages for a single On Fri, Oct 7, 2016 at 7:59 PM, Cody Koeninger cody@koeninger.org wrote:
|
If the concern is TD's comment, During the original implementation I had verified that calling pause kills the internal message buffer, which is one of the complications leading to a cached consumer per partition. I really don't think it's going to happen, but the assert is in there for paranoia, and to be explicit about the conditions. |
Let me know if you guys like that alternative PR better |
I observed the same behavior during my debug. I found that the first
I think you have agreed that this is impossible via current KafkaConsumer APIs as well. However, the unknown thing to me is that if the first |
If you're worried about it then accept the alternative PR I linked. On Sun, Oct 9, 2016 at 11:37 PM, Shixiong Zhu notifications@github.com
|
…of poll twice ## What changes were proposed in this pull request? Alternative approach to apache#15387 Author: cody koeninger <cody@koeninger.org> Closes apache#15401 from koeninger/SPARK-17782-alt.
…of poll twice ## What changes were proposed in this pull request? Alternative approach to #15387 Author: cody koeninger <cody@koeninger.org> Closes #15401 from koeninger/SPARK-17782-alt. (cherry picked from commit f9a56a1) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…of poll twice ## What changes were proposed in this pull request? Alternative approach to apache#15387 Author: cody koeninger <cody@koeninger.org> Closes apache#15401 from koeninger/SPARK-17782-alt.
What changes were proposed in this pull request?
Kafka consumers can't subscribe or maintain heartbeat without polling, but polling ordinarily consumes messages and adjusts position. We don't want this on the driver, so we poll with a timeout of 0 and pause all topicpartitions.
Some consumer strategies that seek to particular positions have to poll first, but they weren't pausing immediately thereafter. Thus, there was a race condition where the second poll() in the DStream start method might actually adjust consumer position.
Eliminated (or at least drastically reduced the chance of) the race condition via pausing in the relevant consumer strategies, and assert on startup that no messages were consumed.
How was this patch tested?
I reliably reproduced the intermittent test failure by inserting a thread.sleep directly before returning from SubscribePattern. The suggested fix eliminated the failure.