-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17834][SQL]Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer #15397
Conversation
…g on KafkaConsumer
/cc @tdas @koeninger |
Test build #66543 has finished for PR 15397 at commit
|
How is this going to work with assign? It seems like it's just avoiding the problem, not fixing it. |
We can seek to the offsets provided by the user. |
Look at the poll/seek implementation in the DStream's subscribe and On Sun, Oct 9, 2016 at 11:41 PM, Shixiong Zhu notifications@github.com
|
@koeninger sorry for the delay. Right now KafkaSource doesn't support external group id, so we don't need to concern about how to fetching committed offsets. Any other cases that I'm missing? |
Test build #66849 has finished for PR 15397 at commit
|
My main point is that whoever implements SPARK-17812 is going to have to deal with the issue shown in SPARK-17782, which means much of this patch is going to need to be changed anyway. But It's not just about external group id. Committed offsets would actually make the issue in SPARK-17782 less of a problem, because they would take precedence over auto.offset.reset |
@koeninger I agreed that this patch will be changed. However, this PR does fix a known issue for the current supported features and there is not user facing changes. Considering 2.0.2 may come out soon and I don't think SPARK-17812 will be done soon, I would like to merge this to fix issues for 2.0.2. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another once over, couple more minor things.
If the plan is to wait for SPARK-17812 to fix up the other stuff I was concerned about, that's ok with me, but I really hope it doesn't slip past another release. To reiterate, I'm fine with doing that work.
@@ -256,8 +269,6 @@ private[kafka010] case class KafkaSource( | |||
*/ | |||
private def fetchNewPartitionEarliestOffsets( | |||
newPartitions: Seq[TopicPartition]): Map[TopicPartition, Long] = withRetriesWithoutInterrupt { | |||
// Make sure `KafkaConsumer.poll` won't be interrupted (KAFKA-1894) | |||
assert(Thread.currentThread().isInstanceOf[StreamExecutionThread]) | |||
// Poll to get the latest assigned partitions | |||
consumer.poll(0) | |||
val partitions = consumer.assignment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to pause all partitions here as well?
@@ -270,7 +281,7 @@ private[kafka010] case class KafkaSource( | |||
// So we need to ignore them | |||
partitions.contains(p) | |||
}.map(p => p -> consumer.position(p)).toMap | |||
logDebug(s"Got offsets for new partitions: $partitionToOffsets") | |||
logDebug(s"Got earliest offsets for new partitions: $partitionToOffsets") | |||
partitionToOffsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: different variable name partitionToOffsets vs partitionOffsets for what is essentially the same thing
Test build #66908 has finished for PR 15397 at commit
|
LGTM, thanks for talking it through |
Thanks! Merging to master and 2.0. |
… instead of counting on KafkaConsumer ## What changes were proposed in this pull request? Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15397 from zsxwing/SPARK-17834. (cherry picked from commit 08eac35) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
… instead of counting on KafkaConsumer ## What changes were proposed in this pull request? Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15397 from zsxwing/SPARK-17834.
… instead of counting on KafkaConsumer ## What changes were proposed in this pull request? Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15397 from zsxwing/SPARK-17834.
What changes were proposed in this pull request?
Because
KafkaConsumer.poll(0)
may update the partition offsets, this PR just callsseekToBeginning
to manually set the earliest offsets for the KafkaSource initial offsets.How was this patch tested?
Existing tests.