[SPARK-23623] [SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer #20767

tdas · 2018-03-08T02:37:22Z

What changes were proposed in this pull request?

CacheKafkaConsumer in the project kafka-0-10-sql is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly.

Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data.

Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see this). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses.

This PR is a step towards that goal. It does the following.

There are effectively two kinds of consumer that may be generated
- Cached consumer - this should be returned to the pool at task end
- Non-cached consumer - this should be closed at task end
A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called val consumer = KafkaConsumer.acquire and then consumer.release().
If there is request for a consumer that is in-use, then a new consumer is generated.
If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release.
In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached.

This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only.

How was this patch tested?

A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool.

tdas · 2018-03-08T02:40:55Z

@zsxwing @brkyvz PTAL.

SparkQA · 2018-03-08T02:44:47Z

Test build #88070 has finished for PR 20767 at commit 97510c6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-03-08T08:57:59Z

jenkins retest this please.

SparkQA · 2018-03-08T09:25:12Z

Test build #88081 has finished for PR 20767 at commit 9e771b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-08T09:32:24Z

Test build #88082 has finished for PR 20767 at commit 9e771b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Left some comments

zsxwing · 2018-03-09T00:01:18Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

+    } else if (existingInternalConsumer == null) {
+      newNonCachedConsumer.internalConsumer.inuse = true
+      cache.put(key, newNonCachedConsumer.internalConsumer)
+      newNonCachedConsumer


We should return a CachedKafkaDataConsumer in this case. Right?

oh yes. damn it. my bad.

zsxwing · 2018-03-09T00:05:10Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

+          cache.remove(key)
+        } else {
+          consumer.inuse = false
+        }
      } else {
        logWarning(s"Attempting to release consumer that does not exist")


This is the case that a consumer may be evicted because of the max capacity. In this case, we should close the internal consumer.

Aah. The warning was misleading. Will add comments to clarify that.

This should not be the case. We do not remove any consumer from the cache while it is being used. So the scenario that you mentioned should not happen.

zsxwing · 2018-03-09T00:09:55Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

-        cache.put(key, new CachedKafkaConsumer(topicPartition, kafkaParams))
+      // If this is reattempt at running the task, then invalidate cache and start with
+      // a new consumer
+      if (existingInternalConsumer != null) {


This logic here seems wrong. I think it should be something like this?

if (existingInternalConsumer != null) { if (existingInternalConsumer.inuse) { existingInternalConsumer.markedForClose = true newNonCachedConsumer } else { existingInternalConsumer.close() cache.put(key, newNonCachedConsumer.internalConsumer) new CachedKafkaDataConsumer(newNonCachedConsumer.internalConsumer) } } else { cache.put(key, newNonCachedConsumer.internalConsumer) new CachedKafkaDataConsumer(newNonCachedConsumer.internalConsumer) }

This is indeed better. What I was doing was always deferring to a later point. But that would lead to it being used one more time before being closed.

zsxwing · 2018-03-09T00:11:25Z

...nal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDataConsumerSuite.scala

+    val numThreads = 50
+    val numConsumerUsages = 500
+
+    val threadpool = Executors.newFixedThreadPool(numThreads)


nit: threadpool should be shut down

SparkQA · 2018-03-09T03:29:59Z

Test build #88109 has finished for PR 20767 at commit 0a838c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

Love this. Left one comment.

brkyvz · 2018-03-09T19:23:36Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

      }
    }
  }

  /**
   * Get a cached consumer for groupId, assigned to topic and partition.
   * If matching consumer doesn't already exist, will be created using kafkaParams.
+   * This will make a best effort attempt to


I would love to see the rest of this sentence. Such a cliffhanger!

SparkQA · 2018-03-09T23:02:59Z

Test build #88140 has finished for PR 20767 at commit 37a9225.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-03-10T00:15:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala


  private val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String]

-  private var consumer = createConsumer
+  @volatile private var consumer = createConsumer


I think these @volatiles are not necessary. I'm okey with them though.

yeah, i just added them to be safer. one less thing to worry about.

zsxwing · 2018-03-10T00:21:30Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

+
+    } else {
+      // If consumer is already cached and is currently not in use, then return that consumer
+      CachedKafkaDataConsumer(existingInternalConsumer)


we should set existingInternalConsumer.isuse = true

I wonder why this was not caught in the stress test.

i will run a longer stress test locally just to be sure.

zsxwing · 2018-03-10T00:23:01Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

-      val consumer = cache.get(key)
-      consumer.inuse = true
-      consumer
+    } else if (!useCache) {


this if should be moved before the above if.

why? I am saying that we should NOT reuse consumers for ANY task retries, independent of the desire to use the cache or not.

When useCache is false, i would expect newInternalConsumer should never be put into the cache. The above if may put newInternalConsumer into the cache. If we put a consumer used by a continuous processing query into the cache and assume it never ends, it will prevent other micro batch queries from putting a consumer reading the same topic partition into the cache.

Technically that wont happens because the continuous query and the batch query will have different groupids. But I agree that if useCache is false, then we should not put it in the cache in any way. In fact, we can simplify the task retry case further by never putting the new one in the cache, only invalidate the existing cached one. The only scenario whether this will hurt a little would be the micro-batch immediately after the reattempt will create a new consumer. Thats a tiny one time cost in a scenario whether the reattempt has already made is slightly costly.

zsxwing · 2018-03-10T00:25:30Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

+      if (cachedIntConsumer != null) {
+        if (cachedIntConsumer.eq(intConsumer)) {
+          // The released consumer is indeed the cached one.
+          cache.remove(key)


We should remove it only when it's closed.

tedyu · 2018-03-10T21:47:53Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

-      if (removedConsumer != null) {
-        removedConsumer.close()
+      // If it has been marked for close, then do it any way
+      if (intConsumer.inuse && intConsumer.markedForClose) intConsumer.close()


Is it possible we have the following condition - should intConsumer.close() be called ?

!intConsumer.inuse && intConsumer.markedForClose

I rewrote the logic. Hopefully, it's simpler to reason about it now.

tedyu · 2018-03-10T21:51:24Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

+        }
+      } else {
+        // Consumer is not cached, put the new one in the cache
+        cache.put(key, newInternalConsumer)


Should newInternalConsumer.inuse = true be called ?

yes. correct. thanks!

koeninger · 2018-03-11T02:21:31Z

Can you clarify why you want to allow only 1 cached consumer per topicpartition, closing any others at task end?

It seems like opening and closing consumers would be less efficient than allowing a pool of more than one consumer per topicpartition.

tdas · 2018-03-15T23:56:21Z

@koeninger good question Cody! I think we should fix this limitation eventually. The only reason I am not doing that in this PR is to keep the changes minimum for backporting to 2.3.x. Eventually, we should not do such cache management, and rather use something like Apache Common Pool.

tdas · 2018-03-16T00:51:43Z

@tedyu @zsxwing thank you very much for catching the bugs. I have simplified the logic quite a bit. Note that I removed the invariant that I had introduced earlier. Additionally, I locally ran the stress test with 100 threads and 10000 read attempts, which ran for 2 mins. It passed. Please review the logic one again.

SparkQA · 2018-03-16T01:21:38Z

Test build #88285 has finished for PR 20767 at commit 5363ea8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tedyu · 2018-03-16T02:29:36Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

      CachedKafkaDataConsumer(newInternalConsumer)

-    } else if (existingInternalConsumer.inuse) {
+    } else if (existingInternalConsumer.inUse) {
      // If consumer is already cached but is currently in use, then return a new consumer
      NonCachedKafkaDataConsumer(newInternalConsumer)


Maybe keep an internal counter for how many times the non cached consumer is created.
This would give us information on how effective the cache is

tdas · 2018-03-16T02:48:32Z

The idea is good. But how do you propose exposing that information? Periodic print out in the log? From a different angle, I would rather not do feature creep in this PR that is intended to be backported to 2.3.

…

On Mar 15, 2018 7:31 PM, "tedyu" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/ KafkaDataConsumer.scala <#20767 (comment)>: > CachedKafkaDataConsumer(newInternalConsumer) - } else if (existingInternalConsumer.inuse) { + } else if (existingInternalConsumer.inUse) { // If consumer is already cached but is currently in use, then return a new consumer NonCachedKafkaDataConsumer(newInternalConsumer) Maybe keep an internal counter for how many times the non cached consumer is created. This would give us information on how effective the cache is — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20767 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAoerMcXNmKmobW4ws25hx3OvcER-1Ptks5teyPogaJpZM4SiC1I> .

zsxwing · 2018-03-16T18:10:37Z

@tdas this is much simpler!!! LGTM. Merging to master.

tedyu · 2018-03-16T18:17:56Z

@tdas
Do you think a follow on JIRA can be logged for adding metrics for the cache operations ?

Thanks

zsxwing · 2018-03-16T18:23:09Z

@tedyu that's a good idea. Could you create a ticket? Thanks!

tdas · 2018-03-16T21:47:23Z

@tedyu @zsxwing My thoughts on this is that we should consider migrating to something like Apache Common Pool (assuming it does not require additional maven libraries), which might be less maintenance load. It could be that it already has metrics, etc. that we can leverage.

tedyu · 2018-03-16T21:58:01Z

I did a quick search for 'apache commons pool metrics' which didn't show up directly related links.

…afkaConsumer CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly. Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data. Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see [this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses. This PR is a step towards that goal. It does the following. - There are effectively two kinds of consumer that may be generated - Cached consumer - this should be returned to the pool at task end - Non-cached consumer - this should be closed at task end - A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called `val consumer = KafkaConsumer.acquire` and then `consumer.release()`. - If there is request for a consumer that is in-use, then a new consumer is generated. - If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release. - In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached. This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only. A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#20767 from tdas/SPARK-23623.

tdas · 2018-03-16T23:22:17Z

@tedyu It was indeed hard to find :) But apache commons pool does expose metrics on idle/active counts. See https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/BaseGenericObjectPool.html

tedyu · 2018-03-16T23:24:54Z

Interesting.

https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/BaseGenericObjectPool.html#getBorrowedCount()

tdas · 2018-03-16T23:25:10Z

@tedyu Just to be clear, I am not saying that we have to move to this pool stuff. I am just saying that if we want to make this more robust (as @koeninger suggested as well) , then we should try to use existing tools (after careful evaluation), rather than trying to reinvent the wheel.

…afkaConsumer (branch-2.3) This is a backport of #20767 to branch 2.3 ## What changes were proposed in this pull request? CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly. Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data. Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see [this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses. This PR is a step towards that goal. It does the following. - There are effectively two kinds of consumer that may be generated - Cached consumer - this should be returned to the pool at task end - Non-cached consumer - this should be closed at task end - A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called `val consumer = KafkaConsumer.acquire` and then `consumer.release()`. - If there is request for a consumer that is in-use, then a new consumer is generated. - If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release. - In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached. This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only. ## How was this patch tested? A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #20848 from tdas/SPARK-23623-2.3.

gaborgsomogyi · 2018-03-21T02:29:48Z

@tdas @zsxwing @koeninger @tedyu do you think it makes sense to make similar step in the DStream area like this and then later follow with the mentioned Apache Common Pool?

…afkaConsumer ## What changes were proposed in this pull request? CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly. Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data. Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see [this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses. This PR is a step towards that goal. It does the following. - There are effectively two kinds of consumer that may be generated - Cached consumer - this should be returned to the pool at task end - Non-cached consumer - this should be closed at task end - A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called `val consumer = KafkaConsumer.acquire` and then `consumer.release()`. - If there is request for a consumer that is in-use, then a new consumer is generated. - If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release. - In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached. This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only. ## How was this patch tested? A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#20767 from tdas/SPARK-23623.

…afkaConsumer (branch-2.3) This is a backport of apache#20767 to branch 2.3 ## What changes were proposed in this pull request? CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly. Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data. Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see [this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses. This PR is a step towards that goal. It does the following. - There are effectively two kinds of consumer that may be generated - Cached consumer - this should be returned to the pool at task end - Non-cached consumer - this should be closed at task end - A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called `val consumer = KafkaConsumer.acquire` and then `consumer.release()`. - If there is request for a consumer that is in-use, then a new consumer is generated. - If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release. - In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached. This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only. ## How was this patch tested? A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#20848 from tdas/SPARK-23623-2.3.

…afkaConsumer CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly. Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data. Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see [this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses. This PR is a step towards that goal. It does the following. - There are effectively two kinds of consumer that may be generated - Cached consumer - this should be returned to the pool at task end - Non-cached consumer - this should be closed at task end - A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called `val consumer = KafkaConsumer.acquire` and then `consumer.release()`. - If there is request for a consumer that is in-use, then a new consumer is generated. - If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release. - In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached. This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only. A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#20767 from tdas/SPARK-23623. Ref: LIHADOOP-48531 RB=1845034 A=

Fixed

97510c6

tdas changed the title ~~Fixed~~ [SPARK-23623] [SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer Mar 8, 2018

Removed println

9e771b0

zsxwing requested changes Mar 9, 2018

View reviewed changes

zsxwing reviewed Mar 9, 2018

View reviewed changes

Fixed bugs

0a838c1

brkyvz approved these changes Mar 9, 2018

View reviewed changes

Updated docs

37a9225

zsxwing requested changes Mar 10, 2018

View reviewed changes

tedyu reviewed Mar 10, 2018

View reviewed changes

Simplified logic

5363ea8

tedyu reviewed Mar 16, 2018

View reviewed changes

asfgit closed this in bd201bf Mar 16, 2018

tdas mentioned this pull request Mar 16, 2018

[SPARK-23623][SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer (branch-2.3) #20848

Closed

koeninger mentioned this pull request Apr 10, 2018

[SPARK-19185] [DSTREAMS] Avoid concurrent use of cached consumers in CachedKafkaConsumer #20997

Closed

koeninger mentioned this pull request Aug 20, 2018

[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer #22138

Closed

[SPARK-23623] [SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer #20767

[SPARK-23623] [SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer #20767

Conversation

tdas commented Mar 8, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

tdas commented Mar 8, 2018

SparkQA commented Mar 8, 2018

tdas commented Mar 8, 2018 • edited Loading

SparkQA commented Mar 8, 2018

SparkQA commented Mar 8, 2018

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2018

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koeninger commented Mar 11, 2018

tdas commented Mar 15, 2018

tdas commented Mar 16, 2018 • edited Loading

SparkQA commented Mar 16, 2018

Choose a reason for hiding this comment

tdas commented Mar 16, 2018 via email

zsxwing commented Mar 16, 2018

tedyu commented Mar 16, 2018

zsxwing commented Mar 16, 2018

tdas commented Mar 16, 2018

tedyu commented Mar 16, 2018

tdas commented Mar 16, 2018

tedyu commented Mar 16, 2018

tdas commented Mar 16, 2018 • edited Loading

gaborgsomogyi commented Mar 21, 2018

tdas commented Mar 8, 2018 •

edited

Loading

tdas commented Mar 8, 2018 •

edited

Loading

tdas Mar 16, 2018 •

edited

Loading

tdas commented Mar 16, 2018 •

edited

Loading

tdas commented Mar 16, 2018 •

edited

Loading