-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer #22138
Conversation
Test build #94912 has finished for PR 22138 at commit
|
CachedKafkaDataConsumer(pool.borrowObject(key, kafkaParams)) | ||
} catch { case _: NoSuchElementException => | ||
// There's neither idle object to clean up nor available space in pool: | ||
// fail back to create non-cached consumer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach introduces behavior change: even though cache
had capacity, the cache
worked like soft capacity and allowed adding item to the cache when there's neither idle object nor free space.
New behavior of the KafkaDataConsumer is creating all the objects to non-cached whenever pool is exhausted and there's no idle object to free up.
I think it is not a big deal when we configure "spark.sql.kafkaConsumerCache.capacity" properly, and having hard capacity feels more convenient to determine what's going on.
However we can still mimic the current behavior with having infinite capacity, so we can be back to current behavior if we feel it makes more sense.
Test build #94914 has finished for PR 22138 at commit
|
Test build #94913 has finished for PR 22138 at commit
|
If you have multiple consumers for a given key, and those consumers are at different offsets, isn't it likely that the client code will not get the right consumer, leading to extra seeking? |
@koeninger
(Btw, I'm curious what's more expensive between This patch keeps most of current behaviors, except two spots I guess. I already commented a spot why I change the behavior, and I'll comment another spot for the same. |
} else if (existingInternalConsumer.inUse) { | ||
// If consumer is already cached but is currently in use, then return a new consumer | ||
NonCachedKafkaDataConsumer(newInternalConsumer) | ||
// borrow a consumer from pool even in this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another behavior change: If this attempt succeeds, we can use pooled consumer from next batch, so no reason to discard the consumer.
But I also see the cost of unnecessary pooling if failure occurs continuously.
So that looks like kind of decision between possibility of success vs possibility of failure (again), and while I decide to cache it, it is pretty easy to go back to current behavior, so please let me know if we think current behavior makes more sense.
I thought the whole reason the caching was changed from the initial naive
approach to the current approach in master was that people were running
jobs that were scheduling multiple consumers for the same topicpartition
and group.
…On Sun, Aug 19, 2018 at 7:51 PM, Jungtaek Lim ***@***.***> wrote:
@koeninger <https://github.com/koeninger>
I'm not sure I got your point correctly. This patch is based on some
assumptions, so please correct me if I'm missing here. Assumptions follow:
1.
There's actually no multiple consumers for a given key working at the
same time. The cache key contains topic partition as well as group id. Even
the query tries to do self-join so reading same topic in two different
sources, I think group id should be different.
2.
In normal case the offset will be continuous, and that's why cache
should help. In retrying case this patch invalidates cache as same as
current behavior, so it should start from scratch.
(Btw, I'm curious what's more expensive between leveraging pooled object
but resetting kafka consumer vs invalidating pooled objects and start
from scratch. Latter feels more safer but if we just need extra seek
instead of reconnecting to kafka, resetting could be improved and former
will be cheaper. I feel it is out of scope of my PR though.)
This patch keeps most of current behaviors, except two spots I guess. I
already commented a spot why I change the behavior, and I'll comment
another spot for the same.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22138 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGAB8x3Khz4bWIxphLJHWFvcc8H4ERyks5uSfnvgaJpZM4WCUJs>
.
|
If my understanding is right, looks like current approach has same limitation. I guess you're busy, but could you refer some issue number or point out some code lines which was based on the reason if you remember any? It should help to determine whether this patch breaks more spots or not. |
see e.g. #20767 for background
Even if this patch doesn't change behavior, if it doesn't really solve the
problem, it may make it harder to solve it correctly.
…On Mon, Aug 20, 2018 at 10:32 AM, Jungtaek Lim ***@***.***> wrote:
If my understanding is right, looks like current approach has same
limitation. I guess you're busy, but could you refer some issue number or
point out some code lines which was based on the reason if you remember
any? It should help to determine whether this patch breaks more spots or
not.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22138 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGAB52zi78tff9E-kvPzeB7k0ZnOyFBks5uSsh0gaJpZM4WCUJs>
.
|
I'm not sure but are you saying that an executor cares about multiple queries (multiple jobs) concurrently? I honestly didn't notice it. If that is going to be problem, we should add something (could we get query id at that time?) in cache key to differentiate consumers. If we want to avoid extra seeking due to different offsets, consumers should not be reused among with multiple queries, and that's just a matter of cache key. If you are thinking about co-use of consumers among multiple queries because of reusing connection to Kafka, I think extra seeking is unavoidable (I guess fetched data should be much more critical unless we never reuse after returning to pool). If seeking is light operation, we may even go with only reusing connection (not position we already sought): always resetting position (and data maybe?) when borrowing from pool or returning consumer to pool. Btw, the rationalization of this patch is not solving the issue you're referring. This patch is also based on #20767 but dealing with another improvements pointed out in comments: adopt pool library to not reinvent the wheel, and also enabling metrics regarding the pool. I'm not sure the issue you're referring is a serious one (show-stopper): if the issue is a kind of serious, someone should handle the issue once we are aware of the issue at March, or at least relevant JIRA issue should be filed with detailed explanation before. I'd like to ask you in favor of handling (or filing) the issue since you may know the issue best. |
Seeking means the pre-fetched data is wasted, so it's not a light
operation. It shouldn't be unavoidable, e.g. if consumers were cached
keyed by topicpartition, groupid, next offset to be processed. One concern
there would be how to make sure you don't have lots of idle consumers.
The question of how serious an issue is could be solved by measurement, but
I don't have production structured streaming jobs, much less ones that
exhibit the kind of behavior tdas was talking about in the original ticket.
…On Mon, Aug 20, 2018 at 7:36 PM, Jungtaek Lim ***@***.***> wrote:
@koeninger <https://github.com/koeninger>
I'm not sure but are you saying that an executor cares about multiple
queries (multiple jobs) concurrently? I honestly didn't notice it. If that
is going to be problem, we should add something (could we get query id at
that time?) in cache key to differentiate consumers. If we want to avoid
extra seeking due to different offsets, consumers should not be reused
among with multiple queries, and that's just a matter of cache key.
If you are thinking about co-use of consumers among multiple queries
because of reusing connection to Kafka, I think extra seeking is
unavoidable (I guess fetched data should be much more critical issue unless
we never reuse after returning to pool). If seeking is light operation, we
may even go with only reusing connection (not position we already sought):
always resetting position (and data maybe?) when borrowing from pool or
returning consumer to pool.
Btw, the rationalization of this patch is not solving the issue you're
referring. This patch is also based on #20767
<#20767> but dealing with another
improvements pointed out in comments: adopt pool library to not reinvent
the wheel, and also enabling metrics regarding the pool.
I'm not sure the issue you're referring is a serious one (show-stopper):
if the issue is a kind of serious, someone should handle the issue once we
are aware of the issue at March, or at least relevant JIRA issue should be
filed with detailed explanation before. I'd like to ask you in favor of
handling (or filing) the issue since you may know the issue best.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22138 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGAB7qFjFrj9dWWkIcUcKcAKbEicuOwks5uS0gDgaJpZM4WCUJs>
.
|
@koeninger There's an evict thread in Apache Commons Pool running on background, and we could close consumers being idle for a long time (say 5 mins or higher). That's another benefit of adopting Apache Commons Pool (maybe available for most of general pool solutions): we could also evict cached consumers eventually which topic or partition is removed while query is running. It is not only evicted because of exceeding cache, but also time of inactivity. |
I just addressed eviction to consumer pool as well as added relevant test. This will help closing invalid idle consumers which topic or partition are no longer be assigned to any tasks. I guess current cache is not capable of closing invalid consumers. I haven't find how to add "query id" to the cache key, but IMHO the patch itself is already providing some values to be merged. It would be even better if someone could guide how to add "query id" to the cache key. @koeninger @tdas @zsxwing Please take a look and comment. Thanks in advance! |
Test build #95206 has finished for PR 22138 at commit
|
859a9d0
to
ba576e8
Compare
Rebased to resolve conflict. |
Test build #95394 has finished for PR 22138 at commit
|
* Borrowed object must be returned by either calling returnObject or invalidateObject, otherwise | ||
* the object will be kept in pool as active object. | ||
*/ | ||
def borrowObject(key: CacheKey, kafkaParams: ju.Map[String, Object]): InternalKafkaConsumer = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why kafkaParams is passed as the second argument?
As I see CacheKey itself is constructed from kafkaParams so would not be better to store kafkaParam in a val within CacheKey?
Then objectFactory.keyToKafkaParams would be deleted along with updateKafkaParamForKey. Is not it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to reduce unnecessary computation along with comparing map while accessing with pool. You can see CacheKey keeps as it is, and I guess CacheKey was designed for same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. What about changing CacheKey to not a case class but to a class where kafkaParams is a member but its equals and hashCode methods does not use kafkaParams?
As this values are goes together I have the feeling encapsulating them is better then keeping their relation in a separate map (keyToKafkaParams). It is just an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good, but let's wait for voices on committers since CacheKey is designed before, not introduced in this patch.
// invalidate all idle consumers for the key | ||
pool.clear(key) | ||
|
||
pool.getNumActive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call (getNumActive) really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Will remove.
CONFIG_NAME_EVICTOR_THREAD_RUN_INTERVAL_MILLIS, | ||
DEFAULT_VALUE_EVICTOR_THREAD_RUN_INTERVAL_MILLIS) | ||
|
||
// NOTE: Below lines define the behavior of CachedInternalKafkaConsumerPool, so do not modify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not found the referenced CachedInternalKafkaConsumerPool. I guess you mean InternalKafkaConsumerPool. Isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I changed the class name and missed to replace it with new name. Will fix.
import PoolConfig._ | ||
|
||
val conf = SparkEnv.get.conf | ||
val capacity = conf.getInt(CONFIG_NAME_CAPACITY, DEFAULT_VALUE_CAPACITY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a common place where we should document these configurations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think spark.sql.kafkaConsumerCache.capacity
wasn't documented somewhere, but it may be the thing to fix. The patch will provide more configurations to let end users can tune, so maybe worth to add them to some docs? Not 100% sure where to.
Thanks @attilapiros I fixed the missing spots you left comments. |
Test build #95450 has finished for PR 22138 at commit
|
Test build #95472 has finished for PR 22138 at commit
|
I just applied a new approach: "separate of concerns". This approach does pooling for both kafka consumers as well as fetched data. Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks. It also enables applying different policies on pool, which helps optimization of pooling for each pool. We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This approach properly handles the case. It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter. @koeninger @tdas @zsxwing @arunmahadevan |
Updated the description of PR to reflect the new approach. |
Test build #95474 has finished for PR 22138 at commit
|
From the stack trace of test failure, it doesn't look like relevant to the code change I guess. Jenkins shows REGRESSION but the test is added at Aug 25, 2018 so aged less than 7 days, which is not enough to consider it as regression or not I think. The test itself could be flaky. The test succeeds 5 times in a row in local dev. and I'll try to run more. @zsxwing Do you have any idea on this? I'm curious how checking query exception after query.stop() is relevant to Lines 270 to 274 in bb3e6ed
If you think additional checking is not necessary after query.stop, I'm happy to provide new minor patch for the fix. At least we may want to check the type of exception and let the test fail when the type of exception is relevant to Will retrigger build for now. EDIT: From what I looked into stack trace again, it doesn't look like due to check query exception. Still wondering why it happened though. |
retest this, please |
Test build #95502 has finished for PR 22138 at commit
|
Two tests failed and not relevant to this patch.
Will retrigger again. |
Test build #109952 has finished for PR 22138 at commit
|
At the first glance doesn't look related. |
retest this please |
Test build #109958 has finished for PR 22138 at commit
|
The last thing from my side it the doc update here: https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md#consumer-caching |
Ah OK. I didn't get your previous comment. I guess basic characteristics have been kept from current consumer cache, but something is changed slightly as well as new feature is also available (JMX metrics) sounds better to revise the doc. I'll try revising it. Thanks! |
Test build #110025 has finished for PR 22138 at commit
|
Looks like intermittent failure. |
retest this, please |
Test build #110026 has finished for PR 22138 at commit
|
The size of the pool is limited by <code>spark.kafka.consumer.cache.capacity</code>, | ||
but it works as "soft-limit" to not block Spark tasks. | ||
|
||
Idle eviction thread periodically removes some consumers which are not used. If this threshold is reached when borrowing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/removes some consumers/removes consumers
it tries to remove the least-used entry that is currently not in use. | ||
|
||
If it cannot be removed, then the pool will keep growing. In the worst case, the pool will grow to | ||
the max number of concurrent tasks that can run in the executor (that is, number of tasks slots). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/tasks slots/task slots
<table class="table"> | ||
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> | ||
<tr> | ||
<td>spark.kafka.consumer.fetchedData.cache.timeout</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, not sure if I understand it. It's a timeout but at the same time number of fetched data cached
. So how should I interpret this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad - just a copy-and-paste error. Will fix all missing things.
If it cannot be removed, then the cache will keep growing. In the worst case, the cache will grow to | ||
the max number of concurrent tasks that can run in the executor (that is, number of tasks slots), | ||
after which it will never reduce. | ||
The following properties are available to configure the consumer pool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why I don't see spark.kafka.consumer.cache.evictorThreadRunInterval
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same missing. Will fix.
… <= 0 for fetched pool
Test build #110094 has finished for PR 22138 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (pending tests).
<td>64</td> | ||
</tr> | ||
<tr> | ||
<td>spark.kafka.consumer.cache.timeout</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit: The actual timeout
+ evictorThreadRunInterval
combination may end-up in ~6 minutes total timeout. This shouldn't be a blocker but maybe it would be better to use evictorThreadRunInterval=1m
. Same applies on the other place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Evictor thread doesn't check all idle objects (for Apache Commons Pool) so OK to have shorter interval.
Test build #110113 has finished for PR 22138 at commit
|
Test build #110115 has finished for PR 22138 at commit
|
Merging to master. |
Thanks all for patiently reviewing and finally merging! That was a long journey! |
What changes were proposed in this pull request?
This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following:
Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits:
FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this.
This patch brings additional dependency, Apache Commons Pool 2.6.0 into
spark-sql-kafka-0-10
module.How was this patch tested?
Existing unit tests as well as new tests for object pool.
Also did some experiment regarding proving concurrent access of consumers for same topic partition.
local[*]
to maximize the chance to have concurrent accessgrep "creating new Kafka consumer" logfile | wc -l
grep "fetching data from Kafka consumer" logfile | wc -l
Topic and data distribution is follow:
The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier.
The result of experiment is below: