Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix the cache max bytes buffering check #7181

Merged

Conversation

rodesai
Copy link
Contributor

@rodesai rodesai commented Mar 9, 2021

this patch fixes our validation of cache.max.bytes.buffering. We validate based
on the set of live queries in the engine context. However, we weren't populating
this set for sandboxes. This is fixed in this change by introducing a sandbox type
for transient query metadata, and filling in the live queries set with sandboxed
query metadatas

this patch fixes our validation of cache.max.bytes.buffering. We validate based
on the set of live queries in the engine context. However, we weren't popullating
this set for sandboxes. This is fixed in this change by introducing a sandbox type
for transient query metadata, and filling in the live queries set with sandboxed
query metadatas
@rodesai rodesai requested a review from a team as a code owner March 9, 2021 08:22
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally orthogonal but it made me think: since Walker's work on adding/removing threads we now have the ability to resize the cache in Streams. It's not possible through a public API at the moment, but it could be with a pretty trivial KIP. I wonder if this would be useful to ksqlDB? That way we could give the initial queries as much memory as we have available, then give some of that memory up as new ones come in. Resizing the cache should be non-disruptive -- it doesn't require a rebalance or anything, the only effect would be to suddenly flush some portion of the LRU records.

I'm just thinking out loud here, I actually don't know what the current model is. How do we handle distributing cache memory in ksqlDB at the moment?

@guozhangwang
Copy link
Contributor

That way we could give the initial queries as much memory as we have available, then give some of that memory up as new ones come in.

Not sure I understand: are you saying we give some threads within a single instance more memory than others?

@ableegoldman
Copy link
Contributor

No, I mean we use some new KafkaStreams#resizeCacheMaxBuffering to make sure all instances have equal memory. To take CCloud as an example, we have one thread per query and up to 20 queries. And let's say 20GB of memory we want to make available for the Streams cache (probably a huge overestimate, just looking for round numbers). Initially the user only runs one persistent query, and we give it all 20GB. A week later they add 4 more persistent queries: now that we have 5 queries in total, each should get 4GB for the cache. We would first call KafkaStreams.resizeCacheMaxBuffering(4GB) to resize the original query, then launch the 4 new queries with 4GB. Similarly if some queries are terminated, we resize the cache to give more memory to the remaining queries.

That way we're always utilizing the full available memory for caching without needing to change cache.max.bytes.buffering and restart

@rodesai
Copy link
Contributor Author

rodesai commented Mar 10, 2021

Thanks for the idea @ableegoldman. Today, we don't do anything special and I imagine in most cases people just use the streams default, or 0 if they don't want buffering. The intent with this change is to let you set a limit so if you have users that do set a high buffer size we don't blow up. I think your suggestion makes sense, though I'm not sure what the ROI is (e.g. how much adding buffer helps throughput in the typical case and if it's worth adding complexity to maximize the size). That said, could we do something similar with suppress buffer size? That's definitely worth exploring because we need a way to bound the suppress buffer size, but it's generally better to have a bigger buffer because if you run out the query fails (or you lose suppress semantics). If we go that route, we should also have a way to fail if reducing the suppress buffer size would force eviction. Not to go off on another tangent, but as an alternative to tuning per-query sizes for these things have we looked at letting you use a shared cache and suppress buffer across multiple streams apps?

@rodesai rodesai merged commit f383800 into confluentinc:master Mar 10, 2021
@guozhangwang
Copy link
Contributor

That way we're always utilizing the full available memory for caching without needing to change cache.max.bytes.buffering and restart

Not to go off on another tangent, but as an alternative to tuning per-query sizes for these things have we looked at letting you use a shared cache and suppress buffer across multiple streams apps?

I think sharing memory space across instances within a single JVM is a plausible direction. Maybe we can file a ticket for it? :)

@ableegoldman
Copy link
Contributor

FYI I filed https://issues.apache.org/jira/browse/KAFKA-12649 for the dynamic cache resizing. I think the LOE on the cache resizing is practically zero since we already implemented it as part of the add/removeStreamThread stuff. We would just need to do a trivial KIP and then a small PR to leverage it in ksqlDB.

We haven't already implemented this for suppress, since the buffer works a bit differently (ie its not a single config that then gets distributed across threads), but we could obviously do something similar there. That said we probably want/need spill-to-disk for suppression anyways, so this in-memory buffer resizing would just be a nice optimization maybe as future work.

@ableegoldman
Copy link
Contributor

Lol, in the 6 minutes since I filed that AK ticket someone from the community has already picked it up. Depending on how quickly they move we could get this into the next cloud release with maybe a day of work on our end)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants