Why createBuffer is bad pt. 2 #7755
Labels
enhancement
New feature or request
P0
Critical, breaking issue (page crash, missing functionality)
plugin-server
Another tale
A few weeks back, one of our customers reported that their plugin server was throwing a ton of errors and ingestion had stopped.
After a long time digging, I landed on the buffer. Again.
Previously we fixed an issue where Piscina would block the event loop and thus prevent us from handling callbacks from voided promises, causing timeouts on instances with relatively low volumes.
This time, the issue is with high volume instances.
We configure piscina to handle up to 10 tasks per worker thread by default (
TASKS_PER_WORKER
setting).This means, if your instance has one plugin running, and this is its code:
You will only ever have 10 concurrent promises (from plugin code execution) that the worker needs to handle. Recap on the event loop: if you
await
something, the event loop will go around handling other errands (callbacks to close, mouths to feed, etc.) until it's time to process the callback.However, say we made a slight change to our plugin:
The theoretical limit for concurrent promises is now ∞.
Now say you've also set up another plugin that does this:
You're getting thousands of events per sec, and
something()
has a few promises it itself triggers, taking a handful of seconds to execute.Now
somethingElse
gets called. The event loop says: "hey, I've got time to do other stuff" and goes around processing the callbacks triggered via promises insomething
. Problem is, by the time it comes back to process the callback fromsomethingElse
, 30 seconds have passed and asyncGuard times it out. Not only this, but memory usage will also grow unchecked.The customer at hand has 19 projects, most running 2 buffer-based export plugins + every project has GeoIP + some other plugins here and there.
They have a decent amount of volume, meaning there are lots of voided promises being triggered every second. These also don't complete particularly fast anyway.
They managed to get temporarily healthy by reducing the number of active plugins, and finally the solution was:
Scaling the cores is particularly valuable because the number of threads we spawn by default is equal to the number of cores on the machine. This makes it so that the work is better distributed and it's less likely that workers get clogged.
What now?
More investigation needs to be done into various aspects of the plugin server. One thing I want to revisit are historical exports and how they might also affect things here.
However, the first thing I want to do is move the buffer implementation used by
exportEvents
to a jobs-based mechanism. That way we can make sure we're processing these in a healthy way. It might be slower, but it will be safer. And as it pertains toexportEvents
, it should be fine if we have a slight delay here - people don't need events in their warehouse ASAP.We can still leave the buffer available in
plugin-contrib
if we like, but use it with more care.The text was updated successfully, but these errors were encountered: