Why createBuffer is bad pt. 2 #7755

yakkomajuri · 2021-12-16T19:26:03Z

Another tale

A few weeks back, one of our customers reported that their plugin server was throwing a ton of errors and ingestion had stopped.

After a long time digging, I landed on the buffer. Again.

Previously we fixed an issue where Piscina would block the event loop and thus prevent us from handling callbacks from voided promises, causing timeouts on instances with relatively low volumes.

This time, the issue is with high volume instances.

We configure piscina to handle up to 10 tasks per worker thread by default (TASKS_PER_WORKER setting).

This means, if your instance has one plugin running, and this is its code:

export async function onEvent() {
    await something()
}

You will only ever have 10 concurrent promises (from plugin code execution) that the worker needs to handle. Recap on the event loop: if you await something, the event loop will go around handling other errands (callbacks to close, mouths to feed, etc.) until it's time to process the callback.

However, say we made a slight change to our plugin:

export function onEvent() {
    void something()
}

The theoretical limit for concurrent promises is now ∞.

Now say you've also set up another plugin that does this:

export async runEveryMinute() {
    await somethingElse()
}

You're getting thousands of events per sec, and something() has a few promises it itself triggers, taking a handful of seconds to execute.

Now somethingElse gets called. The event loop says: "hey, I've got time to do other stuff" and goes around processing the callbacks triggered via promises in something. Problem is, by the time it comes back to process the callback from somethingElse, 30 seconds have passed and asyncGuard times it out. Not only this, but memory usage will also grow unchecked.

The customer at hand has 19 projects, most running 2 buffer-based export plugins + every project has GeoIP + some other plugins here and there.

They have a decent amount of volume, meaning there are lots of voided promises being triggered every second. These also don't complete particularly fast anyway.

They managed to get temporarily healthy by reducing the number of active plugins, and finally the solution was:

"I’ve bumped the server to have 6 cores and 6 gigs of ram and its healthy again!"

Scaling the cores is particularly valuable because the number of threads we spawn by default is equal to the number of cores on the machine. This makes it so that the work is better distributed and it's less likely that workers get clogged.

What now?

More investigation needs to be done into various aspects of the plugin server. One thing I want to revisit are historical exports and how they might also affect things here.

However, the first thing I want to do is move the buffer implementation used by exportEvents to a jobs-based mechanism. That way we can make sure we're processing these in a healthy way. It might be slower, but it will be safer. And as it pertains to exportEvents, it should be fine if we have a slight delay here - people don't need events in their warehouse ASAP.

We can still leave the buffer available in plugin-contrib if we like, but use it with more care.

The text was updated successfully, but these errors were encountered:

yakkomajuri added bug Something isn't working right enhancement New feature or request P0 Critical, breaking issue (page crash, missing functionality) plugin-server team-platform and removed bug Something isn't working right labels Dec 16, 2021

yakkomajuri mentioned this issue Jan 11, 2022

Improve plugin server reliability and management #7980

Closed

9 tasks

This was referenced Feb 8, 2022

Event ingestion slowdown under high load #8477

Closed

Refactor exportEvents buffer #8573

Merged

This was referenced Mar 2, 2022

Plugin Server - Before and After #8808

Closed

Understanding plugin server backpressure #8921

Closed

yakkomajuri closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why createBuffer is bad pt. 2 #7755

Why createBuffer is bad pt. 2 #7755

yakkomajuri commented Dec 16, 2021 •

edited

Loading

Why createBuffer is bad pt. 2 #7755

Why createBuffer is bad pt. 2 #7755

Comments

yakkomajuri commented Dec 16, 2021 • edited Loading

Another tale

What now?

yakkomajuri commented Dec 16, 2021 •

edited

Loading