Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why createBuffer is bad pt. 2 #7755

Closed
yakkomajuri opened this issue Dec 16, 2021 · 0 comments
Closed

Why createBuffer is bad pt. 2 #7755

yakkomajuri opened this issue Dec 16, 2021 · 0 comments
Labels
enhancement New feature or request P0 Critical, breaking issue (page crash, missing functionality) plugin-server

Comments

@yakkomajuri
Copy link
Contributor

yakkomajuri commented Dec 16, 2021

Another tale

A few weeks back, one of our customers reported that their plugin server was throwing a ton of errors and ingestion had stopped.

After a long time digging, I landed on the buffer. Again.

Previously we fixed an issue where Piscina would block the event loop and thus prevent us from handling callbacks from voided promises, causing timeouts on instances with relatively low volumes.

This time, the issue is with high volume instances.

We configure piscina to handle up to 10 tasks per worker thread by default (TASKS_PER_WORKER setting).

This means, if your instance has one plugin running, and this is its code:

export async function onEvent() {
    await something()
}

You will only ever have 10 concurrent promises (from plugin code execution) that the worker needs to handle. Recap on the event loop: if you await something, the event loop will go around handling other errands (callbacks to close, mouths to feed, etc.) until it's time to process the callback.

However, say we made a slight change to our plugin:

export function onEvent() {
    void something()
}

The theoretical limit for concurrent promises is now ∞.

Now say you've also set up another plugin that does this:

export async runEveryMinute() {
    await somethingElse()
}

You're getting thousands of events per sec, and something() has a few promises it itself triggers, taking a handful of seconds to execute.

Now somethingElse gets called. The event loop says: "hey, I've got time to do other stuff" and goes around processing the callbacks triggered via promises in something. Problem is, by the time it comes back to process the callback from somethingElse, 30 seconds have passed and asyncGuard times it out. Not only this, but memory usage will also grow unchecked.

The customer at hand has 19 projects, most running 2 buffer-based export plugins + every project has GeoIP + some other plugins here and there.

They have a decent amount of volume, meaning there are lots of voided promises being triggered every second. These also don't complete particularly fast anyway.

They managed to get temporarily healthy by reducing the number of active plugins, and finally the solution was:

"I’ve bumped the server to have 6 cores and 6 gigs of ram and its healthy again!"

Scaling the cores is particularly valuable because the number of threads we spawn by default is equal to the number of cores on the machine. This makes it so that the work is better distributed and it's less likely that workers get clogged.

What now?

More investigation needs to be done into various aspects of the plugin server. One thing I want to revisit are historical exports and how they might also affect things here.

However, the first thing I want to do is move the buffer implementation used by exportEvents to a jobs-based mechanism. That way we can make sure we're processing these in a healthy way. It might be slower, but it will be safer. And as it pertains to exportEvents, it should be fine if we have a slight delay here - people don't need events in their warehouse ASAP.

We can still leave the buffer available in plugin-contrib if we like, but use it with more care.

@yakkomajuri yakkomajuri added bug Something isn't working right enhancement New feature or request P0 Critical, breaking issue (page crash, missing functionality) plugin-server team-platform and removed bug Something isn't working right labels Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P0 Critical, breaking issue (page crash, missing functionality) plugin-server
Projects
None yet
Development

No branches or pull requests

1 participant