Event ingestion slowdown under high load #8477

yakkomajuri · 2022-02-08T10:46:15Z

Old context

Click to see

Using this issue to document anything potentially related to the slowdowns we've seen under high load.

Some facts:

Under high load we've seen our ingestion rate dip significantly
Metrics show that GeoIP seems to be the slowest plugin by far
One specific reasonably new team has very slow GeoIP runs

Superlist of potential issues

...as ridiculous as they may be!

Stateless plugins: Optimize stateless plugins #8476
Voided promises
- See Why createBuffer is bad pt. 2 #7755
- Why does only GeoIP seem to be affected?
Connections to Redis
MMDB server
Export plugins
- Could we prevent onEvent from blocking the next batch of events?
Scheduled tasks
- Plugin execution time spikes every hour on the dot
...

EDIT: To do list

Managing load better

Monitoring

Add more metrics
- This has been ongoing work. See this fresh new dashboard which will tell you about real event loop lag, how long plugin jobs are taking, how long buffers take to flush, how many voided promises we're creating from the buffer, what piscina tasks take the longest, etc
Better plugins audit log: we need to have a better sense of when plugins get updated, plugin configs get updated etc. We currently have a static updated_at field but would be better to have a log, including who triggered it.
Plugin server "runbook" for debugging issues
...

The text was updated successfully, but these errors were encountered:

yakkomajuri · 2022-02-08T10:49:38Z

Any input is welcome @tiina303 @fuziontech @mariusandra

yakkomajuri · 2022-02-11T11:43:05Z

Ok, so I've been spending a lot of time investigating this, and I now have a better picture of what's going on.

What I've done

I've added more metrics to the plugin server and poured over all metrics we have
I've analyzed a ton of plugin logs
I've analyzed all our plugins
I've disabled plugins by type and monitored the impact
I made some modifications to plugin server code
etc etc

My conclusions

Ultimately I've reached the conclusion that we're going through "standard" scaling woes, where under high load, the system starts to see issues and those end up trickling down across the different moving parts.

Here are some findings:

Hourly spikes

One thing that initially caught my eye were how we had hourly spikes on a few metrics, particularly time spent on processEvent. After digging into every hourly task we run (in both the server and individual plugins), I shipped #8534.

This fixed the on-the-dot regular hourly spikes in GeoIP processing time, as well as correlated spikes in things like MMDB processing time, DB querying times, and kafka batch processing times.

It's important to note how we must always be careful when pointing to GeoIP as the problem. Given it is a plugin that processes most of our events on Cloud, any slowdown in the workers or the main thread will very likely show up in the GeoIP metrics. That's not to say there aren't things around GeoIP to improve, like we saw with #8112.

VM setup

VMs take a long time to be setup:

This is expected, and the plugin server handles this reasonably well. However, when we're in a period of high load, say the daily 572 spike, setting up these VMs again can lead to us seeing timeouts in plugins, as well as backpressure increasing.

This has a bit of a cascading effect, and it's a potential reason why we saw backpressure be reasonably high for a while on Sunday and Monday but then come back down with little input from us. It also would explain why we may not see these every day around the spikes.

Remember that this was the issue with GeoIP that led to building #8112. But the issue doesn't go away with stateless plugins, as there are still many "stateful" plugins we need to set up.

See here (deploy -> backpressure, deploy -> backpressure, deploy -> backpressure), and note that this gets worse the higher the load:

Atomics

Atomics might be another thing at play here. For those who didn't follow PostHog/plugin-server#487, atomics is a mechanism Piscina (our thread pool) uses to pick up tasks faster in the worker. It is faster than the alternative, but it blocks the event loop, deliberately.

We made a decision a while back to add a timeout to how long we'd block the event loop for, in order to benefit from the speed of atomics, while allowing us to do the background processing that Piscina does not recommend (we run a fork), but now it might be time to run without atomics. This is going to be attempted soon (see #8562). The suspicion here is that we will be slower at picking up tasks, but we'll make up for that time by setting the event loop free to handle all the callbacks from voided promises in its backlog.

Buffer

Atomics or not, voided promises are still an inherent problem the plugin server faces.

Beyond providing a buffer implementation in our plugin-contrib lib, we also use this same buffer mechanism in the background to handle the exportEvents function. Thus, if someone is using an export plugin, we'll be voiding promises that increase the worker's backlog in an unbounded way, potentially leading to blowups.

In a normal system, if that backlog grows significantly one might see slowdowns and even worse issues, but we see an additional thing: plugin timeouts.

If a worker's backlog is too large, we might not handle all the callbacks from a buffer flush in time for the 30 second timeout we impose on plugins, which leads to plugin timeouts as a result. See #7755

Here are some graphs relating to this issue:

So what are we going to do about this?

Overall, the goal remains the same: improving robustness, reliability, and ease of management of the plugin server.

EDIT: To-do list that was here now lives in the top comment

yakkomajuri · 2022-02-14T13:49:46Z

Ok so an update here is that the biggest culprit is indeed VM setups.

Here's a look at our metrics during a period when @timgl updated one plugin 4 times (see around 13:00):

Click to see graphs

And as I was writing this I actually realized the issue. When we send the message via pubsub to reload plugins, we should only setup new VMs for plugins that changed.

However, I kept seeing in the metrics that we were always setting up all VMs again. As it turns out, we look for updated_at to determine this, but updated_at is being updated any time we simply check for updates in a plugin.

Fixing this will solve the issue of reloading plugins while the server is up. But VM load times still affect fresh deploys.

Things to do there are super lazy VMs and/or a healthcheck that checks for VMs being ready so we don't kill pods/tasks before the new ones are fully ready.

cc @mariusandra as I had mentioned this to you on our call.

yakkomajuri · 2022-02-14T13:53:33Z

A follow up on above is that things get extremely bad when reloading plugins because workers might already be super busy. Thus, we get things like this:

cascading to this (it's a loop, the above feeds into the below and vice-versa):

and a lot of bad things come from lags like this during high load periods

yakkomajuri · 2022-02-14T14:51:34Z

Woop #8578

EDIT:

Note to self - maybe we shouldn't even need to hit the pubsub channel on all updates. If only the tag columns have changed let's just not do that on post_save. Maybe get rid of the post_save and do it manually? Will look at this next. ✅

EDIT 2:

The load issues this approach brought also are likely to have led to setupPlugin timeouts and thus VMs having to be set up again, which is a bad loop that can lead to a longer period of e.g. backpressure.

yakkomajuri · 2022-06-28T11:09:24Z

Not all tasks were covered, but we have split the plugin server and gotten to the bottom of the consumer issues - closing this!

yakkomajuri added the bug Something isn't working right label Feb 8, 2022

yakkomajuri mentioned this issue Feb 14, 2022

Investigate plugin server load issues #7470

Closed

yakkomajuri added P0 Critical, breaking issue (page crash, missing functionality) plugin-server feature/apps team-platform labels Feb 14, 2022

yakkomajuri mentioned this issue Feb 14, 2022

Refactor exportEvents buffer #8573

Merged

yakkomajuri closed this as completed Jun 28, 2022

ivanagas mentioned this issue Oct 21, 2022

Blog - How we built a scalable, arbitrary code app server (from MVP to millions of events) PostHog/posthog.com#4505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event ingestion slowdown under high load #8477

Event ingestion slowdown under high load #8477

yakkomajuri commented Feb 8, 2022 •

edited

Loading

Superlist of potential issues

yakkomajuri commented Feb 8, 2022

yakkomajuri commented Feb 11, 2022 •

edited

Loading

yakkomajuri commented Feb 14, 2022

yakkomajuri commented Feb 14, 2022

yakkomajuri commented Feb 14, 2022 •

edited

Loading

yakkomajuri commented Jun 28, 2022

Event ingestion slowdown under high load #8477

Event ingestion slowdown under high load #8477

Comments

yakkomajuri commented Feb 8, 2022 • edited Loading

Old context

Superlist of potential issues

EDIT: To do list

Managing load better

Monitoring

yakkomajuri commented Feb 8, 2022

yakkomajuri commented Feb 11, 2022 • edited Loading

What I've done

My conclusions

So what are we going to do about this?

yakkomajuri commented Feb 14, 2022

yakkomajuri commented Feb 14, 2022

yakkomajuri commented Feb 14, 2022 • edited Loading

yakkomajuri commented Jun 28, 2022

yakkomajuri commented Feb 8, 2022 •

edited

Loading

yakkomajuri commented Feb 11, 2022 •

edited

Loading

yakkomajuri commented Feb 14, 2022 •

edited

Loading