-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event ingestion slowdown under high load #8477
Comments
Any input is welcome @tiina303 @fuziontech @mariusandra |
Ok, so I've been spending a lot of time investigating this, and I now have a better picture of what's going on. What I've done
My conclusionsUltimately I've reached the conclusion that we're going through "standard" scaling woes, where under high load, the system starts to see issues and those end up trickling down across the different moving parts. Here are some findings: Hourly spikes One thing that initially caught my eye were how we had hourly spikes on a few metrics, particularly time spent on This fixed the on-the-dot regular hourly spikes in GeoIP processing time, as well as correlated spikes in things like MMDB processing time, DB querying times, and kafka batch processing times. It's important to note how we must always be careful when pointing to GeoIP as the problem. Given it is a plugin that processes most of our events on Cloud, any slowdown in the workers or the main thread will very likely show up in the GeoIP metrics. That's not to say there aren't things around GeoIP to improve, like we saw with #8112. VM setup VMs take a long time to be setup: This is expected, and the plugin server handles this reasonably well. However, when we're in a period of high load, say the daily 572 spike, setting up these VMs again can lead to us seeing timeouts in plugins, as well as backpressure increasing. This has a bit of a cascading effect, and it's a potential reason why we saw backpressure be reasonably high for a while on Sunday and Monday but then come back down with little input from us. It also would explain why we may not see these every day around the spikes. Remember that this was the issue with GeoIP that led to building #8112. But the issue doesn't go away with stateless plugins, as there are still many "stateful" plugins we need to set up. See here (deploy -> backpressure, deploy -> backpressure, deploy -> backpressure), and note that this gets worse the higher the load: Atomics Atomics might be another thing at play here. For those who didn't follow PostHog/plugin-server#487, atomics is a mechanism Piscina (our thread pool) uses to pick up tasks faster in the worker. It is faster than the alternative, but it blocks the event loop, deliberately. We made a decision a while back to add a timeout to how long we'd block the event loop for, in order to benefit from the speed of atomics, while allowing us to do the background processing that Piscina does not recommend (we run a fork), but now it might be time to run without atomics. This is going to be attempted soon (see #8562). The suspicion here is that we will be slower at picking up tasks, but we'll make up for that time by setting the event loop free to handle all the callbacks from voided promises in its backlog. Buffer Atomics or not, voided promises are still an inherent problem the plugin server faces. Beyond providing a buffer implementation in our In a normal system, if that backlog grows significantly one might see slowdowns and even worse issues, but we see an additional thing: plugin timeouts. If a worker's backlog is too large, we might not handle all the callbacks from a buffer flush in time for the 30 second timeout we impose on plugins, which leads to plugin timeouts as a result. See #7755 Here are some graphs relating to this issue: So what are we going to do about this?Overall, the goal remains the same: improving robustness, reliability, and ease of management of the plugin server. EDIT: To-do list that was here now lives in the top comment |
Ok so an update here is that the biggest culprit is indeed VM setups. Here's a look at our metrics during a period when @timgl updated one plugin 4 times (see around 13:00): And as I was writing this I actually realized the issue. When we send the message via pubsub to reload plugins, we should only setup new VMs for plugins that changed. However, I kept seeing in the metrics that we were always setting up all VMs again. As it turns out, we look for Fixing this will solve the issue of reloading plugins while the server is up. But VM load times still affect fresh deploys. Things to do there are super lazy VMs and/or a healthcheck that checks for VMs being ready so we don't kill pods/tasks before the new ones are fully ready. cc @mariusandra as I had mentioned this to you on our call. |
A follow up on above is that things get extremely bad when reloading plugins because workers might already be super busy. Thus, we get things like this: cascading to this (it's a loop, the above feeds into the below and vice-versa): and a lot of bad things come from lags like this during high load periods |
Woop #8578 EDIT: Note to self - maybe we shouldn't even need to hit the pubsub channel on all updates. If only the tag columns have changed let's just not do that on post_save. Maybe get rid of the post_save and do it manually? Will look at this next. ✅ EDIT 2: The load issues this approach brought also are likely to have led to |
Not all tasks were covered, but we have split the plugin server and gotten to the bottom of the consumer issues - closing this! |
Old context
Click to see
Using this issue to document anything potentially related to the slowdowns we've seen under high load.
Some facts:
Superlist of potential issues
...as ridiculous as they may be!
onEvent
from blocking the next batch of events?EDIT: To do list
Managing load better
exportEvents
instead of using the buffer directly and/or expose the new buffer to pluginsMonitoring
updated_at
field but would be better to have a log, including who triggered it.The text was updated successfully, but these errors were encountered: