-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Users stop syncing, postgres queries at high CPU #7618
Comments
after around 15 minutes, these postgres queries finish or get killed (not sure how to check), and server returns back to normal |
got a report that sync broke for at least one other user as well, whereas messages could still be sent by both of our accounts, and bridged by a local appservice |
I think this could very well be forward extremity related #1760, as |
Well that is sub optimal. Can you do me a favour and run I'm wondering if this is related to #5064 |
I will rerun it when everything fails, to see if there's a difference |
just happened again, highest extremity count was 4, executing the query above still took only ~500ms, but again, that query was stuck being executed multiple times according to |
Hmm, that looks a bit odd. I would expect the query to be faster than that, but the query plan looks correct. Is this on a hard drive or SSD? #7567 was recently merged that might help performance a little bit there, but I wouldn't expect that to make a massive difference here. |
postgres is on a luks encrypted RAID1 over two ssd's |
Thanks! For context: this is the query that fetches state for an event from the DB, which obviously happens really quite a lot throughout Synapse. Now, these queries are a bit heavy weight, but if you're seeing a lot of them stacking up then that is a problem. Some potential causes could be: a) your postgres instance has poorer performance than most (either due to hardware or the various perf tuning config options not set correctly), b) something is causing Synapse to request more state than usual, c) your server is busy and the state caches are way too more small, or d) these queries are actually backing up behind another resource intensive query and are only showing up in the logs as they're relatively common. It might also be worth turning on the slow query logging on your postgres instance to see if there is anything else going on, and maybe set up a postgres exporter for prometheus as well? I'm afraid at this point I don't have any suggestions beyond that :/ |
It seems that this issue can be reliably reproduced by doing an initial sync, for either @f0x52 or me. The moment I attempt one, Synapse falls over for 5+ minutes, also from the perspective of my Riot client which is doing normal incremental syncs. |
I can second this. For a few months now initial syncs have been unbearably slow. With or without lazy-loading, it always seems to take more them 5 minutes, which causes my reverse proxy to fail the request and clients to retry until the sync completes. Postgres is basically spinning on 100% for the whole time doing that, while synapse doesn't even use that much CPU. I've also heard others being affected by this, so it doesn't seem to be just me. For comparison, non lazy loading initial syncs used to be less than a minute before this regression, so perf degraded at least 5x. I didn't join that many rooms in that timeframe... |
This has recently started affecting me too. |
Related to the followon discussion in #9182. |
@anoadragon453 can you be clear hat the info needed is? I'm not seeing it at a quick glance. I think this issue might be a duplicate of #5064 (which is hopefully now fixed)? |
I think this is a duplicate of #5064, so hopefully this will be fixed in v1.39.0. If you don't want to wait you can safely manually run the following on your DB: ALTER TABLE state_groups_state ALTER COLUMN state_group SET (n_distinct = -0.02); |
Description
At seemingly random intervals, sync totally cuts out for my account, while other users on my homeserver are able to send and receive just fine.
During this, there are multiple postgres queries related to my account, pegged at very high cpu usage:
from
SELECT datname,query FROM pg_stat_activity where datname='synapse';
, it shows that most of these are related to state groups:This same query is being ran 8 times:
Synapse logging does not seem to report anything on this.
A while after Riot drops my sync completely (Connection to server lost), performance seems to stall for other users too, most likely due to postgres being too busy.
Version information
If not matrix.org:
Version: 1.14.0
Install method: pip
The text was updated successfully, but these errors were encountered: