Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

batch outgoing RRs better #4730

Closed
richvdh opened this issue Feb 24, 2019 · 29 comments
Closed

batch outgoing RRs better #4730

richvdh opened this issue Feb 24, 2019 · 29 comments
Assignees

Comments

@richvdh
Copy link
Member

richvdh commented Feb 24, 2019

I made a heinous bodge which delays all outgoing EDUs (including typing notifs) by 5 seconds. We need to do somethig better

richvdh added a commit that referenced this issue Mar 1, 2019
In order to reduce the number of outgoing federation transactions, we want to
aggregate read-receipts, since there is little harm if outgoing RRs are delayed
by a few seconds, and we can dramatically reduce the number of transactions by
doing so.

This change introduces the concept of EDU 'buckets'; currently we have the
'Instant' bucket and the 'Delayed' bucket.

Fixes #4730, #3951.
@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

I've been doing some analysis on RRs to determine how we can best handle them. The following analysis is based on a period of a few hours today, which is a fairly average time of day, but still with enough RR peaks to let the federation sender get a couple of minutes behind at times.

Firstly: the number of RR EDUs increases as O(N^2) on the number of members of the room (number of people reading messages on matrix.org and hence sending RRs increases linearly; number of servers in the room which we have to send each RR to also increases linearly.) That means that the vast majority of RR EDUs are attributable to a minority of rooms. In the sample period, 36% of RR EDUs were for Matrix HQ, another 20% were for 'Synapse admins', and 80% of all RR EDUs were for 11 rooms.

image

Some datapoints about RR age: 16% arrive within 1 second of their corresponding event, and 37% within 5 seconds.

image

RR EDUs tend to be somewhat older on average: only 7% happen within 1 second of their event, and 28% within 5 seconds:

image

Presumably this means that RRs happen later in large rooms like Matrix HQ. Probably makes sense: people are more likely to read messages in one-to-one rooms immediately. It does suggest that we might want to consider using a different bucketing scheme for larger rooms.

In the sample period, there were 399 RRs for Matrix HQ, each of which would have been sent to around 667 servers. (This will include "offline" servers, and in practice some EDUs will be grouped into transactions if the remote server did not respond immediately).

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

Some analysis on different bucketing techniques as they would affect Matrix HQ:

  • No bucketing: 399 transmissions
  • Delay each RR by a fixed period to see if any others turn up:
  • Delay each RR by a factor of the event's age, with a cap of 30 seconds:
    • age*0.5 (ie, if the RR is 10 seconds after the event, wait 5 seconds to see if there are any more RRs): 228 transmissions
    • age*1: 191 transmissions
  • Fixed delays of 0, 5 seconds and 30 seconds, for RRs received <5s, 5-30s, >30s after their events: 265 transmissions.

Any other suggestions?

@erikjohnston
Copy link
Member

Basing it on a factor of the age feels like a good one, and half feels about right. I might be tempted to squish the low age RR's a bit more so that things feel a bit more instant (e.g. (x - 1 + 1/(x + 1))/2)

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

399->244 and 399->228 seem to be relatively modest improvements, although I much prefer the idea of only aggressively batching RRs for older events.

However, I'm finding it hard to visualise the circumstances when RRs for older events even happen. Are they are ones generated by clients as they catch up on scrollback? I'm a bit confused as to why we'd be sending these, as all clients I know of load the room at the bottom of the page, where they'd send an RR for the most recent event... and so RRs for historical events are irrelevant (other than for updating the RM, which is specific to the local user).

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

erik's suggestion of (x - 1s + 1/(x + 1s))/2, with a cap of 30s: 245 transmissions.

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

However, I'm finding it hard to visualise the circumstances when RRs for older events even happen. Are they are ones generated by clients as they catch up on scrollback? I'm a bit confused as to why we'd be sending these, as all clients I know of load the room at the bottom of the page, where they'd send an RR for the most recent event... and so RRs for historical events are irrelevant (other than for updating the RM, which is specific to the local user).

Unless somebody happens to be watching the current timeline of a given room at the point that an event is sent, there will be a delay before they open the app, switch to the room, etc.

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

oh, of course. so the "old event" could still be the most recent one in the room... it's just that we don't rapidly show RRs as people catch up on the message if that message happens to be old? I'm a bit worried that this could still feel weird (especially if the message is an hour old, and it takes 30 minutes for the RR to be sent?!), given from a user POV the expectation people are used to is that idle users coming back online should immediately send an RR as they catch up (especially given we lack presence).

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

it's just that we don't rapidly show RRs as people catch up on the message if that message happens to be old?

I don't quite understand the question. I think you're describing my proposed behaviour.

(especially if the message is an hour old, and it takes 30 minutes for the RR to be sent?!)

This is why I'm suggesting capping at 30 seconds.

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

I don't quite understand the question.

I was repeating my understanding back to check i had it right, which it seems like i do.

Would an alternative approach be to simply throttle the rate at which we send RRs in a room? So if things are slow (e.g. idle users returning and catching up on scrollback), the RRs come in rapidly, acting as a nice proxy for presence. But if a conversation is in full swing and RRs are flying around everywhere, we do something like "don't send RRs more rapidly than 1 batch every N seconds, per room"? This would then give us "only throttle in a crisis" behaviour. Or is that what the original PR did? :S

@erikjohnston
Copy link
Member

erik's suggestion of (x - 1s + 1/(x + 1s))/2, with a cap of 30s: 245 transmissions.

Ooh, nice. FTR that basically shakes out to adding a 50% delay for older events (e.g. if its taken you 15s to send an RR then we delay it by ~7s), but much less for recent events (e.g. if we send an RR within 2s then we'll only delay by ~600ms and 1s delayed by ~250ms).

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

presumably if we are delaying by 7s, in practice this means that it'll get sent sooner anyway if a less-delayed RR needs to get sent in the interim?

@erikjohnston
Copy link
Member

presumably if we are delaying by 7s, in practice this means that it'll get sent sooner anyway if a less-delayed RR needs to get sent in the interim?

Yup, or something else needs to get sent like a typing notification, events, etc.

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

Or is that what the original PR did? :S

It's not quite, because (a) the throttling in #4777 was independent of rooms; (b) it doesn't send anything out immediately.

So: Matthew's suggestion of throttling RRs per room to one batch every N seconds: 190 transmissions with N=30.

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

i was thinking a much lower value of N ftr - e.g. 3-5s

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

erik's suggestion of (x - 1s + 1/(x + 1s))/2, with a cap of 30s: 245 transmissions.

Ooh, nice.

well, not that nice imho. 245 is a bit of a marginal reduction.

@erikjohnston
Copy link
Member

Or is that what the original PR did? :S

It's not quite, because (a) the throttling in #4777 was independent of rooms; (b) it doesn't send anything out immediately.

So: Matthew's suggestion of throttling RRs per room to one batch every N seconds: 190 transmissions with N=30.

Doesn't that mean a second receipt sent in a room from a given server will potentially be delayed by up to 30s?

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

i was thinking a much lower value of N ftr - e.g. 3-5s

At N=3: 339 transmissions
At N=5: 313 transmissions

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

Doesn't that mean a second receipt sent in a room from a given server will potentially be delayed by up to 30s?

well, yes.

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

At N=3: 339 transmissions
At N=5: 313 transmissions

The point being that currently it's unusual for two RRs to land within 3s of each other, so this doesn't change much.

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

hum. what are we even trying to optimise for here? i thought the problem was that CPU was spiking/saturating when there's a lot of traffic, and we end up sending RRs too frequently because we don't batch them. Are we trying to reduce the total number of RRs, or just smooth things out to not muller the CPU? In which case perhaps 399->339 is okay, as long as they're spread out nicely no more rapidly than every 3s?

@erikjohnston
Copy link
Member

We're trying to reduce the frequency of transactions, hence the idea of waiting a bit to see if any more RRs come in before sending the transaction.

as long as they're spread out nicely no more rapidly than every 3s?

The problem is that for large rooms we need to send a transaction to each host in the room, so every 3s could easily mean 100Hz of outbound transactions.

@ara4n
Copy link
Member

ara4n commented Mar 6, 2019

okay, could we pick the "send RRs no more rapidly than every N seconds" such that N ensures less than X Hz of outbound transactions? So small rooms don't get crippled, but big ones do?

@richvdh
Copy link
Member Author

richvdh commented Mar 6, 2019

In which case perhaps 399->339 is okay, as long as they're spread out nicely no more rapidly than every 3s?

Having thought about this a bit harder: yes you're right that this does make a difference, because although the reduction is only 20%, it happens at the most beneficial time. However, as erik says, it's not enough, because empirically we top out around 150Hz of transactions and one transaction every 3 seconds to each of 600 servers is 200Hz.

okay, could we pick the "send RRs no more rapidly than every N seconds" such that N ensures less than X Hz of outbound transactions? So small rooms don't get crippled, but big ones do?

yes, this definitely sounds like a possible strategy. Let's say you're aiming for 100Hz of transactions: when you get a RR in HQ, you realise it needs to go to 667 servers, so you say that you won't forward any more RRs for 667/100Hz = 6.67s, and any RRs which arrive in the meantime get queued up.

We might need to be careful about what happens if there are two rooms with activity at once, but this is currently rare.

@erikjohnston
Copy link
Member

erikjohnston commented Mar 7, 2019

This does mean for large rooms RR will be significantly delayed, as the first RR will go out to all servers, and then it waits 7s before sending any more. That sounds like its gonna be quite noticeable in a conversation tbh.

I'm also slightly wary that instead of defining this in terms of "this is an acceptable delay" which is fairly static, we now have a magic parameter that we'd need to keep tweaking as our load/usage patterns change (e.g. how many large rooms have concurrent conversations).

@richvdh
Copy link
Member Author

richvdh commented Mar 7, 2019

This does mean for large rooms RR will be significantly delayed, as the first RR will go out to all servers, and then it waits 7s before sending any more. That sounds like its gonna be quite noticeable in a conversation tbh.

well right, but empirically going any faster than that means we start getting behind.

I'm also slightly wary that instead of defining this in terms of "this is an acceptable delay" which is fairly static, we now have a magic parameter that we'd need to keep tweaking as our load/usage patterns change (e.g. how many large rooms have concurrent conversations).

I suspect we'd end up tweaking the "acceptable delay" just as much.

@ara4n
Copy link
Member

ara4n commented Mar 8, 2019

We might need to be careful about what happens if there are two rooms with activity at once, but this is currently rare.

It might just be sufficient to maintain a queue and stack the batches of RRs to go out in series?

@richvdh
Copy link
Member Author

richvdh commented Mar 8, 2019

It might just be sufficient to maintain a queue and stack the batches of RRs to go out in series?

well, that's what happens at the moment. The problem is that everything else gets caught up in it. It's difficult to come up with a mechanism where RRs are queued separately to events without RRs getting completely starved out.

@richvdh
Copy link
Member Author

richvdh commented Mar 12, 2019

I think we need to try something here rather than continue to theorise about what might or might not feel ok.

I'm going to try Matthew's suggestion of rate-limiting RRs by room size. As Erik points out, that might make RRs feel unacceptably laggy, so we might need to consider tweaking it in the future.

richvdh added a commit that referenced this issue Mar 13, 2019
This is mostly a prerequisite for #4730, but also fits with the general theme
of "move everything off the master that we possibly can".
richvdh added a commit that referenced this issue Mar 19, 2019
richvdh added a commit that referenced this issue Mar 20, 2019
@richvdh
Copy link
Member Author

richvdh commented Mar 25, 2019

this has been deployed on matrix.org since 2019-03-21 14:14:49, ftr.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants