Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster remote room joins: Support partial join re-syncing on workers other than the master #14544

Open
reivilibre opened this issue Nov 24, 2022 · 4 comments
Labels
A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@reivilibre
Copy link
Contributor

An enhancement of: #12994 (worker-mode support for Faster Remote Room Joins).

Instead of relying on the master to perform the re-syncing of the rooms, we should allow other workers to be involved.
Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again.
We should be mindful that in a hypothetical deployment, workers can be taken out of service — a room shouldn't be locked to one worker forever in case this happens, as that would mean the re-sync would never progress.

Aside: in future we should consider moving the /send_join request out of the master process. The obvious candidate is the "client reader" that receives the client-side /join request (and hence currently makes the request to ReplicationRemoteJoinRestServlet). The main thing to worry about then is locking (to ensure that we don't have multiple workers all trying to do the remote-join dance at once). For prior art in that department, we should look at the code that handles incoming events received over federation (https://github.com/matrix-org/synapse/blob/v1.69.0rc2/synapse/federation/federation_server.py#L1108-L1116), which uses a database row to hold a lock: we can simply call try_acquire_lock before starting a resync operation.

That still leaves us with the problem of making sure we resume the partial-state resync if the client reader that is currently processing it gets restarted (or, worse, turned off, never to return). Again following the example of incoming events: in that case, we kick off a processing job as soon as a worker discovers itself to be a "federation inbound" worker by receiving a /send request. Probably we could do the same here on a /_matrix/client/v3/rooms/.*/(send|join|invite|leave|ban|unban|kick) request?
#12994 (comment)

@reivilibre reivilibre added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Nov 24, 2022
@DMRobertson
Copy link
Contributor

Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again.

Can we piggyback off the sharding logic used for event persisters? (Is that sharded by room id?)

@reivilibre
Copy link
Contributor Author

To some extent, but that means having a definitive list of workers which are nominated for the job. That's very possibly fine! (But just noting a consideration.)

@squahtx squahtx added the A-Federated-Join joins over federation generally suck label Nov 30, 2022
@erikjohnston
Copy link
Member

We can use the cross-worker locking stuff that we implemented for handling inbound federation:

async def try_acquire_lock(self, lock_name: str, lock_key: str) -> Optional["Lock"]:

@erikjohnston
Copy link
Member

I think sharding the partial join stuff isn't something we need to worry about now TBH. We have a bunch of much busier streams that aren't sharded?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

4 participants