Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sync race cache invalidation fixes part 2 #14156

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/14156.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix race conditions with room membership in sync when using synapse workers. Contributed by Nick @ Beeper (@fizzadar).
39 changes: 25 additions & 14 deletions synapse/handlers/sync.py
Original file line number Diff line number Diff line change
Expand Up @@ -1323,9 +1323,13 @@ async def generate_sync_result(
# See https://github.com/matrix-org/matrix-doc/issues/1144
raise NotImplementedError()

# If we have no since token (init sync), ensure any cached rooms for the user
# is first invalidated to avoid race conditions with invalidation-over-replication.
if not since_token:
self.store.get_rooms_for_user.invalidate((user_id,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really hesitant to do this without fully understanding where the inconsistencies are coming in, as per #14154 (comment). In particular, it's not obvious to me that invalidating just this cache won't lead to really weird inconsistencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this specific change wasn't based on an issue we saw more expected races similar to the other changes here. Probably makes sense to undo this commit, although the invalidating of get_users_in_room probably presents similar inconsistencies. Let's discuss in the issue and come back to this.


# Note: we get the users room list *before* we get the current token, this
# avoids checking back in history if rooms are joined after the token is fetched.
token_before_rooms = self.event_sources.get_current_token()
mutable_joined_room_ids = set(await self.store.get_rooms_for_user(user_id))

# NB: The now_token gets changed by some of the generate_sync_* methods,
Expand All @@ -1339,6 +1343,10 @@ async def generate_sync_result(
# during which membership events may have been persisted, so we fetch these now
# and modify the joined room list for any changes between the get_rooms_for_user
# call and the get_current_token call.

# NB: due to race conditions in cache invalidation-over-replication we cannot
# assume that the get_rooms_for_user is up to date, looking at the membership
# events allows us to correct for this.
membership_change_events = []
if since_token:
membership_change_events = await self.store.get_membership_changes_for_user(
Expand All @@ -1354,28 +1362,31 @@ async def generate_sync_result(
# latest change is JOIN.

for room_id, event in mem_last_change_by_room_id.items():
assert event.internal_metadata.stream_ordering
if (
event.internal_metadata.stream_ordering
< token_before_rooms.room_key.stream
):
# The user left the room, or left and was re-invited but not joined yet
if event.membership != Membership.JOIN:
mutable_joined_room_ids.discard(room_id)
continue

# Joined but we already know about it? Nothing to do here, this will bypass
# most membership events in any gappy syncs as get_rooms_for_user will already
# be up to date or close to it.
if room_id in mutable_joined_room_ids:
continue

logger.info(
"User membership change between getting rooms and current token: %s %s %s",
"Checking membership change between since token and current token: %s %s %s",
user_id,
event.membership,
room_id,
)
# User joined a room - we have to then check the room state to ensure we
# respect any bans if there's a race between the join and ban events.
if event.membership == Membership.JOIN:
user_ids_in_room = await self.store.get_users_in_room(room_id)
if user_id in user_ids_in_room:
mutable_joined_room_ids.add(room_id)
# The user left the room, or left and was re-invited but not joined yet
else:
mutable_joined_room_ids.discard(room_id)
# NB: we invalidate the cache here to avoid a race condition between
# cache invalidation-over-replication and sync requests.
self.store.get_users_in_room.invalidate((room_id,))
user_ids_in_room = await self.store.get_users_in_room(room_id)
if user_id in user_ids_in_room:
mutable_joined_room_ids.add(room_id)

# Now we have our list of joined room IDs, exclude as configured and freeze
joined_room_ids = frozenset(
Expand Down