-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix join being denied after being invited over federation #18075
base: develop
Are you sure you want to change the base?
Fix join being denied after being invited over federation #18075
Conversation
Reproduction test for element-hq/synapse#18075
Reproduction test for element-hq/synapse#18075
This reverts commit a32c1ba.
This is just normal for how someone finds out about an invite over federation See #18075 (comment)
synapse/events/builder.py
Outdated
def is_mine_id(self, string: str) -> bool: | ||
"""Determines whether a user ID or room alias originates from this homeserver. | ||
|
||
Returns: | ||
`True` if the hostname part of the user ID or room alias matches this | ||
homeserver. | ||
`False` otherwise, or if the user ID or room alias is malformed. | ||
""" | ||
localpart_hostname = string.split(":", 1) | ||
if len(localpart_hostname) < 2: | ||
return False | ||
return localpart_hostname[1] == self._hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we would access hs.is_mine_id(...)
but we don't have easy access to hs
here. Better way than to duplicate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have a read-only reference to hs
passed to this class - but I'm not sure how easy that is.
The duplication here isn't the end of the world - given it's a small function.
# After persistence, we always need to notify replication there may be new | ||
# data (backfilled or not) because TODO. | ||
self._notifier.notify_replication() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-backfilled events, we already call _notify_persisted_event
(just below) -> on_new_room_events
-> notify_new_room_events
-> notify_replication
Essentially, I want to fill in the context here: We never notify clients about backfilled events but it's important to let all the workers know about any new event (backfilled or not) because TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...they may need to act on that event type"?
One example is facilitating the Synapse Module API; where a module could be loaded on to any worker. A module may want to act on certain types of backfilled events arriving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example is facilitating the Synapse Module API; where a module could be loaded on to any worker. A module may want to act on certain types of backfilled events arriving.
@anoadragon453 Are you sure about this? I'm pretty sure the Synapse module on_new_event
hook doesn't get called for backfilled events. At-least that's my assumption in the Synapse module I've been working on and I even have asserts for it that don't get triggered (which are stressed by some Complement tests doing federation things).
But now I'm no longer confident in that assumption.
logger = logging.getLogger(__name__) | ||
|
||
|
||
class DeviceListResyncTestCase(unittest.HomeserverTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved these tests from tests/test_federation.py
But seems plausible
state_map: StateMap[EventBase] | ||
|
||
|
||
class OutOfBandMembershipTests(unittest.FederatingHomeserverTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the main tests that stress the scenarios that this PR is trying to fix. I decided to add these trial
tests (in addition to the Complement tests) because we don't seem to have any real coverage for the remote invite scenarios within the Synapse codebase.
These trial
tests are also much faster to run and allow us to exactly control how the remote server is interacting with us (and timing).
We're now better at rejecting this
from synapse.util.retryutils import NotRetryingDestination | ||
|
||
from tests import unittest | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all of these tests to something under tests/federation/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on all the testing here, and for discovering the root cause in the first place. I filed a spec issue off the back of it.
Some small comments below, but otherwise this looks good.
synapse/events/builder.py
Outdated
def is_mine_id(self, string: str) -> bool: | ||
"""Determines whether a user ID or room alias originates from this homeserver. | ||
|
||
Returns: | ||
`True` if the hostname part of the user ID or room alias matches this | ||
homeserver. | ||
`False` otherwise, or if the user ID or room alias is malformed. | ||
""" | ||
localpart_hostname = string.split(":", 1) | ||
if len(localpart_hostname) < 2: | ||
return False | ||
return localpart_hostname[1] == self._hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have a read-only reference to hs
passed to this class - but I'm not sure how easy that is.
The duplication here isn't the end of the world - given it's a small function.
# After persistence, we always need to notify replication there may be new | ||
# data (backfilled or not) because TODO. | ||
self._notifier.notify_replication() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...they may need to act on that event type"?
One example is facilitating the Synapse Module API; where a module could be loaded on to any worker. A module may want to act on certain types of backfilled events arriving.
|
||
def make_homeserver(self, reactor: MemoryReactor, clock: Clock) -> HomeServer: | ||
self.federation_http_client = Mock( | ||
# spec=MatrixFederationHttpClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintentional comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it with this comment:
# The problem with using `spec=MatrixFederationHttpClient` here is that it
# requires everything to be mocked which is a lot of work that I don't want
# to do when the code only uses a few methods (`get_json` and `put_json`).
Fix join being denied after being invited over federation. This also happens for rejecting an invite. Basically, any out-of-band membership transition where we first get the membership as an
outlier
and then rely on federation filling us in to de-outlier it.This PR mainly addresses automated test flakiness, bots/scripts, and options within Synapse like
auto_accept_invites
that are able to react quickly (before federation is able to push us events), but also helps in generic scenarios where federation is lagging.I initially thought this might be a Synapse consistency issue (see issues labeled with
Z-Read-After-Write
) but it seems to be an event auth logic problem. Workers probably do increase the number of possible race condition scenarios that make this visible though (replication and cache invalidation lag).Fix #15012
(probably fixes matrix-org/synapse#15012 (#15012))
Related to matrix-org/matrix-spec#2062
Problems:
event_auth
logic even though we expose them in/sync
.What happened before?
I wrote some Complement test that stresses this exact scenario and reproduces the problem: matrix-org/complement#757
We have
hs1
andhs2
running in monolith mode (no workers):@charlie1:hs2
is invited and joins the room:hs1
invites@charlie1:hs2
to a room which we receive onhs2
asPUT /_matrix/federation/v1/invite/{roomId}/{eventId}
(on_invite_request(...)
) and the invite membership is persisted as an outlier. Theroom_memberships
andlocal_current_membership
database tables are also updated which means they are visible down/sync
at this point.@charlie1:hs2
decides to join because it saw the invite down/sync
. Becausehs2
is not yet in the room, this happens as a remote joinmake_join
/send_join
which comes back with all of the auth events needed to auth successfully and now@charlie1:hs2
is successfully joined to the room.@charlie2:hs2
is invited and and tries to join the room:hs1
invites@charlie2:hs2
to the room which we receive onhs2
asPUT /_matrix/federation/v1/invite/{roomId}/{eventId}
(on_invite_request(...)
) and the invite membership is persisted as an outlier. Theroom_memberships
andlocal_current_membership
database tables are also updated which means they are visible down/sync
at this point.hs2
is already participating in the room, we also see the invite come over federation in a transaction and we start processing it (not done yet, see below)@charlie2:hs2
decides to join because it saw the invite down/sync
. Becausehs2
, is already in the room, this happens as a local join but we deny the event because ourevent_auth
logic thinks that we have no membership in the room ❌ (expected to be able to join because we saw the invite down/sync
)@charlie2:hs2
invite event from and de-outlier it.Logs for
hs2
:Dev notes
Other unrelated but semi-related races:
send_join
races with local users sending messages #17720Running tests
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)