Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Track when the pulled event signature fails #13815

Merged
1 change: 1 addition & 0 deletions changelog.d/13815.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Keep track when an event pulled over federation fails its signature check so we can intelligently back-off in the future.
25 changes: 22 additions & 3 deletions synapse/federation/federation_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import TYPE_CHECKING
from typing import TYPE_CHECKING, Awaitable, Callable, Optional

from synapse.api.constants import MAX_DEPTH, EventContentFields, EventTypes, Membership
from synapse.api.errors import Codes, SynapseError
Expand Down Expand Up @@ -58,7 +58,12 @@ def __init__(self, hs: "HomeServer"):

@trace
async def _check_sigs_and_hash(
self, room_version: RoomVersion, pdu: EventBase
self,
room_version: RoomVersion,
pdu: EventBase,
record_failure_callback: Optional[
Callable[[EventBase, str], Awaitable[None]]
] = None,
Comment on lines +64 to +66
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, see #13815 (comment) for why we're using a callback pattern here.

) -> EventBase:
"""Checks that event is correctly signed by the sending server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to see a description of at least record_failure_callback — I think the other 2 are fairly self-evident so not fussed about those particularly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done in a later commit; in future please consider collapsing those into the same commit so that commit by commit review is possible for mere mortals like me who can't track the full thing ;-))

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR's are definitely not commit by commit review friendly. I find working that way in the first place is impossible.

And rebasing and splitting multiple changes in the same file is a super chore. I'd rather split stuff by PR which feel free to call out if you see something to split out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can make them a bit more CbC friendly, I would really appreciate it to make reviews less daunting if anything. (Maybe I'm the only one though? Definitely not a requirement, but just a preference from me especially when I'm not familiar with the area of code)

Since you seem to have the idea that it's difficult to do: My technique for things like this is to use lazygit (a TUI git frontend but with good mouse support too), which lets you stage changes line-by-line and easily amend (shift+A) them into earlier commits. It also makes it easy to reorder commits etc. I think some other tools support this kind of thing as well, but lazygit is the best I've found for easily crafting the commits into a meaningful arrangement.
I think you may be referring to git's interactive rebasing and interactive staging; you can definitely do this a lot more efficiently than that :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. I find splitting by commits especially useful to avoid conflating 'style' changes (e.g. moving methods, renaming methods) and semantic changes. I think asking you to put these in separate PRs would be too much, though and it would certainly hurt the latency of review.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the team thoughts around rebasing? Anyone annoyed/worried about references changing after they already pulled the branch?

If you're just looking for some of the low hanging fruit of pulling out some style changes or merging some subsequent changes back to the original chunk, that seems a lot more do-able. If we're talking about building up iteratively commit by commit, I have a lot more apprehension to put effort into doing that.

I assume you don't want me to rebase things together from different review cycles so you can easily review the things that have changed.

What about merge commits? I find it a lot harder to rebase around merge commits and would prefer to just rebase my branch than merge develop if I started doing this. But then the rebase would rewrite all of the history including stuff you already reviewed. Do you have a good way to see what's new even though I rebased on the latest develop? (does that even matter to you)


A note around rebased history: I really like the GitLab diff versioning so even if you rebase, you can still reference old diff. I used to use this all the time over there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note around rebased history: I really like the GitLab diff versioning so even if you rebase, you can still reference old diff. I used to use this all the time over there.

I have similar experiences with GL and reviewboard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the team thoughts around rebasing? Anyone annoyed/worried about references changing after they already pulled the branch?

I believe the rough consensus is: "don't rebase after someone has started to review the changes", because doing so resets the "I have viewed this file" in Github's review UI.


Expand All @@ -70,6 +75,11 @@ async def _check_sigs_and_hash(
Args:
room_version: The room version of the PDU
pdu: the event to be checked
record_failure_callback: A callback to run whenever the given event
fails signature or hash checks. This includes exceptions
that would be normally be thrown/raised but also things like
checking for event tampering where we just return the redacted
event.

Returns:
* the original event if the checks pass
Expand All @@ -80,7 +90,12 @@ async def _check_sigs_and_hash(
InvalidEventSignatureError if the signature check failed. Nothing
will be logged in this case.
"""
await _check_sigs_on_pdu(self.keyring, room_version, pdu)
try:
await _check_sigs_on_pdu(self.keyring, room_version, pdu)
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except clause seems too broad; I'd worry it might swallow errors arising from bugs rather than solely invalid signatures.

Can we narrow down the interesting types of Exceptions that we care about here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. this will catch CancelledErrors which sounds dodgy (I don't know if cancellation is enabled for backfill, but it doesn't sound like a good reason to record a failed pull attempt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can just worry about InvalidEventSignatureError here 👍

But all of the downstream places in _check_sigs_on_pdu do the same except Exception as e: 🤷

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we decide here, we should also apply to #13814

Is there a way we can make the errors we care about opt-out instead of opt-in? It seems like it's going to suck to keep some of these error lists up to date whenever the downstream functions change and raise different errors.

Maybe it's better to be opt-in to handle just what you know but as an example at the following spot, we probably need to look at FederationError, RuntimeError, InvalidResponseError from a quick glance at some of the downstream code and this list seems like it will morph under us since there is so much going on in that block. Is there a good way to get a general sense of all exception types that a code path can raise (doesn't need to be comprehensive, maybe info just from docstrings)?

except Exception as exc:
await self._store.record_event_failed_pull_attempt(
event.room_id, event_id, str(exc)
)
raise exc

(created a draft PR to track this point, #13969, and will update with what we think about here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine opt-out is easier — I don't think RuntimeError, ValueError, AssertionError, KeyError etc should be included, for example. These represent programming errors rather than the remote not acting according to the way we expect.

Tracking the set of exceptions that some code can throw is difficult in Python, unfortunately.

I would suggest trying some cases of where you want to track a failed pull attempt, perhaps even adding tests that reproduce these cases. I can think of at least these classes of problems:

  • DNS errors
  • TCP-level Connection errors
  • TLS problems such as untrusted self-signed certs
  • Homeserver not on federation whitelist (? what should we do here?)
  • Bad response code
  • Response isn't JSON
  • JSON response isn't in an expected format (e.g. doesn't include the right keys)... bearing in mind that our validation is probably rubbish and this might currently manifest itself as a KeyError (IMO this should be fixed so it sounds more like a validation error)

I realise this sounds a bit painful but we ought to figure out what we should do for these cases. Should a network problem actually be tracked against the event at all; won't the existing backoff behaviour be sufficient there? ....

Perhaps, rather than accepting that the code underneath is always going to be messy, try and improve the downstream situation — can we define superclasses for some of these interesting groups of exceptions a e.g. TemporaryNetworkError exception class and make relevant exceptions derived from those or get re-raised as those?
We can then add to the docstrings of the downstream functions what exceptions they're allowed to raise for these kinds of circumstances.

Just some thoughts, realise it may not necessarily be easy..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think RuntimeError, ValueError, AssertionError, KeyError etc should be included, for example. These represent programming errors rather than the remote not acting according to the way we expect.

Fair enough on RuntimeError and friends, I saw this case and during my quick scan and seemed relevant (it's not) although if it was, would be a good candidate to make it throw the proper error type 👍

I would suggest trying some cases of where you want to track a failed pull attempt, perhaps even adding tests that reproduce these cases

Having these tests would be great. We'd want to apply them to /event/{eventId}, /event_auth/{roomId}/{eventId}, /state/{roomId}?event_id=$xxx, /state_ids/{roomId}?event_id=$xxx (although related and probably covered by the existing backoff and NotRetryingDestination in the next point below). For /backfill and /get_missing_events, we can't track server/network problems because there is no specific event we're requesting but we can track whether the events in their responses are rejected, etc.

Should a network problem actually be tracked against the event at all; won't the existing backoff behaviour be sufficient there? ....

In general, if anything goes wrong with trying to pull an event from another server, we should backoff and do it in the background next time (#13623).

The existing backoff underlying all of the federation requests (matrixfederationclient.py and NotRetryingDestination) seems like it could be sufficient for network related errors. I haven't experienced network problems being the cause of /messages being slow so I haven't put as much thought in this area.

Homeserver not on federation whitelist (? what should we do here?)

Seems like we shouldn't be asking homeservers if they aren't on our whitelist in the first place. I guess we just let FederationDeniedError handle this and continue to the next destination which accomplishes this though.


Thanks for the thoughtful response on this @reivilibre! Really helping me form the right thoughts around all of this!

if record_failure_callback:
await record_failure_callback(pdu, str(exc))
raise exc

if not check_event_content_hash(pdu):
# let's try to distinguish between failures because the event was
Expand Down Expand Up @@ -116,6 +131,10 @@ async def _check_sigs_and_hash(
"event_id": pdu.event_id,
}
)
if record_failure_callback:
await record_failure_callback(
pdu, "Event content has been tampered with"
)
return redacted_event

spam_check = await self.spam_checker.check_event_for_spam(pdu)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we will potentially record a failed attempt when the spam checker soft-fails an event. Part of #13700

Expand Down
50 changes: 40 additions & 10 deletions synapse/federation/federation_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ async def backfill(
pdus = [event_from_pdu_json(p, room_version) for p in transaction_data_pdus]

# Check signatures and hash of pdus, removing any from the list that fail checks
pdus[:] = await self._check_sigs_and_hash_and_fetch(
pdus[:] = await self._check_sigs_and_hash_for_pulled_events_and_fetch(
dest, pdus, room_version=room_version
)

Expand Down Expand Up @@ -328,7 +328,17 @@ async def get_pdu_from_destination_raw(

# Check signatures are correct.
try:
signed_pdu = await self._check_sigs_and_hash(room_version, pdu)

async def _record_failure_callback(
event: EventBase, cause: str
) -> None:
await self.store.record_event_failed_pull_attempt(
event.room_id, event.event_id, cause
)

signed_pdu = await self._check_sigs_and_hash(
room_version, pdu, _record_failure_callback
)
except InvalidEventSignatureError as e:
errmsg = f"event id {pdu.event_id}: {e}"
logger.warning("%s", errmsg)
Expand Down Expand Up @@ -547,24 +557,28 @@ async def get_room_state(
len(auth_event_map),
)

valid_auth_events = await self._check_sigs_and_hash_and_fetch(
valid_auth_events = await self._check_sigs_and_hash_for_pulled_events_and_fetch(
destination, auth_event_map.values(), room_version
)

valid_state_events = await self._check_sigs_and_hash_and_fetch(
destination, state_event_map.values(), room_version
valid_state_events = (
await self._check_sigs_and_hash_for_pulled_events_and_fetch(
destination, state_event_map.values(), room_version
)
)

return valid_state_events, valid_auth_events

@trace
async def _check_sigs_and_hash_and_fetch(
async def _check_sigs_and_hash_for_pulled_events_and_fetch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this so we stick to using this only for pulled event scenarios

self,
origin: str,
pdus: Collection[EventBase],
room_version: RoomVersion,
) -> List[EventBase]:
"""Checks the signatures and hashes of a list of events.
"""
Checks the signatures and hashes of a list of pulled events we got from
federation and records any signature failures as failed pull attempts.
If a PDU fails its signature check then we check if we have it in
the database, and if not then request it from the sender's server (if that
Expand Down Expand Up @@ -597,11 +611,17 @@ async def _check_sigs_and_hash_and_fetch(

valid_pdus: List[EventBase] = []

async def _record_failure_callback(event: EventBase, cause: str) -> None:
await self.store.record_event_failed_pull_attempt(
event.room_id, event.event_id, cause
)

async def _execute(pdu: EventBase) -> None:
valid_pdu = await self._check_sigs_and_hash_and_fetch_one(
pdu=pdu,
origin=origin,
room_version=room_version,
record_failure_callback=_record_failure_callback,
)

if valid_pdu:
Expand All @@ -618,6 +638,9 @@ async def _check_sigs_and_hash_and_fetch_one(
pdu: EventBase,
origin: str,
room_version: RoomVersion,
record_failure_callback: Optional[
Callable[[EventBase, str], Awaitable[None]]
] = None,
) -> Optional[EventBase]:
"""Takes a PDU and checks its signatures and hashes.
Expand All @@ -634,14 +657,21 @@ async def _check_sigs_and_hash_and_fetch_one(
origin
pdu
room_version
record_failure_callback: A callback to run whenever the given event
fails signature or hash checks. This includes exceptions
that would be normally be thrown/raised but also things like
checking for event tampering where we just return the redacted
event.
Returns:
The PDU (possibly redacted) if it has valid signatures and hashes.
None if no valid copy could be found.
"""

try:
return await self._check_sigs_and_hash(room_version, pdu)
return await self._check_sigs_and_hash(
room_version, pdu, record_failure_callback
)
except InvalidEventSignatureError as e:
logger.warning(
"Signature on retrieved event %s was invalid (%s). "
Expand Down Expand Up @@ -694,7 +724,7 @@ async def get_event_auth(

auth_chain = [event_from_pdu_json(p, room_version) for p in res["auth_chain"]]

signed_auth = await self._check_sigs_and_hash_and_fetch(
signed_auth = await self._check_sigs_and_hash_for_pulled_events_and_fetch(
destination, auth_chain, room_version=room_version
)

Expand Down Expand Up @@ -1401,7 +1431,7 @@ async def get_missing_events(
event_from_pdu_json(e, room_version) for e in content.get("events", [])
]

signed_events = await self._check_sigs_and_hash_and_fetch(
signed_events = await self._check_sigs_and_hash_for_pulled_events_and_fetch(
destination, events, room_version=room_version
)
except HttpResponseException as e:
Expand Down
75 changes: 75 additions & 0 deletions tests/federation/test_federation_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,23 @@

from synapse.api.room_versions import RoomVersions
from synapse.events import EventBase
from synapse.rest import admin
from synapse.rest.client import login, room
from synapse.server import HomeServer
from synapse.types import JsonDict
from synapse.util import Clock

from tests.test_utils import event_injection
from tests.unittest import FederatingHomeserverTestCase


class FederationClientTest(FederatingHomeserverTestCase):
servlets = [
admin.register_servlets,
room.register_servlets,
login.register_servlets,
]

def prepare(self, reactor: MemoryReactor, clock: Clock, homeserver: HomeServer):
super().prepare(reactor, clock, homeserver)

Expand Down Expand Up @@ -231,6 +240,72 @@ def _get_pdu_once(self) -> EventBase:

return remote_pdu

def test_backfill_invalid_signature_records_failed_pull_attempts(
self,
) -> None:
"""
Test to make sure that events from /backfill with invalid signatures get
recorded as failed pull attempts.
"""
OTHER_USER = f"@user:{self.OTHER_SERVER_NAME}"
main_store = self.hs.get_datastores().main

# Create the room
user_id = self.register_user("kermit", "test")
tok = self.login("kermit", "test")
room_id = self.helper.create_room_as(room_creator=user_id, tok=tok)

# We purposely don't run `add_hashes_and_signatures_from_other_server`
# over this because we want the signature check to fail.
pulled_event, _ = self.get_success(
event_injection.create_event(
self.hs,
room_id=room_id,
sender=OTHER_USER,
type="test_event_type",
content={"body": "garply"},
)
)

# We expect an outbound request to /backfill, so stub that out
self._mock_agent.request.side_effect = lambda *args, **kwargs: defer.succeed(
_mock_response(
{
"origin": "yet.another.server",
"origin_server_ts": 900,
# Mimic the other server returning our new `pulled_event`
"pdus": [pulled_event.get_pdu_json()],
}
)
)

self.get_success(
self.hs.get_federation_client().backfill(
# We use "yet.another.server" instead of
# `self.OTHER_SERVER_NAME` because we want to see the behavior
# from `_check_sigs_and_hash_and_fetch_one` where it tries to
# fetch the PDU again from the origin server if the signature
# fails. Just want to make sure that the failure is counted from
# both code paths.
dest="yet.another.server",
room_id=room_id,
limit=1,
extremities=[pulled_event.event_id],
),
)

# Make sure our failed pull attempt was recorded
backfill_num_attempts = self.get_success(
main_store.db_pool.simple_select_one_onecol(
table="event_failed_pull_attempts",
keyvalues={"event_id": pulled_event.event_id},
retcol="num_attempts",
)
)
# This is 2 because it failed once from `self.OTHER_SERVER_NAME` and the
# other from "yet.another.server"
self.assertEqual(backfill_num_attempts, 2)


def _mock_response(resp: JsonDict):
body = json.dumps(resp).encode("utf-8")
Expand Down
4 changes: 2 additions & 2 deletions tests/test_federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ async def _check_event_auth(origin, event, context):

federation_event_handler._check_event_auth = _check_event_auth
self.client = self.homeserver.get_federation_client()
self.client._check_sigs_and_hash_and_fetch = lambda dest, pdus, **k: succeed(
pdus
self.client._check_sigs_and_hash_for_pulled_events_and_fetch = (
lambda dest, pdus, **k: succeed(pdus)
)

# Send the join, it should return None (which is not an error)
Expand Down