-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Refactor MSC3030 /timestamp_to_event
to move away from our snowflake pull from destination
pattern
#14096
Refactor MSC3030 /timestamp_to_event
to move away from our snowflake pull from destination
pattern
#14096
Conversation
e755f41
to
82535ff
Compare
can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint
test flakeoutlier
outlier
outlier
63489a9
to
af3fecc
Compare
except NotRetryingDestination as e: | ||
logger.info(f"get_pdu(event_id={event_id}): {e}") | ||
continue | ||
except FederationDeniedError: | ||
logger.info( | ||
"Failed to get PDU %s from %s because %s", | ||
event_id, | ||
destination, | ||
e, | ||
f"get_pdu(event_id={event_id}): Not attempting to fetch PDU from {destination} because the homeserver is not on our federation whitelist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistently handling these two errors across get_pdu
, backfill, and the generic _try_destination_list
which is used for many places we use this pattern.
# Loop through each homeserver candidate until we get a succesful response | ||
timestamp_to_event_response = await self._try_destination_list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this logic from synapse/handlers/room.py
and refactored to use the generic _try_destination_list
which handles this pattern of looping over destinations
af3fecc
to
7ea5136
Compare
@@ -352,11 +366,11 @@ async def _record_failure_callback( | |||
@tag_args | |||
async def get_pdu( | |||
self, | |||
destinations: Iterable[str], | |||
destinations: Collection[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that we iterate over destinations
twice so if it was actually use an Iterable
anywhere it would get exhausted the first time we iterate in the log before we actually went over each destination.
@@ -51,7 +51,7 @@ def __init__(self, retry_last_ts: int, retry_interval: int, destination: str): | |||
destination: the domain in question | |||
""" | |||
|
|||
msg = "Not retrying server %s." % (destination,) | |||
msg = f"Not retrying server {destination} because we tried it recently retry_last_ts={retry_last_ts} and we won't check for another retry_interval={retry_interval}ms." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a more verbose error message to spell things out
…timestamp_to_event_endpoint` test flake. Make sure we backfill the closest local event if it's an `outlier` so we can use it with `/context` and paginate `/messages` from this point. Previously we only did this for a remote event. Fix #13944 Why did this fail before? Why was it flakey? ==================================================== Sleuthing the server logs on the CI failure, it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.` This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes. ```py from twisted.internet import task as twisted_task d = twisted_task.deferLater(self.hs.get_reactor(), 3.5) await d ``` In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`. Future changes ========================== In a future PR, it would be nice if `/context` would just backfill the event for us. This would also help with #3848
7ea5136
to
da87def
Compare
Is this an actual bug user visible bug? It seems odd to have to have Synapse side changes to fix a flakey complement test, unless Synapse is actually hitting a bug. |
Possibly related: matrix-org/complement#492 (review) |
I guess its not clear to me that we would get anywhere near the same answer. In particular, if the query returns an outlier than (pretty much by definition) we don't actually have that part of the DAG locally, and so we don't know if the outlier is the "closest" event to the timestamp. As a concrete example: let's take a really busy room that has many messages a day, but relative infrequent joins (say one a week). A new server joins the room (and so persists all joins as outliers) and their user wants to jump to a point several months ago. The query as it stands will return the nearest join event, which is, say, several weeks before the requested point. I think what this PR will do is cause the client to render a section of the timeline around that join, which is not what the user requested. Instead, if we didn't pick that join and instead hit over federation, then we're at least more likely to get the right answer back? |
Keep in mind, it also has checks to make sure the local event we found is not next to a forward or backward gap. So chances are the But I think the code/my assumptions are wrong. Re-reading We probably need to add some assertions to those functions that the event is not an And we can make the shortcut to not check for I'll make these changes in a separate PR and leave these refactor changes until after we merge that. Thanks for chatting through this 🙂 |
outlier
/timestamp_to_event
to move away from our snowflake pull from destination
pattern
…ate-complement-backfill-test-flake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a couple of nits like using f-strings in log lines I think this looks good
…ate-complement-backfill-test-flake
…n when not logging at that severity See #14096 (comment)
Thanks for the review @erikjohnston 🐈 |
Synapse 1.71.0 (2022-11-08) =========================== Please note that, as announced in the release notes for Synapse 1.69.0, legacy Prometheus metric names are now disabled by default. They will be removed altogether in Synapse 1.73.0. If not already done, server administrators should update their dashboards and alerting rules to avoid using the deprecated metric names. See the [upgrade notes](https://matrix-org.github.io/synapse/v1.71/upgrade.html#upgrading-to-v1710) for more details. **Note:** in line with our [deprecation policy](https://matrix-org.github.io/synapse/latest/deprecation_policy.html) for platform dependencies, this will be the last release to support PostgreSQL 10, which reaches upstream end-of-life on November 10th, 2022. Future releases of Synapse will require PostgreSQL 11+. No significant changes since 1.71.0rc2. Synapse 1.71.0rc2 (2022-11-04) ============================== Improved Documentation ---------------------- - Document the changes to monthly active user metrics due to deprecation of legacy Prometheus metric names. ([\matrix-org#14358](matrix-org#14358), [\matrix-org#14360](matrix-org#14360)) Deprecations and Removals ------------------------- - Disable legacy Prometheus metric names by default. They can still be re-enabled for now, but they will be removed altogether in Synapse 1.73.0. ([\matrix-org#14353](matrix-org#14353)) Internal Changes ---------------- - Run unit tests against Python 3.11. ([\matrix-org#13812](matrix-org#13812)) Synapse 1.71.0rc1 (2022-11-01) ============================== Features -------- - Support back-channel logouts from OpenID Connect providers. ([\matrix-org#11414](matrix-org#11414)) - Allow use of Postgres and SQLlite full-text search operators in search queries. ([\matrix-org#11635](matrix-org#11635), [\matrix-org#14310](matrix-org#14310), [\matrix-org#14311](matrix-org#14311)) - Implement [MSC3664](matrix-org/matrix-spec-proposals#3664), Pushrules for relations. Contributed by Nico. ([\matrix-org#11804](matrix-org#11804)) - Improve aesthetics of HTML templates. Note that these changes do not retroactively apply to templates which have been [customised](https://matrix-org.github.io/synapse/latest/templates.html#templates) by server admins. ([\matrix-org#13652](matrix-org#13652)) - Enable write-ahead logging for SQLite installations. Contributed by [@asymmetric](https://github.com/asymmetric). ([\matrix-org#13897](matrix-org#13897)) - Show erasure status when [listing users](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html#query-user-account) in the Admin API. ([\matrix-org#14205](matrix-org#14205)) - Provide a specific error code when a `/sync` request provides a filter which doesn't represent a JSON object. ([\matrix-org#14262](matrix-org#14262)) Bugfixes -------- - Fix a long-standing bug where the `update_synapse_database` script could not be run with multiple databases. Contributed by @thefinn93 @ Beeper. ([\matrix-org#13422](matrix-org#13422)) - Fix a bug which prevented setting an avatar on homeservers which have an explicit port in their `server_name` and have `max_avatar_size` and/or `allowed_avatar_mimetypes` configuration. Contributed by @ashfame. ([\matrix-org#13927](matrix-org#13927)) - Check appservice user interest against the local users instead of all users in the room to align with [MSC3905](matrix-org/matrix-spec-proposals#3905). ([\matrix-org#13958](matrix-org#13958)) - Fix a long-standing bug where Synapse would accidentally include extra information in the response to [`PUT /_matrix/federation/v2/invite/{roomId}/{eventId}`](https://spec.matrix.org/v1.4/server-server-api/#put_matrixfederationv2inviteroomideventid). ([\matrix-org#14064](matrix-org#14064)) - Fix a bug introduced in Synapse 1.64.0 where presence updates could be missing from `/sync` responses. ([\matrix-org#14243](matrix-org#14243)) - Fix a bug introduced in Synapse 1.60.0 which caused an error to be logged when Synapse received a SIGHUP signal if debug logging was enabled. ([\matrix-org#14258](matrix-org#14258)) - Prevent history insertion ([MSC2716](matrix-org/matrix-spec-proposals#2716)) during an partial join ([MSC3706](matrix-org/matrix-spec-proposals#3706)). ([\matrix-org#14291](matrix-org#14291)) - Fix a bug introduced in Synapse 1.34.0 where device names would be returned via a federation user key query request when `allow_device_name_lookup_over_federation` was set to `false`. ([\matrix-org#14304](matrix-org#14304)) - Fix a bug introduced in Synapse 0.34.0 where logs could include error spam when background processes are measured as taking a negative amount of time. ([\matrix-org#14323](matrix-org#14323)) - Fix a bug introduced in Synapse 1.70.0 where clients were unable to PUT new [dehydrated devices](matrix-org/matrix-spec-proposals#2697). ([\matrix-org#14336](matrix-org#14336)) Improved Documentation ---------------------- - Explain how to disable the use of [`trusted_key_servers`](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#trusted_key_servers). ([\matrix-org#13999](matrix-org#13999)) - Add workers settings to [configuration manual](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#individual-worker-configuration). ([\matrix-org#14086](matrix-org#14086)) - Correct the name of the config option [`encryption_enabled_by_default_for_room_type`](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#encryption_enabled_by_default_for_room_type). ([\matrix-org#14110](matrix-org#14110)) - Update docstrings of `SynapseError` and `FederationError` to bettter describe what they are used for and the effects of using them are. ([\matrix-org#14191](matrix-org#14191)) Internal Changes ---------------- - Remove unused `@lru_cache` decorator. ([\matrix-org#13595](matrix-org#13595)) - Save login tokens in database and prevent login token reuse. ([\matrix-org#13844](matrix-org#13844)) - Refactor OIDC tests to better mimic an actual OIDC provider. ([\matrix-org#13910](matrix-org#13910)) - Fix type annotation causing import time error in the Complement forking launcher. ([\matrix-org#14084](matrix-org#14084)) - Refactor [MSC3030](matrix-org/matrix-spec-proposals#3030) `/timestamp_to_event` endpoint to loop over federation destinations with standard pattern and error handling. ([\matrix-org#14096](matrix-org#14096)) - Add initial power level event to batch of bulk persisted events when creating a new room. ([\matrix-org#14228](matrix-org#14228)) - Refactor `/key/` endpoints to use `RestServlet` classes. ([\matrix-org#14229](matrix-org#14229)) - Switch to using the `matrix-org/backend-meta` version of `triage-incoming` for new issues in CI. ([\matrix-org#14230](matrix-org#14230)) - Build wheels on macos 11, not 10.15. ([\matrix-org#14249](matrix-org#14249)) - Add debugging to help diagnose lost device list updates. ([\matrix-org#14268](matrix-org#14268)) - Add Rust cache to CI for `trial` runs. ([\matrix-org#14287](matrix-org#14287)) - Improve type hinting of `RawHeaders`. ([\matrix-org#14303](matrix-org#14303)) - Use Poetry 1.2.0 in the Twisted Trunk CI job. ([\matrix-org#14305](matrix-org#14305)) <details> <summary>Dependency updates</summary> Runtime: - Bump anyhow from 1.0.65 to 1.0.66. ([\matrix-org#14278](matrix-org#14278)) - Bump jinja2 from 3.0.3 to 3.1.2. ([\matrix-org#14271](matrix-org#14271)) - Bump prometheus-client from 0.14.0 to 0.15.0. ([\matrix-org#14274](matrix-org#14274)) - Bump psycopg2 from 2.9.4 to 2.9.5. ([\matrix-org#14331](matrix-org#14331)) - Bump pysaml2 from 7.1.2 to 7.2.1. ([\matrix-org#14270](matrix-org#14270)) - Bump sentry-sdk from 1.5.11 to 1.10.1. ([\matrix-org#14330](matrix-org#14330)) - Bump serde from 1.0.145 to 1.0.147. ([\matrix-org#14277](matrix-org#14277)) - Bump serde_json from 1.0.86 to 1.0.87. ([\matrix-org#14279](matrix-org#14279)) Tooling and CI: - Bump black from 22.3.0 to 22.10.0. ([\matrix-org#14328](matrix-org#14328)) - Bump flake8-bugbear from 21.3.2 to 22.9.23. ([\matrix-org#14042](matrix-org#14042)) - Bump peaceiris/actions-gh-pages from 3.8.0 to 3.9.0. ([\matrix-org#14276](matrix-org#14276)) - Bump peaceiris/actions-mdbook from 1.1.14 to 1.2.0. ([\matrix-org#14275](matrix-org#14275)) - Bump setuptools-rust from 1.5.1 to 1.5.2. ([\matrix-org#14273](matrix-org#14273)) - Bump twine from 3.8.0 to 4.0.1. ([\matrix-org#14332](matrix-org#14332)) - Bump types-opentracing from 2.4.7 to 2.4.10. ([\matrix-org#14133](matrix-org#14133)) - Bump types-requests from 2.28.11 to 2.28.11.2. ([\matrix-org#14272](matrix-org#14272)) </details>
Refactor MSC3030
/timestamp_to_event
:federation_client.timestamp_to_event(...)
now handles alldestination
looping and uses our generic_try_destination_list(...)
helper.NotRetryingDestination
andFederationDeniedError
acrossget_pdu
, backfill, and the generic_try_destination_list
which is used for many places we use this pattern.get_pdu(...)
now returnsPulledPduInfo
so we know whichdestination
we ended up pulling the PDU fromTodo
/timestamp_to_event
returningoutliers
that it has no idea whether are near a gap or not #14215 to mergeDev notes
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.Pull request includes a sign off(run the linters)