Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Reduce serialization errors in MultiWriterIdGen #8456

Merged
merged 8 commits into from
Oct 7, 2020

Conversation

erikjohnston
Copy link
Member

We call _update_stream_positions_table_txn a lot, which is an UPSERT
that can conflict in REPEATABLE READ isolation level. Instead of doing
a transaction consisting of a single query we may as well run it outside
of a transaction.

@erikjohnston erikjohnston force-pushed the erikj/fix_serialization_errors branch from 3a69dfc to 8ffd707 Compare October 2, 2020 15:48
@erikjohnston erikjohnston changed the base branch from release-v1.21.0 to develop October 2, 2020 15:48
with conn.cursor(txn_name="MultiWriterIdGenerator._update_table") as cur:
self._update_stream_positions_table_txn(cur)
finally:
conn.conn.autocommit = False # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set this back to what it was instead of always False?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it really should always be False, but can do if you think its clearer?

def _update_stream_positions_table_conn(self, conn: LoggingDatabaseConnection):
# We use autocommit here so that we don't have to go through a
# transaction dance, which a) adds latency and b) runs the risk of
# serialization errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we still risk serialization errors, but now they won't be handled?

autocommit is just syntactic sugar for a transaction with explicit commit afaik?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it'll behave (at worst) like it was in a transaction with a READ COMMITTED isolation level, which doesn't have concurrent update errors. (Plus the fact that the command gets committed immediately means the likelihood of concurrent updates is dramatically smaller.

... but now they won't be handled?

Ah, I forgot that we handled those exceptions in the new_transaction function, rather than in runWithConnection. I do think that we'll still get errors even if we're not in a transaction though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course having said all that the CI is failing with:

psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update

@erikjohnston erikjohnston force-pushed the erikj/fix_serialization_errors branch 3 times, most recently from 6c05c2d to 2242bec Compare October 5, 2020 11:23
This allows running queries outside of transactions, which is useful to
avoid the overhead of transaction management (in terms of RTT and
isolation levels).
We call `_update_stream_positions_table_txn` a lot, which is an UPSERT
that can conflict in `REPEATABLE READ` isolation level. Instead of doing
a transaction consisting of a single query we may as well run it outside
of a transaction.
@erikjohnston erikjohnston force-pushed the erikj/fix_serialization_errors branch from 8bd3ae7 to 9be577c Compare October 5, 2020 14:22
@erikjohnston
Copy link
Member Author

Ok, after quite a bit of faff I think this is ready again. I changed it lots, which is why I rebased it into nicer commits.

The reason everything went horribly wrong is because we put newly connected connections into the DB pool with uncommitted transactions in them (whoops), which then meant nothing of the fiddling of isolation levels/auto commit modes worked. (Plus if you tried to do conn.autocommit = True the twisted DB connection wrapper doesn't proxy that to the underlying connection; it has implemented __getattr_ proxy but not __setattr__ proxy: https://twistedmatrix.com/trac/ticket/9998#ticket.)

The current changes makes it possible to run queries outside of transactions, which should be safe for this use case (and I believe more generally for other single query transactions). Errors are handled, though we don't retry the queries again, which is fine here as we'll right to the table again soon (and the start up code handles it being behind).

I've manually tested concurrently executing these queries with a READ COMMITTED isolation level, and that works correctly (the second query blocks until the other transaction commits/rolls back).

The reason I have added the db_autocommit to runInteraction/runWithConncetion as I think we have a bunch of upserts that it should be safe to change to using this, which should hopefully reduce the amount of serialization errors we see.

@erikjohnston erikjohnston requested a review from a team October 5, 2020 15:48
Comment on lines 476 to 477
db_retry: Whether to retry the transaction by calling `func` again.
This should be disabled if connection is in autocommit mode.
Copy link
Member

@richvdh richvdh Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
db_retry: Whether to retry the transaction by calling `func` again.
This should be disabled if connection is in autocommit mode.
db_retry: Whether to retry the transaction after an OperationalError
or DatabaseError by calling `func` again.
This should be disabled if connection is in autocommit mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with disabling this is that it's the thing that deals with postgres connections dropping (which can happen for annoying TCP reasons). I'm not sure it's safe to do: why is it required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike functions that are run in transactions its not necessarily true its safe to just re-run transaction functions, since they might have done half the work already. I guess we can mandate that such functions should be safe to re-run? (All the usages are for example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that.

synapse/storage/engines/postgres.py Outdated Show resolved Hide resolved
Comment on lines +114 to +115
# Twisted doesn't let us set attributes on the connections, so we can't
# set the connection to autocommit mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's Twisted to do with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connections we get are Twisted connections that wrap the underlying native connections, the wrapper implements __getattr__ but not __setattr__ so we can't set the autocommit flag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right.

I suppose you could delve into conn._connection, but I guess it doesn't really matter? It might be worth commenting somewhere (at the call site, maybe), that this is will only attempt to enable autocommit, so the transaction must do an explicit commit anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could go and pull out _connection, but then I'd want to handle the case where they rename that variable in a future release.

Comment on lines 476 to 477
db_retry: Whether to retry the transaction by calling `func` again.
This should be disabled if connection is in autocommit mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with disabling this is that it's the thing that deals with postgres connections dropping (which can happen for annoying TCP reasons). I'm not sure it's safe to do: why is it required?

synapse/storage/util/id_generators.py Outdated Show resolved Hide resolved
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm otherwise

Comment on lines +114 to +115
# Twisted doesn't let us set attributes on the connections, so we can't
# set the connection to autocommit mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right.

I suppose you could delve into conn._connection, but I guess it doesn't really matter? It might be worth commenting somewhere (at the call site, maybe), that this is will only attempt to enable autocommit, so the transaction must do an explicit commit anyway?

synapse/storage/database.py Outdated Show resolved Hide resolved
synapse/storage/database.py Outdated Show resolved Hide resolved
erikjohnston and others added 2 commits October 7, 2020 14:36
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
@erikjohnston erikjohnston merged commit ae5b2a7 into develop Oct 7, 2020
@erikjohnston erikjohnston deleted the erikj/fix_serialization_errors branch October 7, 2020 14:15
erikjohnston added a commit that referenced this pull request Oct 7, 2020
We call `_update_stream_positions_table_txn` a lot, which is an UPSERT
that can conflict in `REPEATABLE READ` isolation level. Instead of doing
a transaction consisting of a single query we may as well run it outside
of a transaction.
erikjohnston added a commit that referenced this pull request Oct 8, 2020
Synapse 1.21.0rc3 (2020-10-08)
==============================

Bugfixes
--------

- Fix duplication of events on high traffic servers, caused by PostgreSQL `could not serialize access due to concurrent update` errors. ([\#8456](#8456))

Internal Changes
----------------

- Add Groovy Gorilla to the list of distributions we build `.deb`s for. ([\#8475](#8475))
erikjohnston added a commit that referenced this pull request Oct 14, 2020
Autocommit means that we don't wrap the functions in transactions, and instead get executed directly. Introduced in #8456. This will help:

1. reduce the number of `could not serialize access due to concurrent delete` errors that we see (though there are a few functions that often cause serialization errors that we don't fix here);
2. improve the DB performance, as it no longer needs to deal with the overhead of `REPEATABLE READ` isolation levels; and
3. improve wall clock speed of these functions, as we no longer need to send `BEGIN` and `COMMIT` to the DB.

Some notes about the differences between autocommit mode and our default `REPEATABLE READ` transactions:

1. Currently `autocommit` only applies when using PostgreSQL, and is ignored when using SQLite (due to silliness with [Twisted DB classes](https://twistedmatrix.com/trac/ticket/9998)).
2. Autocommit functions may get retried on error, which means they can get applied *twice* (or more) to the DB (since they are not in a transaction the previous call would not get rolled back). This means that the functions need to be idempotent (or otherwise not care about being called multiple times). Read queries, simple deletes, and updates/upserts that replace rows (rather than generating new values from existing rows) are all idempotent.
3. Autocommit functions no longer get executed in [`REPEATABLE READ`](https://www.postgresql.org/docs/current/transaction-iso.html) isolation level, and so data can change queries, which is fine for single statement queries.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 17, 2020
Synapse 1.21.2 (2020-10-15)
===========================

Debian packages and Docker images have been rebuilt using the latest versions of dependency libraries, including authlib 0.15.1. Please see bugfixes below.

Security advisory
-----------------

* HTML pages served via Synapse were vulnerable to cross-site scripting (XSS)
  attacks. All server administrators are encouraged to upgrade.
  ([\#8444](matrix-org/synapse#8444))
  ([CVE-2020-26891](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26891))

  This fix was originally included in v1.21.0 but was missing a security advisory.

  This was reported by [Denis Kasak](https://github.com/dkasak).

Bugfixes
--------

- Fix rare bug where sending an event would fail due to a racey assertion. ([\#8530](matrix-org/synapse#8530))
- An updated version of the authlib dependency is included in the Docker and Debian images to fix an issue using OpenID Connect. See [\#8534](matrix-org/synapse#8534) for details.


Synapse 1.21.1 (2020-10-13)
===========================

This release fixes a regression in v1.21.0 that prevented debian packages from being built.
It is otherwise identical to v1.21.0.

Synapse 1.21.0 (2020-10-12)
===========================

No significant changes since v1.21.0rc3.

As [noted in
v1.20.0](https://github.com/matrix-org/synapse/blob/release-v1.21.0/CHANGES.md#synapse-1200-2020-09-22),
a future release will drop support for accessing Synapse's
[Admin API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api) under the
`/_matrix/client/*` endpoint prefixes. At that point, the Admin API will only
be accessible under `/_synapse/admin`.


Synapse 1.21.0rc3 (2020-10-08)
==============================

Bugfixes
--------

- Fix duplication of events on high traffic servers, caused by PostgreSQL `could not serialize access due to concurrent update` errors. ([\#8456](matrix-org/synapse#8456))


Internal Changes
----------------

- Add Groovy Gorilla to the list of distributions we build `.deb`s for. ([\#8475](matrix-org/synapse#8475))


Synapse 1.21.0rc2 (2020-10-02)
==============================

Features
--------

- Convert additional templates from inline HTML to Jinja2 templates. ([\#8444](matrix-org/synapse#8444))

Bugfixes
--------

- Fix a regression in v1.21.0rc1 which broke thumbnails of remote media. ([\#8438](matrix-org/synapse#8438))
- Do not expose the experimental `uk.half-shot.msc2778.login.application_service` flow in the login API, which caused a compatibility problem with Element iOS. ([\#8440](matrix-org/synapse#8440))
- Fix malformed log line in new federation "catch up" logic. ([\#8442](matrix-org/synapse#8442))
- Fix DB query on startup for negative streams which caused long start up times. Introduced in [\#8374](matrix-org/synapse#8374). ([\#8447](matrix-org/synapse#8447))


Synapse 1.21.0rc1 (2020-10-01)
==============================

Features
--------

- Require the user to confirm that their password should be reset after clicking the email confirmation link. ([\#8004](matrix-org/synapse#8004))
- Add an admin API `GET /_synapse/admin/v1/event_reports` to read entries of table `event_reports`. Contributed by @dklimpel. ([\#8217](matrix-org/synapse#8217))
- Consolidate the SSO error template across all configuration. ([\#8248](matrix-org/synapse#8248), [\#8405](matrix-org/synapse#8405))
- Add a configuration option to specify a whitelist of domains that a user can be redirected to after validating their email or phone number. ([\#8275](matrix-org/synapse#8275), [\#8417](matrix-org/synapse#8417))
- Add experimental support for sharding event persister. ([\#8294](matrix-org/synapse#8294), [\#8387](matrix-org/synapse#8387), [\#8396](matrix-org/synapse#8396), [\#8419](matrix-org/synapse#8419))
- Add the room topic and avatar to the room details admin API. ([\#8305](matrix-org/synapse#8305))
- Add an admin API for querying rooms where a user is a member. Contributed by @dklimpel. ([\#8306](matrix-org/synapse#8306))
- Add `uk.half-shot.msc2778.login.application_service` login type to allow appservices to login. ([\#8320](matrix-org/synapse#8320))
- Add a configuration option that allows existing users to log in with OpenID Connect. Contributed by @BBBSnowball and @OmmyZhang. ([\#8345](matrix-org/synapse#8345))
- Add prometheus metrics for replication requests. ([\#8406](matrix-org/synapse#8406))
- Support passing additional single sign-on parameters to the client. ([\#8413](matrix-org/synapse#8413))
- Add experimental reporting of metrics on expensive rooms for state-resolution. ([\#8420](matrix-org/synapse#8420))
- Add experimental prometheus metric to track numbers of "large" rooms for state resolutiom. ([\#8425](matrix-org/synapse#8425))
- Add prometheus metrics to track federation delays. ([\#8430](matrix-org/synapse#8430))


Bugfixes
--------

- Fix a bug in the media repository where remote thumbnails with the same size but different crop methods would overwrite each other. Contributed by @deepbluev7. ([\#7124](matrix-org/synapse#7124))
- Fix inconsistent handling of non-existent push rules, and stop tracking the `enabled` state of removed push rules. ([\#7796](matrix-org/synapse#7796))
- Fix a longstanding bug when storing a media file with an empty `upload_name`. ([\#7905](matrix-org/synapse#7905))
- Fix messages not being sent over federation until an event is sent into the same room. ([\#8230](matrix-org/synapse#8230), [\#8247](matrix-org/synapse#8247), [\#8258](matrix-org/synapse#8258), [\#8272](matrix-org/synapse#8272), [\#8322](matrix-org/synapse#8322))
- Fix a longstanding bug where files that could not be thumbnailed would result in an Internal Server Error. ([\#8236](matrix-org/synapse#8236), [\#8435](matrix-org/synapse#8435))
- Upgrade minimum version of `canonicaljson` to version 1.4.0, to fix an unicode encoding issue. ([\#8262](matrix-org/synapse#8262))
- Fix longstanding bug which could lead to incomplete database upgrades on SQLite. ([\#8265](matrix-org/synapse#8265))
- Fix stack overflow when stderr is redirected to the logging system, and the logging system encounters an error. ([\#8268](matrix-org/synapse#8268))
- Fix a bug which cause the logging system to report errors, if `DEBUG` was enabled and no `context` filter was applied. ([\#8278](matrix-org/synapse#8278))
- Fix edge case where push could get delayed for a user until a later event was pushed. ([\#8287](matrix-org/synapse#8287))
- Fix fetching malformed events from remote servers. ([\#8324](matrix-org/synapse#8324))
- Fix `UnboundLocalError` from occuring when appservices send a malformed register request. ([\#8329](matrix-org/synapse#8329))
- Don't send push notifications to expired user accounts. ([\#8353](matrix-org/synapse#8353))
- Fix a regression in v1.19.0 with reactivating users through the admin API. ([\#8362](matrix-org/synapse#8362))
- Fix a bug where during device registration the length of the device name wasn't limited. ([\#8364](matrix-org/synapse#8364))
- Include `guest_access` in the fields that are checked for null bytes when updating `room_stats_state`. Broke in v1.7.2. ([\#8373](matrix-org/synapse#8373))
- Fix theoretical race condition where events are not sent down `/sync` if the synchrotron worker is restarted without restarting other workers. ([\#8374](matrix-org/synapse#8374))
- Fix a bug which could cause errors in rooms with malformed membership events, on servers using sqlite. ([\#8385](matrix-org/synapse#8385))
- Fix "Re-starting finished log context" warning when receiving an event we already had over federation. ([\#8398](matrix-org/synapse#8398))
- Fix incorrect handling of timeouts on outgoing HTTP requests. ([\#8400](matrix-org/synapse#8400))
- Fix a regression in v1.20.0 in the `synapse_port_db` script regarding the `ui_auth_sessions_ips` table. ([\#8410](matrix-org/synapse#8410))
- Remove unnecessary 3PID registration check when resetting password via an email address. Bug introduced in v0.34.0rc2. ([\#8414](matrix-org/synapse#8414))


Improved Documentation
----------------------

- Add `/_synapse/client` to the reverse proxy documentation. ([\#8227](matrix-org/synapse#8227))
- Add note to the reverse proxy settings documentation about disabling Apache's mod_security2. Contributed by Julian Fietkau (@jfietkau). ([\#8375](matrix-org/synapse#8375))
- Improve description of `server_name` config option in `homserver.yaml`. ([\#8415](matrix-org/synapse#8415))


Deprecations and Removals
-------------------------

- Drop support for `prometheus_client` older than 0.4.0. ([\#8426](matrix-org/synapse#8426))


Internal Changes
----------------

- Fix tests on distros which disable TLSv1.0. Contributed by @danc86. ([\#8208](matrix-org/synapse#8208))
- Simplify the distributor code to avoid unnecessary work. ([\#8216](matrix-org/synapse#8216))
- Remove the `populate_stats_process_rooms_2` background job and restore functionality to `populate_stats_process_rooms`. ([\#8243](matrix-org/synapse#8243))
- Clean up type hints for `PaginationConfig`. ([\#8250](matrix-org/synapse#8250), [\#8282](matrix-org/synapse#8282))
- Track the latest event for every destination and room for catch-up after federation outage. ([\#8256](matrix-org/synapse#8256))
- Fix non-user visible bug in implementation of `MultiWriterIdGenerator.get_current_token_for_writer`. ([\#8257](matrix-org/synapse#8257))
- Switch to the JSON implementation from the standard library. ([\#8259](matrix-org/synapse#8259))
- Add type hints to `synapse.util.async_helpers`. ([\#8260](matrix-org/synapse#8260))
- Simplify tests that mock asynchronous functions. ([\#8261](matrix-org/synapse#8261))
- Add type hints to `StreamToken` and `RoomStreamToken` classes. ([\#8279](matrix-org/synapse#8279))
- Change `StreamToken.room_key` to be a `RoomStreamToken` instance. ([\#8281](matrix-org/synapse#8281))
- Refactor notifier code to correctly use the max event stream position. ([\#8288](matrix-org/synapse#8288))
- Use slotted classes where possible. ([\#8296](matrix-org/synapse#8296))
- Support testing the local Synapse checkout against the [Complement homeserver test suite](https://github.com/matrix-org/complement/). ([\#8317](matrix-org/synapse#8317))
- Update outdated usages of `metaclass` to python 3 syntax. ([\#8326](matrix-org/synapse#8326))
- Move lint-related dependencies to package-extra field, update CONTRIBUTING.md to utilise this. ([\#8330](matrix-org/synapse#8330), [\#8377](matrix-org/synapse#8377))
- Use the `admin_patterns` helper in additional locations. ([\#8331](matrix-org/synapse#8331))
- Fix test logging to allow braces in log output. ([\#8335](matrix-org/synapse#8335))
- Remove `__future__` imports related to Python 2 compatibility. ([\#8337](matrix-org/synapse#8337))
- Simplify `super()` calls to Python 3 syntax. ([\#8344](matrix-org/synapse#8344))
- Fix bad merge from `release-v1.20.0` branch to `develop`. ([\#8354](matrix-org/synapse#8354))
- Factor out a `_send_dummy_event_for_room` method. ([\#8370](matrix-org/synapse#8370))
- Improve logging of state resolution. ([\#8371](matrix-org/synapse#8371))
- Add type annotations to `SimpleHttpClient`. ([\#8372](matrix-org/synapse#8372))
- Refactor ID generators to use `async with` syntax. ([\#8383](matrix-org/synapse#8383))
- Add `EventStreamPosition` type. ([\#8388](matrix-org/synapse#8388))
- Create a mechanism for marking tests "logcontext clean". ([\#8399](matrix-org/synapse#8399))
- A pair of tiny cleanups in the federation request code. ([\#8401](matrix-org/synapse#8401))
- Add checks on startup that PostgreSQL sequences are consistent with their associated tables. ([\#8402](matrix-org/synapse#8402))
- Do not include appservice users when calculating the total MAU for a server. ([\#8404](matrix-org/synapse#8404))
- Typing fixes for `synapse.handlers.federation`. ([\#8422](matrix-org/synapse#8422))
- Various refactors to simplify stream token handling. ([\#8423](matrix-org/synapse#8423))
- Make stream token serializing/deserializing async. ([\#8427](matrix-org/synapse#8427))
babolivier pushed a commit that referenced this pull request Sep 1, 2021
* commit 'f76194a02':
  1.21.0
  Update change log
  1.21.0rc3
  Reduce serialization errors in MultiWriterIdGen (#8456)
  Add Ubuntu 20.10 (Groovy Gorilla) to build scripts. (#8475)
  move #8444 to 'feature'
  linkify changelog
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants