Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: drop replication slot when db deletes wal segment #154

Merged
merged 2 commits into from
May 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,13 @@ A few reasons:
2. Decoupling. For example, if you want to send a new slack message every time someone makes a new purchase you might build that functionality directly into your API. This allows you to decouple your async functionality from your API.
3. This is built with Phoenix, an [extremely scalable Elixir framework](https://www.phoenixframework.org/blog/the-road-to-2-million-websocket-connections).

### Does this server guarentee delivery of every data change?

Not yet! Due to the following limitations:

1. Postgres database runs out of disk space due to Write-Ahead Logging (WAL) buildup, which can crash the database and prevent Realtime server from streaming replication and broadcasting changes.
2. Realtime server can crash due to a larger replication lag than available memory, forcing the creation of a new replication slot and resetting streaming replication to read from the latest WAL data.
3. When Realtime server falls too far behind for any reason, for example disconnecting from database as WAL continues to build up, then database can delete WAL segments the server still needs to read from, for example after reconnecting.

## Quick start

Expand Down
43 changes: 43 additions & 0 deletions server/lib/adapters/postgres/epgsql_server.ex
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,49 @@ defmodule Realtime.Adapters.Postgres.EpgsqlServer do
{:stop, msg, state}
end

@doc """

Removes the existing replication slot when epgsql replication process crashes due to
database deleting WAL segment when Realtime server has fallen too far behind.

## Example process exit message

{:EXIT, #PID<0.2324.0>,
{:error,
{:error, :error, "58P01", :undefined_file,
"requested WAL segment 00000001000000000000007F has already been removed",
[file: "walsender.c", line: "2447", routine: "XLogRead", severity: "ERROR"]}}}

"""
@impl true
def handle_info(
{:EXIT, _pid,
{:error,
{:error, :error, "58P01", :undefined_file, error_msg,
[file: "walsender.c", line: _line, routine: "XLogRead", severity: "ERROR"]}}} = msg,
%{
replication_epgsql_pid: replication_epgsql_pid,
select_epgsql_pid: select_epgsql_pid
} = state
)
when is_binary(error_msg) do
:ok = :epgsql.close(replication_epgsql_pid)

stop_msg =
case String.split(error_msg) do
["requested", "WAL", "segment", _, "has", "already", "been", "removed"] ->
:ok = maybe_drop_replication_slot(state)
{:error, {error_msg, :replication_slot_dropped}}

_ ->
msg
end

:ok = :epgsql.close(select_epgsql_pid)

{:stop, stop_msg, state}
end

@impl true
def handle_info(
msg,
Expand Down