Merge pull request #154 from supabase/db-deletes-wal

fix: drop replication slot when db deletes wal segment
supabase · May 22, 2021 · 48edd9e · 48edd9e
2 parents c293c76 + df39135
commit 48edd9e
Show file tree

Hide file tree

Showing 2 changed files with 50 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -62,6 +62,13 @@ A few reasons:
 2. Decoupling. For example, if you want to send a new slack message every time someone makes a new purchase you might build that functionality directly into your API. This allows you to decouple your async functionality from your API.
 3. This is built with Phoenix, an [extremely scalable Elixir framework](https://www.phoenixframework.org/blog/the-road-to-2-million-websocket-connections).
 
+### Does this server guarentee delivery of every data change?
+
+Not yet! Due to the following limitations:
+
+1. Postgres database runs out of disk space due to Write-Ahead Logging (WAL) buildup, which can crash the database and prevent Realtime server from streaming replication and broadcasting changes.
+2. Realtime server can crash due to a larger replication lag than available memory, forcing the creation of a new replication slot and resetting streaming replication to read from the latest WAL data.
+3. When Realtime server falls too far behind for any reason, for example disconnecting from database as WAL continues to build up, then database can delete WAL segments the server still needs to read from, for example after reconnecting.
 
 ## Quick start
 

diff --git a/server/lib/adapters/postgres/epgsql_server.ex b/server/lib/adapters/postgres/epgsql_server.ex
@@ -170,6 +170,49 @@ defmodule Realtime.Adapters.Postgres.EpgsqlServer do
     {:stop, msg, state}
   end
 
+  @doc """
+
+  Removes the existing replication slot when epgsql replication process crashes due to
+  database deleting WAL segment when Realtime server has fallen too far behind.
+
+  ## Example process exit message
+
+    {:EXIT, #PID<0.2324.0>,
+     {:error,
+      {:error, :error, "58P01", :undefined_file,
+       "requested WAL segment 00000001000000000000007F has already been removed",
+       [file: "walsender.c", line: "2447", routine: "XLogRead", severity: "ERROR"]}}}
+
+  """
+  @impl true
+  def handle_info(
+        {:EXIT, _pid,
+         {:error,
+          {:error, :error, "58P01", :undefined_file, error_msg,
+           [file: "walsender.c", line: _line, routine: "XLogRead", severity: "ERROR"]}}} = msg,
+        %{
+          replication_epgsql_pid: replication_epgsql_pid,
+          select_epgsql_pid: select_epgsql_pid
+        } = state
+      )
+      when is_binary(error_msg) do
+    :ok = :epgsql.close(replication_epgsql_pid)
+
+    stop_msg =
+      case String.split(error_msg) do
+        ["requested", "WAL", "segment", _, "has", "already", "been", "removed"] ->
+          :ok = maybe_drop_replication_slot(state)
+          {:error, {error_msg, :replication_slot_dropped}}
+
+        _ ->
+          msg
+      end
+
+    :ok = :epgsql.close(select_epgsql_pid)
+
+    {:stop, stop_msg, state}
+  end
+
   @impl true
   def handle_info(
         msg,