From 9e96fb164aeeb96649f5c7f96903a8ed8e1115a4 Mon Sep 17 00:00:00 2001
From: Ashleigh Carr <ashcorr20@gmail.com>
Date: Thu, 9 Jan 2025 12:43:54 +0000
Subject: [PATCH] Document migration process

---
 docs/1-migration/database.md | 144 +++++++++++++++++++++++++++++++++++
 1 file changed, 144 insertions(+)
 create mode 100644 docs/1-migration/database.md

diff --git a/docs/1-migration/database.md b/docs/1-migration/database.md
new file mode 100644
index 0000000..d2fa1f1
--- /dev/null
+++ b/docs/1-migration/database.md
@@ -0,0 +1,144 @@
+# Identity database Migration
+
+## Objectives
+
+ - The **security** and **privacy** of our readers data remains our top priority.
+ - There should be **zero downtime** when switching from the legacy database to the new database. All data should be eventually consistent.
+ - We should be able to quickly switch back to the legacy database incase of any issues.
+ - Data is no longer stored in vast JSONB columns. Our schemas should be well defined and normalized.
+
+## Identity API Migration
+
+Our plan is to initially migrate our databases before thinking about how we rebuild **Identity API**. This means that **Identity API** will have to read and write to both the legacy database and the new database until we've completed our database migration.
+
+1. **Apply `UPDATE`s/`INSERT`s/`DELETE`s to both databases at the same time.**
+
+   We may want to do each operation in its own PR to reduce complexity and avoid outages. In this case we should ensure that we implement `UPDATE` and `DELETE` operations before we implement `INSERT` operations to avoid having stale data in the new database.
+   
+   At this point we'd expect that if a user exists in the new database that their data would be up to date and identical to that in the legacy database.
+
+2. **Read data from new database and verify it matches data from the legacy database.**
+
+   Whenever **Identity API** receives an API request which requires it to fetch data from the database it will request data from both databases and indicate in some manner if the results from both databases match.  
+
+   At this point **Identity API** should still be exclusively serving data from the legacy database, even if it has data from the new database.
+
+3. **Migrate data from the legacy database to the new database.**
+
+   See [Data Migration](#data-migration).
+
+4. **Switch to serving data from the new database.**
+   > [!NOTE]  
+   > We should be able to control which database is the "primary" via a feature switch, incase we need to quickly revert to serving traffic from the legacy database.
+
+   Once all of the data is migrated from the legacy database to the new database we should see our data validity ratio increase due to the new database having access to data for all of the users.
+
+   If the data validity ratio is 100% we can start serving data from the new database instead. Note that at this point we should still be writing data to both databases incase we need to quickly revert back to serving traffic from the legacy database.
+
+5. **Stop reading and writing data from the legacy database**
+   > [!NOTE]  
+   > The moment we stop writing data to the legacy database it will no longer be possible to easily switch back to using it as we'll no longer be able to rely on the data being accurate.
+
+   Once we're happy that the new database is working correctly for a sufficiently long period we can remove the code whi
+
+6. **Snapshot and remove legacy database.**
+
+   TBD. What else do we need to migrate after identifiers and consents? 
+
+
+## Database Schema Proposal
+
+```sql
+-- Take a private_uuid, append the specified salt, and hash it to generate a unique external ID
+CREATE FUNCTION gu_generate_identifier(private_uuid UUID, salt varchar) RETURNS varchar
+   LANGUAGE SQL
+   IMMUTABLE
+   RETURNS NULL ON NULL INPUT
+   RETURN encode(sha256((private_uuid || salt)::bytea ), 'hex' );
+
+-- Manually create a sequene to be able to set the starting value
+-- The starting value should be considerably higher than the current highest ID
+-- This is so that we don't accidentally assign 2 users the same ID when we switch to using the new sequence.
+CREATE SEQUENCE users_identity_id_seq AS integer START 100_000;
+
+CREATE TABLE  users(
+   identity_id INTEGER PRIMARY KEY DEFAULT nextval('table_name_id_seq'), 
+   okta_id varchar(100) UNIQUE NOT NULL,
+   username varchar(20) UNIQUE,
+   braze_id UUID UNIQUE NOT NULL DEFAULT gen_random_uuid(),
+   private_id UUID UNIQUE NOT NULL DEFAULT gen_random_uuid(), 
+   puzzle_id varchar UNIQUE NOT NULL GENERATED ALWAYS as (gu_generate_identifier(private_uuid, '8e833eab546c44a8a441ab052604ff2a')) STORED,
+   google_tag_id varchar UNIQUE NOT NULL GENERATED ALWAYS as (gu_generate_identifier(private_uuid, 'c16a3672d5404771baa2e10668cc1285')) STORED
+);
+
+ALTER SEQUENCE users_identity_id_seq OWNED BY users.identity_id;
+```
+
+## Data Migration
+
+We have two approaches to migrating data from the legacy DB to the new DB, both approaches require that new data, updates, and deletions are already being applied to the new database in order to maintain consistency. Both approaches can also be done either in batches or as a one shot job.
+
+### Migration using aws_s3
+
+```mermaid
+flowchart LR
+    subgraph Identity DB
+        db1_users[(users)]
+    end
+    S3
+    subgraph Gatehouse DB
+        db2_temp[(users_temp)]
+        db2_users[(users)]
+    end
+    db1_users-->S3
+    S3-->db2_temp
+    db2_temp-->db2_users
+    db1_users<-->idapi
+    db2_users<-->idapi
+
+    idapi[Identity API]
+```
+
+AWS have built a Postgres extension called `aws_s3` which adds new Postgres functions to upload and download data from S3. Using this extension we could in theory export all our user data from the legacy database to S3 and then import it into our new database.
+
+This extension does have a slight limitation as it does not support [`ON CONFLICT` clauses](https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT) when importing data into a database. Ideally we'd like to use ON CONFLICT to skip rows that already exist in the new database, for example users that were created after starting the migration.
+
+To work around the `ON CONFLICT` limitation we'll likely have to employ the use of a staging table where we initially import the data before copying it to the live table. This way we can apply `ON CONFLICT` when we copy the data from staging to live.
+
+1. Provision a new **encrypted** S3 bucket and grant access to databases
+2. Enable S3 extension
+
+   Run the following SQL in both databases to enable the S3 extension:
+
+   ```sql
+   CREATE EXTENSION aws_s3 CASCADE;
+   ```
+
+3. Export users from legacy database to S3
+
+   Run the following SQL in the legacy database, replacing `destination-bucket` and `destination-file` as required
+
+   ```sql
+      SELECT * from aws_s3.query_export_to_s3('select id AS identity_id, okta_id, braze_uuid as braze_id, private_uuid as private_id, jdoc->>\'publicFields.userName\' AS username from user', 
+         aws_commons.create_s3_uri('destination-bucket', 'destination-file', 'eu-west-1') 
+      );
+   ```
+
+4. Create staging table and import users from S3
+
+   ```
+   CREATE TABLE users (
+      identity_id SERIAL PRIMARY KEY,
+      okta_id VARCHAR(100) UNIQUE NOT NULL,
+      braze_id UUID UNIQUE NOT NULL,
+      private_id UUID UNIQUE NOT NULL,
+      username
+   )
+5. Copy data from staging table to live table
+
+### Migration using postgres_fdw
+
+1. Create a migration user in the legacy database.
+2. Allow traffic between legacy and new database.
+3. Enable `postgres_fdw` extension in new database and connect to legacy database.
+4. Copy data from legacy database to live table
\ No newline at end of file