Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

SCdF · 2017-03-16T14:54:02Z

We have some large instances, and it's going to be a pain to migrate to CouchDB 2.0, primary because it will force all long term sessions to be logged out (ie all CHWs will have to log back in).

We should look into how we can get around this, and if we definitely can't, strategies for migrating people over slowly (ie running both at the same time).

SCdF · 2017-04-06T14:38:17Z

In terms of logging people out, CouchDB uses HMAC, so while we need to test this, I think if we migrate the secret key then 1.x sessions will be valid on 2.0.

SCdF · 2017-04-20T13:48:32Z

Cool! If you set the same couchdb secret it looks like you can share cookies:

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ echo '"foo"' | http PUT 'http://admin:pass@localhost:5984/_node/couchdb@localhost/_config/couch_httpd_
auth/secret'
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 35
Content-Type: application/json
Date: Thu, 20 Apr 2017 13:43:58 GMT
Server: CouchDB/2.0.0 (Erlang OTP/19)
X-Couch-Request-ID: 67eb5a4ec5
X-CouchDB-Body-Time: 0

"1f3563713d875264c1bc6700e0af76eb"
scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ echo '"foo"' | http PUT http://admin:pass@localhost:5985/_config/couch_httpd_auth/secret
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 35
Content-Type: application/json
Date: Thu, 20 Apr 2017 13:45:02 GMT
Server: CouchDB/1.6.1 (Erlang OTP/19)

"1f3563713d875264c1bc6700e0af76eb"

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ http -f http://demo:medic@localhost:5985/_session "name=demo" "password=medic"
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 98
Content-Type: text/plain; charset=utf-8
Date: Thu, 20 Apr 2017 13:45:14 GMT
Server: CouchDB/1.6.1 (Erlang OTP/19)
Set-Cookie: AuthSession=ZGVtbzo1OEY4QkI2QTotL2j7KssOEvEpaKNJ8Kj2LO3c_A; Version=1; Path=/; HttpOnly

{
    "name": "demo",
    "ok": true,
    "roles": [
        "district-manager",
        "kujua_user",
        "data_entry",
        "district_admin"
    ]
}

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ http http://localhost:5985/medic 'Cookie:AuthSession=ZGVtbzo1OEY4QkI2QTotL2j7KssOEvEpaKNJ8Kj2LO3c_A; V
ersion=1; Path=/; HttpOnly'
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 239
Content-Type: text/plain; charset=utf-8
Date: Thu, 20 Apr 2017 13:45:45 GMT
Server: CouchDB/1.6.1 (Erlang OTP/19)

{
    "committed_update_seq": 96,
    "compact_running": false,
    "data_size": 7685985,
    "db_name": "medic",
    "disk_format_version": 6,
    "disk_size": 23683182,
    "doc_count": 29,
    "doc_del_count": 1,
    "instance_start_time": "1492695345616773",
    "purge_seq": 0,
    "update_seq": 96
}

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ http http://localhost:5984/medic 'Cookie:AuthSession=ZGVtbzo1OEY4QkI2QTotL2j7KssOEvEpaKNJ8Kj2LO3c_A; V
ersion=1; Path=/; HttpOnly'
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 415
Content-Type: application/json
Date: Thu, 20 Apr 2017 13:45:52 GMT
Server: CouchDB/2.0.0 (Erlang OTP/19)
X-Couch-Request-ID: fd319d8c56
X-CouchDB-Body-Time: 0

{
    "compact_running": false,
    "data_size": 265318814,
    "db_name": "medic",
    "disk_format_version": 6,
    "disk_size": 6383088035,
    "doc_count": 54332,
    "doc_del_count": 13602,
    "instance_start_time": "0",
    "other": {
        "data_size": 186070962
    },
    "purge_seq": 0,
    "sizes": {
        "active": 265318814,
        "external": 186070962,
        "file": 6383088035
    },
    "update_seq": "175201-g1AAAABdeJzLYWBgYMpgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUklMiTV____PyuJgWHHNDzq8liAJEMDkPoPUc74-XQWAFwQH3s"
}

SCdF · 2017-04-20T14:56:41Z

@browndav so what do you think about this rough plan:

Partner X has old server running old api, old sentinel, couchdb1.6
We create new server that is contains new api, new sentinel, couchdb2.0
Only CouchDB 2.0 should be running, don't want new api and new sentinel running right now
Get the secret from 1.6 and put it into 2.0 (see above)
Setup a continual replication between CouchDB1.6 and CouchDB2.0 directly (no api between)
Wait for 2.0's replication to "catch up" with 1.6 (new data is always coming in, "catch up" means get rid of the backlog)
Warm up views on CouchDB 2.0
Wait again for "catch up"
Kill old api and old sentinel. This should stop new data coming in, web users being able to access the site, and for restricted users to replicate
Fully wait this time (since we've waited in the past it shouldn't take that long) until we've got everything
Boot up new api and new sentinel, wait for them to do migration dance etc
do AWS DNS switcheroo so new server is now where old server was, delete old server

The main issue here is that we would still be offline while migrations ran. Unfortunately to solve this we would need to solve the continual migrations problem, and forward port any migrations that need it into that system: #2012

SCdF · 2017-04-20T15:00:40Z

Oh, to be clear about where this will leave different users:

Analytics: would face migration-time down, plus would have to have their stored seq dropped and started again
Internet users: would face migration-time down, would not have to log in again
Restricted users: would not be able to replicate up or down while the migration was happening, would not have to log in again

And of course, while I've tested the simplest version of this, we'd want to actually test this with a real user and our app.

ghost · 2017-04-20T15:21:14Z

This sounds right to me, thanks for putting it together. Just to make sure I'm understanding this correctly: does the safety of this approach depend upon the fact that neither API nor Sentinel are running on the destination CouchDB 2.x instance (i.e. no writes other than replication are occurring)?

ghost · 2017-04-20T15:25:56Z

One other note (and we can actually build a checklist for this, so no big deal): we can actually swap Elastic IPs to avoid any DNS TTL issues. Not a huge win, but still decreased downtime.

SCdF · 2017-04-20T15:48:11Z

does the safety of this approach depend upon the fact that neither API nor Sentinel are running on the destination CouchDB 2.x instance (i.e. no writes other than replication are occurring)

I honestly don't know. At least for API you'd need to wait until you had all the data to upgrade to the newer api and have all the migrations run.

Apart from that, there is simply no need for them to be running at this stage, so better safe than sorry.

sglangevin · 2017-04-20T18:36:08Z

Thanks @SCdF for putting this together! Partners will be VERY thankful if we can make this work.

SCdF · 2017-04-27T11:33:54Z

TODO:

Test this from start to finish with a restricted user to make sure they keep their session

SCdF · 2017-04-28T12:18:56Z

Tested this locally, by:

setting up a CouchDB 1.6 system with a restricted user on a phone
killed api (started getting connection errors in the log, as expected)
Rebooted CouchDB 1.6 on a different port, and booted CouchDB 2.0 to the normal port
Confirmed that 2.0 and 1.6 had the same entry for this user in _user (in a prod scenario I imagine we'd replicate the _user db)
Deleted the medic db in 2.0, and then replicated the 1.6/medic db into 2.0
Once complete I warmed the caches, and booted up api
Once api had booted, confirmed in the network log that phone reconnected its replication back successfully
Tested changing a person on the server, and seeing that successfully replicate back down to the phone

@browndav what should the next steps here be? Do you want to test this (I can help) in an AWS environment with multiple servers? Presumably this would also be a good point to upgrade their MedicOS? Would you like me to generate some document about how to do this, or is proving it enough?

SCdF · 2017-05-05T11:11:51Z

More additional information.

In Fauxton, it shows a warning if >50% of your documents in your DB are deletes. This is because in large DBs this can cause performance problems.

Since our migration will force clients to re-replicate the _changes feed, depending on these ratios for our larger partners we might want to consider doing a filtered replication to CouchDB2.0, to not replicate deleted stubs.

We could manually work this out for larger clients by walking their changes feed and counting how many entries are _deleted: true.

sglangevin · 2017-05-05T16:33:20Z

Since we did a lot of deleting of training data in the early stages of the LG project, this seems like something we should consider.

garethbowen · 2017-06-06T21:53:38Z

CouchDB2 is not prioritised for this milestone - removing.

garethbowen · 2018-04-30T03:17:53Z

NB: I think we'll have to increase the http.max_http_request_size value. In CouchDB1 it was quite small - the default in CouchDB2 is 4GB. The v3.0 exceeds the old value so publishing fails with the error:

{
  "error": "too_large",
  "reason": "the request entity is too large",
  "name": "too_large",
  "status": 413,
  "message": "the request entity is too large"
}

garethbowen · 2018-04-30T23:00:23Z

It turns out the couchdb default value is still only 64MB. I've raised an issue with couchdb to clarify and @browndav increased our default in medic-os to 128MiB.

You can now install, stage and complete a staged install from the command line. Running horti daemonless now actually makes sense, with it performing the given action and then stopping. This change allows for easier migrations from 2.x to 3.0 because once you've replicated data over to the new instance you can run horti with: horti --stage=3.0.0 --no-daemon To pre-prepare the instance as much as possible for the deploy. Once you're ready to make the switch you can run: horti --complete --no-daemon And once that is done run horti as you would normally do via supervisor.d (or just run horti --complete and have the daemon run from there) medic/cht-core#3257

You can now install, stage and complete a staged install from the command line. Running horti daemonless now actually makes sense, with it performing the given action and then stopping. This change allows for easier migrations from 2.x to 3.0 because once you've replicated data over to the new instance you can run horti with: horti --stage=3.0.0 --no-daemon To pre-prepare the instance as much as possible for the deploy. Once you're ready to make the switch you can run: horti --complete-install --no-daemon And once that is done run horti as you would normally do via supervisor.d (or just run horti --complete-install and have the daemon run from there) medic/cht-core#3257

SCdF · 2018-05-29T13:57:21Z

Apart from the horti PR linked above, I don't think there is any more work here to do. @browndav can you confirm, so we can close this ticket?

You can now install, stage and complete a staged install from the command line. Running horti daemonless now actually makes sense, with it performing the given action and then stopping. This change allows for easier migrations from 2.x to 3.0 because once you've replicated data over to the new instance you can run horti with: horti --stage=3.0.0 --no-daemon To pre-prepare the instance as much as possible for the deploy. Once you're ready to make the switch you can run: horti --complete-install --no-daemon And once that is done run horti as you would normally do via supervisor.d (or just run horti --complete-install and have the daemon run from there) medic/cht-core#3257

garethbowen · 2018-07-02T03:38:08Z

This looks to be in good shape. Ready for release and testing with an actual project.

SCdF added this to the March 14th - March 28th milestone Mar 16, 2017

garethbowen modified the milestones: March 14th - March 28th, March 28th - April 11th Mar 28, 2017

garethbowen added 1 - Scheduled Upgrading Affects the upgrading of the app labels Mar 28, 2017

garethbowen modified the milestones: March 28th - April 11th, April 11th - April 25th Apr 11, 2017

garethbowen modified the milestones: April 11th - April 25th, April 25 - May 9 Apr 25, 2017

SCdF added 2 - Active Work and removed 1 - Scheduled labels Apr 28, 2017

SCdF assigned ghost Apr 28, 2017

SCdF mentioned this issue May 15, 2017

Make sure Concierge and CouchDB2.0 work correctly together #2960

Closed

SCdF removed their assignment May 22, 2017

SCdF added 1 - Scheduled and removed 2 - Active Work labels May 22, 2017

garethbowen modified the milestones: May 9 - May 23, May 23 - June 6 May 24, 2017

garethbowen removed this from the May 23 - June 6 milestone Jun 6, 2017

garethbowen added Type: Technical issue Improve something that users won't notice Priority: 2 - Medium Normal priority labels Feb 21, 2018

This was referenced Apr 30, 2018

Module packing is broken #4489

Closed

Make default httpd.max_http_request_size consistent apache/couchdb#1304

Closed

SCdF mentioned this issue May 23, 2018

Support all deploy actions from the cli medic/horticulturalist#14

Merged

garethbowen closed this as completed Jul 2, 2018

garethbowen added Status: 5 - Ready and removed Status: 1 - Triaged labels Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

SCdF commented Mar 16, 2017

SCdF commented Apr 6, 2017

SCdF commented Apr 20, 2017

SCdF commented Apr 20, 2017

SCdF commented Apr 20, 2017

ghost commented Apr 20, 2017

ghost commented Apr 20, 2017

SCdF commented Apr 20, 2017

sglangevin commented Apr 20, 2017

SCdF commented Apr 27, 2017 •

edited

Loading

SCdF commented Apr 28, 2017

SCdF commented May 5, 2017

sglangevin commented May 5, 2017

garethbowen commented Jun 6, 2017

garethbowen commented Apr 30, 2018 •

edited

Loading

garethbowen commented Apr 30, 2018

SCdF commented May 29, 2018

garethbowen commented Jul 2, 2018

Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

Comments

SCdF commented Mar 16, 2017

SCdF commented Apr 6, 2017

SCdF commented Apr 20, 2017

SCdF commented Apr 20, 2017

SCdF commented Apr 20, 2017

ghost commented Apr 20, 2017

ghost commented Apr 20, 2017

SCdF commented Apr 20, 2017

sglangevin commented Apr 20, 2017

SCdF commented Apr 27, 2017 • edited Loading

SCdF commented Apr 28, 2017

SCdF commented May 5, 2017

sglangevin commented May 5, 2017

garethbowen commented Jun 6, 2017

garethbowen commented Apr 30, 2018 • edited Loading

garethbowen commented Apr 30, 2018

SCdF commented May 29, 2018

garethbowen commented Jul 2, 2018

SCdF commented Apr 27, 2017 •

edited

Loading

garethbowen commented Apr 30, 2018 •

edited

Loading