Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

Closed
SCdF opened this issue Mar 16, 2017 · 18 comments
Closed

Determine how we migrate large instances from CouchDB 1.x to CouchDB 2.0 #3257

SCdF opened this issue Mar 16, 2017 · 18 comments
Labels
Priority: 2 - Medium Normal priority Type: Technical issue Improve something that users won't notice Upgrading Affects the upgrading of the app

Comments

@SCdF
Copy link
Contributor

SCdF commented Mar 16, 2017

We have some large instances, and it's going to be a pain to migrate to CouchDB 2.0, primary because it will force all long term sessions to be logged out (ie all CHWs will have to log back in).

We should look into how we can get around this, and if we definitely can't, strategies for migrating people over slowly (ie running both at the same time).

@SCdF SCdF added this to the March 14th - March 28th milestone Mar 16, 2017
@garethbowen garethbowen added 1 - Scheduled Upgrading Affects the upgrading of the app labels Mar 28, 2017
@SCdF
Copy link
Contributor Author

SCdF commented Apr 6, 2017

In terms of logging people out, CouchDB uses HMAC, so while we need to test this, I think if we migrate the secret key then 1.x sessions will be valid on 2.0.

@SCdF
Copy link
Contributor Author

SCdF commented Apr 20, 2017

Cool! If you set the same couchdb secret it looks like you can share cookies:

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ echo '"foo"' | http PUT 'http://admin:pass@localhost:5984/_node/couchdb@localhost/_config/couch_httpd_
auth/secret'
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 35
Content-Type: application/json
Date: Thu, 20 Apr 2017 13:43:58 GMT
Server: CouchDB/2.0.0 (Erlang OTP/19)
X-Couch-Request-ID: 67eb5a4ec5
X-CouchDB-Body-Time: 0

"1f3563713d875264c1bc6700e0af76eb"
scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ echo '"foo"' | http PUT http://admin:pass@localhost:5985/_config/couch_httpd_auth/secret
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 35
Content-Type: application/json
Date: Thu, 20 Apr 2017 13:45:02 GMT
Server: CouchDB/1.6.1 (Erlang OTP/19)

"1f3563713d875264c1bc6700e0af76eb"

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ http -f http://demo:medic@localhost:5985/_session "name=demo" "password=medic"
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 98
Content-Type: text/plain; charset=utf-8
Date: Thu, 20 Apr 2017 13:45:14 GMT
Server: CouchDB/1.6.1 (Erlang OTP/19)
Set-Cookie: AuthSession=ZGVtbzo1OEY4QkI2QTotL2j7KssOEvEpaKNJ8Kj2LO3c_A; Version=1; Path=/; HttpOnly

{
    "name": "demo",
    "ok": true,
    "roles": [
        "district-manager",
        "kujua_user",
        "data_entry",
        "district_admin"
    ]
}

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ http http://localhost:5985/medic 'Cookie:AuthSession=ZGVtbzo1OEY4QkI2QTotL2j7KssOEvEpaKNJ8Kj2LO3c_A; V
ersion=1; Path=/; HttpOnly'
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 239
Content-Type: text/plain; charset=utf-8
Date: Thu, 20 Apr 2017 13:45:45 GMT
Server: CouchDB/1.6.1 (Erlang OTP/19)

{
    "committed_update_seq": 96,
    "compact_running": false,
    "data_size": 7685985,
    "db_name": "medic",
    "disk_format_version": 6,
    "disk_size": 23683182,
    "doc_count": 29,
    "doc_del_count": 1,
    "instance_start_time": "1492695345616773",
    "purge_seq": 0,
    "update_seq": 96
}

scdf at SCdF in ~/Code/Medic/medic-webapp on master
$ http http://localhost:5984/medic 'Cookie:AuthSession=ZGVtbzo1OEY4QkI2QTotL2j7KssOEvEpaKNJ8Kj2LO3c_A; V
ersion=1; Path=/; HttpOnly'
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Length: 415
Content-Type: application/json
Date: Thu, 20 Apr 2017 13:45:52 GMT
Server: CouchDB/2.0.0 (Erlang OTP/19)
X-Couch-Request-ID: fd319d8c56
X-CouchDB-Body-Time: 0

{
    "compact_running": false,
    "data_size": 265318814,
    "db_name": "medic",
    "disk_format_version": 6,
    "disk_size": 6383088035,
    "doc_count": 54332,
    "doc_del_count": 13602,
    "instance_start_time": "0",
    "other": {
        "data_size": 186070962
    },
    "purge_seq": 0,
    "sizes": {
        "active": 265318814,
        "external": 186070962,
        "file": 6383088035
    },
    "update_seq": "175201-g1AAAABdeJzLYWBgYMpgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUklMiTV____PyuJgWHHNDzq8liAJEMDkPoPUc74-XQWAFwQH3s"
}

@SCdF
Copy link
Contributor Author

SCdF commented Apr 20, 2017

@browndav so what do you think about this rough plan:

  • Partner X has old server running old api, old sentinel, couchdb1.6
  • We create new server that is contains new api, new sentinel, couchdb2.0
  • Only CouchDB 2.0 should be running, don't want new api and new sentinel running right now
  • Get the secret from 1.6 and put it into 2.0 (see above)
  • Setup a continual replication between CouchDB1.6 and CouchDB2.0 directly (no api between)
  • Wait for 2.0's replication to "catch up" with 1.6 (new data is always coming in, "catch up" means get rid of the backlog)
  • Warm up views on CouchDB 2.0
  • Wait again for "catch up"
  • Kill old api and old sentinel. This should stop new data coming in, web users being able to access the site, and for restricted users to replicate
  • Fully wait this time (since we've waited in the past it shouldn't take that long) until we've got everything
  • Boot up new api and new sentinel, wait for them to do migration dance etc
  • do AWS DNS switcheroo so new server is now where old server was, delete old server

The main issue here is that we would still be offline while migrations ran. Unfortunately to solve this we would need to solve the continual migrations problem, and forward port any migrations that need it into that system: #2012

@SCdF
Copy link
Contributor Author

SCdF commented Apr 20, 2017

Oh, to be clear about where this will leave different users:

  • Analytics: would face migration-time down, plus would have to have their stored seq dropped and started again
  • Internet users: would face migration-time down, would not have to log in again
  • Restricted users: would not be able to replicate up or down while the migration was happening, would not have to log in again

And of course, while I've tested the simplest version of this, we'd want to actually test this with a real user and our app.

@ghost
Copy link

ghost commented Apr 20, 2017

This sounds right to me, thanks for putting it together. Just to make sure I'm understanding this correctly: does the safety of this approach depend upon the fact that neither API nor Sentinel are running on the destination CouchDB 2.x instance (i.e. no writes other than replication are occurring)?

@ghost
Copy link

ghost commented Apr 20, 2017

One other note (and we can actually build a checklist for this, so no big deal): we can actually swap Elastic IPs to avoid any DNS TTL issues. Not a huge win, but still decreased downtime.

@SCdF
Copy link
Contributor Author

SCdF commented Apr 20, 2017

does the safety of this approach depend upon the fact that neither API nor Sentinel are running on the destination CouchDB 2.x instance (i.e. no writes other than replication are occurring)

I honestly don't know. At least for API you'd need to wait until you had all the data to upgrade to the newer api and have all the migrations run.

Apart from that, there is simply no need for them to be running at this stage, so better safe than sorry.

@sglangevin
Copy link

Thanks @SCdF for putting this together! Partners will be VERY thankful if we can make this work.

@SCdF
Copy link
Contributor Author

SCdF commented Apr 27, 2017

TODO:

  • Test this from start to finish with a restricted user to make sure they keep their session

@SCdF
Copy link
Contributor Author

SCdF commented Apr 28, 2017

Tested this locally, by:

  • setting up a CouchDB 1.6 system with a restricted user on a phone
  • killed api (started getting connection errors in the log, as expected)
  • Rebooted CouchDB 1.6 on a different port, and booted CouchDB 2.0 to the normal port
  • Confirmed that 2.0 and 1.6 had the same entry for this user in _user (in a prod scenario I imagine we'd replicate the _user db)
  • Deleted the medic db in 2.0, and then replicated the 1.6/medic db into 2.0
  • Once complete I warmed the caches, and booted up api
  • Once api had booted, confirmed in the network log that phone reconnected its replication back successfully
  • Tested changing a person on the server, and seeing that successfully replicate back down to the phone

@browndav what should the next steps here be? Do you want to test this (I can help) in an AWS environment with multiple servers? Presumably this would also be a good point to upgrade their MedicOS? Would you like me to generate some document about how to do this, or is proving it enough?

@SCdF SCdF assigned ghost Apr 28, 2017
@SCdF
Copy link
Contributor Author

SCdF commented May 5, 2017

More additional information.

In Fauxton, it shows a warning if >50% of your documents in your DB are deletes. This is because in large DBs this can cause performance problems.

Since our migration will force clients to re-replicate the _changes feed, depending on these ratios for our larger partners we might want to consider doing a filtered replication to CouchDB2.0, to not replicate deleted stubs.

We could manually work this out for larger clients by walking their changes feed and counting how many entries are _deleted: true.

@sglangevin
Copy link

Since we did a lot of deleting of training data in the early stages of the LG project, this seems like something we should consider.

@garethbowen
Copy link
Contributor

CouchDB2 is not prioritised for this milestone - removing.

@garethbowen garethbowen removed this from the May 23 - June 6 milestone Jun 6, 2017
@garethbowen garethbowen added Type: Technical issue Improve something that users won't notice Priority: 2 - Medium Normal priority labels Feb 21, 2018
@garethbowen
Copy link
Contributor

garethbowen commented Apr 30, 2018

NB: I think we'll have to increase the http.max_http_request_size value. In CouchDB1 it was quite small - the default in CouchDB2 is 4GB. The v3.0 exceeds the old value so publishing fails with the error:

{
  "error": "too_large",
  "reason": "the request entity is too large",
  "name": "too_large",
  "status": 413,
  "message": "the request entity is too large"
}

@garethbowen
Copy link
Contributor

It turns out the couchdb default value is still only 64MB. I've raised an issue with couchdb to clarify and @browndav increased our default in medic-os to 128MiB.

SCdF added a commit to medic/horticulturalist that referenced this issue May 23, 2018
You can now install, stage and complete a staged install from the
command line. Running horti daemonless now actually makes sense, with it
performing the given action and then stopping.

This change allows for easier migrations from 2.x to 3.0 because once
you've replicated data over to the new instance you can run horti with:

  horti --stage=3.0.0 --no-daemon

To pre-prepare the instance as much as possible for the deploy. Once
you're ready to make the switch you can run:

  horti --complete --no-daemon

And once that is done run horti as you would normally do via
supervisor.d (or just run horti --complete and have the daemon run from
there)

medic/cht-core#3257
SCdF added a commit to medic/horticulturalist that referenced this issue May 23, 2018
You can now install, stage and complete a staged install from the
command line. Running horti daemonless now actually makes sense, with it
performing the given action and then stopping.

This change allows for easier migrations from 2.x to 3.0 because once
you've replicated data over to the new instance you can run horti with:

  horti --stage=3.0.0 --no-daemon

To pre-prepare the instance as much as possible for the deploy. Once
you're ready to make the switch you can run:

  horti --complete-install --no-daemon

And once that is done run horti as you would normally do via
supervisor.d (or just run horti --complete-install and have the daemon
run from there)

medic/cht-core#3257
SCdF added a commit to medic/horticulturalist that referenced this issue May 23, 2018
You can now install, stage and complete a staged install from the
command line. Running horti daemonless now actually makes sense, with it
performing the given action and then stopping.

This change allows for easier migrations from 2.x to 3.0 because once
you've replicated data over to the new instance you can run horti with:

  horti --stage=3.0.0 --no-daemon

To pre-prepare the instance as much as possible for the deploy. Once
you're ready to make the switch you can run:

  horti --complete-install --no-daemon

And once that is done run horti as you would normally do via
supervisor.d (or just run horti --complete-install and have the daemon
run from there)

medic/cht-core#3257
@SCdF
Copy link
Contributor Author

SCdF commented May 29, 2018

Apart from the horti PR linked above, I don't think there is any more work here to do. @browndav can you confirm, so we can close this ticket?

SCdF added a commit to medic/horticulturalist that referenced this issue May 29, 2018
You can now install, stage and complete a staged install from the
command line. Running horti daemonless now actually makes sense, with it
performing the given action and then stopping.

This change allows for easier migrations from 2.x to 3.0 because once
you've replicated data over to the new instance you can run horti with:

  horti --stage=3.0.0 --no-daemon

To pre-prepare the instance as much as possible for the deploy. Once
you're ready to make the switch you can run:

  horti --complete-install --no-daemon

And once that is done run horti as you would normally do via
supervisor.d (or just run horti --complete-install and have the daemon
run from there)

medic/cht-core#3257
@garethbowen
Copy link
Contributor

This looks to be in good shape. Ready for release and testing with an actual project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 2 - Medium Normal priority Type: Technical issue Improve something that users won't notice Upgrading Affects the upgrading of the app
Projects
None yet
Development

No branches or pull requests

3 participants