Various Backup and restore improvements #174

sivanov-nuodb · 2020-12-30T13:11:42Z

The primary work was been done by @NikTJ777 and posted in #80.
This PR contains his work and have some extra fixes, improvements, and test automation. The main issues addressed by this change are described below.

Issue
Running DB restore to a specific backupset fails if non-HC SM is started first when autoRestart is enabled.
As non-HC SMs doesn't have persistent storage for backup, they can only restore/import database from URL. If the restore source is not found or it isn't URL, they should fail so that HC SM that has access to the backupset restore database from it.

Issue
Invalid restore source URL or credentials are not handled well and the SM gracefully continues to start even if restore/import has failed. SM ends up SYNCing when nuosm should exit with an error.

Issue
If restore.autoRestart is set to false, this doesn't disable database auto restart.

Changes

nuosm script now causes the SM startup to exit in the event of fatal errors during restore. Any user errors like invalid restore source or credentials are causing the startup to fail fast signaling to the user that there is an invalid input.
Logging has been added in nuosm and nuorestore scripts to better guide the user's expectations.
Unified code path for autoRestore, autoImport and in-place restore source URL handling.
Remove any dependence in the restore chart on any values in the database chart.
Documentation enhancement for restore values in database and restore charts

Testing

Manual testing for autoRestore
Test automation for autoImport and in-place restore using multiple SMs
Shell script linting integration test
Restore job integration tests

…d URL or credential

…n restore source is not available

sivanov-nuodb · 2020-12-30T13:17:25Z

test/testlib/nuodb_database_utilities.go

@@ -108,38 +108,46 @@ func StartDatabaseTemplate(t *testing.T, namespaceName string, adminPod string,

 	installationStep(t, options, helmChartReleaseName)

-	AwaitNrReplicasScheduled(t, namespaceName, tePodNameTemplate, opt.NrTePods)
-	AwaitNrReplicasScheduled(t, namespaceName, smPodName, opt.NrSmPods)
+	if awaitDatabase {


Sorry for the diff mess.
The change done here is just to wrap the database await logic in if awaitDatabase { ... }.
The change starts here.

sivanov-nuodb · 2020-12-30T13:17:44Z

test/testlib/nuodb_database_utilities.go


-	AwaitDatabaseUp(t, namespaceName, adminPod, opt.DbName, opt.NrSmPods+opt.NrTePods)
+		AwaitDatabaseUp(t, namespaceName, adminPod, opt.DbName, opt.NrSmPods+opt.NrTePods)
+	}


The change ends here.

stable/database/files/nuosm

adriansuarez · 2020-12-30T16:27:20Z

stable/database/files/nuosm

+  fi
+}
+
+function releaseLockedArchive() {


This is just resurrecting the archive object that is still in the domain state as removed. I don't understand why this is called releaseLockedArchive. Can we call this resurrectRemovedArchive and update the output message accordingly?

adriansuarez · 2020-12-30T16:29:20Z

stable/database/files/nuosm

+function wrapLogfile() {
+  logsize=$( du -sb $LOGFILE | grep -o '^ *[0-9]\+' )
+  maxlog=5000000
+  log "logsize=$logsize; maxlog=$maxlog"
+  if [ ${logsize:=0} -gt $maxlog ]; then
+    lines=$(wc -l $LOGFILE)
+    tail -n $(( lines / 2 )) $LOGFILE > ${LOGFILE}-new
+    rm -rf $LOGFILE
+    mv $LOGFILE-new $LOGFILE
+    log "(nuosm) log file wrapped around"
+  fi
+}


How much logging do we expect this to generate that we are not just emitting to standard output? And do we want to support log persistence for this sort of output?

I see the value of having a nuosm persistent log which is available between container restarts. All restore/import logic will be executed only once and it's easy to lose the logging from it especially if the user tried to resolve a restore problem on their own. Of course, all this will work if log.pesistance is enabled for SMs.

Example restore log:

=========================================== logsize=1487; maxlog=5000000 Directory /var/opt/nuodb/archive/nuodb/demo exists myArchive=0; DB=demo; hostname=sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 restore_source=20201229T225653; restore_requested=20201229T225653; path=/var/opt/nuodb/archive/nuodb/demo; atoms=78; catalogs=213 First-in = sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 I am first-in: sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 == sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 Deleting archiveId=1 Restoring 20201229T225653; existing archive directores: total 4 drwxr-xr-x 37 nuodb root 4096 Dec 29 22:57 demo (restore) recreated /var/opt/nuodb/archive/nuodb/demo; atoms=0 Calling nuodocker to restore 20201229T225653 into /var/opt/nuodb/archive/nuodb/demo Finished restoring /var/opt/nuodb/backup/20201229T225653 to /var/opt/nuodb/archive/nuodb/demo. Created archive with archive ID 2 Deleting my archive metadata: 0 Restored from 20201229T225653 Clearing restore credential request from raft [2] <NO VALUE> : /var/opt/nuodb/archive/nuodb/demo @ demo [journal_path = ] [snapshot_archive_path = ] NOT_RUNNING

Example normal startup:

=========================================== logsize=44; maxlog=5000000 Created new dir /var/opt/nuodb/archive/nuodb/demo myArchive=-1; DB=demo; hostname=sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 restore_source=; restore_requested=; path=/var/opt/nuodb/archive/nuodb/demo; atoms=0; catalogs=0

EDIT: I've also added a timestamp for all log messages.

adriansuarez · 2020-12-30T16:34:30Z

stable/database/files/nuosm

+}
+
+function isUrl() {
+  if echo $1 | grep -q '^[a-z]\+:/[^ ]\+'; then


We should quote the argument, i.e. "$1". Also, do we actually need the if; then; else; fi? Isn't it the same to just invoke this command? I guess we would get whatever non-0 exit code the command returns instead of 1, but I'm not sure if that matters.

I'm fine removing the if/else but I prefer to keep the return $? so that it's clear what the function returns.

adriansuarez · 2020-12-30T16:38:00Z

stable/database/files/nuosm

+}
+
+function isRestoreSourceAvailable() {
+  if isUrl "$restore_source" || [ -d $NUODB_BACKUPDIR/$restore_source ]; then


Same comment as above about if; then; else; fi. Also, this has to be called in a scope that defines restore_source. I think it would be better to make that an explicit argument (i.e. "$1").

adriansuarez · 2020-12-30T16:42:07Z

stable/database/files/nuosm

+
+function releaseLockedArchive() {
+  trace "releasing my locked archive metadata"
+  locked_archive=$( nuocmd show archives --db-name $DB_NAME --removed --removed-archive-format "archive-id: {id}" | sed -En "/^archive-id: / {N; /$HOSTNAME/ s/^archive-id: ([0-9]+).*$/\1/; T; p}" | head -n 1 )


I'm not familiar with the sed syntax used here, but this seems to assume that there is only one archive object associated with the current host (the SM pod hostname/FQDN) and that it has been started at least once. Can you include a comment describing what this doozy of a one-liner does (maybe @NikTJ777 can explain)?

This is my understanding as well. It expects the archiveId on the first line followed by the exited process on the next line. AFAIK this is the case always in docker deployments as the process is started right after the archive is created by nuodocker start sm.
I will add a comment.

adriansuarez · 2020-12-30T16:49:58Z

stable/database/files/nuosm

+  --db-name $DB_NAME \
+  --removed --removed-archive-format "archive-id: {id}" | sed -En "/^archive-id: / {N; /$HOSTNAME/ s/^archive-id: ([0-9]+).*$/\1/; T; p}" | head -n 1 )
+[ -z "$myArchive" ] && myArchive="-1"
+log "myArchive=$myArchive; DB=$DB_NAME; hostname=$HOSTNAME"


Can we log archiveID instead of myArchive, which reads weirdly.

adriansuarez · 2020-12-30T20:23:57Z

stable/database/files/nuosm


    # take ownership of the SM startup semaphore
-    $NUOCMD set value --key $startup_key/$DB_NAME --value $HOSTNAME --unconditional
+    trace "Take ownership of SM startup semaphore"
+    nuocmd set value --key $startup_key/$DB_NAME --value $HOSTNAME --unconditional


What is this KV wrangling for? This is probably a question for @NikTJ777.

As far as I can tell, there are two main KV keys used for synchronization:

$NUODB_RESTORE_REQUEST_PREFIX/$DB_NAME/first - this is held by the SM which will perform restore. Any other SM will wait for the "first" SM to go into RUNNING state before it starts.

$startup_key/$DB_NAME - this is held by every SM when they start so that SMs start in sequential order so that we don't have more than one SYNCing SM at a time. This is probably to address an old issue where we had problems when more than 2 SMs start SYNCing.

adriansuarez

Looks good and it is good that we have test coverage of this now.

We should definitely move a lot of this functionality into nuodocker. The behavior of blocking process startup based on some key in the KV store is useful and something that we could generalize, because it also has other use cases like diagnostics (e.g. someone wants to figure out why an SM keeps crashing and wants access its archive directory, but that is only possible in K8s if we inhibit startup to that it does not crash again).

mkysel

Thanks for getting this over the line

acabrele · 2021-01-05T13:06:55Z

I only just got round to looking at this since I was on hols. I was mostly interested in whether all the issues found during welab testing were addressed. I thought I would leave this comment here just to confirm, as far as I can see, the changes do resolve them all.

sivanov-nuodb · 2021-01-05T13:28:52Z

@acabrele Thanks for checking it! The three issues in the PR description should have been fixed by this PR (two of them are found during the initial B&R testing by QA team). If you find any major problems, please open issues so that we can fix them in addition.

acabrele · 2021-01-05T13:34:03Z

@acabrele Thanks for checking it! The three issues in the PR description should have been fixed by this PR (two of them are found during the initial B&R testing by QA team). If you find any major problems, please open issues so that we can fix them in addition.

It also fixes some other issues not listed :-)

NikTJ777 and others added 23 commits November 5, 2020 14:02

squash commits

b395c15

rebase on master

0689fb6

update database restore test

50d4d65

Merge branch 'master' into ntj/restore-improvements-merged

82fc1c5

Remove commented out and unused code

8ba8c66

Undo unused configmap and test changes

6775243

Uno changes in testlib

5a75a38

Undo new constants.go file

536765d

Fix some review comments in #PR80

15e777c

Fix merge problems

4ed8e25

Unconditionally log messages from nuosm

537376d

[HELM-181] nuosm should handle invalid parameters better, e.g. invali…

8613569

…d URL or credential

Various logging fixes

8796aa2

Minor logging fixes in nuosm

65a997d

Die if nuodocker fails; fix bug in curl download path

095f73d

Clear restore request only on successfull restore

2d7977a

Remove NUOCMD_API_SERVER references; fix problem with autoRestore whe…

2b09a1b

…n restore source is not available

Add tests for autoImport

495ca59

Fix autoRestart value

5dcacaf

Add multiple SMs restore tests

513d1ae

Merge branch 'master' into ntj/restore-improvements-merged

ce929c9

Remove unused functions in nuobackup

c7dd6d1

Address comments on shell script linting test

5640a38

sivanov-nuodb requested review from NikTJ777, adriansuarez, mkysel and acabrele December 30, 2020 13:11

sivanov-nuodb commented Dec 30, 2020

View reviewed changes

Re-add fix done by original change

6fd4d7a

adriansuarez reviewed Dec 30, 2020

View reviewed changes

stable/database/files/nuosm Show resolved Hide resolved

adriansuarez reviewed Dec 30, 2020

View reviewed changes

Fix test dependencies

0a94328

adriansuarez reviewed Dec 30, 2020

View reviewed changes

sivanov-nuodb added 5 commits December 31, 2020 16:21

Address review comments

ae8f292

[ci skip] Add default value for DB_NAME

ed07dd3

Remove nuosm exit codes

b1b8d83

Add autoRestore test automation

fd73670

Add timestamp to nuosm and nuorestore logs

e1b6d9d

adriansuarez approved these changes Dec 31, 2020

View reviewed changes

mkysel self-requested a review January 4, 2021 17:07

mkysel approved these changes Jan 4, 2021

View reviewed changes

mkysel mentioned this pull request Jan 4, 2021

Ntj/restore improvements #80

Closed

sivanov-nuodb merged commit 6ead867 into master Jan 5, 2021

sivanov-nuodb deleted the ntj/restore-improvements-merged branch January 5, 2021 12:17

sivanov-nuodb added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Jan 8, 2021

mkysel changed the title ~~Backup and restore improvements~~ Various Backup and restore improvements Jan 21, 2021

mkysel added meta Exclude from Changelog enhancement New feature or request and removed enhancement New feature or request labels Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various Backup and restore improvements #174

Various Backup and restore improvements #174

sivanov-nuodb commented Dec 30, 2020

sivanov-nuodb Dec 30, 2020 •

edited

Loading

sivanov-nuodb Dec 30, 2020

adriansuarez Dec 30, 2020 •

edited

Loading

adriansuarez Dec 30, 2020 •

edited

Loading

sivanov-nuodb Dec 31, 2020 •

edited

Loading

adriansuarez Dec 30, 2020

sivanov-nuodb Dec 31, 2020

adriansuarez Dec 30, 2020

adriansuarez Dec 30, 2020 •

edited

Loading

sivanov-nuodb Dec 31, 2020

adriansuarez Dec 30, 2020

adriansuarez Dec 30, 2020

sivanov-nuodb Dec 31, 2020

adriansuarez left a comment •

edited

Loading

mkysel left a comment

acabrele commented Jan 5, 2021

sivanov-nuodb commented Jan 5, 2021

acabrele commented Jan 5, 2021

Various Backup and restore improvements #174

Various Backup and restore improvements #174

Conversation

sivanov-nuodb commented Dec 30, 2020

sivanov-nuodb Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriansuarez Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

adriansuarez Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

sivanov-nuodb Dec 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriansuarez Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriansuarez left a comment • edited Loading

Choose a reason for hiding this comment

mkysel left a comment

Choose a reason for hiding this comment

acabrele commented Jan 5, 2021

sivanov-nuodb commented Jan 5, 2021

acabrele commented Jan 5, 2021

sivanov-nuodb Dec 30, 2020 •

edited

Loading

adriansuarez Dec 30, 2020 •

edited

Loading

adriansuarez Dec 30, 2020 •

edited

Loading

sivanov-nuodb Dec 31, 2020 •

edited

Loading

adriansuarez Dec 30, 2020 •

edited

Loading

adriansuarez left a comment •

edited

Loading