-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various Backup and restore improvements #174
Conversation
…d URL or credential
…n restore source is not available
@@ -108,38 +108,46 @@ func StartDatabaseTemplate(t *testing.T, namespaceName string, adminPod string, | |||
|
|||
installationStep(t, options, helmChartReleaseName) | |||
|
|||
AwaitNrReplicasScheduled(t, namespaceName, tePodNameTemplate, opt.NrTePods) | |||
AwaitNrReplicasScheduled(t, namespaceName, smPodName, opt.NrSmPods) | |||
if awaitDatabase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the diff mess.
The change done here is just to wrap the database await logic in if awaitDatabase { ... }
.
The change starts here.
|
||
AwaitDatabaseUp(t, namespaceName, adminPod, opt.DbName, opt.NrSmPods+opt.NrTePods) | ||
AwaitDatabaseUp(t, namespaceName, adminPod, opt.DbName, opt.NrSmPods+opt.NrTePods) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change ends here.
stable/database/files/nuosm
Outdated
fi | ||
} | ||
|
||
function releaseLockedArchive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just resurrecting the archive object that is still in the domain state as removed. I don't understand why this is called releaseLockedArchive
. Can we call this resurrectRemovedArchive
and update the output message accordingly?
function wrapLogfile() { | ||
logsize=$( du -sb $LOGFILE | grep -o '^ *[0-9]\+' ) | ||
maxlog=5000000 | ||
log "logsize=$logsize; maxlog=$maxlog" | ||
if [ ${logsize:=0} -gt $maxlog ]; then | ||
lines=$(wc -l $LOGFILE) | ||
tail -n $(( lines / 2 )) $LOGFILE > ${LOGFILE}-new | ||
rm -rf $LOGFILE | ||
mv $LOGFILE-new $LOGFILE | ||
log "(nuosm) log file wrapped around" | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much logging do we expect this to generate that we are not just emitting to standard output? And do we want to support log persistence for this sort of output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the value of having a nuosm
persistent log which is available between container restarts. All restore/import logic will be executed only once and it's easy to lose the logging from it especially if the user tried to resolve a restore problem on their own. Of course, all this will work if log.pesistance
is enabled for SMs.
Example restore log:
===========================================
logsize=1487; maxlog=5000000
Directory /var/opt/nuodb/archive/nuodb/demo exists
myArchive=0; DB=demo; hostname=sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
restore_source=20201229T225653; restore_requested=20201229T225653; path=/var/opt/nuodb/archive/nuodb/demo; atoms=78; catalogs=213
First-in = sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
I am first-in: sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 == sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
Deleting archiveId=1
Restoring 20201229T225653; existing archive directores: total 4
drwxr-xr-x 37 nuodb root 4096 Dec 29 22:57 demo
(restore) recreated /var/opt/nuodb/archive/nuodb/demo; atoms=0
Calling nuodocker to restore 20201229T225653 into /var/opt/nuodb/archive/nuodb/demo
Finished restoring /var/opt/nuodb/backup/20201229T225653 to /var/opt/nuodb/archive/nuodb/demo. Created archive with archive ID 2
Deleting my archive metadata: 0
Restored from 20201229T225653
Clearing restore credential request from raft
[2] <NO VALUE> : /var/opt/nuodb/archive/nuodb/demo @ demo [journal_path = ] [snapshot_archive_path = ] NOT_RUNNING
Example normal startup:
===========================================
logsize=44; maxlog=5000000
Created new dir /var/opt/nuodb/archive/nuodb/demo
myArchive=-1; DB=demo; hostname=sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
restore_source=; restore_requested=; path=/var/opt/nuodb/archive/nuodb/demo; atoms=0; catalogs=0
EDIT: I've also added a timestamp for all log messages.
stable/database/files/nuosm
Outdated
} | ||
|
||
function isUrl() { | ||
if echo $1 | grep -q '^[a-z]\+:/[^ ]\+'; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should quote the argument, i.e. "$1"
. Also, do we actually need the if; then; else; fi
? Isn't it the same to just invoke this command? I guess we would get whatever non-0 exit code the command returns instead of 1, but I'm not sure if that matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine removing the if/else
but I prefer to keep the return $?
so that it's clear what the function returns.
stable/database/files/nuosm
Outdated
} | ||
|
||
function isRestoreSourceAvailable() { | ||
if isUrl "$restore_source" || [ -d $NUODB_BACKUPDIR/$restore_source ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about if; then; else; fi
. Also, this has to be called in a scope that defines restore_source
. I think it would be better to make that an explicit argument (i.e. "$1"
).
stable/database/files/nuosm
Outdated
|
||
function releaseLockedArchive() { | ||
trace "releasing my locked archive metadata" | ||
locked_archive=$( nuocmd show archives --db-name $DB_NAME --removed --removed-archive-format "archive-id: {id}" | sed -En "/^archive-id: / {N; /$HOSTNAME/ s/^archive-id: ([0-9]+).*$/\1/; T; p}" | head -n 1 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the sed syntax used here, but this seems to assume that there is only one archive object associated with the current host (the SM pod hostname/FQDN) and that it has been started at least once. Can you include a comment describing what this doozy of a one-liner does (maybe @NikTJ777 can explain)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my understanding as well. It expects the archiveId
on the first line followed by the exited process on the next line. AFAIK this is the case always in docker deployments as the process is started right after the archive is created by nuodocker start sm
.
I will add a comment.
stable/database/files/nuosm
Outdated
--db-name $DB_NAME \ | ||
--removed --removed-archive-format "archive-id: {id}" | sed -En "/^archive-id: / {N; /$HOSTNAME/ s/^archive-id: ([0-9]+).*$/\1/; T; p}" | head -n 1 ) | ||
[ -z "$myArchive" ] && myArchive="-1" | ||
log "myArchive=$myArchive; DB=$DB_NAME; hostname=$HOSTNAME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log archiveID
instead of myArchive
, which reads weirdly.
|
||
# take ownership of the SM startup semaphore | ||
$NUOCMD set value --key $startup_key/$DB_NAME --value $HOSTNAME --unconditional | ||
trace "Take ownership of SM startup semaphore" | ||
nuocmd set value --key $startup_key/$DB_NAME --value $HOSTNAME --unconditional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this KV wrangling for? This is probably a question for @NikTJ777.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, there are two main KV keys used for synchronization:
$NUODB_RESTORE_REQUEST_PREFIX/$DB_NAME/first
- this is held by the SM which will perform restore. Any other SM will wait for the "first" SM to go intoRUNNING
state before it starts.$startup_key/$DB_NAME
- this is held by every SM when they start so that SMs start in sequential order so that we don't have more than one SYNCing SM at a time. This is probably to address an old issue where we had problems when more than 2 SMs start SYNCing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and it is good that we have test coverage of this now.
We should definitely move a lot of this functionality into nuodocker
. The behavior of blocking process startup based on some key in the KV store is useful and something that we could generalize, because it also has other use cases like diagnostics (e.g. someone wants to figure out why an SM keeps crashing and wants access its archive directory, but that is only possible in K8s if we inhibit startup to that it does not crash again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this over the line
I only just got round to looking at this since I was on hols. I was mostly interested in whether all the issues found during welab testing were addressed. I thought I would leave this comment here just to confirm, as far as I can see, the changes do resolve them all. |
@acabrele Thanks for checking it! The three issues in the PR description should have been fixed by this PR (two of them are found during the initial B&R testing by QA team). If you find any major problems, please open issues so that we can fix them in addition. |
It also fixes some other issues not listed :-) |
The primary work was been done by @NikTJ777 and posted in #80.
This PR contains his work and have some extra fixes, improvements, and test automation. The main issues addressed by this change are described below.
Issue
Running DB restore to a specific backupset fails if non-HC SM is started first when
autoRestart
is enabled.As non-HC SMs doesn't have persistent storage for backup, they can only restore/import database from URL. If the restore source is not found or it isn't URL, they should fail so that HC SM that has access to the backupset restore database from it.
Issue
Invalid restore source URL or credentials are not handled well and the SM gracefully continues to start even if restore/import has failed. SM ends up SYNCing when
nuosm
should exit with an error.Issue
If
restore.autoRestart
is set tofalse
, this doesn't disable database auto restart.Changes
nuosm
script now causes the SM startup to exit in the event of fatal errors during restore. Any user errors like invalid restore source or credentials are causing the startup to fail fast signaling to the user that there is an invalid input.nuosm
andnuorestore
scripts to better guide the user's expectations.autoRestore
,autoImport
andin-place
restore source URL handling.database
andrestore
chartsTesting
autoRestore
autoImport
and in-place restore using multiple SMs