Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Backup and restore improvements #174

Merged
merged 32 commits into from
Jan 5, 2021

Conversation

sivanov-nuodb
Copy link
Collaborator

The primary work was been done by @NikTJ777 and posted in #80.
This PR contains his work and have some extra fixes, improvements, and test automation. The main issues addressed by this change are described below.

Issue
Running DB restore to a specific backupset fails if non-HC SM is started first when autoRestart is enabled.
As non-HC SMs doesn't have persistent storage for backup, they can only restore/import database from URL. If the restore source is not found or it isn't URL, they should fail so that HC SM that has access to the backupset restore database from it.

Issue
Invalid restore source URL or credentials are not handled well and the SM gracefully continues to start even if restore/import has failed. SM ends up SYNCing when nuosm should exit with an error.

Issue
If restore.autoRestart is set to false, this doesn't disable database auto restart.

Changes

  • nuosm script now causes the SM startup to exit in the event of fatal errors during restore. Any user errors like invalid restore source or credentials are causing the startup to fail fast signaling to the user that there is an invalid input.
  • Logging has been added in nuosm and nuorestore scripts to better guide the user's expectations.
  • Unified code path for autoRestore, autoImport and in-place restore source URL handling.
  • Remove any dependence in the restore chart on any values in the database chart.
  • Documentation enhancement for restore values in database and restore charts

Testing

  • Manual testing for autoRestore
  • Test automation for autoImport and in-place restore using multiple SMs
  • Shell script linting integration test
  • Restore job integration tests

@@ -108,38 +108,46 @@ func StartDatabaseTemplate(t *testing.T, namespaceName string, adminPod string,

installationStep(t, options, helmChartReleaseName)

AwaitNrReplicasScheduled(t, namespaceName, tePodNameTemplate, opt.NrTePods)
AwaitNrReplicasScheduled(t, namespaceName, smPodName, opt.NrSmPods)
if awaitDatabase {
Copy link
Collaborator Author

@sivanov-nuodb sivanov-nuodb Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the diff mess.
The change done here is just to wrap the database await logic in if awaitDatabase { ... }.
The change starts here.


AwaitDatabaseUp(t, namespaceName, adminPod, opt.DbName, opt.NrSmPods+opt.NrTePods)
AwaitDatabaseUp(t, namespaceName, adminPod, opt.DbName, opt.NrSmPods+opt.NrTePods)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change ends here.

fi
}

function releaseLockedArchive() {
Copy link
Collaborator

@adriansuarez adriansuarez Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just resurrecting the archive object that is still in the domain state as removed. I don't understand why this is called releaseLockedArchive. Can we call this resurrectRemovedArchive and update the output message accordingly?

Comment on lines +83 to +94
function wrapLogfile() {
logsize=$( du -sb $LOGFILE | grep -o '^ *[0-9]\+' )
maxlog=5000000
log "logsize=$logsize; maxlog=$maxlog"
if [ ${logsize:=0} -gt $maxlog ]; then
lines=$(wc -l $LOGFILE)
tail -n $(( lines / 2 )) $LOGFILE > ${LOGFILE}-new
rm -rf $LOGFILE
mv $LOGFILE-new $LOGFILE
log "(nuosm) log file wrapped around"
fi
}
Copy link
Collaborator

@adriansuarez adriansuarez Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much logging do we expect this to generate that we are not just emitting to standard output? And do we want to support log persistence for this sort of output?

Copy link
Collaborator Author

@sivanov-nuodb sivanov-nuodb Dec 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the value of having a nuosm persistent log which is available between container restarts. All restore/import logic will be executed only once and it's easy to lose the logging from it especially if the user tried to resolve a restore problem on their own. Of course, all this will work if log.pesistance is enabled for SMs.

Example restore log:

===========================================
logsize=1487; maxlog=5000000
Directory /var/opt/nuodb/archive/nuodb/demo exists
myArchive=0; DB=demo; hostname=sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
restore_source=20201229T225653; restore_requested=20201229T225653; path=/var/opt/nuodb/archive/nuodb/demo; atoms=78; catalogs=213
First-in = sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
I am first-in: sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0 == sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
Deleting archiveId=1
Restoring 20201229T225653; existing archive directores: total 4
drwxr-xr-x 37 nuodb root 4096 Dec 29 22:57 demo
(restore) recreated /var/opt/nuodb/archive/nuodb/demo; atoms=0
Calling nuodocker to restore 20201229T225653 into /var/opt/nuodb/archive/nuodb/demo
Finished restoring /var/opt/nuodb/backup/20201229T225653 to /var/opt/nuodb/archive/nuodb/demo. Created archive with archive ID 2
Deleting my archive metadata: 0
Restored from 20201229T225653
Clearing restore credential request from raft
[2] <NO VALUE> : /var/opt/nuodb/archive/nuodb/demo @ demo [journal_path = ] [snapshot_archive_path = ] NOT_RUNNING

Example normal startup:

===========================================
logsize=44; maxlog=5000000
Created new dir /var/opt/nuodb/archive/nuodb/demo
myArchive=-1; DB=demo; hostname=sm-database-aqochm-nuodb-cluster0-demo-hotcopy-0
restore_source=; restore_requested=; path=/var/opt/nuodb/archive/nuodb/demo; atoms=0; catalogs=0

EDIT: I've also added a timestamp for all log messages.

}

function isUrl() {
if echo $1 | grep -q '^[a-z]\+:/[^ ]\+'; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should quote the argument, i.e. "$1". Also, do we actually need the if; then; else; fi? Isn't it the same to just invoke this command? I guess we would get whatever non-0 exit code the command returns instead of 1, but I'm not sure if that matters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine removing the if/else but I prefer to keep the return $? so that it's clear what the function returns.

}

function isRestoreSourceAvailable() {
if isUrl "$restore_source" || [ -d $NUODB_BACKUPDIR/$restore_source ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about if; then; else; fi. Also, this has to be called in a scope that defines restore_source. I think it would be better to make that an explicit argument (i.e. "$1").


function releaseLockedArchive() {
trace "releasing my locked archive metadata"
locked_archive=$( nuocmd show archives --db-name $DB_NAME --removed --removed-archive-format "archive-id: {id}" | sed -En "/^archive-id: / {N; /$HOSTNAME/ s/^archive-id: ([0-9]+).*$/\1/; T; p}" | head -n 1 )
Copy link
Collaborator

@adriansuarez adriansuarez Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the sed syntax used here, but this seems to assume that there is only one archive object associated with the current host (the SM pod hostname/FQDN) and that it has been started at least once. Can you include a comment describing what this doozy of a one-liner does (maybe @NikTJ777 can explain)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my understanding as well. It expects the archiveId on the first line followed by the exited process on the next line. AFAIK this is the case always in docker deployments as the process is started right after the archive is created by nuodocker start sm.
I will add a comment.

--db-name $DB_NAME \
--removed --removed-archive-format "archive-id: {id}" | sed -En "/^archive-id: / {N; /$HOSTNAME/ s/^archive-id: ([0-9]+).*$/\1/; T; p}" | head -n 1 )
[ -z "$myArchive" ] && myArchive="-1"
log "myArchive=$myArchive; DB=$DB_NAME; hostname=$HOSTNAME"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log archiveID instead of myArchive, which reads weirdly.


# take ownership of the SM startup semaphore
$NUOCMD set value --key $startup_key/$DB_NAME --value $HOSTNAME --unconditional
trace "Take ownership of SM startup semaphore"
nuocmd set value --key $startup_key/$DB_NAME --value $HOSTNAME --unconditional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this KV wrangling for? This is probably a question for @NikTJ777.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, there are two main KV keys used for synchronization:

  • $NUODB_RESTORE_REQUEST_PREFIX/$DB_NAME/first - this is held by the SM which will perform restore. Any other SM will wait for the "first" SM to go into RUNNING state before it starts.
  • $startup_key/$DB_NAME - this is held by every SM when they start so that SMs start in sequential order so that we don't have more than one SYNCing SM at a time. This is probably to address an old issue where we had problems when more than 2 SMs start SYNCing.

Copy link
Collaborator

@adriansuarez adriansuarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and it is good that we have test coverage of this now.

We should definitely move a lot of this functionality into nuodocker. The behavior of blocking process startup based on some key in the KV store is useful and something that we could generalize, because it also has other use cases like diagnostics (e.g. someone wants to figure out why an SM keeps crashing and wants access its archive directory, but that is only possible in K8s if we inhibit startup to that it does not crash again).

@mkysel mkysel self-requested a review January 4, 2021 17:07
Copy link
Contributor

@mkysel mkysel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this over the line

@mkysel mkysel mentioned this pull request Jan 4, 2021
@sivanov-nuodb sivanov-nuodb merged commit 6ead867 into master Jan 5, 2021
@sivanov-nuodb sivanov-nuodb deleted the ntj/restore-improvements-merged branch January 5, 2021 12:17
@acabrele
Copy link
Contributor

acabrele commented Jan 5, 2021

I only just got round to looking at this since I was on hols. I was mostly interested in whether all the issues found during welab testing were addressed. I thought I would leave this comment here just to confirm, as far as I can see, the changes do resolve them all.

@sivanov-nuodb
Copy link
Collaborator Author

@acabrele Thanks for checking it! The three issues in the PR description should have been fixed by this PR (two of them are found during the initial B&R testing by QA team). If you find any major problems, please open issues so that we can fix them in addition.

@acabrele
Copy link
Contributor

acabrele commented Jan 5, 2021

@acabrele Thanks for checking it! The three issues in the PR description should have been fixed by this PR (two of them are found during the initial B&R testing by QA team). If you find any major problems, please open issues so that we can fix them in addition.

It also fixes some other issues not listed :-)

@sivanov-nuodb sivanov-nuodb added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Jan 8, 2021
@mkysel mkysel changed the title Backup and restore improvements Various Backup and restore improvements Jan 21, 2021
@mkysel mkysel added meta Exclude from Changelog enhancement New feature or request and removed enhancement New feature or request labels Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request meta Exclude from Changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants