Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for gap between snapshot preparation and archive creation #361

Merged
merged 4 commits into from
Mar 25, 2024

Conversation

adriansuarez
Copy link
Collaborator

For SMs created from snapshots, we currently we rely on the absence of the info.json file and presence of the backup.txt file to detect that the archive has to be prepared from a snapshot. But there is a window of time until the info.json file for the newly-prepared archive is created in which any failure would leave the nuosm script unable to detect if it should perform a normal restart.

This change adds a restore.txt file which is created at the same time the old info.json and backup.txt file are deleted and is used to detect that the SM already performed snapshot preparation and should continue with normal startup.

For SMs created from snapshots, we currently we rely on the absence of
the info.json file and presence of the backup.txt file to detect that
the archive has to be prepared from a snapshot. But there is a window of
time until the info.json file for the newly-prepared archive is created
in which any failure would leave the `nuosm` script unable to detect if
it should perform a normal restart.

This change adds a restore.txt file which is created at the same time
the old info.json and backup.txt file are deleted and is used to detect
that the SM already performed snapshot preparation and should continue
with normal startup.
Also, to guarantee that database processes are not running, add 'nuocmd
check database' invocation before cleaning up domain state for database.
Copy link
Collaborator

@sivanov-nuodb sivanov-nuodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

@kontaras kontaras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to delete restored.txt at the end. It will break an SM being restored, snapshot, and restored.

stable/database/files/nuosm Show resolved Hide resolved
@adriansuarez
Copy link
Collaborator Author

adriansuarez commented Mar 22, 2024

Don't forget to delete restored.txt at the end. It will break an SM being restored, snapshot, and restored.

Great point. The info.json file is created by nuodocker start sm, so the nuosm script cannot delete restored.txt after info.json is created, because we have handed control of the startup lifecycle to nuodocker start sm by that point. Since nuodocker start sm is implemented in the container image, putting that behavior there would create a dependency on a particular image version for backup to work.

I will put the deletion inside the backup_hooks.py script in the pre-backup hook, before the new backup.txt is created.

@sivanov-nuodb
Copy link
Collaborator

Don't forget to delete restored.txt at the end. It will break an SM being restored, snapshot, and restored.

Great point. The info.json file is created by nuodocker start sm, so the nuosm script cannot delete restored.txt after info.json is created, because we have handed control of the startup lifecycle to nuodocker start sm by that point. Since nuodocker start sm is implemented in the container image, putting that behavior there would create a dependency on a particular image version for backup to work.

I will put the deletion inside the backup_hooks.py script in the pre-backup hook, before the new backup.txt is created.

What is the issue by not deleting the restore.txt file? I see that only the absence of the file is checked in combination with the absence of info.json.

@adriansuarez
Copy link
Collaborator Author

adriansuarez commented Mar 22, 2024

What is the issue by not deleting the restore.txt file? I see that only the absence of the file is checked in combination with the absence of info.json.

After thinking a little bit more about it, I actually do not think it is an issue. A stale restore.txt can only be found in the case where there is a stale info.json, i.e. restore with same domain and database. The presence of the backup.txt would still prevent us from skipping the deletion of these files.

Either way, I think it makes sense to remove the stale restore.txt, since it is irrelevant which backup the database that was backed up came from.

@adriansuarez adriansuarez merged commit 19d85e9 into master Mar 25, 2024
7 checks passed
@adriansuarez adriansuarez deleted the asuarez/snapshot-restore-timing-hazard branch March 25, 2024 15:03
adriansuarez added a commit that referenced this pull request Mar 25, 2024
For SMs created from snapshots, we currently we rely on the absence of
the info.json file and presence of the backup.txt file to detect that
the archive has to be prepared from a snapshot. But there is a window of
time until the info.json file for the newly-prepared archive is created
in which any failure would leave the `nuosm` script unable to detect if
it should perform a normal restart.

This change adds a restore.txt file which is created at the same time
the old info.json and backup.txt file are deleted and is used to detect
that the SM already performed snapshot preparation and should continue
with normal startup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants