-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cylc.rundb: replace state dump #1827
Conversation
How do we roll back to a previous suite state in your new system? |
It cannot be done at the moment. I want to discuss the main usages of this before putting this functionality back in. |
The current rolling archive of state dumps is probably of little use - by default it's very short, and it isn't easy to figure out which to use if not the most recent (however, I'll ask our operations team if they ever use it). But I think a more deliberate suite check-pointing system might be really useful, e.g. during suite development, which typically involves a lot of re-running - see #1735. |
Main use to date (to my knowledge) has been in operations where items of graphing have gone missing following a restart or reload, allowing us to 1) easily diff the contents of the task pool before and after reload/restart and 2) allowing us to roll back when something has messed up rather than sitting there trying to manually insert and reset states for not insignificant numbers of tasks. |
Both of these are better solved by #1735 (thanks to @hjoliver for raising this) instead of the rolling archive:
I think it is easy to extend the |
a9174ac
to
984169a
Compare
So long as @hjoliver's proposal and/or point 1 are in the same release then that's fine. I don't want to end up in a position where we've lost the state dumps and have no way of rolling back/checkpointing for a given cylc version. (point 2 is rarely the case in my experience of our operational problem suites - it's usually pretty clear something's gone horrendously wrong fairly quickly and the task becomes determining what's happened and how to recover)
Sounds good to me - as a basic starting point I'd snapshot at initialisation of the suite/restart/warmstart and on (planned) shutdowns, otherwise just using wherever the suite was last up to. |
17fe934
to
3bdb067
Compare
984169a
to
161d65d
Compare
Operations here reports they only ever (and infrequently at that) use the "most recent non-corrupted state dump", i.e. (presumably) the most-recent-but-one file in the unlikely event that the suite was killed in the middle of writing the most recent one. So that's fine - the atomic nature of the the DB means it should never be corrupted! |
(The logic since #825 should mean that the default state file (which is a symbolic link to the latest state file that has been fully written and fsync'ed) is never corrupted.) I think the only (useful?) functionality missing from this change is the ability to easily inspect the last 10 state dumps (or whatever is the The former is probably not useful. For the latter, I'll add a table to snapshot the task pool and a table to snapshot broadcast states on restart (and possibly before/after reload). |
cef549e
to
495b2cd
Compare
90e1fec
to
465f141
Compare
5bac391
to
ad6bd30
Compare
self.pub_dao.add_delete_item(table_name, where_args) | ||
# Record suite parameters and tasks in pool | ||
# Record any broadcast settings to be dumped out | ||
for obj in self, BroadcastServer.get_inst(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you (or do you want me to) do some profiling with a busy
suite for this routine? It was a bit of a bottleneck before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running my 100-task busy suite for 30 cycles in my environment, process_queued_db_ops
has a cumulative run time of 67.2s. This is faster when compared to master's 63.6s + 10.8s on state file dumping.
Lots of comments but I really like the change. |
I think the most important thing is to profile it against a busy suite. |
1feeb7b
to
00badf1
Compare
@benfitzpatrick most comments addressed or fixed. I'll do some profiling. |
ec2c162
to
2ae9a80
Compare
Branch squashed and re-based. Profiling using my 100-task busy suite suggests little or no change in performance. |
New tables for: * Suite parameters, e.g. run mode, initial/final cycle points. * Current tasks in task pool, with their "spawned" status. * Snapshots of above and broadcast states. Restart loads previous states from run DB. * Will upgrade DB with old state dump file automatically. Remove duplicated columns from task_states and task_events tables. Remove state dump functionality. cylc ls-checkpoints: new command to list the task pool checkpoints. Rename cylc-suite.db * cylc-suite-private.db <- state/cylc-suite.db * cylc-suite-public.db <- cylc-suite.db * cylc-suite.db now a symbolic link to cylc-suite-public.db
2ae9a80
to
bf903bd
Compare
Branch re-based again. |
…cylc-1827 Fix t: rose-bush/00, broken since cylc/cylc-flow#1827
A missing `break` means that we were still querying the `task_events` table instead of the `task_jobs` table for a set of unique job hosts.
This change replaces state dump functionality with the runtime DB.
Runtime DB changes:
task_states
andtask_events
tables.The information can all be found under
task_jobs
.Restart now loads previous states and other information from the runtime DB only.
cylc ls-checkpoints: new command to list the task pool checkpoints.
Close #421.
See also:
This change breaks backward compatibility on restart (as it no longer read/write state dumps), so I'd like to start a conversation on what old features and/or backward compatibility we want to retain (or not).