cylc.rundb: improve commit/rollback logic #1788

matthewrmshin · 2016-04-12T09:18:21Z

~~Commit as soon as statements are executed correctly - should reset roll back point.~~
~~Roll back as soon as statements are not executed correctly - should prevent roll back of more statements than intended.~~

Commit and delete all queued items only on success. Roll back if anything dies, and retry all queued items next time. This ensures statements are never missed on retry.

Print statement diagnostics on DB connect error for write.

@hjoliver please take an initial look.

hjoliver · 2016-04-13T02:54:54Z

This looks good, although my level of sqlite wizardry is sub-optimal. Could you (or did you) test your theory about what's wrong with current master, and that this will fix it, by messing with the code to cause a deliberate failure mid-way through some statements?

matthewrmshin · 2016-04-13T08:32:50Z

Branch re-based. Added test to prove that DB statement diagnostic is printed on STDERR.

matthewrmshin · 2016-04-13T09:44:03Z

Yes, I have simulated the problem by adding to CylcSuiteDAO._execute_stmt a statement like this:

if self.is_public and self.n_tries == 0 and stmt.startswith("UPDATE task_states"):
    raise sqlite3.Error

Running a single-task suite, we'll end up with no entry in the task_states table on master (!), but recovers correctly on this branch.

matthewrmshin · 2016-04-13T09:55:56Z

@hjoliver @arjclark please review. This change should now be good.

hjoliver · 2016-04-14T01:16:14Z

There might be a performance issue here.

This randomly chosen small test suite takes almost twice as long to run on this branch, and the extra time is all spent in the db code:

[scheduling]
    [[dependencies]]
        graph = broken-task:fail => fixer-task
[runtime]
    [[broken-task]]
        script = /bin/false
    [[fixer-task]]
        command scripting = """
cylc broadcast -s 'script = /bin/true' -p 1 -n broken-task $CYLC_SUITE_NAME
cylc trigger $CYLC_SUITE_NAME broken-task.1"""

Profiling on master:

[oliverh@niwa-34403 cylc.git]$ statshow stats.master | head -n 20
Thu Apr 14 13:06:25 2016    stats.master

         23360 function calls (23037 primitive calls) in 7.488 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.488    7.488 <string>:1(<module>)
        1    0.001    0.001    7.488    7.488 scheduler.py:974(run)
        6    6.006    1.001    6.006    1.001 {time.sleep}
        6    0.001    0.000    1.392    0.232 task_pool.py:1228(process_queued_db_ops)
       12    0.001    0.000    1.389    0.116 rundb.py:316(execute_queued_items)
       12    1.374    0.115    1.374    0.115 {method 'commit' of 'sqlite3.Connection' objects}
        6    0.000    0.000    0.048    0.008 suite_state_dumping.py:53(dump)
        6    0.045    0.007    0.045    0.007 {posix.fsync}
        6    0.000    0.000    0.021    0.004 task_pool.py:463(submit_tasks)

Profiling on this branch:

[oliverh@niwa-34403 cylc.git]$ statshow stats.matt | head -n 20
Thu Apr 14 13:06:47 2016    stats.matt

         22934 function calls (22612 primitive calls) in 12.904 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.904   12.904 <string>:1(<module>)
        1    0.001    0.001   12.904   12.904 scheduler.py:974(run)
        5    0.001    0.000    7.760    1.552 task_pool.py:1228(process_queued_db_ops)
       10    0.001    0.000    7.756    0.776 rundb.py:316(execute_queued_items)
       70    0.001    0.000    7.754    0.111 rundb.py:360(_execute_stmt)
       70    7.736    0.111    7.736    0.111 {method 'commit' of 'sqlite3.Connection' objects}
        5    5.005    1.001    5.005    1.001 {time.sleep}
        5    0.000    0.000    0.098    0.020 suite_state_dumping.py:53(dump)
        5    0.095    0.019    0.095    0.019 {posix.fsync}
        5    0.000    0.000    0.020    0.004 task_pool.py:463(submit_tasks)

8 seconds does seem like a long time given the small amount of data being written here.

(I have not had time to profile any more realistic suites)

matthewrmshin · 2016-04-14T05:49:44Z

To reduce the number of commits, the alternative is to commit all statements together like before, but also allow rollback and retry all statements together, hence don't delete any statements until all successful.

matthewrmshin · 2016-04-14T10:09:26Z

I have updated the logic to reduce the number of commits. It will only call commit on success of the whole set of queued items, and will only delete items in the queue if everything is OK. This will allow it to retry the whole set of queued items on failure. The change continues to work correctly with my simulated failure described above.

hjoliver · 2016-04-15T01:58:42Z

Performance is similar to master now. Test battery passes here.

Review 1 - good.

hjoliver · 2016-04-15T02:35:40Z

Actually (post review 1!) I decided to run the database tests a few times, and in two of eight runs tests/database/04-lock-recover.t failed:

[oliverh@niwa-34403 cylc.git]$ cylc test-b tests/database/04-lock-recover.t 
tests/database/04-lock-recover.t .. ok   
All tests successful.
Files=1, Tests=4, 11 wallclock secs ( 0.03 usr  0.01 sys +  1.87 cusr  1.08 csys =  2.99 CPU)
Result: PASS

[oliverh@niwa-34403 cylc.git]$ cylc test-b tests/database/04-lock-recover.t 
tests/database/04-lock-recover.t .. 2/4 --- 04-lock-recover-run.stderr.grep 2016-04-15 14:33:09.526980421 +1200
+++ -   2016-04-15 14:33:09.635571781 +1200
@@ -1,4 +1,4 @@
 WARNING: cannot execute database statement:
    file=/home/oliverh/cylc-run/20160415T023259Z_cylc_test_database_04-lock-recover/cylc-suite.db:
-   stmt=INSERT OR REPLACE INTO task_jobs VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
-   stmt_args[0]=['1', 'locker', 1, False, 1, '2016-04-15T02:33:00Z', None, None, None, None, None, None, 'localhost', 'background', None]
+   stmt=UPDATE task_jobs SET time_submit_exit=?, submit_status=?, batch_sys_job_id=? WHERE cycle==? AND name==? AND submit_num==?
+   stmt_args[0]=['?', 0, '30402', '1', 'locker', 1]

    stdout and stderr stored in: /tmp/cylc-tests-oliverh/database/04-lock-recover-20160415T023259Z
tests/database/04-lock-recover.t .. Failed 1/4 subtests 

Test Summary Report
-------------------
tests/database/04-lock-recover.t (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=1, Tests=4, 10 wallclock secs ( 0.02 usr  0.00 sys +  1.30 cusr  0.34 csys =  1.66 CPU)
Result: FAIL

arjclark · 2016-04-15T08:05:54Z

@matthewrmshin - can you look into the issue reported by @hjoliver above?

matthewrmshin · 2016-04-15T08:42:22Z

Test should now be robust.

matthewrmshin · 2016-04-15T08:53:09Z

Branch re-based.

Commit and delete all queued items only on success. Roll back if anything dies, and retry all queued items next time. This ensures statements are never missed on retry. Print statement diagnostics on DB connect error for write.

A sleep now guarantees that when the database is locked, the next statement would be an INSERT INTO task_events statement.

hjoliver · 2016-04-17T23:12:09Z

All good now; back to @arjclark

arjclark · 2016-04-18T14:36:59Z

Looks OK to me. No problems from the test-battery in my environment.

matthewrmshin assigned hjoliver Apr 12, 2016

matthewrmshin added the bug Something is wrong :( label Apr 12, 2016

matthewrmshin added this to the next-release milestone Apr 12, 2016

matthewrmshin force-pushed the rundb-logic-improve branch 2 times, most recently from 04c18b3 to 93adf74 Compare April 13, 2016 08:30

matthewrmshin force-pushed the rundb-logic-improve branch from 93adf74 to 0b7ca25 Compare April 13, 2016 08:46

matthewrmshin force-pushed the rundb-logic-improve branch from 0b7ca25 to 8010205 Compare April 13, 2016 09:55

matthewrmshin force-pushed the rundb-logic-improve branch from 8010205 to adcabba Compare April 14, 2016 10:05

hjoliver assigned arjclark and unassigned hjoliver Apr 15, 2016

arjclark assigned matthewrmshin and unassigned arjclark Apr 15, 2016

matthewrmshin assigned hjoliver and unassigned matthewrmshin Apr 15, 2016

matthewrmshin force-pushed the rundb-logic-improve branch from 8341cdd to d7be7b8 Compare April 15, 2016 08:52

matthewrmshin added 2 commits April 15, 2016 09:55

cylc.rundb: improve commit/rollback logic

699337a

Commit and delete all queued items only on success. Roll back if anything dies, and retry all queued items next time. This ensures statements are never missed on retry. Print statement diagnostics on DB connect error for write.

t: data-base 04: improve test robustness

5113af1

A sleep now guarantees that when the database is locked, the next statement would be an INSERT INTO task_events statement.

matthewrmshin force-pushed the rundb-logic-improve branch from d7be7b8 to 5113af1 Compare April 15, 2016 08:55

hjoliver assigned arjclark and unassigned hjoliver Apr 17, 2016

arjclark merged commit 07d9fd6 into cylc:master Apr 18, 2016

matthewrmshin deleted the rundb-logic-improve branch April 18, 2016 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cylc.rundb: improve commit/rollback logic #1788

cylc.rundb: improve commit/rollback logic #1788

matthewrmshin commented Apr 12, 2016

hjoliver commented Apr 13, 2016

matthewrmshin commented Apr 13, 2016

matthewrmshin commented Apr 13, 2016

matthewrmshin commented Apr 13, 2016

hjoliver commented Apr 14, 2016

matthewrmshin commented Apr 14, 2016

matthewrmshin commented Apr 14, 2016

hjoliver commented Apr 15, 2016

hjoliver commented Apr 15, 2016

arjclark commented Apr 15, 2016

matthewrmshin commented Apr 15, 2016

matthewrmshin commented Apr 15, 2016

hjoliver commented Apr 17, 2016

arjclark commented Apr 18, 2016

cylc.rundb: improve commit/rollback logic #1788

cylc.rundb: improve commit/rollback logic #1788

Conversation

matthewrmshin commented Apr 12, 2016

hjoliver commented Apr 13, 2016

matthewrmshin commented Apr 13, 2016

matthewrmshin commented Apr 13, 2016

matthewrmshin commented Apr 13, 2016

hjoliver commented Apr 14, 2016

matthewrmshin commented Apr 14, 2016

matthewrmshin commented Apr 14, 2016

hjoliver commented Apr 15, 2016

hjoliver commented Apr 15, 2016

arjclark commented Apr 15, 2016

matthewrmshin commented Apr 15, 2016

matthewrmshin commented Apr 15, 2016

hjoliver commented Apr 17, 2016

arjclark commented Apr 18, 2016