Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc.rundb: improve commit/rollback logic #1788

Merged
merged 2 commits into from
Apr 18, 2016

Conversation

matthewrmshin
Copy link
Contributor

Commit as soon as statements are executed correctly - should reset roll back point.
Roll back as soon as statements are not executed correctly - should prevent roll back of more statements than intended.

Commit and delete all queued items only on success. Roll back if anything dies, and retry all queued items next time. This ensures statements are never missed on retry.

Print statement diagnostics on DB connect error for write.

@hjoliver please take an initial look.

@matthewrmshin matthewrmshin added the bug Something is wrong :( label Apr 12, 2016
@matthewrmshin matthewrmshin added this to the next-release milestone Apr 12, 2016
@hjoliver
Copy link
Member

This looks good, although my level of sqlite wizardry is sub-optimal. Could you (or did you) test your theory about what's wrong with current master, and that this will fix it, by messing with the code to cause a deliberate failure mid-way through some statements?

@matthewrmshin matthewrmshin force-pushed the rundb-logic-improve branch 2 times, most recently from 04c18b3 to 93adf74 Compare April 13, 2016 08:30
@matthewrmshin
Copy link
Contributor Author

Branch re-based. Added test to prove that DB statement diagnostic is printed on STDERR.

@matthewrmshin
Copy link
Contributor Author

Yes, I have simulated the problem by adding to CylcSuiteDAO._execute_stmt a statement like this:

if self.is_public and self.n_tries == 0 and stmt.startswith("UPDATE task_states"):
    raise sqlite3.Error

Running a single-task suite, we'll end up with no entry in the task_states table on master (!), but recovers correctly on this branch.

@matthewrmshin
Copy link
Contributor Author

@hjoliver @arjclark please review. This change should now be good.

@hjoliver
Copy link
Member

There might be a performance issue here.

This randomly chosen small test suite takes almost twice as long to run on this branch, and the extra time is all spent in the db code:

[scheduling]
    [[dependencies]]
        graph = broken-task:fail => fixer-task
[runtime]
    [[broken-task]]
        script = /bin/false
    [[fixer-task]]
        command scripting = """
cylc broadcast -s 'script = /bin/true' -p 1 -n broken-task $CYLC_SUITE_NAME
cylc trigger $CYLC_SUITE_NAME broken-task.1"""

Profiling on master:

[oliverh@niwa-34403 cylc.git]$ statshow stats.master | head -n 20
Thu Apr 14 13:06:25 2016    stats.master

         23360 function calls (23037 primitive calls) in 7.488 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.488    7.488 <string>:1(<module>)
        1    0.001    0.001    7.488    7.488 scheduler.py:974(run)
        6    6.006    1.001    6.006    1.001 {time.sleep}
        6    0.001    0.000    1.392    0.232 task_pool.py:1228(process_queued_db_ops)
       12    0.001    0.000    1.389    0.116 rundb.py:316(execute_queued_items)
       12    1.374    0.115    1.374    0.115 {method 'commit' of 'sqlite3.Connection' objects}
        6    0.000    0.000    0.048    0.008 suite_state_dumping.py:53(dump)
        6    0.045    0.007    0.045    0.007 {posix.fsync}
        6    0.000    0.000    0.021    0.004 task_pool.py:463(submit_tasks)

Profiling on this branch:

[oliverh@niwa-34403 cylc.git]$ statshow stats.matt | head -n 20
Thu Apr 14 13:06:47 2016    stats.matt

         22934 function calls (22612 primitive calls) in 12.904 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.904   12.904 <string>:1(<module>)
        1    0.001    0.001   12.904   12.904 scheduler.py:974(run)
        5    0.001    0.000    7.760    1.552 task_pool.py:1228(process_queued_db_ops)
       10    0.001    0.000    7.756    0.776 rundb.py:316(execute_queued_items)
       70    0.001    0.000    7.754    0.111 rundb.py:360(_execute_stmt)
       70    7.736    0.111    7.736    0.111 {method 'commit' of 'sqlite3.Connection' objects}
        5    5.005    1.001    5.005    1.001 {time.sleep}
        5    0.000    0.000    0.098    0.020 suite_state_dumping.py:53(dump)
        5    0.095    0.019    0.095    0.019 {posix.fsync}
        5    0.000    0.000    0.020    0.004 task_pool.py:463(submit_tasks)

8 seconds does seem like a long time given the small amount of data being written here.

(I have not had time to profile any more realistic suites)

@matthewrmshin
Copy link
Contributor Author

To reduce the number of commits, the alternative is to commit all statements together like before, but also allow rollback and retry all statements together, hence don't delete any statements until all successful.

@matthewrmshin
Copy link
Contributor Author

I have updated the logic to reduce the number of commits. It will only call commit on success of the whole set of queued items, and will only delete items in the queue if everything is OK. This will allow it to retry the whole set of queued items on failure. The change continues to work correctly with my simulated failure described above.

@hjoliver
Copy link
Member

Performance is similar to master now. Test battery passes here.

Review 1 - good.

@hjoliver hjoliver assigned arjclark and unassigned hjoliver Apr 15, 2016
@hjoliver
Copy link
Member

Actually (post review 1!) I decided to run the database tests a few times, and in two of eight runs tests/database/04-lock-recover.t failed:

[oliverh@niwa-34403 cylc.git]$ cylc test-b tests/database/04-lock-recover.t 
tests/database/04-lock-recover.t .. ok   
All tests successful.
Files=1, Tests=4, 11 wallclock secs ( 0.03 usr  0.01 sys +  1.87 cusr  1.08 csys =  2.99 CPU)
Result: PASS

[oliverh@niwa-34403 cylc.git]$ cylc test-b tests/database/04-lock-recover.t 
tests/database/04-lock-recover.t .. 2/4 --- 04-lock-recover-run.stderr.grep 2016-04-15 14:33:09.526980421 +1200
+++ -   2016-04-15 14:33:09.635571781 +1200
@@ -1,4 +1,4 @@
 WARNING: cannot execute database statement:
    file=/home/oliverh/cylc-run/20160415T023259Z_cylc_test_database_04-lock-recover/cylc-suite.db:
-   stmt=INSERT OR REPLACE INTO task_jobs VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
-   stmt_args[0]=['1', 'locker', 1, False, 1, '2016-04-15T02:33:00Z', None, None, None, None, None, None, 'localhost', 'background', None]
+   stmt=UPDATE task_jobs SET time_submit_exit=?, submit_status=?, batch_sys_job_id=? WHERE cycle==? AND name==? AND submit_num==?
+   stmt_args[0]=['?', 0, '30402', '1', 'locker', 1]

    stdout and stderr stored in: /tmp/cylc-tests-oliverh/database/04-lock-recover-20160415T023259Z
tests/database/04-lock-recover.t .. Failed 1/4 subtests 

Test Summary Report
-------------------
tests/database/04-lock-recover.t (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=1, Tests=4, 10 wallclock secs ( 0.02 usr  0.00 sys +  1.30 cusr  0.34 csys =  1.66 CPU)
Result: FAIL

@arjclark
Copy link
Contributor

@matthewrmshin - can you look into the issue reported by @hjoliver above?

@arjclark arjclark assigned matthewrmshin and unassigned arjclark Apr 15, 2016
@matthewrmshin
Copy link
Contributor Author

Test should now be robust.

@matthewrmshin
Copy link
Contributor Author

Branch re-based.

Commit and delete all queued items only on success. Roll back if
anything dies, and retry all queued items next time. This ensures
statements are never missed on retry.
Print statement diagnostics on DB connect error for write.
A sleep now guarantees that when the database is locked, the next
statement would be an INSERT INTO task_events statement.
@hjoliver
Copy link
Member

All good now; back to @arjclark

@hjoliver hjoliver assigned arjclark and unassigned hjoliver Apr 17, 2016
@arjclark
Copy link
Contributor

Looks OK to me. No problems from the test-battery in my environment.

@arjclark arjclark merged commit 07d9fd6 into cylc:master Apr 18, 2016
@matthewrmshin matthewrmshin deleted the rundb-logic-improve branch April 18, 2016 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants