-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new cylc command to show task timings. #2286
Conversation
Add a new query to the run database and a top-level driver script that will allow interrogation of task times to facilitate generation of statistics about wallclock time required for tasks, including information necessary to generate batch queue wait times, task run time (which was already avialable through the GUI), and total wallclock time for the task.
@trwhitcomb - thanks for this, looks like it will be useful. Partially addresses #1683 Note the Python PEP8 style check failed on Travis CI:
If @cylc/core agree this is a good thing, some tests will be needed to ensure future changes don't break it - let us know if you want some advice on how to do this. |
@hjoliver Thanks. This came out of our desire to keep tabs on how long certain jobs were waiting in the queue. I have an additional script that reads this output, does some grouping, and then makes basic plots like the attached but haven't included that here as it introduces additional dependencies (such as Pandas and Matplotlib). I'll have a look at the PEP8 errors (aaargh - those are PEP8 rules I don't like!) but will definitely need assistance for the tests. |
@trwhitcomb - the plots look nice, BoM would be interested in this for sure. I think we could consider including this sort of thing in Cylc so long as we have clear documentation of requirements and clear error messages for those who don't have Pandas and Matplotlib installed. Re PEP8 - initially I did not particularly agree with all the style rules, but I've come to love them 😀 If nothing else, enforced consistency has proved to be a very good thing. |
@trwhitcomb @hjoliver Sounds good to me and certainly the sort of thing various users are interested in. Would be nice if the command could natively support printing summary timings (i.e. queued time and execution time) via a cli switch - this could be extended to displaying average timings as well. |
@arjclark actually the summary timing print was where this whole thing started. The original version of the script that we wrote performed those computations and the table that was printed out was based on the summary statistics rather than the individual listing you see here. The reason that this PR doesn't include that functionality is that computation used Pandas for cleanly handling a lot of the data alignment issues and summary statistics computations, and I didn't want to introduce an additional package dependency to the cylc system. That said, if there's interest in this, then perhaps it could be handled similar to the way that graphviz is now, i.e. try to import it and if it doesn't work, disable the features that depend on it. But that's not a call I wanted to make before getting input from the core development team! |
Fix errors complained about PEP8 test in Travis CI.
@trwhitcomb - I guess it's just alignment that's the issue though right? As far as summary stats are concerned you should be able to take your db results and calculate the stats in python from those? An ancient bit of scripting I wrote for someone gets all the useful data out of the database into a python structure and then does these lines:
As far as alignment goes you can just go with the "crude" solution of padding each result entry with spaces until it's length is equal to the longest length in its column and then printing that (I did something to that effect in the rosie command line utils). I think Pandas makes a lot of this easier but its not necessary. |
@arjclark Alignment in this case is less about text alignment (i.e. column width) and more about alignment of indexed data while performing arithmetic operations - namely that with a large dataset being able to do something like:
and ensuring that the index of (name, cycle, submit_num) is consistent and you're subtracting matching data together even if one of them is (potentially) missing is very convenient. You're right that it's not strictly necessary, but it does get around some nasty corner cases (especially where data isn't present). |
@trwhitcomb @arjclark - IMO it is OK to add a new dependency on Pandas and Matplotlib for this kind of functionality. We should use the best tools for the job unless there's reason to believe that installation of these packages is very difficult. |
Sounds good - I'll work on incorporating those features in. For handling the case where packages are not installed, is it better to not show the options that require it, or always show everything but raise an error if the user uses a switch that requires a missing package? |
@trwhitcomb - I vote for show all options and raise an error - otherwise users may not notice that the extra functionality is available. |
Relegate the previous output format to an optional raw output and instead default to a much more useful summary output that shows computed queue times, run times, and total times based on the run database. Add a dependency to the Pandas library to compute summary statistics.
Here's an example of the summary timings available - because the distributions can be skewed and decidedly non-Gaussian, I've opted to go for a few more parameters than the regular "mean/standard deviation" pairs.
|
Looks good. |
Update the timing summary to include generated HTML showing both boxplots of the timing statistics (per task) as well as a tabular summary of the statistics. Refactor the summary generation to remove duplication between the text summary and HTML summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, tests as working, just a few comments to think about.
Noticed I get a repeated error if pandas is not installed:
Cannot import pandas - summary unavailable.Cannot import pandas - summary unavailable.
bin/cylc
Outdated
@@ -237,6 +237,7 @@ utility_commands['cycle-point'] = [ | |||
utility_commands['scp-transfer'] = ['scp-transfer'] | |||
utility_commands['suite-state'] = ['suite-state'] | |||
utility_commands['ls-checkpoints'] = ['ls-checkpoints'] | |||
utility_commands['show-times'] = ['show-times'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like the name cylc show-times
. What about cylc report-timings
. Or maybe just cylc report
with special option for a timing report, leaving room for other kinds of report in the future without adding many more cylc sub-commands? (This needs input from @cylc/core I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW my vote would be to rename to cylc report-timings
and then modify into a report
subcommand once we have something else ready to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fine with me.
bin/cylc-show-times
Outdated
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
"""cylc [util] show-times [OPTIONS] REG | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll suggest a few changes below to reduce help text verbosity a bit...
bin/cylc-show-times
Outdated
|
||
"""cylc [util] show-times [OPTIONS] REG | ||
|
||
Retrieve timing information for the given cylc suite that can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieve suite timing information for wait and run time performance analysis.
bin/cylc-show-times
Outdated
|
||
Retrieve timing information for the given cylc suite that can be | ||
used for fine-grain diagnostics for wait time and run time performance | ||
analysis. Two main categories of output are available: summary output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"raw output, and text or HTML summary output, is available"
bin/cylc-show-times
Outdated
analysis. Two main categories of output are available: summary output | ||
(in text or web format) and raw output. If an output filename is | ||
supplied, output is redirected there otherwise both summary and raw | ||
outputs are printed to standard output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor point, but given that all the output including SVG plots is text-based, an output filename option is not really necessary - just advise to redirect with > filename
? (hmmm, or maybe we want an MS Windows port one day...)
bin/cylc-show-times
Outdated
) | ||
parser.add_option( | ||
"-r", "--raw", | ||
help="Show raw timing output for downstream processing.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you, "for downstream processing"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"downstream processing" meaning user-specific postprocessing that happens with the data. This is less necessary now that we're including some sort of summary capability within the command, but I can still see people wanting to do something extra and using the raw output as data input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, understood! I was confused as we commonly use "downstream processing" to describe bits of in-suite workflow downstream of some important task.
Maybe something like "raw timing output suitable for automated text parsing"?
bin/cylc-show-times
Outdated
if output_options.count(True) > 1: | ||
parser.error('Cannot combine output formats (choose one)') | ||
if output_options.count(True) == 0: | ||
# No output specified - choose summary by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not any(output_options)
bin/cylc-show-times
Outdated
|
||
def write_category(self, buf, category, df_reshape, df_describe): | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these use @abstractmethod
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but one of the reasons I implemented it this way is so I don't need to define every single routine in the inheriting classes. If they're @abstractmethod
, I'd need to add subroutines with pass
in the child classes if it's not used (i.e. TextTimingSummary
doesn't use a summary header or footer) If you'd like, though, I can modify to make it more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's fine - it's not a major issue in this case.
lib/cylc/rundb.py
Outdated
GROUP BY | ||
e.name, e.cycle, e.submit_num | ||
""" % { | ||
'task_events': self.TABLE_TASK_EVENTS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is the gnarliest db query I've seen to date ... not that I see many of these!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be easier to use:
task_states.submit_num
for the latest submit number.task_jobs.time_submit
for submit time.task_jobs.time_run
for start time.task_jobs.time_exit
for the exit time.
(The task_jobs
table is designed to store this sort of information for Rose Bush, so should be a bit more efficient.)
bin/cylc-show-times
Outdated
border-bottom: 1px solid grey; | ||
} | ||
svg { | ||
width: 65%; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the plots look too big, to me, although perhaps it depends on how much data is being displayed. Could be configurable?
PEP8 checks failed, btw. |
@matthewrmshin - can you review too, or reassign? |
@hjoliver spotted the pep8 - this time I actually ran it before committing and it caught a bunch - don't know why those last few slipped through. Will fix when I commit the response to your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @trwhitcomb I have added some initial thoughts that you may want to consider, or perhaps open new issues for after this PR.
bin/cylc-show-times
Outdated
} | ||
""" | ||
|
||
buf.write('<html><head><style>%s</style></head><body>' % css) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if you'd consider moving the display presentation logic into Jinja2 templates?
lib/cylc/rundb.py
Outdated
GROUP BY | ||
e.name, e.cycle, e.submit_num | ||
""" % { | ||
'task_events': self.TABLE_TASK_EVENTS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be easier to use:
task_states.submit_num
for the latest submit number.task_jobs.time_submit
for submit time.task_jobs.time_run
for start time.task_jobs.time_exit
for the exit time.
(The task_jobs
table is designed to store this sort of information for Rose Bush, so should be a bit more efficient.)
Update several items based on review and revision comments. Some highlights: - command is now cylc report-timings - db query updated to mostly use the same table as Rose Bush, avoiding a pivot and triple join. The new query performs much better than the old query and is easier to read. - documentation updated for less verbosity - PEP8 updates that weren't caught by local pep8 - missing library error should only show up once
Thanks, guys! New revision should address most of these things. @hjoliver I can see what you mean about the image size, but it does resize with the window. We could parameterize but I'm open to suggestions for the easiest way to handle that (I'm trying not to overcomplicate things intially). @matthewrmshin Using |
Also, it looks like my local |
Timing information requires obtaining the most recent submission number of a succeeded task. When this is done with the task events table, it requires a subquery but using the task states table (where only the most recent submit number is stored) allows a much simpler join operation to be used instead.
Catch one PEP8 violation caught by Travis CI but missed by my local pep8 run.
@hjoliver also, I'm having a mental block for the best way to handle configuring the plot width - suggest deferring that to Matt's suggestion of basing output on a Jinja2 template later. |
lib/cylc/rundb.py
Outdated
WHERE | ||
s.event='%(succeeded)s' | ||
event = '%(succeeded)s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should also have mentioned that task_jobs.run_status == 0
is probably a better test for succeeded.
@trwhitcomb The |
Rather than computing the latest submit number for a succeeded task, the return status is also stored in the task_jobs table so we can get rid of joins altogether and use only task_jobs for better efficiency as well as reduced SQL complexity.
@matthewrmshin Thanks! That simplifies things even more. |
@trwhitcomb - yes I'm happy to punt further improvements, such as plot customization and Jinja2 templating to future PRs. |
add7f5b
to
7162266
Compare
Add a new query to the run database and a top-level driver
script that will allow interrogation of task times to facilitate
generation of statistics about wallclock time required for tasks,
including information necessary to generate batch queue wait times,
task run time (which was already available through the GUI), and
total wallclock time for the task.