Implement additional transcription metrics (#180) #189

justinsalamon · 2016-04-12T23:14:37Z

This PR will include implementations of:

P, R & F measures for onsets only
P, R & F measures for offsets only
Overlap ratio metric (normal and _no_offset version)

match_onsets behaves exactly like match_notes, except that it only take note onsets into account.

coveralls · 2016-04-12T23:19:01Z

Coverage decreased (-0.8%) to 97.28% when pulling 39d39e1 on transcriptionext into 53c16af on master.

This function behaves exactly like match_notes, except that it only takes offsets into account.

coveralls · 2016-04-12T23:30:33Z

Coverage decreased (-1.6%) to 96.459% when pulling 4a5d4ba on transcriptionext into 53c16af on master.

coveralls · 2016-04-12T23:51:14Z

Coverage increased (+0.03%) to 98.097% when pulling 3662134 on transcriptionext into 53c16af on master.

coveralls · 2016-04-13T00:21:07Z

Coverage decreased (-0.1%) to 97.954% when pulling 6245fb2 on transcriptionext into 53c16af on master.

coveralls · 2016-04-13T00:29:38Z

Coverage decreased (-0.2%) to 97.911% when pulling 0c92e5d on transcriptionext into 53c16af on master.

coveralls · 2016-04-13T00:43:04Z

Coverage increased (+0.06%) to 98.12% when pulling 5e5ca2b on transcriptionext into 53c16af on master.

justinsalamon · 2016-04-13T01:03:28Z

@craffel the onset/offset-only metrics are pretty much done, but I have not added them to the evaluator since they are not officially part of MIREX. Any preference?

I'm still missing the overlap ratio (and maybe chroma alternatives) to complete the PR, but this metric is computed in MIREX so no doubts there.

craffel · 2016-04-13T15:54:50Z

the onset/offset-only metrics are pretty much done, but I have not added them to the evaluator since they are not officially part of MIREX. Any preference?

They should be put in the evaluator even if they're not in MIREX.

I'm still missing the overlap ratio (and maybe chroma alternatives) to complete the PR, but this metric is computed in MIREX so no doubts there.

I won't have a chance to review this until mid-May. Feel free to entice someone else to do a code review in the meanwhile.

justinsalamon · 2016-04-13T16:16:13Z

They should be put in the evaluator even if they're not in MIREX.

ok

I won't have a chance to review this until mid-May. Feel free to entice someone else to do a code review in the meanwhile.

no worries, I'll ping the rest of team once this is ready for review

justinsalamon · 2016-04-13T16:38:07Z

@craffel just one quick question: to compute the overlap ratio you need to compute note matching, and then the metric is derived from the matched notes similar to precision, recall and f1. Computationally the most efficient thing to do would be to add this metric to the existing precision_recall_f1 function since it already computes the note matching. The alternative would be to implement the metric in a new function which also computes the note matching, but this means exactly the same note matching would be computed twice when the evaluator is called (and this where most of the computational load is).

So... add the metric to precision_recall_f1 (and if so, rename function?), or implement in separate function?

craffel · 2016-04-13T16:55:21Z

We shouldn't duplicate computation. I would need to look at your code to decide if it makes sense to me to lump this all into one function because I'm not clear on what exactly the computation being duplicated is.

justinsalamon · 2016-04-13T18:12:21Z

The computation being duplicated would be this call to match_notes which involves computing the 3 hit matrices (for onset/offset/pitch) and then computing the bi-partite graph matching.

craffel · 2016-04-13T18:16:16Z

Ok. I am fine with adding overlap ratio computation to the PRF function. Why don't you code it up that way, and if there are any undesirable side-effects we can discuss them later.

justinsalamon · 2016-04-13T18:17:44Z

ok sounds good. shall we rename it to precision_recall_f1_overlap then?

craffel · 2016-04-13T18:19:32Z

Sounds good.

coveralls · 2016-04-14T00:23:21Z

Coverage decreased (-0.3%) to 97.765% when pulling d03d01c on transcriptionext into 53c16af on master.

coveralls · 2016-04-14T15:26:02Z

Coverage decreased (-0.4%) to 97.664% when pulling be3259a on transcriptionext into 53c16af on master.

justinsalamon · 2016-04-14T16:45:37Z

@craffel I've finished implementing the AOR metric, and so I've compared it to some MIREX results: https://gist.github.com/justinsalamon/b2a837a889a61a69b76a42a4ff38d889

In the first plot you can see the mir_eval results are pretty different from the MIREX results. The reason is that this metric is highly dependent on the note matching, and since we use a different matching algorithm, the matching is different and consequently the AOR too. In the second plot I've plotted the scores we'd get if mir_eval used the same greedy note matching as in MIREX, and you can see the scores now match just fine.

The reason AOR is so sensitive to note matching is because it's the average (over all matched notes) of the ratio between the overlapping segment of two matched notes and the maximum time duration spanned by the two notes. So for example, lets say greedy matching matches ref note A to est note B with start/end times [1, 3] [0.8, 2.8] respectively. Their overlap ratio will be (2.8 - 1)/(3 - 0.8) = 1.8/2.2 = 0.82. Now, let's say graph-based note matching instead matches ref note A to est note C with start/end times [1.2, 2]. Now the overlap ratio will be (2 - 1.2)/(3 - 1) = 0.8/2 = 0.4, a dramatic drop from 0.82.

All this to say, I'm very confident my implementation of the metric is correct, it's just going to be different compared to MIREX due to the different note matching algorithm used. To get a rough notion of how the metric changes I computed the differences for all algorithms evaluated on the Su dataset in 2015, displayed in the third plot of the notebook. You can see the change in score actually depends more on the specific track being evaluated and less on the algorithm (which is not surprising). For this dataset, changes range from an increase of 0.02 to a drop of almost -0.2. The last (box)plot in the notebook gives a good feel for the spread of differences. The median difference is about -0.02. Results are likely to vary for other datasets.

craffel · 2016-04-14T16:49:59Z

Thanks for this analysis. I'll take a closer look when I have more cycles. It's interesting that the greedy vs. optimal matching here makes such a big difference. Since this keeps popping up and effects a wide variety of metrics, I think it would be valuable to publish something about this (beyond what we discuss in the mir_eval paper -- i.e. discussing all of these other metrics and how it effects them), maybe as an LBD this year or something.

justinsalamon · 2016-04-14T16:55:37Z

Cool, let me know when you've had a chance to give it a look. As far as implementation goes, I just need to generate new regression data and add back the regression test and then I think this PR is ready for a CR (I'll ping when it is). We're still missing the chorma metrics to reach full coverage of the MIREX metrics, but I'm out of cycles. Might add them in a later PR.

If you think it's worth writing something up as an LBD I'm happy to help in where I can.

justinsalamon · 2016-04-25T18:56:59Z

mir_eval's convention in this case, which seems sane to me, is that the submodule itself's validate function handles warning for empty annotations and then passes the rest off to the util validate function. See https://github.com/craffel/mir_eval/blob/master/mir_eval/beat.py#L77 for example. I don't think it's totally insane to think that some metric which takes in interval annotations would be ok with empty annotations (i.e. not warranting a warning).

ok, in which case I'll have to keep my validate_intervals, but will add a call to util.validate_intervals inside this function.

If the annotation specification is that the shape is (n_intervals, 2) and an algorithm outputs np.array([]) or something, then that algorithm would have a bug. I would argue it's not mir_eval's job to handle that.

ok

I'm not supposed to be looking at this now

Too tempting? 🍰

craffel · 2016-04-25T19:01:50Z

ok, in which case I'll have to keep my validate_intervals, but will add a call to util.validate_intervals inside this function.

👍

Too tempting? 🍰

💀

justinsalamon · 2016-04-25T19:11:14Z

💀

👻 @bmcfee I think this means the last remaining discussion point is:

Adding an optional strict=False parameter to util.match_events such that if strict==True then the function uses < instead of <=, and adding another optional parameter decimals=None such that if it's not None it's an integer indicating how many decimals to round event time differences down to for matching, would make the function more useful, for example it could probably replace match_note_onsets which basically implements the aforementioned functionality.

Appending relevant earlier comment:

The reason I'm rounding down is because I had a corner case in my unit tests where the distance between 2 events was exactly equal to the window, so when <= is used (as in util.match_events) the correct behavior would be to consider the events a match. However, without rounding down first the two events in my test were not being matched, and it turned out the reason was precision. So you could say this is an issue with util.match_events, since it uses <= but doesn't guarantee correct behavior for corner cases.

1. Call validate_intervals() in validate(), call util.validate_intervals() in validate_intervals() 2. Rename match_offsets and match_onsets to match_note_offsets and match_note_onsets 3. Document difference between match_notes, match_not_onsets and match_note_offsets 4. Spellcheck sequences 5. Remove old reference to with_offset parameter from docstring 6. Make OR formula easier to read in docstring 7. Add note about validation in AOR docstring 8. Add beta optional parameter to all functions that compute f1 9. Generate empty interval arrays with correct shape in unit tests

bmcfee · 2016-04-25T19:54:05Z

I think this means the last remaining discussion point is:

Adding an optional strict=False parameter to util.match_events such that if strict==True then the function uses < instead of <=, and adding another optional parameter decimals=None such that if it's not None it's an integer indicating how many decimals to round event time differences down to for matching, would make the function more useful, for example it could probably replace match_note_onsets which basically implements the aforementioned functionality.

I think it makes sense to adopt the strict parameter functionality in general, since there's no other way to do this.

The decimals case is less obvious to me. Can't this be accomplished by pulling the rounding logic up to the calling function, rather than building it into match_events?

I'd rather keep match_events (or similar functions) as simple as possible, and not bloat them unnecessarily.

justinsalamon · 2016-04-25T20:46:43Z

think it makes sense to adopt the strict parameter functionality in general, since there's no other way to do this.

ok, depending on what our final decision for match_events is I can add it as part of this PR or it can be done separately.

The decimals case is less obvious to me. Can't this be accomplished by pulling the rounding logic up to the calling function, rather than building it into match_events?
I'd rather keep match_events (or similar functions) as simple as possible, and not bloat them unnecessarily.

I agree that keeping it simple is preferable, but if you have a look at match_onsets it's not particularly bloated. More importantly though, right now if you have 2 events at times 1.0 and 1.5, and you call match_events with a window of 0.5, then you'd expect the two events to match, whereas in practice it's possible for them not to match with the current implementation. So without rounding, match_events can actually behave incorrectly, which I think is decent motivation to include rounding. In transcription.py by default we round to 4 decimals, which is probably still more temporal resolution than any of the eval metrics require, but it guarantees that when you use <= and the time difference between events is identical (or very close) to the tolerance window, the behavior will stay correct.

justinsalamon · 2016-04-27T22:20:11Z

@bmcfee if you still prefer to keep match_events as is then I think this PR should be ready to merge.

bmcfee · 2016-04-28T00:01:56Z

@bmcfee if you still prefer to keep match_events as is then I think this PR should be ready to merge.

I'm fine with it as is. If we decide to merge that logic into match_events, we can do so in another PR.

justinsalamon · 2016-04-28T00:04:18Z

I'm fine with it as is. If we decide to merge that logic into match_events, we can do so in another PR.

ok cool.

@craffel I think we're good to go on this one

craffel · 2016-04-28T00:05:45Z

Can we compare to MIREX's implementation first, like last time?

justinsalamon · 2016-04-28T00:10:54Z

Can we compare to MIREX's implementation first, like last time?

You mean beyond the comparison I already shared here: https://gist.github.com/justinsalamon/b2a837a889a61a69b76a42a4ff38d889 ?

The changes I made based on @bmcfee CR were mainly about naming conventions and validation, I haven't changed anything in the implementation of the metrics themselves since I performed the comparison in the notebook, so it still holds.

craffel · 2016-04-28T00:15:15Z

Oh, I totally forgot about that, sorry, too much other stuff going on. Thanks for reminding me. I echo my original sentiment

Since this keeps popping up and effects a wide variety of metrics, I think it would be valuable to publish something about this (beyond what we discuss in the mir_eval paper -- i.e. discussing all of these other metrics and how it effects them), maybe as an LBD this year or something.

Anyways, I am OK with merging in terms of the implementation/API being correct, although it's always helpful to get another pair of eyes on the docstrings etc. and I could do a code review this weekend/next week sometime. Any big rush to merge?

justinsalamon · 2016-04-28T00:19:35Z

Anyways, I am OK with merging in terms of the implementation/API being correct, although it's always helpful to get another pair of eyes on the docstrings etc. and I could do a code review this weekend/next week sometime. Any big rush to merge?

No, we can definitely wait a week for a final CR if that works for you. No rush, I just have a few PR's pending in different places and want to make sure they all get merged (eventually). Thanks!

craffel · 2016-04-28T00:23:38Z

Bug me if I have not done it by May 6th.

craffel · 2016-05-03T15:46:04Z

mir_eval/transcription.py

+    if ref_intervals.size == 0:
+        warnings.warn("Reference notes are empty.")
+    if est_intervals.size == 0:
+        warnings.warn("Estimate notes are empty.")


craffel · 2016-05-03T15:59:04Z

Looks good to me. I made two minor docs nitpicks. I'd also like it if you standardize how you are referring to reference and estimated notes and offsets in docs - some places it's "ref note", some places it's "reference", and you say "estimate note" instead of "estimated" in some places. So, all nitpicks.

justinsalamon · 2016-05-04T01:27:24Z

@craffel happy to make these changes, but before I do I just want to be sure - I think in most places I use "estimate", treating it as a noun, and a counterpart to "reference". It's true that in some places I use "estimated notes", but that's just for the sake of flow. For example in one of the two comments you made you note I use "Estimate note", but it's not a typo, it means "a note belonging to the estimate".

To make a long story short, I'm happy to make terminology consistent (as it should be), but I actually have a preference for "Reference notes" and "Estimate notes", so I'd change occurrences of "estimated notes" to "estimate notes". Would that work? How is this handled in other modules in mir_eval? If everyone uses "estimated" everywhere and I'm the odd one out I can follow suit.

craffel · 2016-05-04T13:36:58Z

If everyone uses "estimated" everywhere and I'm the odd one out I can follow suit.

This is the case. The convention is that you refer to the annotation as, e.g. "estimated beats", or generically as "the estimate", not as "estimate beats" (try a grep to see).

justinsalamon · 2016-05-06T18:04:56Z

@craffel nitpicks have been nitpicked.

craffel · 2016-05-08T03:33:24Z

Merged! Thank you very much!

justinsalamon · 2016-05-08T14:40:00Z

ehm, this still seems to be around :) shall I?

craffel · 2016-05-08T14:43:28Z

Please do! (Note - you need to close, not merge; the PR was already merged: https://github.com/craffel/mir_eval/commits/master )

craffel · 2016-05-08T15:07:21Z

Oh no! You merged instead of closing... it looks like history is not too messed up, but d3b8df8 got committed twice. I guess leaving that in there is better than rebasing and corrupting everyone's local copy. In the future don't hit that merge button (or merge otherwise) after the PR has already been merged manually, you need to close it.

justinsalamon · 2016-05-08T15:39:40Z

My bad - didn't realize I'd still have the option to merge if the PR had already been merged and thought it hadn't been. You live, you learn.

Implement match_onsets function

39d39e1

match_onsets behaves exactly like match_notes, except that it only take note onsets into account.

Implement match_offsets function

4a5d4ba

This function behaves exactly like match_notes, except that it only takes offsets into account.

Add tests for match_onsets and match_offsets

3662134

Implement onset_precision_recall_f1 and tests

6245fb2

Implement offset_precision_recall_f1 and tests

0c92e5d

Increase test coverage (empty ref/est case)

5e5ca2b

justinsalamon added 2 commits April 13, 2016 17:42

Add average overlap ratio metric and test (breaks regression)

9960de2

Temporarily comment out regression test

d03d01c

Set AOR=0 when no matched notes + bugfix

be3259a

New output files for regression test (now include AOR)

329474b

Make PEP8 complient

a74ab36

craffel reviewed May 3, 2016
View reviewed changes

Update dosctrings for completeness and language consistency

d3b8df8

justinsalamon merged commit f858df3 into master May 8, 2016

craffel mentioned this pull request Jun 1, 2016

Implement more transcription metrics #180

Closed

bmcfee deleted the transcriptionext branch February 7, 2025 20:46

Implement additional transcription metrics (#180) #189

Implement additional transcription metrics (#180) #189

Conversation

justinsalamon commented Apr 12, 2016 • edited Loading

coveralls commented Apr 12, 2016

coveralls commented Apr 12, 2016

coveralls commented Apr 12, 2016

coveralls commented Apr 13, 2016

coveralls commented Apr 13, 2016

coveralls commented Apr 13, 2016

justinsalamon commented Apr 13, 2016

craffel commented Apr 13, 2016

justinsalamon commented Apr 13, 2016

justinsalamon commented Apr 13, 2016

craffel commented Apr 13, 2016

justinsalamon commented Apr 13, 2016

craffel commented Apr 13, 2016

justinsalamon commented Apr 13, 2016

craffel commented Apr 13, 2016

coveralls commented Apr 14, 2016

coveralls commented Apr 14, 2016

justinsalamon commented Apr 14, 2016

craffel commented Apr 14, 2016

justinsalamon commented Apr 14, 2016

justinsalamon commented Apr 25, 2016

craffel commented Apr 25, 2016

justinsalamon commented Apr 25, 2016

bmcfee commented Apr 25, 2016

justinsalamon commented Apr 25, 2016 • edited Loading

justinsalamon commented Apr 27, 2016

bmcfee commented Apr 28, 2016

justinsalamon commented Apr 28, 2016

craffel commented Apr 28, 2016

justinsalamon commented Apr 28, 2016 • edited Loading

craffel commented Apr 28, 2016

justinsalamon commented Apr 28, 2016

craffel commented Apr 28, 2016

craffel May 3, 2016

Choose a reason for hiding this comment

craffel commented May 3, 2016

justinsalamon commented May 4, 2016

craffel commented May 4, 2016

justinsalamon commented May 6, 2016

craffel commented May 8, 2016

justinsalamon commented May 8, 2016

craffel commented May 8, 2016 • edited Loading

craffel commented May 8, 2016

justinsalamon commented May 8, 2016

justinsalamon commented Apr 12, 2016 •

edited

Loading

justinsalamon commented Apr 25, 2016 •

edited

Loading

justinsalamon commented Apr 28, 2016 •

edited

Loading

craffel commented May 8, 2016 •

edited

Loading