Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement additional transcription metrics (#180) #189

Merged
merged 14 commits into from
May 8, 2016

Conversation

justinsalamon
Copy link
Collaborator

@justinsalamon justinsalamon commented Apr 12, 2016

This PR will include implementations of:

  • P, R & F measures for onsets only
  • P, R & F measures for offsets only
  • Overlap ratio metric (normal and _no_offset version)

match_onsets behaves exactly like match_notes, except that it only take note onsets into account.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.8%) to 97.28% when pulling 39d39e1 on transcriptionext into 53c16af on master.

This function behaves exactly like match_notes, except that it only takes offsets into account.
@coveralls
Copy link

Coverage Status

Coverage decreased (-1.6%) to 96.459% when pulling 4a5d4ba on transcriptionext into 53c16af on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 98.097% when pulling 3662134 on transcriptionext into 53c16af on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 97.954% when pulling 6245fb2 on transcriptionext into 53c16af on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 97.911% when pulling 0c92e5d on transcriptionext into 53c16af on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 98.12% when pulling 5e5ca2b on transcriptionext into 53c16af on master.

@justinsalamon
Copy link
Collaborator Author

@craffel the onset/offset-only metrics are pretty much done, but I have not added them to the evaluator since they are not officially part of MIREX. Any preference?

I'm still missing the overlap ratio (and maybe chroma alternatives) to complete the PR, but this metric is computed in MIREX so no doubts there.

@craffel
Copy link
Collaborator

craffel commented Apr 13, 2016

the onset/offset-only metrics are pretty much done, but I have not added them to the evaluator since they are not officially part of MIREX. Any preference?

They should be put in the evaluator even if they're not in MIREX.

I'm still missing the overlap ratio (and maybe chroma alternatives) to complete the PR, but this metric is computed in MIREX so no doubts there.

I won't have a chance to review this until mid-May. Feel free to entice someone else to do a code review in the meanwhile.

@justinsalamon
Copy link
Collaborator Author

They should be put in the evaluator even if they're not in MIREX.

ok

I won't have a chance to review this until mid-May. Feel free to entice someone else to do a code review in the meanwhile.

no worries, I'll ping the rest of team once this is ready for review

@justinsalamon
Copy link
Collaborator Author

@craffel just one quick question: to compute the overlap ratio you need to compute note matching, and then the metric is derived from the matched notes similar to precision, recall and f1. Computationally the most efficient thing to do would be to add this metric to the existing precision_recall_f1 function since it already computes the note matching. The alternative would be to implement the metric in a new function which also computes the note matching, but this means exactly the same note matching would be computed twice when the evaluator is called (and this where most of the computational load is).

So... add the metric to precision_recall_f1 (and if so, rename function?), or implement in separate function?

@craffel
Copy link
Collaborator

craffel commented Apr 13, 2016

We shouldn't duplicate computation. I would need to look at your code to decide if it makes sense to me to lump this all into one function because I'm not clear on what exactly the computation being duplicated is.

@justinsalamon
Copy link
Collaborator Author

The computation being duplicated would be this call to match_notes which involves computing the 3 hit matrices (for onset/offset/pitch) and then computing the bi-partite graph matching.

@craffel
Copy link
Collaborator

craffel commented Apr 13, 2016

Ok. I am fine with adding overlap ratio computation to the PRF function. Why don't you code it up that way, and if there are any undesirable side-effects we can discuss them later.

@justinsalamon
Copy link
Collaborator Author

ok sounds good. shall we rename it to precision_recall_f1_overlap then?

@craffel
Copy link
Collaborator

craffel commented Apr 13, 2016

Sounds good.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 97.765% when pulling d03d01c on transcriptionext into 53c16af on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 97.664% when pulling be3259a on transcriptionext into 53c16af on master.

@justinsalamon
Copy link
Collaborator Author

@craffel I've finished implementing the AOR metric, and so I've compared it to some MIREX results: https://gist.github.com/justinsalamon/b2a837a889a61a69b76a42a4ff38d889

In the first plot you can see the mir_eval results are pretty different from the MIREX results. The reason is that this metric is highly dependent on the note matching, and since we use a different matching algorithm, the matching is different and consequently the AOR too. In the second plot I've plotted the scores we'd get if mir_eval used the same greedy note matching as in MIREX, and you can see the scores now match just fine.

The reason AOR is so sensitive to note matching is because it's the average (over all matched notes) of the ratio between the overlapping segment of two matched notes and the maximum time duration spanned by the two notes. So for example, lets say greedy matching matches ref note A to est note B with start/end times [1, 3] [0.8, 2.8] respectively. Their overlap ratio will be (2.8 - 1)/(3 - 0.8) = 1.8/2.2 = 0.82. Now, let's say graph-based note matching instead matches ref note A to est note C with start/end times [1.2, 2]. Now the overlap ratio will be (2 - 1.2)/(3 - 1) = 0.8/2 = 0.4, a dramatic drop from 0.82.

All this to say, I'm very confident my implementation of the metric is correct, it's just going to be different compared to MIREX due to the different note matching algorithm used. To get a rough notion of how the metric changes I computed the differences for all algorithms evaluated on the Su dataset in 2015, displayed in the third plot of the notebook. You can see the change in score actually depends more on the specific track being evaluated and less on the algorithm (which is not surprising). For this dataset, changes range from an increase of 0.02 to a drop of almost -0.2. The last (box)plot in the notebook gives a good feel for the spread of differences. The median difference is about -0.02. Results are likely to vary for other datasets.

@craffel
Copy link
Collaborator

craffel commented Apr 14, 2016

Thanks for this analysis. I'll take a closer look when I have more cycles. It's interesting that the greedy vs. optimal matching here makes such a big difference. Since this keeps popping up and effects a wide variety of metrics, I think it would be valuable to publish something about this (beyond what we discuss in the mir_eval paper -- i.e. discussing all of these other metrics and how it effects them), maybe as an LBD this year or something.

@justinsalamon
Copy link
Collaborator Author

Cool, let me know when you've had a chance to give it a look. As far as implementation goes, I just need to generate new regression data and add back the regression test and then I think this PR is ready for a CR (I'll ping when it is). We're still missing the chorma metrics to reach full coverage of the MIREX metrics, but I'm out of cycles. Might add them in a later PR.

If you think it's worth writing something up as an LBD I'm happy to help in where I can.

@justinsalamon
Copy link
Collaborator Author

mir_eval's convention in this case, which seems sane to me, is that the submodule itself's validate function handles warning for empty annotations and then passes the rest off to the util validate function. See https://github.com/craffel/mir_eval/blob/master/mir_eval/beat.py#L77 for example. I don't think it's totally insane to think that some metric which takes in interval annotations would be ok with empty annotations (i.e. not warranting a warning).

ok, in which case I'll have to keep my validate_intervals, but will add a call to util.validate_intervals inside this function.

If the annotation specification is that the shape is (n_intervals, 2) and an algorithm outputs np.array([]) or something, then that algorithm would have a bug. I would argue it's not mir_eval's job to handle that.

ok

I'm not supposed to be looking at this now

Too tempting? 🍰

@craffel
Copy link
Collaborator

craffel commented Apr 25, 2016

ok, in which case I'll have to keep my validate_intervals, but will add a call to util.validate_intervals inside this function.

👍

Too tempting? 🍰

💀

@justinsalamon
Copy link
Collaborator Author

💀

👻 @bmcfee I think this means the last remaining discussion point is:

Adding an optional strict=False parameter to util.match_events such that if strict==True then the function uses < instead of <=, and adding another optional parameter decimals=None such that if it's not None it's an integer indicating how many decimals to round event time differences down to for matching, would make the function more useful, for example it could probably replace match_note_onsets which basically implements the aforementioned functionality.

Appending relevant earlier comment:

The reason I'm rounding down is because I had a corner case in my unit tests where the distance between 2 events was exactly equal to the window, so when <= is used (as in util.match_events) the correct behavior would be to consider the events a match. However, without rounding down first the two events in my test were not being matched, and it turned out the reason was precision. So you could say this is an issue with util.match_events, since it uses <= but doesn't guarantee correct behavior for corner cases.

1. Call validate_intervals() in validate(), call util.validate_intervals() in validate_intervals()
2. Rename match_offsets and match_onsets to match_note_offsets and match_note_onsets
3. Document difference between match_notes, match_not_onsets and match_note_offsets
4. Spellcheck sequences
5. Remove old reference to with_offset parameter from docstring
6. Make OR formula easier to read in docstring
7. Add note about validation in AOR docstring
8. Add beta optional parameter to all functions that compute f1
9. Generate empty interval arrays with correct shape in unit tests
@bmcfee
Copy link
Collaborator

bmcfee commented Apr 25, 2016

I think this means the last remaining discussion point is:

Adding an optional strict=False parameter to util.match_events such that if strict==True then the function uses < instead of <=, and adding another optional parameter decimals=None such that if it's not None it's an integer indicating how many decimals to round event time differences down to for matching, would make the function more useful, for example it could probably replace match_note_onsets which basically implements the aforementioned functionality.

I think it makes sense to adopt the strict parameter functionality in general, since there's no other way to do this.

The decimals case is less obvious to me. Can't this be accomplished by pulling the rounding logic up to the calling function, rather than building it into match_events?

I'd rather keep match_events (or similar functions) as simple as possible, and not bloat them unnecessarily.

@justinsalamon
Copy link
Collaborator Author

justinsalamon commented Apr 25, 2016

think it makes sense to adopt the strict parameter functionality in general, since there's no other way to do this.

ok, depending on what our final decision for match_events is I can add it as part of this PR or it can be done separately.

The decimals case is less obvious to me. Can't this be accomplished by pulling the rounding logic up to the calling function, rather than building it into match_events?
I'd rather keep match_events (or similar functions) as simple as possible, and not bloat them unnecessarily.

I agree that keeping it simple is preferable, but if you have a look at match_onsets it's not particularly bloated. More importantly though, right now if you have 2 events at times 1.0 and 1.5, and you call match_events with a window of 0.5, then you'd expect the two events to match, whereas in practice it's possible for them not to match with the current implementation. So without rounding, match_events can actually behave incorrectly, which I think is decent motivation to include rounding. In transcription.py by default we round to 4 decimals, which is probably still more temporal resolution than any of the eval metrics require, but it guarantees that when you use <= and the time difference between events is identical (or very close) to the tolerance window, the behavior will stay correct.

@justinsalamon
Copy link
Collaborator Author

@bmcfee if you still prefer to keep match_events as is then I think this PR should be ready to merge.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 28, 2016

@bmcfee if you still prefer to keep match_events as is then I think this PR should be ready to merge.

I'm fine with it as is. If we decide to merge that logic into match_events, we can do so in another PR.

@justinsalamon
Copy link
Collaborator Author

I'm fine with it as is. If we decide to merge that logic into match_events, we can do so in another PR.

ok cool.

@craffel I think we're good to go on this one

@craffel
Copy link
Collaborator

craffel commented Apr 28, 2016

Can we compare to MIREX's implementation first, like last time?

@justinsalamon
Copy link
Collaborator Author

justinsalamon commented Apr 28, 2016

Can we compare to MIREX's implementation first, like last time?

You mean beyond the comparison I already shared here: https://gist.github.com/justinsalamon/b2a837a889a61a69b76a42a4ff38d889 ?

The changes I made based on @bmcfee CR were mainly about naming conventions and validation, I haven't changed anything in the implementation of the metrics themselves since I performed the comparison in the notebook, so it still holds.

@craffel
Copy link
Collaborator

craffel commented Apr 28, 2016

Oh, I totally forgot about that, sorry, too much other stuff going on. Thanks for reminding me. I echo my original sentiment

Since this keeps popping up and effects a wide variety of metrics, I think it would be valuable to publish something about this (beyond what we discuss in the mir_eval paper -- i.e. discussing all of these other metrics and how it effects them), maybe as an LBD this year or something.

Anyways, I am OK with merging in terms of the implementation/API being correct, although it's always helpful to get another pair of eyes on the docstrings etc. and I could do a code review this weekend/next week sometime. Any big rush to merge?

@justinsalamon
Copy link
Collaborator Author

Anyways, I am OK with merging in terms of the implementation/API being correct, although it's always helpful to get another pair of eyes on the docstrings etc. and I could do a code review this weekend/next week sometime. Any big rush to merge?

No, we can definitely wait a week for a final CR if that works for you. No rush, I just have a few PR's pending in different places and want to make sure they all get merged (eventually). Thanks!

@craffel
Copy link
Collaborator

craffel commented Apr 28, 2016

Bug me if I have not done it by May 6th.

if ref_intervals.size == 0:
warnings.warn("Reference notes are empty.")
if est_intervals.size == 0:
warnings.warn("Estimate notes are empty.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Estimated

@craffel
Copy link
Collaborator

craffel commented May 3, 2016

Looks good to me. I made two minor docs nitpicks. I'd also like it if you standardize how you are referring to reference and estimated notes and offsets in docs - some places it's "ref note", some places it's "reference", and you say "estimate note" instead of "estimated" in some places. So, all nitpicks.

@justinsalamon
Copy link
Collaborator Author

@craffel happy to make these changes, but before I do I just want to be sure - I think in most places I use "estimate", treating it as a noun, and a counterpart to "reference". It's true that in some places I use "estimated notes", but that's just for the sake of flow. For example in one of the two comments you made you note I use "Estimate note", but it's not a typo, it means "a note belonging to the estimate".

To make a long story short, I'm happy to make terminology consistent (as it should be), but I actually have a preference for "Reference notes" and "Estimate notes", so I'd change occurrences of "estimated notes" to "estimate notes". Would that work? How is this handled in other modules in mir_eval? If everyone uses "estimated" everywhere and I'm the odd one out I can follow suit.

@craffel
Copy link
Collaborator

craffel commented May 4, 2016

If everyone uses "estimated" everywhere and I'm the odd one out I can follow suit.

This is the case. The convention is that you refer to the annotation as, e.g. "estimated beats", or generically as "the estimate", not as "estimate beats" (try a grep to see).

@justinsalamon
Copy link
Collaborator Author

@craffel nitpicks have been nitpicked.

@craffel
Copy link
Collaborator

craffel commented May 8, 2016

Merged! Thank you very much!

@justinsalamon
Copy link
Collaborator Author

ehm, this still seems to be around :) shall I?

@craffel
Copy link
Collaborator

craffel commented May 8, 2016

Please do! (Note - you need to close, not merge; the PR was already merged: https://github.com/craffel/mir_eval/commits/master )

@justinsalamon justinsalamon merged commit f858df3 into master May 8, 2016
@craffel
Copy link
Collaborator

craffel commented May 8, 2016

Oh no! You merged instead of closing... it looks like history is not too messed up, but d3b8df8 got committed twice. I guess leaving that in there is better than rebasing and corrupting everyone's local copy. In the future don't hit that merge button (or merge otherwise) after the PR has already been merged manually, you need to close it.

@justinsalamon
Copy link
Collaborator Author

My bad - didn't realize I'd still have the option to merge if the PR had already been merged and thought it hadn't been. You live, you learn.

@bmcfee bmcfee deleted the transcriptionext branch February 7, 2025 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants