-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement additional transcription metrics (#180) #189
Conversation
match_onsets behaves exactly like match_notes, except that it only take note onsets into account.
This function behaves exactly like match_notes, except that it only takes offsets into account.
@craffel the onset/offset-only metrics are pretty much done, but I have not added them to the evaluator since they are not officially part of MIREX. Any preference? I'm still missing the overlap ratio (and maybe chroma alternatives) to complete the PR, but this metric is computed in MIREX so no doubts there. |
They should be put in the evaluator even if they're not in MIREX.
I won't have a chance to review this until mid-May. Feel free to entice someone else to do a code review in the meanwhile. |
ok
no worries, I'll ping the rest of team once this is ready for review |
@craffel just one quick question: to compute the overlap ratio you need to compute note matching, and then the metric is derived from the matched notes similar to precision, recall and f1. Computationally the most efficient thing to do would be to add this metric to the existing So... add the metric to |
We shouldn't duplicate computation. I would need to look at your code to decide if it makes sense to me to lump this all into one function because I'm not clear on what exactly the computation being duplicated is. |
The computation being duplicated would be this call to |
Ok. I am fine with adding overlap ratio computation to the PRF function. Why don't you code it up that way, and if there are any undesirable side-effects we can discuss them later. |
ok sounds good. shall we rename it to |
Sounds good. |
@craffel I've finished implementing the AOR metric, and so I've compared it to some MIREX results: https://gist.github.com/justinsalamon/b2a837a889a61a69b76a42a4ff38d889 In the first plot you can see the The reason AOR is so sensitive to note matching is because it's the average (over all matched notes) of the ratio between the overlapping segment of two matched notes and the maximum time duration spanned by the two notes. So for example, lets say greedy matching matches ref note A to est note B with start/end times [1, 3] [0.8, 2.8] respectively. Their overlap ratio will be (2.8 - 1)/(3 - 0.8) = 1.8/2.2 = 0.82. Now, let's say graph-based note matching instead matches ref note A to est note C with start/end times [1.2, 2]. Now the overlap ratio will be (2 - 1.2)/(3 - 1) = 0.8/2 = 0.4, a dramatic drop from 0.82. All this to say, I'm very confident my implementation of the metric is correct, it's just going to be different compared to MIREX due to the different note matching algorithm used. To get a rough notion of how the metric changes I computed the differences for all algorithms evaluated on the Su dataset in 2015, displayed in the third plot of the notebook. You can see the change in score actually depends more on the specific track being evaluated and less on the algorithm (which is not surprising). For this dataset, changes range from an increase of 0.02 to a drop of almost -0.2. The last (box)plot in the notebook gives a good feel for the spread of differences. The median difference is about -0.02. Results are likely to vary for other datasets. |
Thanks for this analysis. I'll take a closer look when I have more cycles. It's interesting that the greedy vs. optimal matching here makes such a big difference. Since this keeps popping up and effects a wide variety of metrics, I think it would be valuable to publish something about this (beyond what we discuss in the |
Cool, let me know when you've had a chance to give it a look. As far as implementation goes, I just need to generate new regression data and add back the regression test and then I think this PR is ready for a CR (I'll ping when it is). We're still missing the chorma metrics to reach full coverage of the MIREX metrics, but I'm out of cycles. Might add them in a later PR. If you think it's worth writing something up as an LBD I'm happy to help in where I can. |
ok, in which case I'll have to keep my
ok
Too tempting? 🍰 |
👍
💀 |
👻 @bmcfee I think this means the last remaining discussion point is:
Appending relevant earlier comment:
|
1. Call validate_intervals() in validate(), call util.validate_intervals() in validate_intervals() 2. Rename match_offsets and match_onsets to match_note_offsets and match_note_onsets 3. Document difference between match_notes, match_not_onsets and match_note_offsets 4. Spellcheck sequences 5. Remove old reference to with_offset parameter from docstring 6. Make OR formula easier to read in docstring 7. Add note about validation in AOR docstring 8. Add beta optional parameter to all functions that compute f1 9. Generate empty interval arrays with correct shape in unit tests
I think it makes sense to adopt the The I'd rather keep match_events (or similar functions) as simple as possible, and not bloat them unnecessarily. |
ok, depending on what our final decision for
I agree that keeping it simple is preferable, but if you have a look at |
@bmcfee if you still prefer to keep |
I'm fine with it as is. If we decide to merge that logic into match_events, we can do so in another PR. |
ok cool. @craffel I think we're good to go on this one |
Can we compare to MIREX's implementation first, like last time? |
You mean beyond the comparison I already shared here: https://gist.github.com/justinsalamon/b2a837a889a61a69b76a42a4ff38d889 ? The changes I made based on @bmcfee CR were mainly about naming conventions and validation, I haven't changed anything in the implementation of the metrics themselves since I performed the comparison in the notebook, so it still holds. |
Oh, I totally forgot about that, sorry, too much other stuff going on. Thanks for reminding me. I echo my original sentiment
Anyways, I am OK with merging in terms of the implementation/API being correct, although it's always helpful to get another pair of eyes on the docstrings etc. and I could do a code review this weekend/next week sometime. Any big rush to merge? |
No, we can definitely wait a week for a final CR if that works for you. No rush, I just have a few PR's pending in different places and want to make sure they all get merged (eventually). Thanks! |
Bug me if I have not done it by May 6th. |
if ref_intervals.size == 0: | ||
warnings.warn("Reference notes are empty.") | ||
if est_intervals.size == 0: | ||
warnings.warn("Estimate notes are empty.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Estimated
Looks good to me. I made two minor docs nitpicks. I'd also like it if you standardize how you are referring to reference and estimated notes and offsets in docs - some places it's "ref note", some places it's "reference", and you say "estimate note" instead of "estimated" in some places. So, all nitpicks. |
@craffel happy to make these changes, but before I do I just want to be sure - I think in most places I use "estimate", treating it as a noun, and a counterpart to "reference". It's true that in some places I use "estimated notes", but that's just for the sake of flow. For example in one of the two comments you made you note I use "Estimate note", but it's not a typo, it means "a note belonging to the estimate". To make a long story short, I'm happy to make terminology consistent (as it should be), but I actually have a preference for "Reference notes" and "Estimate notes", so I'd change occurrences of "estimated notes" to "estimate notes". Would that work? How is this handled in other modules in |
This is the case. The convention is that you refer to the annotation as, e.g. "estimated beats", or generically as "the estimate", not as "estimate beats" (try a grep to see). |
@craffel nitpicks have been nitpicked. |
Merged! Thank you very much! |
ehm, this still seems to be around :) shall I? |
Please do! (Note - you need to close, not merge; the PR was already merged: https://github.com/craffel/mir_eval/commits/master ) |
Oh no! You merged instead of closing... it looks like history is not too messed up, but d3b8df8 got committed twice. I guess leaving that in there is better than rebasing and corrupting everyone's local copy. In the future don't hit that merge button (or merge otherwise) after the PR has already been merged manually, you need to close it. |
My bad - didn't realize I'd still have the option to merge if the PR had already been merged and thought it hadn't been. You live, you learn. |
This PR will include implementations of:
_no_offset
version)