-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measures for over- and under-segmentation of chord labels #263
Conversation
mir_eval/chord.py
Outdated
seg = 0. | ||
for start, end in reference_intervals: | ||
dur = end - start | ||
between_start_end = est_ts[(est_ts > start) & (est_ts < end)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should only one of these comparisons be strict?
mir_eval/chord.py
Outdated
directional hamming distance between reference intervals and | ||
estimated intervals. | ||
""" | ||
est_ts = np.unique(estimated_intervals.flatten()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should have an input validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like if estimated_intervals.shape[1] != 2
or should the check go deeper and ensure e.g. that there are no overlaps in the intervals, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling util.validate_intervals
would probably suffice. It doesn't check for disjoint/complete segmentation (and maybe it should), but it does cover the basics of shape matching and valid interval timings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good, thanks! A few minor comments about validation and edge cases.
Also, it looks like there are no unit tests for the new functions?
Right, I missed that - somehow I thought having numbers in the |
Thanks for reviewing @bmcfee , I am slammed right now so I can give it a final look-over once things are squared |
So, I think I addressed all concerns @bmcfee pointed out, so from my side, this PR is ready to merge. Let me know if there is anything else I can improve! |
between_start_end = est_ts[(est_ts >= start) & (est_ts < end)] | ||
seg_ts = np.hstack([start, between_start_end, end]) | ||
seg += dur - np.diff(seg_ts).max() | ||
return seg / (reference_intervals[-1, 1] - reference_intervals[0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible div-by-0 here, since validate_intervals
doesn't check for non-emptiness. Maybe add an explicit check here, and document the behavior if reference intervals are empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does implicitly in the following check (util.py:770):
if intervals.ndim != 2 or intervals.shape[1] != 2:
raise ValueError('Intervals should be n-by-2 numpy ndarray, '
'but shape={}'.format(intervals.shape))
Here, np.atleast_2d(np.array([])).shape[1] == 0
, and thus the ValueError
will be raised. So, at least one interval must be present, and it must have a positive duration, and thus reference_intervals[-1, 1] - reference_intervals[0, 0]
must be positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, quite correct, thanks. Sorry I missed that before.
LGTM!
Great. I am about to travel to ISMIR but should get a chance to do a final pass over in the next few days. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor documentation question, otherwise looks good.
mir_eval/chord.py
Outdated
... ref_intervals, est_intervals) | ||
>>> underseg = 1 - mir_eval.chord.directional_hamming_distance( | ||
... est_intervals, ref_intervals) | ||
>>> meanseg = min(overseg, underseg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this line --
- It's called "meanseg" but is computed as the min.
- Is this a thing people measure? If so should we have a separate metric function for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the naming is weird, but I wanted to be consistent with MIREX (see http://www.music-ir.org/mirex/wiki/2017:Audio_Chord_Estimation_Results).
I'm not sure if it is meaningful to have this as a separate metric for this for each song (it is, after all, just min(overseg, underseg)). However, it might then be easier to recreate the metrics as used in the MIREX task, where "MeanSeg" means the mean of the min(os, us) for each song. In this case, I would prefer to call it just something like "Segmentation".
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they are calling it "mean" because it's the mean of the min across all songs? I guess that bothers me because all of the metrics are mean across songs, correct? In terms of what I think, I honestly don't know in this case! But I will say typically it's ok to have an extra metric function if it's something that people measure, even if it's oneline. The idea being that evaluate returns everything that you might want to measure to report your results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's add a function for it, then! The only negative now is that underseg()
and overseg()
get computed twice when calling evaluate()
. If this is a problem, I can replace scores['seg'] = seg(...)
with scores['seg'] = min(scores['underseg'], scores['overseg'])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a problem, I can replace scores['seg'] = seg(...) with scores['seg'] = min(scores['underseg'], scores['overseg']).
This seems like the best option. (and sorry for the delay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! Should be done now. If there is anything I can do to further improve this PR, let me know.
estimated intervals as defined by [#harte2010towards]_ and used for MIREX | ||
'OverSeg', 'UnderSeg' and 'MeanSeg' measures. | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, since this isn't a metric function, you don't need an "Examples" section, but it certainly doesn't hurt :)
fixed chord interval overlap check; added tests for chord interval overlap check.
Merged! Thanks so much! |
* implemented over- and under-segmentation measuers * added function to merge intervals; added tests * added more tests and input validation for directional hamming distance * added function for "segmentation" measure * fixed computation of 'segmentation' measure for chord evaluation; fixed chord interval overlap check; added tests for chord interval overlap check.
An attempt for addressing #262.
MeanSeg
metric is not implemented, because it is the dataset average of min(overseg, underseg). I don't think it's meaningful to add another field to the scores that just returns the minimum of the two.Let me know what you think.