Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measures for over- and under-segmentation of chord labels #263

Merged
merged 5 commits into from
Nov 30, 2017

Conversation

fdlm
Copy link
Contributor

@fdlm fdlm commented Oct 4, 2017

An attempt for addressing #262.

  • The MIREX MeanSeg metric is not implemented, because it is the dataset average of min(overseg, underseg). I don't think it's meaningful to add another field to the scores that just returns the minimum of the two.
  • The results can be slightly different than the ones used in MIREX due to a philosophical difference. This implementation uses the adjusted intervals (i.e., assumes a "N" detection if the estimation does not output anything), while in MIREX, positions without an explicit estimation are ignored.

Let me know what you think.

seg = 0.
for start, end in reference_intervals:
dur = end - start
between_start_end = est_ts[(est_ts > start) & (est_ts < end)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should only one of these comparisons be strict?

directional hamming distance between reference intervals and
estimated intervals.
"""
est_ts = np.unique(estimated_intervals.flatten())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should have an input validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like if estimated_intervals.shape[1] != 2 or should the check go deeper and ensure e.g. that there are no overlaps in the intervals, etc.?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling util.validate_intervals would probably suffice. It doesn't check for disjoint/complete segmentation (and maybe it should), but it does cover the basics of shape matching and valid interval timings.

Copy link
Collaborator

@bmcfee bmcfee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good, thanks! A few minor comments about validation and edge cases.

Also, it looks like there are no unit tests for the new functions?

@fdlm
Copy link
Contributor Author

fdlm commented Oct 5, 2017

Right, I missed that - somehow I thought having numbers in the outputXX.json is enough, but I will write tests that cover some edge cases.

@craffel
Copy link
Collaborator

craffel commented Oct 5, 2017

Thanks for reviewing @bmcfee , I am slammed right now so I can give it a final look-over once things are squared

@fdlm
Copy link
Contributor Author

fdlm commented Oct 11, 2017

So, I think I addressed all concerns @bmcfee pointed out, so from my side, this PR is ready to merge. Let me know if there is anything else I can improve!

between_start_end = est_ts[(est_ts >= start) & (est_ts < end)]
seg_ts = np.hstack([start, between_start_end, end])
seg += dur - np.diff(seg_ts).max()
return seg / (reference_intervals[-1, 1] - reference_intervals[0, 0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible div-by-0 here, since validate_intervals doesn't check for non-emptiness. Maybe add an explicit check here, and document the behavior if reference intervals are empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does implicitly in the following check (util.py:770):

    if intervals.ndim != 2 or intervals.shape[1] != 2:
        raise ValueError('Intervals should be n-by-2 numpy ndarray, '
                         'but shape={}'.format(intervals.shape))

Here, np.atleast_2d(np.array([])).shape[1] == 0, and thus the ValueError will be raised. So, at least one interval must be present, and it must have a positive duration, and thus reference_intervals[-1, 1] - reference_intervals[0, 0] must be positive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, quite correct, thanks. Sorry I missed that before.

LGTM!

@craffel
Copy link
Collaborator

craffel commented Oct 19, 2017

Great. I am about to travel to ISMIR but should get a chance to do a final pass over in the next few days. Thanks!

Copy link
Collaborator

@craffel craffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor documentation question, otherwise looks good.

... ref_intervals, est_intervals)
>>> underseg = 1 - mir_eval.chord.directional_hamming_distance(
... est_intervals, ref_intervals)
>>> meanseg = min(overseg, underseg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this line --

  • It's called "meanseg" but is computed as the min.
  • Is this a thing people measure? If so should we have a separate metric function for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the naming is weird, but I wanted to be consistent with MIREX (see http://www.music-ir.org/mirex/wiki/2017:Audio_Chord_Estimation_Results).

I'm not sure if it is meaningful to have this as a separate metric for this for each song (it is, after all, just min(overseg, underseg)). However, it might then be easier to recreate the metrics as used in the MIREX task, where "MeanSeg" means the mean of the min(os, us) for each song. In this case, I would prefer to call it just something like "Segmentation".

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they are calling it "mean" because it's the mean of the min across all songs? I guess that bothers me because all of the metrics are mean across songs, correct? In terms of what I think, I honestly don't know in this case! But I will say typically it's ok to have an extra metric function if it's something that people measure, even if it's oneline. The idea being that evaluate returns everything that you might want to measure to report your results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's add a function for it, then! The only negative now is that underseg() and overseg() get computed twice when calling evaluate(). If this is a problem, I can replace scores['seg'] = seg(...) with scores['seg'] = min(scores['underseg'], scores['overseg']).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a problem, I can replace scores['seg'] = seg(...) with scores['seg'] = min(scores['underseg'], scores['overseg']).

This seems like the best option. (and sorry for the delay)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! Should be done now. If there is anything I can do to further improve this PR, let me know.

estimated intervals as defined by [#harte2010towards]_ and used for MIREX
'OverSeg', 'UnderSeg' and 'MeanSeg' measures.

Examples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, since this isn't a metric function, you don't need an "Examples" section, but it certainly doesn't hurt :)

fdlm added 2 commits October 25, 2017 16:39
fixed chord interval overlap check; added tests for chord interval overlap check.
@craffel craffel merged commit 58ad872 into mir-evaluation:master Nov 30, 2017
@craffel
Copy link
Collaborator

craffel commented Nov 30, 2017

Merged! Thanks so much!

craffel pushed a commit that referenced this pull request Jun 22, 2018
* implemented over- and under-segmentation measuers

* added function to merge intervals; added tests

* added more tests and input validation for directional hamming distance

* added function for "segmentation" measure

* fixed computation of 'segmentation' measure for chord evaluation;
fixed chord interval overlap check; added tests for chord interval overlap check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants