Sorting keys api #3902

DeNeutoy · 2020-03-04T21:08:50Z

Radical idea for sorting instances - we don't call get_padding_lengths at all.
Instead, we just specify what fields we want to sort by, and implement __len__ for all fields.

I'm pretty sure this has zero downsides, and many upsides:

Bucketing now no-longer requires the instances to be indexed, which is a potentially expensive operation.
Configs are simpler, because you just pass the name of the field you want to sort by.
There are cases for which len(field) doesn't correspond directly to what will be padded - e.g wordpiece tokenizers may cause this to be slightly different. We should be willing to gamble that sentence length and wordpieced sentence length are highly correlated in the general case, and so bucketing like this is actually fine.

The only edge case I can think of is listfields of textfields, but it's unclear to me how "efficient" you can be even if you bucket things perfectly in that case.

viking-sudo-rm · 2020-03-04T21:57:55Z

This makes sense to me. It's definitely simpler than the proposed patch in my PR where we just elevate one field in each indexer to "default sort" status.

As to point #3, I agree. If it's the case that the word lengths and word piece lengths are not correlated (imagining a case with complex morphology in Finnish), then you're probably doing something wrong anyway, and should consider using a better word-level tokenizer. But maybe there are other cases where this assumption breaks down.

DeNeutoy · 2020-03-04T22:36:43Z

@matt-gardner I know you're traveling, but what do you think of this idea? I can finish it up quite easily, but I thought you might have some objections.

matt-gardner · 2020-03-04T23:39:27Z

Haven't looked at code, but this seems like a good idea to me. The main question I have is what you also listed - when there are multiple potential sorting options, what should you sort by? You used to be able to control this, now you can't. Maybe it's worth having some constructor argument for ambiguous cases?

DeNeutoy · 2020-03-05T00:02:00Z

I guess my point is that we have increasingly few scenarios where you actually do need to control this, and that often in the case that you do need that control, sorting by something else is a good approximation anyway?

dirkgr · 2020-03-05T18:31:44Z

What's the escape hatch for people who really need to batch a certain way? Implement their own Sampler?

DeNeutoy · 2020-03-05T18:37:52Z

Yeah, I think so.

DeNeutoy added 2 commits March 4, 2020 13:01

new idea for sorting

3a89cf4

add len to all fields

add2abf

update references to sorting keys

1e76936

DeNeutoy requested a review from dirkgr March 5, 2020 01:46

viking-sudo-rm mentioned this pull request Mar 5, 2020

Default sort key for token indexers #3876

Closed

dirkgr approved these changes Mar 5, 2020

View reviewed changes

DeNeutoy merged commit 644ef22 into allenai:master Mar 5, 2020

DeNeutoy deleted the sorting-keys-api branch March 5, 2020 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting keys api #3902

Sorting keys api #3902

DeNeutoy commented Mar 4, 2020 •

edited

Loading

viking-sudo-rm commented Mar 4, 2020 •

edited

Loading

DeNeutoy commented Mar 4, 2020

matt-gardner commented Mar 4, 2020

DeNeutoy commented Mar 5, 2020

dirkgr commented Mar 5, 2020

DeNeutoy commented Mar 5, 2020

Sorting keys api #3902

Sorting keys api #3902

Conversation

DeNeutoy commented Mar 4, 2020 • edited Loading

viking-sudo-rm commented Mar 4, 2020 • edited Loading

DeNeutoy commented Mar 4, 2020

matt-gardner commented Mar 4, 2020

DeNeutoy commented Mar 5, 2020

dirkgr commented Mar 5, 2020

DeNeutoy commented Mar 5, 2020

DeNeutoy commented Mar 4, 2020 •

edited

Loading

viking-sudo-rm commented Mar 4, 2020 •

edited

Loading