Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Sorting keys api #3902

Merged
merged 3 commits into from
Mar 5, 2020
Merged

Sorting keys api #3902

merged 3 commits into from
Mar 5, 2020

Conversation

DeNeutoy
Copy link
Contributor

@DeNeutoy DeNeutoy commented Mar 4, 2020

Fixes #3664

Radical idea for sorting instances - we don't call get_padding_lengths at all.
Instead, we just specify what fields we want to sort by, and implement __len__ for all fields.

I'm pretty sure this has zero downsides, and many upsides:

  1. Bucketing now no-longer requires the instances to be indexed, which is a potentially expensive operation.

  2. Configs are simpler, because you just pass the name of the field you want to sort by.

  3. There are cases for which len(field) doesn't correspond directly to what will be padded - e.g wordpiece tokenizers may cause this to be slightly different. We should be willing to gamble that sentence length and wordpieced sentence length are highly correlated in the general case, and so bucketing like this is actually fine.

The only edge case I can think of is listfields of textfields, but it's unclear to me how "efficient" you can be even if you bucket things perfectly in that case.

@viking-sudo-rm
Copy link
Contributor

viking-sudo-rm commented Mar 4, 2020

This makes sense to me. It's definitely simpler than the proposed patch in my PR where we just elevate one field in each indexer to "default sort" status.

As to point #3, I agree. If it's the case that the word lengths and word piece lengths are not correlated (imagining a case with complex morphology in Finnish), then you're probably doing something wrong anyway, and should consider using a better word-level tokenizer. But maybe there are other cases where this assumption breaks down.

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Mar 4, 2020

@matt-gardner I know you're traveling, but what do you think of this idea? I can finish it up quite easily, but I thought you might have some objections.

@matt-gardner
Copy link
Contributor

Haven't looked at code, but this seems like a good idea to me. The main question I have is what you also listed - when there are multiple potential sorting options, what should you sort by? You used to be able to control this, now you can't. Maybe it's worth having some constructor argument for ambiguous cases?

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Mar 5, 2020

I guess my point is that we have increasingly few scenarios where you actually do need to control this, and that often in the case that you do need that control, sorting by something else is a good approximation anyway?

@DeNeutoy DeNeutoy requested a review from dirkgr March 5, 2020 01:46
@dirkgr
Copy link
Member

dirkgr commented Mar 5, 2020

What's the escape hatch for people who really need to batch a certain way? Implement their own Sampler?

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Mar 5, 2020

Yeah, I think so.

@DeNeutoy DeNeutoy merged commit 644ef22 into allenai:master Mar 5, 2020
@DeNeutoy DeNeutoy deleted the sorting-keys-api branch March 5, 2020 18:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unintuitive sorting_keys scoping in new token indexers
4 participants