Default sort key for token indexers #3876

viking-sudo-rm · 2020-02-28T22:45:39Z

Gave each TokenIndexer a default sorting key, so that if the user wants to manually specify sorting keys for a BucketBatchSampler, they do not need to know the internal key names used by the TokenIndexer.

For example, say you have a TextField called "text" with a PretrainedTransformerIndexer called "roberta". The sorting keys would now be (text, roberta) instead of (text, roberta___token_ids).

I gave each Field an expand_sort_key method, which can be used to rewrite default user-specified sort keys into the underlying internal format. Currently, the only field that utilizes this is TextField, although it would be straightforward to extend the same pattern to other fields.

Fixes #3664

viking-sudo-rm · 2020-02-28T23:25:38Z

Also worth mentioning that this should be fully backwards compatible. In other words, old-style roberta___tokens_ids sort keys should still work.

DeNeutoy

@viking-sudo-rm - unfortunately this isn't quite right, as it's leaked into the sampler which it should be completely independent from.

Right now, we have something that looks like:

sorting_keys: ["tokens", "{indexer_name}___{name_of_some_returned_key_from_indexer}"]

The idea of this PR was to actually replace entirely the underscore notation with the indexer_name only, as then we can call the method that you have added get_default_sort_key to retrieve the actual name. So the idea was that the new config could look like:

sorting_keys: ["tokens", "{indexer_name}"]

And then we could replace this line with something like padding_key = self.token_indexers[indexer_name].get_default_sort_key().

Does that make sense? We can chat today about this if you have some Qs

DeNeutoy · 2020-03-02T18:46:23Z

allennlp/data/samplers/bucket_batch_sampler.py

@@ -144,5 +153,22 @@ def _guess_sorting_keys(self, instances: Iterable[Instance], num_instances: int
            )
        self.sorting_keys = [longest_padding_key]

+    def _expand_sorting_keys(self, schema: Instance) -> None:


The fact that we need this to exist on the Sampler class means this design needs to be revisited.

viking-sudo-rm · 2020-03-05T18:36:22Z

Closing since #3902 is potentially a better fix.

Will Merrill added 4 commits February 28, 2020 13:55

Updated TokenIndexer API to support default sort key.

d3c28bc

Deleted weird copy file.

f953a44

Added test coverage for expand sorting keys.

d5d0a90

Added more error checking.

8333532

viking-sudo-rm changed the title ~~Default sort key for each kind of token indexer~~ Default sort key for token indexers Feb 28, 2020

Will Merrill added 2 commits February 28, 2020 14:48

ValueError -> ConfigurationError.

7f15695

Linting.

5197993

Fixed trailing whitespace.

b9b441c

DeNeutoy suggested changes Mar 2, 2020

View reviewed changes

viking-sudo-rm closed this Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default sort key for token indexers #3876

Default sort key for token indexers #3876

viking-sudo-rm commented Feb 28, 2020 •

edited

Loading

viking-sudo-rm commented Feb 28, 2020

DeNeutoy left a comment

DeNeutoy Mar 2, 2020

viking-sudo-rm commented Mar 5, 2020

Default sort key for token indexers #3876

Default sort key for token indexers #3876

Conversation

viking-sudo-rm commented Feb 28, 2020 • edited Loading

viking-sudo-rm commented Feb 28, 2020

DeNeutoy left a comment

Choose a reason for hiding this comment

DeNeutoy Mar 2, 2020

Choose a reason for hiding this comment

viking-sudo-rm commented Mar 5, 2020

viking-sudo-rm commented Feb 28, 2020 •

edited

Loading