Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Default sort key for token indexers #3876

Closed
wants to merge 7 commits into from
Closed

Default sort key for token indexers #3876

wants to merge 7 commits into from

Conversation

viking-sudo-rm
Copy link
Contributor

@viking-sudo-rm viking-sudo-rm commented Feb 28, 2020

Gave each TokenIndexer a default sorting key, so that if the user wants to manually specify sorting keys for a BucketBatchSampler, they do not need to know the internal key names used by the TokenIndexer.

For example, say you have a TextField called "text" with a PretrainedTransformerIndexer called "roberta". The sorting keys would now be (text, roberta) instead of (text, roberta___token_ids).

I gave each Field an expand_sort_key method, which can be used to rewrite default user-specified sort keys into the underlying internal format. Currently, the only field that utilizes this is TextField, although it would be straightforward to extend the same pattern to other fields.

Fixes #3664

@viking-sudo-rm viking-sudo-rm changed the title Default sort key for each kind of token indexer Default sort key for token indexers Feb 28, 2020
@viking-sudo-rm
Copy link
Contributor Author

Also worth mentioning that this should be fully backwards compatible. In other words, old-style roberta___tokens_ids sort keys should still work.

Copy link
Contributor

@DeNeutoy DeNeutoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viking-sudo-rm - unfortunately this isn't quite right, as it's leaked into the sampler which it should be completely independent from.

Right now, we have something that looks like:

sorting_keys: ["tokens", "{indexer_name}___{name_of_some_returned_key_from_indexer}"]

The idea of this PR was to actually replace entirely the underscore notation with the indexer_name only, as then we can call the method that you have added get_default_sort_key to retrieve the actual name. So the idea was that the new config could look like:

sorting_keys: ["tokens", "{indexer_name}"]

And then we could replace this line with something like padding_key = self.token_indexers[indexer_name].get_default_sort_key().

Does that make sense? We can chat today about this if you have some Qs

@@ -144,5 +153,22 @@ def _guess_sorting_keys(self, instances: Iterable[Instance], num_instances: int
)
self.sorting_keys = [longest_padding_key]

def _expand_sorting_keys(self, schema: Instance) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we need this to exist on the Sampler class means this design needs to be revisited.

@viking-sudo-rm
Copy link
Contributor Author

Closing since #3902 is potentially a better fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unintuitive sorting_keys scoping in new token indexers
2 participants