Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

unintuitive sorting_keys scoping in new token indexers #3664

Closed
DeNeutoy opened this issue Jan 21, 2020 · 10 comments · Fixed by #3902
Closed

unintuitive sorting_keys scoping in new token indexers #3664

DeNeutoy opened this issue Jan 21, 2020 · 10 comments · Fixed by #3902
Assignees
Milestone

Comments

@DeNeutoy
Copy link
Contributor

@viking-sudo-rm and I ran into some poor config design resulting from #3597

sorting_keys: ["tokens", "num_tokens"] in most configs is now replaced by:

sorting_keys: ["tokens", "{indexer_name}___{name_of_some_returned_key_from_indexer}"]

In Will's case, this looked like:
sorting_keys: ["tokens", "{roberta}___{token_ids}"]

This is pretty obscure and hard to use.

Possible solutions:

I think the best solution to this problem is for each TokenIndexer to have a method output_field_to_sort_by() -> str or something like this. By default, this would just be e.g "tokens". This is much better, because it means that sorting_keys now could be "["tokens", "roberta"]", both of which are available in a user's config. We should almost certainly do this.

@matt-gardner it would be good to get your opinion here, what do you think of my suggestion?

@DeNeutoy DeNeutoy changed the title sorting_keys scoping in new token indexers unintuitive sorting_keys scoping in new token indexers Jan 21, 2020
@matt-gardner
Copy link
Contributor

No objections from me, that seems reasonable. I do think that almost all use cases can just delete that line entirely, though, as the auto-detection should do the right thing in basically all cases.

@DeNeutoy
Copy link
Contributor Author

Apart from it iterates over the data, which in some cases (like if your data is infinite) is extremely bad, but good that you agree 👍

@DeNeutoy
Copy link
Contributor Author

Possibly instead/as well as this, we should augment #3603 to only iterate over a small amount of data, because in the general case this will work

@DeNeutoy DeNeutoy added this to the 1.0.0 milestone Jan 24, 2020
@brendan-ai2
Copy link
Contributor

Probably everybody else realized this already, but just in case, given how _memory_sized_lists works, #3603 will only consider the first max_instances_in_memory or -- if the reader is lazy with no max -- the first batch_size instances. So this seems like pretty good behavior as is.

@viking-sudo-rm
Copy link
Contributor

If people still think it's worth making the fix suggested by @DeNeutoy, I'd be happy to submit a PR addressing it.

@matt-gardner
Copy link
Contributor

#3812 (which will get integrated into #3700) handles the biggest issue here, but actually changing the key would definitely also be an improvement. I think a big problem will be how any change here interacts with how the bucket sampler / iterator actually does sorting. If there's a good way to fix the key and make sure that this still works, then yes, a PR would be great.

@dirkgr
Copy link
Member

dirkgr commented Feb 24, 2020

That sounds like @viking-sudo-rm should not pick this up right now, because too much other stuff will change around this soon.

@matt-gardner
Copy link
Contributor

Yes, I think that waiting for #3700 to get merged is a good idea, but after that, I don't think anything else around this is changing.

@viking-sudo-rm
Copy link
Contributor

Got it. I'll wait for #3700, and assuming nothing else blocking comes up, will start work on this.

@DeNeutoy
Copy link
Contributor Author

@viking-sudo-rm - you can start this now, as #3700 has been merged 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants