-
Notifications
You must be signed in to change notification settings - Fork 2.2k
unintuitive sorting_keys
scoping in new token indexers
#3664
Comments
sorting_keys
scoping in new token indexerssorting_keys
scoping in new token indexers
No objections from me, that seems reasonable. I do think that almost all use cases can just delete that line entirely, though, as the auto-detection should do the right thing in basically all cases. |
Apart from it iterates over the data, which in some cases (like if your data is infinite) is extremely bad, but good that you agree 👍 |
Possibly instead/as well as this, we should augment #3603 to only iterate over a small amount of data, because in the general case this will work |
Probably everybody else realized this already, but just in case, given how |
If people still think it's worth making the fix suggested by @DeNeutoy, I'd be happy to submit a PR addressing it. |
#3812 (which will get integrated into #3700) handles the biggest issue here, but actually changing the key would definitely also be an improvement. I think a big problem will be how any change here interacts with how the bucket sampler / iterator actually does sorting. If there's a good way to fix the key and make sure that this still works, then yes, a PR would be great. |
That sounds like @viking-sudo-rm should not pick this up right now, because too much other stuff will change around this soon. |
Yes, I think that waiting for #3700 to get merged is a good idea, but after that, I don't think anything else around this is changing. |
Got it. I'll wait for #3700, and assuming nothing else blocking comes up, will start work on this. |
@viking-sudo-rm - you can start this now, as #3700 has been merged 👍 |
@viking-sudo-rm and I ran into some poor config design resulting from #3597
sorting_keys: ["tokens", "num_tokens"]
in most configs is now replaced by:sorting_keys: ["tokens", "{indexer_name}___{name_of_some_returned_key_from_indexer}"]
In Will's case, this looked like:
sorting_keys: ["tokens", "{roberta}___{token_ids}"]
This is pretty obscure and hard to use.
Possible solutions:
We don't care, as Guess sorting keys when none are given to BucketIterator #3603 means most config files can just delete this line (at the cost of iterating over the data to guess them before you start training)
We change the sorting keys to be a triple (field_name, indexer_name, return_name_you_have_to_know)
I think the best solution to this problem is for each TokenIndexer to have a method
output_field_to_sort_by() -> str
or something like this. By default, this would just be e.g"tokens"
. This is much better, because it means thatsorting_keys
now could be"["tokens", "roberta"]"
, both of which are available in a user's config. We should almost certainly do this.@matt-gardner it would be good to get your opinion here, what do you think of my suggestion?
The text was updated successfully, but these errors were encountered: