-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access to pre_tokenizer for PreTrainedTokenizer #26254
Comments
Hey! The equivalent of |
@ArthurZucker Thanks for your reply! That's unfortunate. One would expect that the two classes derive from the same base class and that that base class offers pretokenisation (and postprocessing, while we're at it). I did see the
...wherein I assume the |
Usually the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Feature request
Give access to setting a
pre_tokenizer
for atransformers.PreTrainedTokenizer
, similar to how this works forPreTrainedTokenizerFast
.Motivation
As far as I understand from these docs, there are two interfaces for interacting with tokenizers in the HuggingFace ecosystem:
PreTrainedTokenizerFast
is a wrapper around Rust code, andPreTrainedTokenizer
is supposed to be the slow Python equivalent.PreTrainedTokenizerFast
has a propertybackend_tokenizer
which is atokenizers.Tokenizer
object, which has apre_tokenizer
property and is built from atokenizers.models.Model
subclass (the thing that does the tokenization). You can instantiate aPreTrainedTokenizerFast
from such aTokenizer
object with the constructor argumenttokenizer_object
. Meanwhile, none of this is accessible for aPreTrainedTokenizer
.Here is my use-case: I have a function
tokenizeWord(w: str)
implemented entirely in Python to segment a single word into subwords. I would now like toPreTrainedTokenizer
from this function, andI can do the first as follows (at least I think this is how it's supposed to be done):
but where does the pre-tokenizer come in? It doesn't even seem feasible to manually use the pre-tokenizers provided by
tokenizers.pre_tokenizers
(e.g.Whitespace
, to name one) because those all provide Rust interfaces and hence the objects they output don't work with a simple string segmentation function.Your contribution
None.
The text was updated successfully, but these errors were encountered: