[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

Ki-Seki · 2024-03-06T15:39:29Z

System Info

transformers version: 4.39.0.dev0
Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
Python version: 3.8.18
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): 2.13.1 (True)
Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
Jax version: 0.4.13
JaxLib version: 0.4.13
Using GPU in script?: no need
Using distributed or parallel set-up in script?:no need

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

# Roughly estimated, around 15 models would have this issue.

Output

<<<<# # ～>>>>
<<<<##～>>>>
<<<<# # ～>>>>
<<<<##～>>>>

Expected behavior

Consistent behaviors. For example, when decoding the single ID, the output could also be ##~.

Suspected rationale: In the src/transformers/tokenization_utils.py, the _decode function incorrectly uses spaces_between_special_tokens, and then adds spaces between the sub-tokens.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-07T04:25:43Z

That's very interesting, and can confirm we have this issue.
gemma would just error out if you pass an int and not a list, with no proper warning. While the fast works.
I think adding a test in the test_tokenization_common will help know which models fails and which we have to update.

Ki-Seki · 2024-03-07T05:09:32Z

Yes, you're right. I added this test case in the test_tokenization_common:

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
        rust_tokenizer = self.get_rust_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
        self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
        self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id))

The test results are as below (scroll to the bottom to view the failed 33 models):

Details

>       self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
E       AssertionError: 'l o w e s t' != 'lowest'
E       - l o w e s t
E       + lowest

tests/test_tokenization_common.py:4208: AssertionError
__________________ SqueezeBertTokenizationTest.test_single_id __________________

self = <tests.models.squeezebert.test_tokenization_squeezebert.SqueezeBertTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
        rust_tokenizer = self.get_rust_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
>       self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
E       AssertionError: 'l o w e s t' != 'lowest'
E       - l o w e s t
E       + lowest

tests/test_tokenization_common.py:4208: AssertionError
_____________________ TapasTokenizationTest.test_single_id _____________________

self = <tests.models.tapas.test_tokenization_tapas.TapasTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.tapas.test_tokenization_tapas.TapasTokenizationTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
_______________________ VitsTokenizerTest.test_single_id _______________________

self = <tests.models.vits.test_tokenization_vits.VitsTokenizerTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.vits.test_tokenization_vits.VitsTokenizerTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
___________________ Wav2Vec2CTCTokenizerTest.test_single_id ____________________

self = <tests.models.wav2vec2.test_tokenization_wav2vec2.Wav2Vec2CTCTokenizerTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.wav2vec2.test_tokenization_wav2vec2.Wav2Vec2CTCTokenizerTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
________________ Wav2Vec2PhonemeCTCTokenizerTest.test_single_id ________________

self = <tests.models.wav2vec2_phoneme.test_tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerTest testMethod=test_single_id>

    def test_single_id(self):
>       tokenizer = self.get_tokenizer()

tests/test_tokenization_common.py:4203: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py:87: in get_tokenizer
    return Wav2Vec2PhonemeCTCTokenizer.from_pretrained(self.tmpdirname, **kwargs)
src/transformers/tokenization_utils_base.py:2055: in from_pretrained
    return cls._from_pretrained(
src/transformers/tokenization_utils_base.py:2294: in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py:153: in __init__
    self.init_backend(self.phonemizer_lang)
src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py:202: in init_backend
    self.backend = BACKENDS[self.phonemizer_backend](phonemizer_lang, language_switch="remove-flags")
/home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/espeak/espeak.py:45: in __init__
    super().__init__(
/home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/espeak/base.py:39: in __init__
    super().__init__(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <phonemizer.backend.espeak.espeak.EspeakBackend object at 0x7fc839333160>
language = 'en-us', punctuation_marks = ';:,.!?¡¿—…"«»“”'
preserve_punctuation = False, logger = <Logger phonemizer (WARNING)>

    def __init__(self, language: str,
                 punctuation_marks: Optional[Union[str, Pattern]] = None,
                 preserve_punctuation: bool = False,
                 logger: Optional[Logger] = None):
    
        if punctuation_marks is None:
            punctuation_marks = Punctuation.default_marks()
    
        if logger is None:
            logger = get_logger()
    
        # ensure the backend is installed on the system
        if not self.is_available():
>           raise RuntimeError(  # pragma: nocover
                '{} not installed on your system'.format(self.name()))
E           RuntimeError: espeak not installed on your system

/home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/base.py:77: RuntimeError
_____________________ WhisperTokenizerTest.test_single_id ______________________
tests/models/whisper/test_tokenization_whisper.py:42: in setUp
    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'transformers.models.whisper.tokenization_whisper.WhisperTokenizer'>
pretrained_model_name_or_path = 'openai/whisper-tiny', cache_dir = None
force_download = False, local_files_only = False, token = None
revision = 'main', trust_remote_code = False, init_inputs = (), kwargs = {}
resume_download = False, proxies = None, use_auth_token = None, subfolder = None

    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: Union[str, os.PathLike],
        *init_inputs,
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        local_files_only: bool = False,
        token: Optional[Union[str, bool]] = None,
        revision: str = "main",
        trust_remote_code=False,
        **kwargs,
    ):
        r"""
        Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined
        tokenizer.
    
        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                Can be either:
    
                - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
                  using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g.,
                  `./my_model_directory/`.
                - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary
                  file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
                  `./my_model_directory/vocab.txt`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the
                standard cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force the (re-)download the vocabulary files and override the cached versions if they
                exist.
            resume_download (`bool`, *optional*, defaults to `False`):
                Whether or not to delete incompletely received files. Attempt to resume the download if such a file
                exists.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            local_files_only (`bool`, *optional*, defaults to `False`):
                Whether or not to only rely on local files and not to attempt to download any files.
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            subfolder (`str`, *optional*):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for
                facebook/rag-token-base), specify it here.
            inputs (additional positional arguments, *optional*):
                Will be passed along to the Tokenizer `__init__` method.
            trust_remote_code (`bool`, *optional*, defaults to `False`):
                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
                should only be set to `True` for repositories you trust and in which you have read the code, as it will
                execute code present on the Hub on your local machine.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the Tokenizer `__init__` method. Can be used to set special tokens like `bos_token`,
                `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`,
                `additional_special_tokens`. See parameters in the `__init__` for more details.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Examples:
    
        ```python
        # We can't instantiate directly the base class *PreTrainedTokenizerBase* so let's show our examples on a derived class: BertTokenizer
        # Download vocabulary from huggingface.co and cache.
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
    
        # Download vocabulary from huggingface.co (user-uploaded) and cache.
        tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
    
        # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/")
    
        # If the tokenizer uses a single vocabulary file, you can point directly to this file
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt")
    
        # You can link tokens to special vocabulary when instantiating
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="<unk>")
        # You should be sure '<unk>' is in the vocabulary when doing that.
        # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
        assert tokenizer.unk_token == "<unk>"
        ```"""
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
        use_auth_token = kwargs.pop("use_auth_token", None)
        subfolder = kwargs.pop("subfolder", None)
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)
    
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token
    
        user_agent = {"file_type": "tokenizer", "from_auto_class": from_auto_class, "is_fast": "Fast" in cls.__name__}
        if from_pipeline is not None:
            user_agent["using_pipeline"] = from_pipeline
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
    
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
        vocab_files = {}
        init_configuration = {}
    
        is_local = os.path.isdir(pretrained_model_name_or_path)
        single_file_id = None
        if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):
            if len(cls.vocab_files_names) > 1:
                raise ValueError(
                    f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not "
                    "supported for this tokenizer. Use a model identifier or the path to a directory instead."
                )
            warnings.warn(
                f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated and "
                "won't be possible anymore in v5. Use a model identifier or the path to a directory instead.",
                FutureWarning,
            )
            file_id = list(cls.vocab_files_names.keys())[0]
    
            vocab_files[file_id] = pretrained_model_name_or_path
            single_file_id = file_id
        else:
            # At this point pretrained_model_name_or_path is either a directory or a model identifier name
            additional_files_names = {
                "added_tokens_file": ADDED_TOKENS_FILE,  # kept only for legacy
                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,  # kept only for legacy
                "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
                # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
                "tokenizer_file": FULL_TOKENIZER_FILE,
            }
            vocab_files = {**cls.vocab_files_names, **additional_files_names}
            if "tokenizer_file" in vocab_files:
                # Try to get the tokenizer config to see if there are versioned tokenizer files.
                fast_tokenizer_file = FULL_TOKENIZER_FILE
                resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    TOKENIZER_CONFIG_FILE,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    resume_download=resume_download,
                    proxies=proxies,
                    token=token,
                    revision=revision,
                    local_files_only=local_files_only,
                    subfolder=subfolder,
                    user_agent=user_agent,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
                if resolved_config_file is not None:
                    with open(resolved_config_file, encoding="utf-8") as reader:
                        tokenizer_config = json.load(reader)
                        if "fast_tokenizer_files" in tokenizer_config:
                            fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
                vocab_files["tokenizer_file"] = fast_tokenizer_file
    
        # Get files from url, cache, or disk depending on the case
        resolved_vocab_files = {}
        unresolved_files = []
        for file_id, file_path in vocab_files.items():
            if file_path is None:
                resolved_vocab_files[file_id] = None
            elif single_file_id == file_id:
                if os.path.isfile(file_path):
                    resolved_vocab_files[file_id] = file_path
                elif is_remote_url(file_path):
                    resolved_vocab_files[file_id] = download_url(file_path, proxies=proxies)
            else:
                resolved_vocab_files[file_id] = cached_file(
                    pretrained_model_name_or_path,
                    file_path,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_vocab_files[file_id], commit_hash)
    
        if len(unresolved_files) > 0:
            logger.info(
                f"Can't load following files from cache: {unresolved_files} and cannot check if these "
                "files are necessary for the tokenizer to operate."
            )
    
        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
>           raise EnvironmentError(
                f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
                "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
                f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
                f"containing all relevant files for a {cls.__name__} tokenizer."
            )
E           OSError: Can't load tokenizer for 'openai/whisper-tiny'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'openai/whisper-tiny' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer.

src/transformers/tokenization_utils_base.py:2039: OSError
______________________ XLMTokenizationTest.test_single_id ______________________

self = <tests.models.xlm.test_tokenization_xlm.XLMTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.xlm.test_tokenization_xlm.XLMTokenizationTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
_________________ XLMProphetNetTokenizationTest.test_single_id _________________

self = <tests.models.xlm_prophetnet.test_tokenization_xlm_prophetnet.XLMProphetNetTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.xlm_prophetnet.test_tokenization_xlm_prophetnet.XLMProphetNetTokenizationTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
________________ PreTrainedTokenizationFastTest.test_single_id _________________
tests/tokenization/test_tokenization_fast.py:47: in setUp
    tokenizer = PreTrainedTokenizerFast.from_pretrained(model_paths[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
pretrained_model_name_or_path = 'robot-test/dummy-tokenizer-fast'
cache_dir = None, force_download = False, local_files_only = False, token = None
revision = 'main', trust_remote_code = False, init_inputs = (), kwargs = {}
resume_download = False, proxies = None, use_auth_token = None, subfolder = None

    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: Union[str, os.PathLike],
        *init_inputs,
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        local_files_only: bool = False,
        token: Optional[Union[str, bool]] = None,
        revision: str = "main",
        trust_remote_code=False,
        **kwargs,
    ):
        r"""
        Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined
        tokenizer.
    
        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                Can be either:
    
                - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
                  using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g.,
                  `./my_model_directory/`.
                - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary
                  file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
                  `./my_model_directory/vocab.txt`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the
                standard cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force the (re-)download the vocabulary files and override the cached versions if they
                exist.
            resume_download (`bool`, *optional*, defaults to `False`):
                Whether or not to delete incompletely received files. Attempt to resume the download if such a file
                exists.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            local_files_only (`bool`, *optional*, defaults to `False`):
                Whether or not to only rely on local files and not to attempt to download any files.
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            subfolder (`str`, *optional*):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for
                facebook/rag-token-base), specify it here.
            inputs (additional positional arguments, *optional*):
                Will be passed along to the Tokenizer `__init__` method.
            trust_remote_code (`bool`, *optional*, defaults to `False`):
                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
                should only be set to `True` for repositories you trust and in which you have read the code, as it will
                execute code present on the Hub on your local machine.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the Tokenizer `__init__` method. Can be used to set special tokens like `bos_token`,
                `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`,
                `additional_special_tokens`. See parameters in the `__init__` for more details.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Examples:
    
        ```python
        # We can't instantiate directly the base class *PreTrainedTokenizerBase* so let's show our examples on a derived class: BertTokenizer
        # Download vocabulary from huggingface.co and cache.
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
    
        # Download vocabulary from huggingface.co (user-uploaded) and cache.
        tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
    
        # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/")
    
        # If the tokenizer uses a single vocabulary file, you can point directly to this file
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt")
    
        # You can link tokens to special vocabulary when instantiating
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="<unk>")
        # You should be sure '<unk>' is in the vocabulary when doing that.
        # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
        assert tokenizer.unk_token == "<unk>"
        ```"""
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
        use_auth_token = kwargs.pop("use_auth_token", None)
        subfolder = kwargs.pop("subfolder", None)
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)
    
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token
    
        user_agent = {"file_type": "tokenizer", "from_auto_class": from_auto_class, "is_fast": "Fast" in cls.__name__}
        if from_pipeline is not None:
            user_agent["using_pipeline"] = from_pipeline
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
    
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
        vocab_files = {}
        init_configuration = {}
    
        is_local = os.path.isdir(pretrained_model_name_or_path)
        single_file_id = None
        if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):
            if len(cls.vocab_files_names) > 1:
                raise ValueError(
                    f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not "
                    "supported for this tokenizer. Use a model identifier or the path to a directory instead."
                )
            warnings.warn(
                f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated and "
                "won't be possible anymore in v5. Use a model identifier or the path to a directory instead.",
                FutureWarning,
            )
            file_id = list(cls.vocab_files_names.keys())[0]
    
            vocab_files[file_id] = pretrained_model_name_or_path
            single_file_id = file_id
        else:
            # At this point pretrained_model_name_or_path is either a directory or a model identifier name
            additional_files_names = {
                "added_tokens_file": ADDED_TOKENS_FILE,  # kept only for legacy
                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,  # kept only for legacy
                "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
                # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
                "tokenizer_file": FULL_TOKENIZER_FILE,
            }
            vocab_files = {**cls.vocab_files_names, **additional_files_names}
            if "tokenizer_file" in vocab_files:
                # Try to get the tokenizer config to see if there are versioned tokenizer files.
                fast_tokenizer_file = FULL_TOKENIZER_FILE
                resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    TOKENIZER_CONFIG_FILE,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    resume_download=resume_download,
                    proxies=proxies,
                    token=token,
                    revision=revision,
                    local_files_only=local_files_only,
                    subfolder=subfolder,
                    user_agent=user_agent,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
                if resolved_config_file is not None:
                    with open(resolved_config_file, encoding="utf-8") as reader:
                        tokenizer_config = json.load(reader)
                        if "fast_tokenizer_files" in tokenizer_config:
                            fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
                vocab_files["tokenizer_file"] = fast_tokenizer_file
    
        # Get files from url, cache, or disk depending on the case
        resolved_vocab_files = {}
        unresolved_files = []
        for file_id, file_path in vocab_files.items():
            if file_path is None:
                resolved_vocab_files[file_id] = None
            elif single_file_id == file_id:
                if os.path.isfile(file_path):
                    resolved_vocab_files[file_id] = file_path
                elif is_remote_url(file_path):
                    resolved_vocab_files[file_id] = download_url(file_path, proxies=proxies)
            else:
                resolved_vocab_files[file_id] = cached_file(
                    pretrained_model_name_or_path,
                    file_path,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_vocab_files[file_id], commit_hash)
    
        if len(unresolved_files) > 0:
            logger.info(
                f"Can't load following files from cache: {unresolved_files} and cannot check if these "
                "files are necessary for the tokenizer to operate."
            )
    
        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
>           raise EnvironmentError(
                f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
                "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
                f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
                f"containing all relevant files for a {cls.__name__} tokenizer."
            )
E           OSError: Can't load tokenizer for 'robot-test/dummy-tokenizer-fast'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'robot-test/dummy-tokenizer-fast' is the correct path to a directory containing all relevant files for a PreTrainedTokenizerFast tokenizer.

src/transformers/tokenization_utils_base.py:2039: OSError
=============================== warnings summary ===============================
../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/_pytest/config/__init__.py:1373
  /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/utils.py:22
  /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/utils.py:22: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/pkg_resources/__init__.py:2846
  /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/pkg_resources/__init__.py:2846: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_single_id
FAILED tests/models/bert/test_tokenization_bert.py::BertTokenizationTest::test_single_id
FAILED tests/models/bert_generation/test_tokenization_bert_generation.py::BertGenerationTokenizationTest::test_single_id
FAILED tests/models/bertweet/test_tokenization_bertweet.py::BertweetTokenizationTest::test_single_id
FAILED tests/models/biogpt/test_tokenization_biogpt.py::BioGptTokenizationTest::test_single_id
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_single_id
FAILED tests/models/bloom/test_tokenization_bloom.py::BloomTokenizationTest::test_single_id
FAILED tests/models/byt5/test_tokenization_byt5.py::ByT5TokenizationTest::test_single_id
FAILED tests/models/canine/test_tokenization_canine.py::CanineTokenizationTest::test_single_id
FAILED tests/models/clvp/test_tokenization_clvp.py::ClvpTokenizationTest::test_single_id
FAILED tests/models/code_llama/test_tokenization_code_llama.py::CodeLlamaTokenizationTest::test_single_id
FAILED tests/models/ctrl/test_tokenization_ctrl.py::CTRLTokenizationTest::test_single_id
FAILED tests/models/distilbert/test_tokenization_distilbert.py::BertTokenizationTest::test_single_id
FAILED tests/models/distilbert/test_tokenization_distilbert.py::DistilBertTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::BertTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::DPRContextEncoderTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::DPRQuestionEncoderTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::DPRReaderTokenizationTest::test_single_id
FAILED tests/models/electra/test_tokenization_electra.py::ElectraTokenizationTest::test_single_id
FAILED tests/models/ernie_m/test_tokenization_ernie_m.py::ErnieMTokenizationTest::test_single_id
FAILED tests/models/fsmt/test_tokenization_fsmt.py::FSMTTokenizationTest::test_single_id
FAILED tests/models/funnel/test_tokenization_funnel.py::FunnelTokenizationTest::test_single_id
FAILED tests/models/gemma/test_tokenization_gemma.py::GemmaTokenizationTest::test_single_id
FAILED tests/models/gpt_neox_japanese/test_tokenization_gpt_neox_japanese.py::GPTNeoXJapaneseTokenizationTest::test_single_id
FAILED tests/models/gpt_sw3/test_tokenization_gpt_sw3.py::GPTSw3TokenizationTest::test_single_id
FAILED tests/models/gptsan_japanese/test_tokenization_gptsan_japanese.py::GPTSanJapaneseTokenizationTest::test_single_id
FAILED tests/models/layoutlm/test_tokenization_layoutlm.py::LayoutLMTokenizationTest::test_single_id
FAILED tests/models/layoutlmv2/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_single_id
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_single_id
FAILED tests/models/lxmert/test_tokenization_lxmert.py::LxmertTokenizationTest::test_single_id
FAILED tests/models/m2m_100/test_tokenization_m2m_100.py::M2M100TokenizationTest::test_single_id
FAILED tests/models/marian/test_tokenization_marian.py::MarianTokenizationTest::test_single_id
FAILED tests/models/mgp_str/test_tokenization_mgp_str.py::MgpstrTokenizationTest::test_single_id
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_single_id
FAILED tests/models/mobilebert/test_tokenization_mobilebert.py::MobileBERTTokenizationTest::test_single_id
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_single_id
FAILED tests/models/nougat/test_tokenization_nougat.py::NougatTokenizationTest::test_single_id
FAILED tests/models/perceiver/test_tokenization_perceiver.py::PerceiverTokenizationTest::test_single_id
FAILED tests/models/phobert/test_tokenization_phobert.py::PhobertTokenizationTest::test_single_id
FAILED tests/models/plbart/test_tokenization_plbart.py::PLBartTokenizationTest::test_single_id
FAILED tests/models/prophetnet/test_tokenization_prophetnet.py::ProphetNetTokenizationTest::test_single_id
FAILED tests/models/realm/test_tokenization_realm.py::RealmTokenizationTest::test_single_id
FAILED tests/models/roc_bert/test_tokenization_roc_bert.py::BertTokenizationTest::test_single_id
FAILED tests/models/roformer/test_tokenization_roformer.py::RoFormerTokenizationTest::test_single_id
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_single_id
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_single_id
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_single_id
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_single_id
FAILED tests/models/squeezebert/test_tokenization_squeezebert.py::BertTokenizationTest::test_single_id
FAILED tests/models/squeezebert/test_tokenization_squeezebert.py::SqueezeBertTokenizationTest::test_single_id
FAILED tests/models/tapas/test_tokenization_tapas.py::TapasTokenizationTest::test_single_id
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_single_id
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_single_id
FAILED tests/models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py::Wav2Vec2PhonemeCTCTokenizerTest::test_single_id
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_single_id
FAILED tests/models/xlm/test_tokenization_xlm.py::XLMTokenizationTest::test_single_id
FAILED tests/models/xlm_prophetnet/test_tokenization_xlm_prophetnet.py::XLMProphetNetTokenizationTest::test_single_id
FAILED tests/tokenization/test_tokenization_fast.py::PreTrainedTokenizationFastTest::test_single_id
============ 58 failed, 33 passed, 6 skipped, 3 warnings in 13.87s =============
<<<PYTHON-EXEC-OUTPUT
Finished running tests!

ArthurZucker · 2024-03-07T05:50:50Z

feel free to open a PR for a fix. IMO we should not have spaces added in this case

Ki-Seki · 2024-03-10T02:55:54Z

No problem, I will try to do this, but there are some other research work that needs to be pushed forward recently, and I may do it later.

MariaHei · 2024-06-24T17:55:21Z

Hi :)
I'm pretty sure the issue is not how spaces_between_special_tokens is used but that single tokens are split into letters here. To fix it, I'd suggest adding the following before iterating over the tokens:

if isinstance(filtered_tokens, str):
       filtered_tokens = [filtered_tokens]

I ran a couple of the test cases that were reported to be failing above with a slightly modified version of the test function proposed by @Ki-Seki and they pass now

def test_single_id(self):
        tokenizer = self.get_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
        self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
        if self.test_rust_tokenizer:
            rust_tokenizer = self.get_rust_tokenizer()
            self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id))

Unfortunately, I can't run all of the test cases (I keep running into weird python segmentation errors that occur even without having changed the library at all). Does know a trick how I can run the test cases anyway or is it ok if I create a pull request and wait for the CI tests?

ArthurZucker · 2024-08-03T09:30:57Z

You can create a PR and rely on the CIs for sure! 🤗

DuyguA · 2024-08-09T11:13:40Z

Hello @ArthurZucker and all,
I don't think this is an issue related to the specific ids, but rather a general problem. I tested a bit on my local but to make sure my local setup isn't related, I tested on Colab:

Looks to me problem is that (i) a signature mismatch between PretrainedTokenizerBase and PretrainedTokenizer classes _decode methods:

transformers/src/transformers/tokenization_utils_base.py

Lines 3913 to 3915 in 74b92c6

    
           def _decode( 
        
               self, 
        
               token_ids: Union[int, List[int]],

transformers/src/transformers/tokenization_utils.py

Lines 1062 to 1064 in 74b92c6

    
           def _decode( 
        
               self, 
        
               token_ids: List[int],

FastTokenizer has this signature correctly:

transformers/src/transformers/tokenization_utils_fast.py

Lines 640 to 642 in 74b92c6

    
           def _decode( 
        
               self, 
        
               token_ids: Union[int, List[int]],

Consequently slow tokenizer _decode handles only list of ids, not a single id. If the filtered_tokens is a single string, not a list of strings in the loop its characters are iterated and processed so @MariaHei is totally right:

transformers/src/transformers/tokenization_utils.py

Line 1082 in 74b92c6

for token in filtered_tokens:

Also there are not many decoding tests, though lots of encoding tests 😊 I added quick signature fix and return statements, also added some decode tests in my PR.

Ki-Seki mentioned this issue Mar 6, 2024

[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

Closed

5 tasks

Ki-Seki changed the title ~~[Tokenizer] Inconsistent behavior when decoding the single id and the list of the single id~~ [Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID Mar 6, 2024

ArthurZucker added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Mar 7, 2024

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 25, 2024

DuyguA mentioned this issue Aug 9, 2024

Fix for slow the bug tokenizer adding spaces to single id decodes #32564

Merged

5 tasks

itazap closed this as completed in #32564 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

Ki-Seki commented Mar 6, 2024

ArthurZucker commented Mar 7, 2024

Ki-Seki commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

Ki-Seki commented Mar 10, 2024

MariaHei commented Jun 24, 2024 •

edited

Loading

ArthurZucker commented Aug 3, 2024

DuyguA commented Aug 9, 2024

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

Comments

Ki-Seki commented Mar 6, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Mar 7, 2024

Ki-Seki commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

Ki-Seki commented Mar 10, 2024

MariaHei commented Jun 24, 2024 • edited Loading

ArthurZucker commented Aug 3, 2024

DuyguA commented Aug 9, 2024

MariaHei commented Jun 24, 2024 •

edited

Loading