Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

Closed
4 tasks
Ki-Seki opened this issue Mar 6, 2024 · 7 comments · Fixed by #32564
Closed
4 tasks
Labels
Core: Tokenization Internals of the library; Tokenization. Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@Ki-Seki
Copy link
Contributor

Ki-Seki commented Mar 6, 2024

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
  • Python version: 3.8.18
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): 2.13.1 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?: no need
  • Using distributed or parallel set-up in script?:no need

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

# Roughly estimated, around 15 models would have this issue.

Output

<<<<# # ~>>>>
<<<<##~>>>>
<<<<# # ~>>>>
<<<<##~>>>>

Expected behavior

Consistent behaviors. For example, when decoding the single ID, the output could also be ##~.

Suspected rationale: In the src/transformers/tokenization_utils.py, the _decode function incorrectly uses spaces_between_special_tokens, and then adds spaces between the sub-tokens.

@Ki-Seki Ki-Seki changed the title [Tokenizer] Inconsistent behavior when decoding the single id and the list of the single id [Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID Mar 6, 2024
@ArthurZucker
Copy link
Collaborator

That's very interesting, and can confirm we have this issue.
gemma would just error out if you pass an int and not a list, with no proper warning. While the fast works.
I think adding a test in the test_tokenization_common will help know which models fails and which we have to update.

@ArthurZucker ArthurZucker added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Mar 7, 2024
@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Mar 7, 2024

Yes, you're right. I added this test case in the test_tokenization_common:

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
        rust_tokenizer = self.get_rust_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
        self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
        self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id))

The test results are as below (scroll to the bottom to view the failed 33 models):

Details

>       self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
E       AssertionError: 'l o w e s t' != 'lowest'
E       - l o w e s t
E       + lowest

tests/test_tokenization_common.py:4208: AssertionError
__________________ SqueezeBertTokenizationTest.test_single_id __________________

self = <tests.models.squeezebert.test_tokenization_squeezebert.SqueezeBertTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
        rust_tokenizer = self.get_rust_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
>       self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
E       AssertionError: 'l o w e s t' != 'lowest'
E       - l o w e s t
E       + lowest

tests/test_tokenization_common.py:4208: AssertionError
_____________________ TapasTokenizationTest.test_single_id _____________________

self = <tests.models.tapas.test_tokenization_tapas.TapasTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.tapas.test_tokenization_tapas.TapasTokenizationTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
_______________________ VitsTokenizerTest.test_single_id _______________________

self = <tests.models.vits.test_tokenization_vits.VitsTokenizerTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.vits.test_tokenization_vits.VitsTokenizerTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
___________________ Wav2Vec2CTCTokenizerTest.test_single_id ____________________

self = <tests.models.wav2vec2.test_tokenization_wav2vec2.Wav2Vec2CTCTokenizerTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.wav2vec2.test_tokenization_wav2vec2.Wav2Vec2CTCTokenizerTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
________________ Wav2Vec2PhonemeCTCTokenizerTest.test_single_id ________________

self = <tests.models.wav2vec2_phoneme.test_tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerTest testMethod=test_single_id>

    def test_single_id(self):
>       tokenizer = self.get_tokenizer()

tests/test_tokenization_common.py:4203: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py:87: in get_tokenizer
    return Wav2Vec2PhonemeCTCTokenizer.from_pretrained(self.tmpdirname, **kwargs)
src/transformers/tokenization_utils_base.py:2055: in from_pretrained
    return cls._from_pretrained(
src/transformers/tokenization_utils_base.py:2294: in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py:153: in __init__
    self.init_backend(self.phonemizer_lang)
src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py:202: in init_backend
    self.backend = BACKENDS[self.phonemizer_backend](phonemizer_lang, language_switch="remove-flags")
/home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/espeak/espeak.py:45: in __init__
    super().__init__(
/home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/espeak/base.py:39: in __init__
    super().__init__(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <phonemizer.backend.espeak.espeak.EspeakBackend object at 0x7fc839333160>
language = 'en-us', punctuation_marks = ';:,.!?¡¿—…"«»“”'
preserve_punctuation = False, logger = <Logger phonemizer (WARNING)>

    def __init__(self, language: str,
                 punctuation_marks: Optional[Union[str, Pattern]] = None,
                 preserve_punctuation: bool = False,
                 logger: Optional[Logger] = None):
    
        if punctuation_marks is None:
            punctuation_marks = Punctuation.default_marks()
    
        if logger is None:
            logger = get_logger()
    
        # ensure the backend is installed on the system
        if not self.is_available():
>           raise RuntimeError(  # pragma: nocover
                '{} not installed on your system'.format(self.name()))
E           RuntimeError: espeak not installed on your system

/home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/base.py:77: RuntimeError
_____________________ WhisperTokenizerTest.test_single_id ______________________
tests/models/whisper/test_tokenization_whisper.py:42: in setUp
    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'transformers.models.whisper.tokenization_whisper.WhisperTokenizer'>
pretrained_model_name_or_path = 'openai/whisper-tiny', cache_dir = None
force_download = False, local_files_only = False, token = None
revision = 'main', trust_remote_code = False, init_inputs = (), kwargs = {}
resume_download = False, proxies = None, use_auth_token = None, subfolder = None

    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: Union[str, os.PathLike],
        *init_inputs,
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        local_files_only: bool = False,
        token: Optional[Union[str, bool]] = None,
        revision: str = "main",
        trust_remote_code=False,
        **kwargs,
    ):
        r"""
        Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined
        tokenizer.
    
        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                Can be either:
    
                - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
                  using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g.,
                  `./my_model_directory/`.
                - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary
                  file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
                  `./my_model_directory/vocab.txt`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the
                standard cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force the (re-)download the vocabulary files and override the cached versions if they
                exist.
            resume_download (`bool`, *optional*, defaults to `False`):
                Whether or not to delete incompletely received files. Attempt to resume the download if such a file
                exists.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            local_files_only (`bool`, *optional*, defaults to `False`):
                Whether or not to only rely on local files and not to attempt to download any files.
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            subfolder (`str`, *optional*):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for
                facebook/rag-token-base), specify it here.
            inputs (additional positional arguments, *optional*):
                Will be passed along to the Tokenizer `__init__` method.
            trust_remote_code (`bool`, *optional*, defaults to `False`):
                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
                should only be set to `True` for repositories you trust and in which you have read the code, as it will
                execute code present on the Hub on your local machine.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the Tokenizer `__init__` method. Can be used to set special tokens like `bos_token`,
                `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`,
                `additional_special_tokens`. See parameters in the `__init__` for more details.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Examples:
    
        ```python
        # We can't instantiate directly the base class *PreTrainedTokenizerBase* so let's show our examples on a derived class: BertTokenizer
        # Download vocabulary from huggingface.co and cache.
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
    
        # Download vocabulary from huggingface.co (user-uploaded) and cache.
        tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
    
        # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/")
    
        # If the tokenizer uses a single vocabulary file, you can point directly to this file
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt")
    
        # You can link tokens to special vocabulary when instantiating
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="<unk>")
        # You should be sure '<unk>' is in the vocabulary when doing that.
        # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
        assert tokenizer.unk_token == "<unk>"
        ```"""
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
        use_auth_token = kwargs.pop("use_auth_token", None)
        subfolder = kwargs.pop("subfolder", None)
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)
    
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token
    
        user_agent = {"file_type": "tokenizer", "from_auto_class": from_auto_class, "is_fast": "Fast" in cls.__name__}
        if from_pipeline is not None:
            user_agent["using_pipeline"] = from_pipeline
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
    
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
        vocab_files = {}
        init_configuration = {}
    
        is_local = os.path.isdir(pretrained_model_name_or_path)
        single_file_id = None
        if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):
            if len(cls.vocab_files_names) > 1:
                raise ValueError(
                    f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not "
                    "supported for this tokenizer. Use a model identifier or the path to a directory instead."
                )
            warnings.warn(
                f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated and "
                "won't be possible anymore in v5. Use a model identifier or the path to a directory instead.",
                FutureWarning,
            )
            file_id = list(cls.vocab_files_names.keys())[0]
    
            vocab_files[file_id] = pretrained_model_name_or_path
            single_file_id = file_id
        else:
            # At this point pretrained_model_name_or_path is either a directory or a model identifier name
            additional_files_names = {
                "added_tokens_file": ADDED_TOKENS_FILE,  # kept only for legacy
                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,  # kept only for legacy
                "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
                # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
                "tokenizer_file": FULL_TOKENIZER_FILE,
            }
            vocab_files = {**cls.vocab_files_names, **additional_files_names}
            if "tokenizer_file" in vocab_files:
                # Try to get the tokenizer config to see if there are versioned tokenizer files.
                fast_tokenizer_file = FULL_TOKENIZER_FILE
                resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    TOKENIZER_CONFIG_FILE,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    resume_download=resume_download,
                    proxies=proxies,
                    token=token,
                    revision=revision,
                    local_files_only=local_files_only,
                    subfolder=subfolder,
                    user_agent=user_agent,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
                if resolved_config_file is not None:
                    with open(resolved_config_file, encoding="utf-8") as reader:
                        tokenizer_config = json.load(reader)
                        if "fast_tokenizer_files" in tokenizer_config:
                            fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
                vocab_files["tokenizer_file"] = fast_tokenizer_file
    
        # Get files from url, cache, or disk depending on the case
        resolved_vocab_files = {}
        unresolved_files = []
        for file_id, file_path in vocab_files.items():
            if file_path is None:
                resolved_vocab_files[file_id] = None
            elif single_file_id == file_id:
                if os.path.isfile(file_path):
                    resolved_vocab_files[file_id] = file_path
                elif is_remote_url(file_path):
                    resolved_vocab_files[file_id] = download_url(file_path, proxies=proxies)
            else:
                resolved_vocab_files[file_id] = cached_file(
                    pretrained_model_name_or_path,
                    file_path,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_vocab_files[file_id], commit_hash)
    
        if len(unresolved_files) > 0:
            logger.info(
                f"Can't load following files from cache: {unresolved_files} and cannot check if these "
                "files are necessary for the tokenizer to operate."
            )
    
        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
>           raise EnvironmentError(
                f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
                "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
                f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
                f"containing all relevant files for a {cls.__name__} tokenizer."
            )
E           OSError: Can't load tokenizer for 'openai/whisper-tiny'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'openai/whisper-tiny' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer.

src/transformers/tokenization_utils_base.py:2039: OSError
______________________ XLMTokenizationTest.test_single_id ______________________

self = <tests.models.xlm.test_tokenization_xlm.XLMTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.xlm.test_tokenization_xlm.XLMTokenizationTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
_________________ XLMProphetNetTokenizationTest.test_single_id _________________

self = <tests.models.xlm_prophetnet.test_tokenization_xlm_prophetnet.XLMProphetNetTokenizationTest testMethod=test_single_id>

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
>       rust_tokenizer = self.get_rust_tokenizer()

tests/test_tokenization_common.py:4204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.models.xlm_prophetnet.test_tokenization_xlm_prophetnet.XLMProphetNetTokenizationTest testMethod=test_single_id>
kwargs = {}

    def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
>       return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
E       AttributeError: 'NoneType' object has no attribute 'from_pretrained'

tests/test_tokenization_common.py:272: AttributeError
________________ PreTrainedTokenizationFastTest.test_single_id _________________
tests/tokenization/test_tokenization_fast.py:47: in setUp
    tokenizer = PreTrainedTokenizerFast.from_pretrained(model_paths[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
pretrained_model_name_or_path = 'robot-test/dummy-tokenizer-fast'
cache_dir = None, force_download = False, local_files_only = False, token = None
revision = 'main', trust_remote_code = False, init_inputs = (), kwargs = {}
resume_download = False, proxies = None, use_auth_token = None, subfolder = None

    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: Union[str, os.PathLike],
        *init_inputs,
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        local_files_only: bool = False,
        token: Optional[Union[str, bool]] = None,
        revision: str = "main",
        trust_remote_code=False,
        **kwargs,
    ):
        r"""
        Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined
        tokenizer.
    
        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                Can be either:
    
                - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
                  using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g.,
                  `./my_model_directory/`.
                - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary
                  file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
                  `./my_model_directory/vocab.txt`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the
                standard cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force the (re-)download the vocabulary files and override the cached versions if they
                exist.
            resume_download (`bool`, *optional*, defaults to `False`):
                Whether or not to delete incompletely received files. Attempt to resume the download if such a file
                exists.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            local_files_only (`bool`, *optional*, defaults to `False`):
                Whether or not to only rely on local files and not to attempt to download any files.
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            subfolder (`str`, *optional*):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for
                facebook/rag-token-base), specify it here.
            inputs (additional positional arguments, *optional*):
                Will be passed along to the Tokenizer `__init__` method.
            trust_remote_code (`bool`, *optional*, defaults to `False`):
                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
                should only be set to `True` for repositories you trust and in which you have read the code, as it will
                execute code present on the Hub on your local machine.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the Tokenizer `__init__` method. Can be used to set special tokens like `bos_token`,
                `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`,
                `additional_special_tokens`. See parameters in the `__init__` for more details.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Examples:
    
        ```python
        # We can't instantiate directly the base class *PreTrainedTokenizerBase* so let's show our examples on a derived class: BertTokenizer
        # Download vocabulary from huggingface.co and cache.
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
    
        # Download vocabulary from huggingface.co (user-uploaded) and cache.
        tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
    
        # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/")
    
        # If the tokenizer uses a single vocabulary file, you can point directly to this file
        tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt")
    
        # You can link tokens to special vocabulary when instantiating
        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="<unk>")
        # You should be sure '<unk>' is in the vocabulary when doing that.
        # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
        assert tokenizer.unk_token == "<unk>"
        ```"""
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
        use_auth_token = kwargs.pop("use_auth_token", None)
        subfolder = kwargs.pop("subfolder", None)
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)
    
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token
    
        user_agent = {"file_type": "tokenizer", "from_auto_class": from_auto_class, "is_fast": "Fast" in cls.__name__}
        if from_pipeline is not None:
            user_agent["using_pipeline"] = from_pipeline
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
    
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
        vocab_files = {}
        init_configuration = {}
    
        is_local = os.path.isdir(pretrained_model_name_or_path)
        single_file_id = None
        if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):
            if len(cls.vocab_files_names) > 1:
                raise ValueError(
                    f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not "
                    "supported for this tokenizer. Use a model identifier or the path to a directory instead."
                )
            warnings.warn(
                f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated and "
                "won't be possible anymore in v5. Use a model identifier or the path to a directory instead.",
                FutureWarning,
            )
            file_id = list(cls.vocab_files_names.keys())[0]
    
            vocab_files[file_id] = pretrained_model_name_or_path
            single_file_id = file_id
        else:
            # At this point pretrained_model_name_or_path is either a directory or a model identifier name
            additional_files_names = {
                "added_tokens_file": ADDED_TOKENS_FILE,  # kept only for legacy
                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,  # kept only for legacy
                "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
                # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
                "tokenizer_file": FULL_TOKENIZER_FILE,
            }
            vocab_files = {**cls.vocab_files_names, **additional_files_names}
            if "tokenizer_file" in vocab_files:
                # Try to get the tokenizer config to see if there are versioned tokenizer files.
                fast_tokenizer_file = FULL_TOKENIZER_FILE
                resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    TOKENIZER_CONFIG_FILE,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    resume_download=resume_download,
                    proxies=proxies,
                    token=token,
                    revision=revision,
                    local_files_only=local_files_only,
                    subfolder=subfolder,
                    user_agent=user_agent,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
                if resolved_config_file is not None:
                    with open(resolved_config_file, encoding="utf-8") as reader:
                        tokenizer_config = json.load(reader)
                        if "fast_tokenizer_files" in tokenizer_config:
                            fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
                vocab_files["tokenizer_file"] = fast_tokenizer_file
    
        # Get files from url, cache, or disk depending on the case
        resolved_vocab_files = {}
        unresolved_files = []
        for file_id, file_path in vocab_files.items():
            if file_path is None:
                resolved_vocab_files[file_id] = None
            elif single_file_id == file_id:
                if os.path.isfile(file_path):
                    resolved_vocab_files[file_id] = file_path
                elif is_remote_url(file_path):
                    resolved_vocab_files[file_id] = download_url(file_path, proxies=proxies)
            else:
                resolved_vocab_files[file_id] = cached_file(
                    pretrained_model_name_or_path,
                    file_path,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _raise_exceptions_for_gated_repo=False,
                    _raise_exceptions_for_missing_entries=False,
                    _raise_exceptions_for_connection_errors=False,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_vocab_files[file_id], commit_hash)
    
        if len(unresolved_files) > 0:
            logger.info(
                f"Can't load following files from cache: {unresolved_files} and cannot check if these "
                "files are necessary for the tokenizer to operate."
            )
    
        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
>           raise EnvironmentError(
                f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
                "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
                f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
                f"containing all relevant files for a {cls.__name__} tokenizer."
            )
E           OSError: Can't load tokenizer for 'robot-test/dummy-tokenizer-fast'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'robot-test/dummy-tokenizer-fast' is the correct path to a directory containing all relevant files for a PreTrainedTokenizerFast tokenizer.

src/transformers/tokenization_utils_base.py:2039: OSError
=============================== warnings summary ===============================
../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/_pytest/config/__init__.py:1373
  /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/utils.py:22
  /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/utils.py:22: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/pkg_resources/__init__.py:2846
  /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/pkg_resources/__init__.py:2846: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_single_id
FAILED tests/models/bert/test_tokenization_bert.py::BertTokenizationTest::test_single_id
FAILED tests/models/bert_generation/test_tokenization_bert_generation.py::BertGenerationTokenizationTest::test_single_id
FAILED tests/models/bertweet/test_tokenization_bertweet.py::BertweetTokenizationTest::test_single_id
FAILED tests/models/biogpt/test_tokenization_biogpt.py::BioGptTokenizationTest::test_single_id
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_single_id
FAILED tests/models/bloom/test_tokenization_bloom.py::BloomTokenizationTest::test_single_id
FAILED tests/models/byt5/test_tokenization_byt5.py::ByT5TokenizationTest::test_single_id
FAILED tests/models/canine/test_tokenization_canine.py::CanineTokenizationTest::test_single_id
FAILED tests/models/clvp/test_tokenization_clvp.py::ClvpTokenizationTest::test_single_id
FAILED tests/models/code_llama/test_tokenization_code_llama.py::CodeLlamaTokenizationTest::test_single_id
FAILED tests/models/ctrl/test_tokenization_ctrl.py::CTRLTokenizationTest::test_single_id
FAILED tests/models/distilbert/test_tokenization_distilbert.py::BertTokenizationTest::test_single_id
FAILED tests/models/distilbert/test_tokenization_distilbert.py::DistilBertTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::BertTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::DPRContextEncoderTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::DPRQuestionEncoderTokenizationTest::test_single_id
FAILED tests/models/dpr/test_tokenization_dpr.py::DPRReaderTokenizationTest::test_single_id
FAILED tests/models/electra/test_tokenization_electra.py::ElectraTokenizationTest::test_single_id
FAILED tests/models/ernie_m/test_tokenization_ernie_m.py::ErnieMTokenizationTest::test_single_id
FAILED tests/models/fsmt/test_tokenization_fsmt.py::FSMTTokenizationTest::test_single_id
FAILED tests/models/funnel/test_tokenization_funnel.py::FunnelTokenizationTest::test_single_id
FAILED tests/models/gemma/test_tokenization_gemma.py::GemmaTokenizationTest::test_single_id
FAILED tests/models/gpt_neox_japanese/test_tokenization_gpt_neox_japanese.py::GPTNeoXJapaneseTokenizationTest::test_single_id
FAILED tests/models/gpt_sw3/test_tokenization_gpt_sw3.py::GPTSw3TokenizationTest::test_single_id
FAILED tests/models/gptsan_japanese/test_tokenization_gptsan_japanese.py::GPTSanJapaneseTokenizationTest::test_single_id
FAILED tests/models/layoutlm/test_tokenization_layoutlm.py::LayoutLMTokenizationTest::test_single_id
FAILED tests/models/layoutlmv2/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_single_id
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_single_id
FAILED tests/models/lxmert/test_tokenization_lxmert.py::LxmertTokenizationTest::test_single_id
FAILED tests/models/m2m_100/test_tokenization_m2m_100.py::M2M100TokenizationTest::test_single_id
FAILED tests/models/marian/test_tokenization_marian.py::MarianTokenizationTest::test_single_id
FAILED tests/models/mgp_str/test_tokenization_mgp_str.py::MgpstrTokenizationTest::test_single_id
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_single_id
FAILED tests/models/mobilebert/test_tokenization_mobilebert.py::MobileBERTTokenizationTest::test_single_id
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_single_id
FAILED tests/models/nougat/test_tokenization_nougat.py::NougatTokenizationTest::test_single_id
FAILED tests/models/perceiver/test_tokenization_perceiver.py::PerceiverTokenizationTest::test_single_id
FAILED tests/models/phobert/test_tokenization_phobert.py::PhobertTokenizationTest::test_single_id
FAILED tests/models/plbart/test_tokenization_plbart.py::PLBartTokenizationTest::test_single_id
FAILED tests/models/prophetnet/test_tokenization_prophetnet.py::ProphetNetTokenizationTest::test_single_id
FAILED tests/models/realm/test_tokenization_realm.py::RealmTokenizationTest::test_single_id
FAILED tests/models/roc_bert/test_tokenization_roc_bert.py::BertTokenizationTest::test_single_id
FAILED tests/models/roformer/test_tokenization_roformer.py::RoFormerTokenizationTest::test_single_id
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_single_id
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_single_id
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_single_id
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_single_id
FAILED tests/models/squeezebert/test_tokenization_squeezebert.py::BertTokenizationTest::test_single_id
FAILED tests/models/squeezebert/test_tokenization_squeezebert.py::SqueezeBertTokenizationTest::test_single_id
FAILED tests/models/tapas/test_tokenization_tapas.py::TapasTokenizationTest::test_single_id
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_single_id
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_single_id
FAILED tests/models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py::Wav2Vec2PhonemeCTCTokenizerTest::test_single_id
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_single_id
FAILED tests/models/xlm/test_tokenization_xlm.py::XLMTokenizationTest::test_single_id
FAILED tests/models/xlm_prophetnet/test_tokenization_xlm_prophetnet.py::XLMProphetNetTokenizationTest::test_single_id
FAILED tests/tokenization/test_tokenization_fast.py::PreTrainedTokenizationFastTest::test_single_id
============ 58 failed, 33 passed, 6 skipped, 3 warnings in 13.87s =============
<<<PYTHON-EXEC-OUTPUT
Finished running tests!

@ArthurZucker
Copy link
Collaborator

feel free to open a PR for a fix. IMO we should not have spaces added in this case

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Mar 10, 2024

No problem, I will try to do this, but there are some other research work that needs to be pushed forward recently, and I may do it later.

@MariaHei
Copy link
Contributor

MariaHei commented Jun 24, 2024

Hi :)
I'm pretty sure the issue is not how spaces_between_special_tokens is used but that single tokens are split into letters here. To fix it, I'd suggest adding the following before iterating over the tokens:

if isinstance(filtered_tokens, str):
       filtered_tokens = [filtered_tokens]

I ran a couple of the test cases that were reported to be failing above with a slightly modified version of the test function proposed by @Ki-Seki and they pass now

def test_single_id(self):
        tokenizer = self.get_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
        self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
        if self.test_rust_tokenizer:
            rust_tokenizer = self.get_rust_tokenizer()
            self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id))

Unfortunately, I can't run all of the test cases (I keep running into weird python segmentation errors that occur even without having changed the library at all). Does know a trick how I can run the test cases anyway or is it ok if I create a pull request and wait for the CI tests?

@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 25, 2024
@ArthurZucker
Copy link
Collaborator

You can create a PR and rely on the CIs for sure! 🤗

@DuyguA
Copy link
Contributor

DuyguA commented Aug 9, 2024

Hello @ArthurZucker and all,
I don't think this is an issue related to the specific ids, but rather a general problem. I tested a bit on my local but to make sure my local setup isn't related, I tested on Colab:

colab_ids

Looks to me problem is that (i) a signature mismatch between PretrainedTokenizerBase and PretrainedTokenizer classes _decode methods:

def _decode(
self,
token_ids: Union[int, List[int]],

def _decode(
self,
token_ids: List[int],

FastTokenizer has this signature correctly:

def _decode(
self,
token_ids: Union[int, List[int]],

Consequently slow tokenizer _decode handles only list of ids, not a single id. If the filtered_tokens is a single string, not a list of strings in the loop its characters are iterated and processed so @MariaHei is totally right:
for token in filtered_tokens:

Also there are not many decoding tests, though lots of encoding tests 😊 I added quick signature fix and return statements, also added some decode tests in my PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization. Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants