Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError due to huggingface-hub FutureWarning about resume_download #31002

Closed
albertvillanova opened this issue May 24, 2024 · 1 comment · Fixed by #31007
Closed

OSError due to huggingface-hub FutureWarning about resume_download #31002

albertvillanova opened this issue May 24, 2024 · 1 comment · Fixed by #31007
Labels

Comments

@albertvillanova
Copy link
Member

albertvillanova commented May 24, 2024

The hugginface-hub FutureWarning about resume_download makes transformers raise an OSError.

See: https://github.com/huggingface/datasets/actions/runs/8973676702/job/25330013267

____________________ TokenizersHashTest.test_hash_tokenizer ____________________
[gw1] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

cls = <class 'transformers.configuration_utils.PretrainedConfig'>
pretrained_model_name_or_path = 'bert-base-uncased'
kwargs = {'name_or_path': 'bert-base-uncased'}, cache_dir = None
force_download = False, resume_download = False, proxies = None, token = None
local_files_only = False, revision = None

    @classmethod
    def _get_config_dict(
        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
        cache_dir = kwargs.pop("cache_dir", None)
        force_download = kwargs.pop("force_download", False)
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
        token = kwargs.pop("token", None)
        local_files_only = kwargs.pop("local_files_only", False)
        revision = kwargs.pop("revision", None)
        trust_remote_code = kwargs.pop("trust_remote_code", None)
        subfolder = kwargs.pop("subfolder", "")
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)
    
        gguf_file = kwargs.get("gguf_file", None)
    
        if trust_remote_code is True:
            logger.warning(
                "The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is"
                " ignored."
            )
    
        user_agent = {"file_type": "config", "from_auto_class": from_auto_class}
        if from_pipeline is not None:
            user_agent["using_pipeline"] = from_pipeline
    
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
    
        is_local = os.path.isdir(pretrained_model_name_or_path)
        if os.path.isfile(os.path.join(subfolder, pretrained_model_name_or_path)):
            # Special case when pretrained_model_name_or_path is a local file
            resolved_config_file = pretrained_model_name_or_path
            is_local = True
        elif is_remote_url(pretrained_model_name_or_path):
            configuration_file = pretrained_model_name_or_path if gguf_file is None else gguf_file
            resolved_config_file = download_url(pretrained_model_name_or_path)
        else:
            configuration_file = kwargs.pop("_configuration_file", CONFIG_NAME) if gguf_file is None else gguf_file
    
            try:
                # Load from local folder or from cache or download from model Hub and cache
>               resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    configuration_file,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _commit_hash=commit_hash,
                )

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/configuration_utils.py:689: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/utils/hub.py:399: in cached_file
    resolved_file = hf_hub_download(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repo_id = 'bert-base-uncased', filename = 'config.json'

    @validate_hf_hub_args
    def hf_hub_download(
        repo_id: str,
        filename: str,
        *,
        subfolder: Optional[str] = None,
        repo_type: Optional[str] = None,
        revision: Optional[str] = None,
        library_name: Optional[str] = None,
        library_version: Optional[str] = None,
        cache_dir: Union[str, Path, None] = None,
        local_dir: Union[str, Path, None] = None,
        user_agent: Union[Dict, str, None] = None,
        force_download: bool = False,
        proxies: Optional[Dict] = None,
        etag_timeout: float = DEFAULT_ETAG_TIMEOUT,
        token: Union[bool, str, None] = None,
        local_files_only: bool = False,
        headers: Optional[Dict[str, str]] = None,
        endpoint: Optional[str] = None,
        # Deprecated args
        legacy_cache_layout: bool = False,
        resume_download: Optional[bool] = None,
        force_filename: Optional[str] = None,
        local_dir_use_symlinks: Union[bool, Literal["auto"]] = "auto",
    ) -> str:
        """Download a given file if it's not already present in the local cache.
        ...

        """
        if HF_HUB_ETAG_TIMEOUT != DEFAULT_ETAG_TIMEOUT:
            # Respect environment variable above user value
            etag_timeout = HF_HUB_ETAG_TIMEOUT
    
        if force_filename is not None:
            warnings.warn(
                "The `force_filename` parameter is deprecated as a new caching system, "
                "which keeps the filenames as they are on the Hub, is now in place.",
                FutureWarning,
            )
            legacy_cache_layout = True
        if resume_download is not None:
>           warnings.warn(
                "`resume_download` is deprecated and will be removed in version 1.0.0. "
                "Downloads always resume when possible. "
                "If you want to force a new download, use `force_download=True`.",
                FutureWarning,
            )
E           FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning

During handling of the above exception, another exception occurred:

self = <tests.test_fingerprint.TokenizersHashTest testMethod=test_hash_tokenizer>

    @require_transformers
    @pytest.mark.integration
    def test_hash_tokenizer(self):
        from transformers import AutoTokenizer
    
        def encode(x):
            return tokenizer(x)
    
        # TODO: add hash consistency tests across sessions
>       tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tests/test_fingerprint.py:90: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:837: in from_pretrained
    config = AutoConfig.from_pretrained(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:934: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/configuration_utils.py:632: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'transformers.configuration_utils.PretrainedConfig'>
pretrained_model_name_or_path = 'bert-base-uncased'
kwargs = {'name_or_path': 'bert-base-uncased'}, cache_dir = None
force_download = False, resume_download = False, proxies = None, token = None
local_files_only = False, revision = None

    @classmethod
    def _get_config_dict(
        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
        cache_dir = kwargs.pop("cache_dir", None)
        force_download = kwargs.pop("force_download", False)
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
        token = kwargs.pop("token", None)
        local_files_only = kwargs.pop("local_files_only", False)
        revision = kwargs.pop("revision", None)
        trust_remote_code = kwargs.pop("trust_remote_code", None)
        subfolder = kwargs.pop("subfolder", "")
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)
    
        gguf_file = kwargs.get("gguf_file", None)
    
        if trust_remote_code is True:
            logger.warning(
                "The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is"
                " ignored."
            )
    
        user_agent = {"file_type": "config", "from_auto_class": from_auto_class}
        if from_pipeline is not None:
            user_agent["using_pipeline"] = from_pipeline
    
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
    
        is_local = os.path.isdir(pretrained_model_name_or_path)
        if os.path.isfile(os.path.join(subfolder, pretrained_model_name_or_path)):
            # Special case when pretrained_model_name_or_path is a local file
            resolved_config_file = pretrained_model_name_or_path
            is_local = True
        elif is_remote_url(pretrained_model_name_or_path):
            configuration_file = pretrained_model_name_or_path if gguf_file is None else gguf_file
            resolved_config_file = download_url(pretrained_model_name_or_path)
        else:
            configuration_file = kwargs.pop("_configuration_file", CONFIG_NAME) if gguf_file is None else gguf_file
    
            try:
                # Load from local folder or from cache or download from model Hub and cache
                resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    configuration_file,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _commit_hash=commit_hash,
                )
                commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
            except EnvironmentError:
                # Raise any environment error raise by `cached_file`. It will have a helpful error message adapted to
                # the original exception.
                raise
            except Exception:
                # For any other exception, we throw a generic error.
>               raise EnvironmentError(
                    f"Can't load the configuration of '{pretrained_model_name_or_path}'. If you were trying to load it"
                    " from 'https://huggingface.co/models', make sure you don't have a local directory with the same"
                    f" name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory"
                    f" containing a {configuration_file} file"
                )
E               OSError: Can't load the configuration of 'bert-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'bert-base-uncased' is the correct path to a directory containing a config.json file

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/configuration_utils.py:710: OSError

The error can be avoided by filtering the warning out. See: huggingface/datasets@889a48d#diff-6d4f609754165d567fd0edfc980788464067245d6220097b4c76a5120238dd2cR91-R93

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Reproduction

To reproduce the bug locally, you must first remove the model from the "hub" cache:

rm -fr ~/.cache/huggingface/hub/models--bert-base-uncased
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Follows:

Maybe related to:

CC: @Wauplin

@Wauplin
Copy link
Contributor

Wauplin commented May 24, 2024

Opened #31007 to fix this. Looks like an overlook on my side, sorry about that.
Though the OSError is triggered by pytest because of CI config, not happening to all users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants