Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for finding candidate models for deprecation #29686

Merged
merged 8 commits into from
Apr 25, 2024

Conversation

amyeroberts
Copy link
Collaborator

@amyeroberts amyeroberts commented Mar 15, 2024

What does this PR do?

Adds a script to find a list of candidate models to deprecate. By default, the list of models are those which are over a year old and have had fewer than 5,000 downloads over the last month.

Things to note:

  • As the script iterates over all of the checkpoints on the hub, it can be fairly slow to run (a few mins). The option to save the retrieved information --save_model_info is added, which will save all of the models, alongside their date of first commit and number of downloads. This is useful when changing the threshold date of number of downloads to select models as you can run --use_cache to iterate quickly
  • There are assumptions made about the model file structures and names. Therefore some "models" will be found for deprecation, even if they many downloads e.g. "data2vec" as there's many separate data2vec modeling files e.g. modeling_data2vec_vision.py

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@amyeroberts amyeroberts changed the title Add utility for finding candidate models for deprecation Script for finding candidate models for deprecation Apr 11, 2024
@amyeroberts amyeroberts requested a review from ydshieh April 11, 2024 13:26
@amyeroberts
Copy link
Collaborator Author

Gentle ping @ydshieh

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 23, 2024

Sorry, didn't notice this PR. Will review asap.

@ydshieh ydshieh self-assigned this Apr 23, 2024
Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @amyeroberts 👍

A first round of (quick) review comments 🤗

utils/models_to_deprecate.py Outdated Show resolved Hide resolved
utils/models_to_deprecate.py Outdated Show resolved Hide resolved
utils/models_to_deprecate.py Outdated Show resolved Hide resolved
utils/models_to_deprecate.py Outdated Show resolved Hide resolved
utils/models_to_deprecate.py Outdated Show resolved Hide resolved
utils/models_to_deprecate.py Outdated Show resolved Hide resolved
@amyeroberts amyeroberts force-pushed the find-models-to-deprecate branch from 7964b2f to 98183b9 Compare April 24, 2024 10:51
@amyeroberts
Copy link
Collaborator Author

Thanks for the review @ydshieh! I've addressed all your comments

model = model_path.split("/")[-2]
if model in models_info:
continue
commits = repo.git.log("--diff-filter=A", "--", model_path).split("\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each call takes 0.5 seconds, and we have ~270 models in the repository. So it takes 2 mins to get this info.

Since this is static (other than the new added models), we could probably build this info, save to somewhere and load it.

But probably 2 mins is nothing compared to the Hub connections. So it's merely a possible improvements which don't need to be in this PR.

Comment on lines 96 to 106
for i, hub_model in enumerate(api.list_models()):
if i % 100 == 0:
print(f"Processing model {i}")
if max_num_models != -1 and i > max_num_models:
break
if hub_model.private:
continue
for tag in hub_model.tags:
tag = tag.lower().replace("-", "_")
if tag in models_info:
models_info[tag]["downloads"] += hub_model.downloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how long this will take. The PR description says the whole script runs in a few minutes, but the above commit info extraction already takes ~2 minutes I believe.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 24, 2024

There are assumptions made about the model file structures and names. Therefore some "models" will be found for deprecation, even if they many downloads e.g. "data2vec" as there's many separate data2vec modeling files e.g. modeling_data2vec_vision.py

If IIRC, the dictionary model_info use the model directory (so it is data2vec) but the tags of models on the Hub could be data2vec-vision etc, so it won't be added to the dictionary.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Curious about the time required to run this script, but PR is good!

for tag in hub_model.tags:
tag = tag.lower().replace("-", "_")
if tag in models_info:
models_info[tag]["downloads"] += hub_model.downloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin Is hub_model.downloads the number of downloads over the last month?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over the last 30 days to be more precise.

@amyeroberts
Copy link
Collaborator Author

@ydshieh I updated a little bit, so that we filter the hub requests based on model tags in da62f7e. Running this way, it ran in 161 seconds. Previously it was 240s. Not ideal - not sure of a faster way to get the downloads info - but as we're running sporadically I don't think too bad either

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 24, 2024

OK, I was wondering why you pushed the last commit but now I understand it is to make things a bit fast. Nice!

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 24, 2024

My only left question is just about if the download number is over the last month. I don't seem to find this is mentioned in the HfApi. But since it is mentioned in your PR, let's wait @Wauplin to confirm. (or if it's urgent, you can merge and we check with him later)

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 24, 2024

Ah, sorry.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 24, 2024

Would be nice to have something like

# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@amyeroberts
Copy link
Collaborator Author

let's wait @Wauplin to confirm. (or if it's urgent, you can merge and we check with him later)

Yep, let's wait 👍 It's not urgent - just a few other PRs lined up after this one to handle automated deprecation and one which will deprecate a bunch of models that are flagged from this script

@Wauplin
Copy link
Contributor

Wauplin commented Apr 25, 2024

Didn't want to make you wait longer to I replied 😄 And it's all good 👍

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 25, 2024

OK, thanks for confirmation. BTW, for class ModelInfo:, my local file shows

        downloads (`int`):
            Number of downloads of the dataset.

Might be better to make it more precise (if not done yet on the latest version)

@Wauplin
Copy link
Contributor

Wauplin commented Apr 25, 2024

Opened huggingface/huggingface_hub#2250 to fix that 🤗

@amyeroberts amyeroberts merged commit 30ee508 into huggingface:main Apr 25, 2024
8 checks passed
@amyeroberts amyeroberts deleted the find-models-to-deprecate branch April 25, 2024 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants