-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for finding candidate models for deprecation #29686
Script for finding candidate models for deprecation #29686
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Gentle ping @ydshieh |
Sorry, didn't notice this PR. Will review asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @amyeroberts 👍
A first round of (quick) review comments 🤗
7964b2f
to
98183b9
Compare
Thanks for the review @ydshieh! I've addressed all your comments |
model = model_path.split("/")[-2] | ||
if model in models_info: | ||
continue | ||
commits = repo.git.log("--diff-filter=A", "--", model_path).split("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each call takes 0.5 seconds, and we have ~270 models in the repository. So it takes 2 mins to get this info.
Since this is static (other than the new added models), we could probably build this info, save to somewhere and load it.
But probably 2 mins is nothing compared to the Hub connections. So it's merely a possible improvements which don't need to be in this PR.
utils/models_to_deprecate.py
Outdated
for i, hub_model in enumerate(api.list_models()): | ||
if i % 100 == 0: | ||
print(f"Processing model {i}") | ||
if max_num_models != -1 and i > max_num_models: | ||
break | ||
if hub_model.private: | ||
continue | ||
for tag in hub_model.tags: | ||
tag = tag.lower().replace("-", "_") | ||
if tag in models_info: | ||
models_info[tag]["downloads"] += hub_model.downloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how long this will take. The PR description says the whole script runs in a few minutes, but the above commit info extraction already takes ~2 minutes I believe.
If IIRC, the dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Curious about the time required to run this script, but PR is good!
utils/models_to_deprecate.py
Outdated
for tag in hub_model.tags: | ||
tag = tag.lower().replace("-", "_") | ||
if tag in models_info: | ||
models_info[tag]["downloads"] += hub_model.downloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wauplin Is hub_model.downloads
the number of downloads over the last month?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over the last 30 days to be more precise.
OK, I was wondering why you pushed the last commit but now I understand it is to make things a bit fast. Nice! |
My only left question is just about if the download number is over the last month. I don't seem to find this is mentioned in the |
Ah, sorry. |
Would be nice to have something like # Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. |
Yep, let's wait 👍 It's not urgent - just a few other PRs lined up after this one to handle automated deprecation and one which will deprecate a bunch of models that are flagged from this script |
Didn't want to make you wait longer to I replied 😄 And it's all good 👍 |
OK, thanks for confirmation. BTW, for
Might be better to make it more precise (if not done yet on the latest version) |
Opened huggingface/huggingface_hub#2250 to fix that 🤗 |
What does this PR do?
Adds a script to find a list of candidate models to deprecate. By default, the list of models are those which are over a year old and have had fewer than 5,000 downloads over the last month.
Things to note:
--save_model_info
is added, which will save all of the models, alongside their date of first commit and number of downloads. This is useful when changing the threshold date of number of downloads to select models as you can run--use_cache
to iterate quicklymodeling_data2vec_vision.py