-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map speedup #6745
Map speedup #6745
Conversation
@sayakpaul could you please merge it? |
Can you explain what exactly are you doing here? |
@sayakpaul sure. Instead of mapping over the already mapped dataset once, i assign it to another variable and then merge those datasets together. Speeds things up a lot. |
@lhoestq could you comment on this change? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it can be faster since there are still the two same map()
calls, can you elaborate ?
Also not sure why with_transform
is called twice (once in your code addition, and once line 866)
@lhoestq have no idea how it can be faster, but it is, you can run a test training on a Pokémon datasets and see that it's faster. I don't know how to remove the second .with_transform (or first). If I do, it breaks the entire training due to the absence of required columns. Maybe you can help me do that? Also I can't understand why it's with_transform instead of just a plain map. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@lhoestq a gentle ping. |
I think it's faster because the second map call has to read+write the previously computed embeddings even though they're already saved, good find ! I would not add |
@lhoestq done. Please merge ❤️ |
@lhoestq i have more very cool things i could add to the training script like saving the pre-computed dataset so it does not compute it when the same script is run again with different params. Just need to find some time to do it cause i also need to do some testing as well... |
@lhoestq merged the current main branch into this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! I'll let you merge @sayakpaul if it's good for you
PS: I also opened #7171 to fix an issue with the fingerprint of train_dataset_with_vae
train_dataset_with_embeddings = train_dataset.map( | ||
compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) This can be a single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Just one small change.
Let's also resolve the merge conflicts. |
There's a merge conflict that we need to resolve before we can ship this. |
@sayakpaul of course. done. please check this |
Thanks! Ping me once the CI run is complete :) |
@sayakpaul seem to be complete now |
But now we have a code quality problem :/ |
What problem exactly? My change did not introduce any bugs. |
@sayakpaul i see the logs. But there is no info on what's exactly wrong. Could you please tag the person who is responsible for the tests? |
You need to run |
@sayakpaul thanks. Now all the checks seem to be completed |
Thanks a lot for your hard work. |
@sayakpaul u too, it was a pleasure ❤️. Be ready to review more cool stuff from me in the near future. |
Speed up 2nd mapping in
examples/text_to_image/train_text_to_image_sdxl.py
(computing VAE).Testing on 833 samples of this dataset: lambdalabs/pokemon-blip-captions
Mine: 1m 48s
Current implementation: 2m 25s