Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map speedup #6745

Merged
merged 11 commits into from
Mar 1, 2024
Merged

Map speedup #6745

merged 11 commits into from
Mar 1, 2024

Conversation

kopyl
Copy link
Contributor

@kopyl kopyl commented Jan 28, 2024

Speed up 2nd mapping in examples/text_to_image/train_text_to_image_sdxl.py (computing VAE).

Testing on 833 samples of this dataset: lambdalabs/pokemon-blip-captions

Mine: 1m 48s
Current implementation: 2m 25s

@kopyl
Copy link
Contributor Author

kopyl commented Jan 29, 2024

@sayakpaul could you please merge it?

@sayakpaul
Copy link
Member

Can you explain what exactly are you doing here?

@kopyl
Copy link
Contributor Author

kopyl commented Jan 29, 2024

@sayakpaul sure. Instead of mapping over the already mapped dataset once, i assign it to another variable and then merge those datasets together.

Speeds things up a lot.

@sayakpaul
Copy link
Member

@lhoestq could you comment on this change?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it can be faster since there are still the two same map() calls, can you elaborate ?

Also not sure why with_transform is called twice (once in your code addition, and once line 866)

@kopyl
Copy link
Contributor Author

kopyl commented Jan 30, 2024

@lhoestq have no idea how it can be faster, but it is, you can run a test training on a Pokémon datasets and see that it's faster.

I don't know how to remove the second .with_transform (or first). If I do, it breaks the entire training due to the absence of required columns. Maybe you can help me do that? Also I can't understand why it's with_transform instead of just a plain map.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 28, 2024
@sayakpaul
Copy link
Member

@lhoestq a gentle ping.

@lhoestq
Copy link
Member

lhoestq commented Feb 28, 2024

I think it's faster because the second map call has to read+write the previously computed embeddings even though they're already saved, good find !

I would not add train_dataset.cleanup_cache_files() though, because it would not allow to reload the cache properly if you run the code a second time

@github-actions github-actions bot removed the stale Issues that haven't received updates label Feb 29, 2024
@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@lhoestq done. Please merge ❤️

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@lhoestq i have more very cool things i could add to the training script like saving the pre-computed dataset so it does not compute it when the same script is run again with different params. Just need to find some time to do it cause i also need to do some testing as well...

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@lhoestq merged the current main branch into this one

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! I'll let you merge @sayakpaul if it's good for you

PS: I also opened #7171 to fix an issue with the fingerprint of train_dataset_with_vae

Comment on lines +899 to +901
train_dataset_with_embeddings = train_dataset.map(
compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) This can be a single line.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Just one small change.

@sayakpaul
Copy link
Member

Let's also resolve the merge conflicts.

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@sayakpaul

#6745 (comment)

done

@sayakpaul
Copy link
Member

There's a merge conflict that we need to resolve before we can ship this.

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

There's a merge conflict that we need to resolve before we can ship this.

@sayakpaul of course. done. please check this

@sayakpaul
Copy link
Member

Thanks! Ping me once the CI run is complete :)

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@sayakpaul seem to be complete now

@sayakpaul
Copy link
Member

But now we have a code quality problem :/

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

But now we have a code quality problem :/

What problem exactly? My change did not introduce any bugs.

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@sayakpaul i see the logs. But there is no info on what's exactly wrong.

Could you please tag the person who is responsible for the tests?

@sayakpaul
Copy link
Member

You need to run make style && make quality to ensure the code is properly formatted.

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@sayakpaul thanks. Now all the checks seem to be completed

@sayakpaul sayakpaul merged commit 9a2600e into huggingface:main Mar 1, 2024
8 checks passed
@sayakpaul
Copy link
Member

Thanks a lot for your hard work.

@kopyl
Copy link
Contributor Author

kopyl commented Mar 1, 2024

@sayakpaul u too, it was a pleasure ❤️. Be ready to review more cool stuff from me in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants