Map speedup #6745

kopyl · 2024-01-28T23:40:12Z

Speed up 2nd mapping in examples/text_to_image/train_text_to_image_sdxl.py (computing VAE).

Testing on 833 samples of this dataset: lambdalabs/pokemon-blip-captions

Mine: 1m 48s
Current implementation: 2m 25s

kopyl · 2024-01-29T10:40:51Z

@sayakpaul could you please merge it?

sayakpaul · 2024-01-29T11:08:04Z

Can you explain what exactly are you doing here?

kopyl · 2024-01-29T23:44:51Z

@sayakpaul sure. Instead of mapping over the already mapped dataset once, i assign it to another variable and then merge those datasets together.

Speeds things up a lot.

sayakpaul · 2024-01-30T01:35:35Z

@lhoestq could you comment on this change?

HuggingFaceDocBuilderDev · 2024-01-30T01:42:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

I don't see how it can be faster since there are still the two same map() calls, can you elaborate ?

Also not sure why with_transform is called twice (once in your code addition, and once line 866)

kopyl · 2024-01-30T18:49:14Z

@lhoestq have no idea how it can be faster, but it is, you can run a test training on a Pokémon datasets and see that it's faster.

I don't know how to remove the second .with_transform (or first). If I do, it breaks the entire training due to the absence of required columns. Maybe you can help me do that? Also I can't understand why it's with_transform instead of just a plain map.

github-actions · 2024-02-28T15:03:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-02-28T15:07:36Z

@lhoestq a gentle ping.

lhoestq · 2024-02-28T15:12:56Z

I think it's faster because the second map call has to read+write the previously computed embeddings even though they're already saved, good find !

I would not add train_dataset.cleanup_cache_files() though, because it would not allow to reload the cache properly if you run the code a second time

kopyl · 2024-03-01T07:34:52Z

@lhoestq done. Please merge ❤️

kopyl · 2024-03-01T07:36:22Z

@lhoestq i have more very cool things i could add to the training script like saving the pre-computed dataset so it does not compute it when the same script is run again with different params. Just need to find some time to do it cause i also need to do some testing as well...

kopyl · 2024-03-01T08:07:03Z

@lhoestq merged the current main branch into this one

examples/text_to_image/train_text_to_image_sdxl.py

lhoestq

LGTM ! I'll let you merge @sayakpaul if it's good for you

PS: I also opened #7171 to fix an issue with the fingerprint of train_dataset_with_vae

sayakpaul · 2024-03-01T10:50:22Z

examples/text_to_image/train_text_to_image_sdxl.py

+        train_dataset_with_embeddings = train_dataset.map(
+            compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint
+        )


(nit) This can be a single line.

sayakpaul

Looking great! Just one small change.

sayakpaul · 2024-03-01T11:27:16Z

Let's also resolve the merge conflicts.

kopyl · 2024-03-01T11:30:55Z

@sayakpaul

#6745 (comment)

done

sayakpaul · 2024-03-01T11:31:41Z

There's a merge conflict that we need to resolve before we can ship this.

kopyl · 2024-03-01T11:44:30Z

There's a merge conflict that we need to resolve before we can ship this.

@sayakpaul of course. done. please check this

sayakpaul · 2024-03-01T12:45:11Z

Thanks! Ping me once the CI run is complete :)

kopyl · 2024-03-01T13:08:18Z

@sayakpaul seem to be complete now

sayakpaul · 2024-03-01T14:00:22Z

But now we have a code quality problem :/

kopyl · 2024-03-01T14:07:19Z

But now we have a code quality problem :/

What problem exactly? My change did not introduce any bugs.

kopyl · 2024-03-01T14:18:59Z

@sayakpaul i see the logs. But there is no info on what's exactly wrong.

Could you please tag the person who is responsible for the tests?

sayakpaul · 2024-03-01T15:29:34Z

You need to run make style && make quality to ensure the code is properly formatted.

kopyl · 2024-03-01T16:07:37Z

@sayakpaul thanks. Now all the checks seem to be completed

sayakpaul · 2024-03-01T16:08:33Z

Thanks a lot for your hard work.

kopyl · 2024-03-01T16:09:33Z

@sayakpaul u too, it was a pleasure ❤️. Be ready to review more cool stuff from me in the near future.

kopyl added 2 commits January 28, 2024 02:08

Speed up dataset mapping

1cd65d4

Fix missing columns

7a75f04

kopyl mentioned this pull request Jan 29, 2024

[Core] move transformer scripts to transformers modules #6747

Merged

Merge branch 'main' into map-speedup

d501a19

lhoestq reviewed Jan 30, 2024

View reviewed changes

github-actions bot added the stale Issues that haven't received updates label Feb 28, 2024

github-actions bot removed the stale Issues that haven't received updates label Feb 29, 2024

Remove cache files cleanup

61e1754

Merge remote-tracking branch 'upstream/main' into map-speedup

8bea362

lhoestq reviewed Mar 1, 2024

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Outdated Show resolved Hide resolved

lhoestq and others added 2 commits March 1, 2024 10:51

Update examples/text_to_image/train_text_to_image_sdxl.py

808fdcb

make style

6e9c114

lhoestq approved these changes Mar 1, 2024

View reviewed changes

sayakpaul reviewed Mar 1, 2024

View reviewed changes

sayakpaul approved these changes Mar 1, 2024

View reviewed changes

Fix code style

16f344d

Merge remote-tracking branch 'upstream/main' into map-speedup

899a1fb

sayakpaul added 2 commits March 1, 2024 20:57

style

507c57c

Empty-Commit

1f71136

sayakpaul merged commit 9a2600e into huggingface:main Mar 1, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map speedup #6745

Map speedup #6745

kopyl commented Jan 28, 2024 •

edited

Loading

kopyl commented Jan 29, 2024 •

edited

Loading

sayakpaul commented Jan 29, 2024

kopyl commented Jan 29, 2024

sayakpaul commented Jan 30, 2024

HuggingFaceDocBuilderDev commented Jan 30, 2024

lhoestq left a comment

kopyl commented Jan 30, 2024

github-actions bot commented Feb 28, 2024

sayakpaul commented Feb 28, 2024

lhoestq commented Feb 28, 2024

kopyl commented Mar 1, 2024

kopyl commented Mar 1, 2024

kopyl commented Mar 1, 2024

lhoestq left a comment

sayakpaul Mar 1, 2024

sayakpaul left a comment

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

Map speedup #6745

Map speedup #6745

Conversation

kopyl commented Jan 28, 2024 • edited Loading

kopyl commented Jan 29, 2024 • edited Loading

sayakpaul commented Jan 29, 2024

kopyl commented Jan 29, 2024

sayakpaul commented Jan 30, 2024

HuggingFaceDocBuilderDev commented Jan 30, 2024

lhoestq left a comment

Choose a reason for hiding this comment

kopyl commented Jan 30, 2024

github-actions bot commented Feb 28, 2024

sayakpaul commented Feb 28, 2024

lhoestq commented Feb 28, 2024

kopyl commented Mar 1, 2024

kopyl commented Mar 1, 2024

kopyl commented Mar 1, 2024

lhoestq left a comment

Choose a reason for hiding this comment

sayakpaul Mar 1, 2024

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

sayakpaul commented Mar 1, 2024

kopyl commented Mar 1, 2024

kopyl commented Jan 28, 2024 •

edited

Loading

kopyl commented Jan 29, 2024 •

edited

Loading