Streaming vision datasets #284

knighton · 2022-01-27T17:22:09Z

No description provided.

…enet-1K).

…taset.

abhi-mosaic

Looks good so far! Left a few comments, looking forward to some end-to-end testing with streaming the newly-created datasets from S3 before approving this PR. Quick reminder that for the big datasets, we probably don't want to test more than 1 or 2 epochs because it will cost $$$ in egress (until we have cacheing).

We can also postpone support for cacheing to a future PR, since it is currently still broken in webdataset. Context: webdataset/webdataset#131

composer/datasets/cifar.py

composer/datasets/imagenet.py

composer/datasets/webdataset.py

ravi-mosaicml · 2022-02-22T19:13:13Z

@knighton curious about the status of this PR?

knighton · 2022-02-22T19:16:07Z

@ravi-mosaicml webdataset is giving weird issues

knighton · 2022-03-04T22:51:55Z

item 1 is fixed in commit b6ead1a

for item 3, some minor differences are expected behavior. these sharded streaming datasets are new datasets

abhi-mosaic · 2022-03-04T23:07:08Z

I started to test this and talked to James about these comments, but I thought I should add them here:

pip install --user -e .[all] failed due to conflicting dependencies: ERROR: Cannot install awscli==1.22.67, mosaicml and mosaicml[all]==0.4.0 because these package versions have conflicting dependencies.

If a download fails, then is retried, it will use the folder from the failed attempt and fail again. James might have adjusted this one?

It appears the calculation for number of iterations is off. The training ends in ~900 fewer iterations (~1%) and the LR schedule is also adjusted:

The pink line is webdatasets and the blue line is my original runs. The performance seems to be a bit worse, but it is possible that it is within noise. I would need to do more runs, but I can try ResNet50 ImageNet first.

@Landanjs Just want to confirm that the # samples is expected to be lower, this is due to the constraints of sharded datasets. Basically we need an equal number of samples per shard, so n_samples % n_shards == 0 and to accomplish this we truncate the dataset. So we have lost a few examples for each vision dataset but I think @knighton was careful to keep the loss small to around ~1%.

This probably means that baselines need to be re-run, for all vision tasks. I would love to know if the results are meaningfully different or within the noise! I'm hoping the latter :)

tests/test_dataset_registry.py

composer/datasets/webdataset.py

knighton added 4 commits January 27, 2022 09:18

build_(vision datasets).

b6dab36

Streaming dataset: Web(MNIST, CIFAR-10/20/100, TinyImagenet-200, Imag…

4b4e8d4

…enet-1K).

WebDataset index files that indirect num_shards, (load, create)_webda…

302f666

…taset.

Make uniform.

4a61894

abhi-mosaic reviewed Feb 1, 2022

View reviewed changes

composer/datasets/cifar.py Outdated Show resolved Hide resolved

composer/datasets/cifar.py Outdated Show resolved Hide resolved

composer/datasets/cifar.py Outdated Show resolved Hide resolved

composer/datasets/imagenet.py Outdated Show resolved Hide resolved

knighton added 8 commits February 1, 2022 16:23

Link up webdataset creation -> load_dataset -> dataset hparam init.

aa89dda

Fix.

84e8254

isort.

fa22cf0

yapf.

75aa4a8

build/ -> create/ ("build" is overloaded term).

842029c

pyright (typing annotations).

ec582ff

Rm imagenet1k-multiproc example.

2fd7f53

Typing.

43023d6

abhi-mosaic reviewed Feb 7, 2022

View reviewed changes

composer/datasets/webdataset.py Outdated Show resolved Hide resolved

abhi-mosaic reviewed Feb 7, 2022

View reviewed changes

composer/datasets/webdataset.py Outdated Show resolved Hide resolved

knighton added 3 commits February 8, 2022 10:28

isort, etc.

679e0f2

ADE20k.

adb3289

Refactor: dataset_s3_bucket.

9f5f561

knighton and others added 10 commits February 22, 2022 21:29

Fix (sharding).

3172c4c

fix wds length

48c4c52

add wds cifar10 yaml

cd34c82

add deps

72ac7d4

minor fixes

42f2bb4

typo

142245b

size_webdataset().

c41a94c

shuffle (following webdataset defaults).

9b56693

Add YAMLs.

47fafb9

Fix.

e1a4d90

knighton added 2 commits March 4, 2022 22:27

Class variables <-> yahp.

6e03882

Weaken docutils requirement to make .[all] happy.

b6ead1a

knighton and others added 16 commits March 4, 2022 23:47

Optional webdataset.

bfd2e4c

Also Wurlitzer.

1a75f8e

Lint.

e25f125

Docstrings.

e0223d5

Merge branch 'dev' into james/wds

eecc937

Python usage too modern.

13035c0

Skip over webdatasets (wds_).

cb84613

Fix (split).

8df5392

Test hparams abstract base class solution.

a7b4334

Required hparams before optional hparams.

1d8fb9d

All optional.

40eae59

Fix (default).

c525d3b

Merge branch 'dev' into james/wds

523dc41

fix typing

74e8dda

fix lint

aee6084

lazy import load_webdataset

393f2d5

hanlint reviewed Mar 7, 2022

View reviewed changes

tests/test_dataset_registry.py Outdated Show resolved Hide resolved

hanlint reviewed Mar 7, 2022

View reviewed changes

tests/test_dataset_registry.py Outdated Show resolved Hide resolved

hanlint reviewed Mar 7, 2022

View reviewed changes

composer/datasets/webdataset.py Show resolved Hide resolved

knighton and others added 4 commits March 7, 2022 21:44

Skip over WebDatasets properly.

a14bf6c

Ported Dockerfile README to rst, style fixes

c3cea69

Merge branch 'james/wds' of github.com:mosaicml/composer into james/wds

e8681e2

Minor doc tweak

a6ef182

hanlint approved these changes Mar 8, 2022

View reviewed changes

hanlint merged commit bb33b1a into dev Mar 8, 2022

hanlint deleted the james/wds branch March 8, 2022 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming vision datasets #284

Streaming vision datasets #284

knighton commented Jan 27, 2022

abhi-mosaic left a comment •

edited

Loading

ravi-mosaicml commented Feb 22, 2022

knighton commented Feb 22, 2022

knighton commented Mar 4, 2022

abhi-mosaic commented Mar 4, 2022

Streaming vision datasets #284

Streaming vision datasets #284

Conversation

knighton commented Jan 27, 2022

abhi-mosaic left a comment • edited Loading

Choose a reason for hiding this comment

ravi-mosaicml commented Feb 22, 2022

knighton commented Feb 22, 2022

knighton commented Mar 4, 2022

abhi-mosaic commented Mar 4, 2022

abhi-mosaic left a comment •

edited

Loading