Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming vision datasets #284

Merged
merged 63 commits into from
Mar 8, 2022
Merged

Streaming vision datasets #284

merged 63 commits into from
Mar 8, 2022

Conversation

knighton
Copy link
Contributor

No description provided.

Copy link
Contributor

@abhi-mosaic abhi-mosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! Left a few comments, looking forward to some end-to-end testing with streaming the newly-created datasets from S3 before approving this PR. Quick reminder that for the big datasets, we probably don't want to test more than 1 or 2 epochs because it will cost $$$ in egress (until we have cacheing).

We can also postpone support for cacheing to a future PR, since it is currently still broken in webdataset. Context: webdataset/webdataset#131

@ravi-mosaicml
Copy link
Contributor

@knighton curious about the status of this PR?

@knighton
Copy link
Contributor Author

@ravi-mosaicml webdataset is giving weird issues

@knighton
Copy link
Contributor Author

knighton commented Mar 4, 2022

item 1 is fixed in commit b6ead1a

for item 3, some minor differences are expected behavior. these sharded streaming datasets are new datasets

@abhi-mosaic
Copy link
Contributor

I started to test this and talked to James about these comments, but I thought I should add them here:

  1. pip install --user -e .[all] failed due to conflicting dependencies: ERROR: Cannot install awscli==1.22.67, mosaicml and mosaicml[all]==0.4.0 because these package versions have conflicting dependencies.
  2. If a download fails, then is retried, it will use the folder from the failed attempt and fail again. James might have adjusted this one?
  3. It appears the calculation for number of iterations is off. The training ends in ~900 fewer iterations (~1%) and the LR schedule is also adjusted:

image

The pink line is webdatasets and the blue line is my original runs. The performance seems to be a bit worse, but it is possible that it is within noise. I would need to do more runs, but I can try ResNet50 ImageNet first.

@Landanjs Just want to confirm that the # samples is expected to be lower, this is due to the constraints of sharded datasets. Basically we need an equal number of samples per shard, so n_samples % n_shards == 0 and to accomplish this we truncate the dataset. So we have lost a few examples for each vision dataset but I think @knighton was careful to keep the loss small to around ~1%.

This probably means that baselines need to be re-run, for all vision tasks. I would love to know if the results are meaningfully different or within the noise! I'm hoping the latter :)

@hanlint hanlint merged commit bb33b1a into dev Mar 8, 2022
@hanlint hanlint deleted the james/wds branch March 8, 2022 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants