-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming vision datasets #284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! Left a few comments, looking forward to some end-to-end testing with streaming the newly-created datasets from S3 before approving this PR. Quick reminder that for the big datasets, we probably don't want to test more than 1 or 2 epochs because it will cost $$$ in egress (until we have cacheing).
We can also postpone support for cacheing to a future PR, since it is currently still broken in webdataset
. Context: webdataset/webdataset#131
@knighton curious about the status of this PR? |
@ravi-mosaicml webdataset is giving weird issues |
item 1 is fixed in commit b6ead1a for item 3, some minor differences are expected behavior. these sharded streaming datasets are new datasets |
@Landanjs Just want to confirm that the # samples is expected to be lower, this is due to the constraints of sharded datasets. Basically we need an equal number of samples per shard, so This probably means that baselines need to be re-run, for all vision tasks. I would love to know if the results are meaningfully different or within the noise! I'm hoping the latter :) |
No description provided.