You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Short description
When I split the tf_flowers dataset into subsplits with weights 10, 15 and 75, I actually get datasets of size 400, 600, and 2670. This translates to 10.9%, 16.3%, 72.8%, which is pretty different from what I requested.
Moreover, apart from iterating through the whole datasets, there does not seem to be a way to know the size of the splits.
Expected behavior
I expected split sizes with the requested ratios (rounded up or down to the nearest integer): in this example, the correct sizes should have been 367, 550 and 2753 (or 551 and 2752).
I also expect to be able to know the subsplit sizes without iterating through the datasets.
Additional context
TFDS is cool.
The text was updated successfully, but these errors were encountered:
Yes, this is due to the current implementation. Because each shard do not contains exactly x*100 examples, the remaining examples of each shard are arbitrary assign to one of the split, as this error accumulate among shards, it may lead to an incorrect number of sample per subsplit.
We're going to try to improve the subsplit implementation to reduce the issue. Note that because the total number of sample 2752 % 100 == 52 != 0, the last 52 example of the dataset will still be arbitrary assigned.
Short description
When I split the tf_flowers dataset into subsplits with weights 10, 15 and 75, I actually get datasets of size 400, 600, and 2670. This translates to 10.9%, 16.3%, 72.8%, which is pretty different from what I requested.
Moreover, apart from iterating through the whole datasets, there does not seem to be a way to know the size of the splits.
Environment information
tfds-nightly
version: tfds-nightly-1.0.1.dev201903180105tf-nightly-2.0-preview
version: tf-nightly-2.0-preview-2.0.0.dev20190319Reproduction instructions
Expected behavior
I expected split sizes with the requested ratios (rounded up or down to the nearest integer): in this example, the correct sizes should have been 367, 550 and 2753 (or 551 and 2752).
I also expect to be able to know the subsplit sizes without iterating through the datasets.
Additional context
TFDS is cool.
The text was updated successfully, but these errors were encountered: