Dataset splits do not have exactly the requested weights #292

ageron · 2019-03-20T04:53:44Z

Short description
When I split the tf_flowers dataset into subsplits with weights 10, 15 and 75, I actually get datasets of size 400, 600, and 2670. This translates to 10.9%, 16.3%, 72.8%, which is pretty different from what I requested.
Moreover, apart from iterating through the whole datasets, there does not seem to be a way to know the size of the splits.

Environment information

Operating System: MacOSX 10.13.6
Python version: 3.6.8
tfds-nightly version: tfds-nightly-1.0.1.dev201903180105
tf-nightly-2.0-preview version: tf-nightly-2.0-preview-2.0.0.dev20190319

Reproduction instructions

import tensorflow_datasets as tfds

test_split, valid_split, train_split = tfds.Split.TRAIN.subsplit([10, 15, 75])

test_set = tfds.load("tf_flowers", split=test_split, as_supervised=True)
valid_set = tfds.load("tf_flowers", split=valid_split, as_supervised=True)
train_set = tfds.load("tf_flowers", split=train_split, as_supervised=True)

def dataset_length(dataset):
    count = 0
    for image in dataset:
        count += 1
    return count

print(dataset_length(test_set)) # 400
print(dataset_length(valid_set)) # 600
print(dataset_length(train_set)) # 2670

Expected behavior
I expected split sizes with the requested ratios (rounded up or down to the nearest integer): in this example, the correct sizes should have been 367, 550 and 2753 (or 551 and 2752).
I also expect to be able to know the subsplit sizes without iterating through the datasets.

Additional context
TFDS is cool.

The text was updated successfully, but these errors were encountered:

Conchylicultor · 2019-04-08T18:32:49Z

Yes, this is due to the current implementation. Because each shard do not contains exactly x*100 examples, the remaining examples of each shard are arbitrary assign to one of the split, as this error accumulate among shards, it may lead to an incorrect number of sample per subsplit.

We're going to try to improve the subsplit implementation to reduce the issue. Note that because the total number of sample 2752 % 100 == 52 != 0, the last 52 example of the dataset will still be arbitrary assigned.

Conchylicultor · 2019-04-12T20:44:38Z

With the new split reading, independent of the number of shards, this should be fixed.
#458

For the tf_flowers, because the number of examples % 100 isn't null, the last examples will create some small unbalance.

For your use case, this should gives:

test_split, valid_split, train_split = tfds.Split.TRAIN.subsplit([10, 15, 75])

total_num_examples = 3670 = 3600 + 70
test = 3600 * 0.1 + 10 = 370
valid =  3600 * 0.15 + 15 = 555
train =  3600 * 0.75 + 45 = 2745

Thanks again for reporting this. I'm closing this but do not hesitate to re-open if this doesn't solve your issue.

ageron added the bug Something isn't working label Mar 20, 2019

Conchylicultor closed this as completed Apr 12, 2019

kaushikacharya mentioned this issue Jul 3, 2019

Splitting a dataset from tfds results in inaccurate partition sizes #665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset splits do not have exactly the requested weights #292

Dataset splits do not have exactly the requested weights #292

ageron commented Mar 20, 2019 •

edited

Loading

Conchylicultor commented Apr 8, 2019

Conchylicultor commented Apr 12, 2019

Dataset splits do not have exactly the requested weights #292

Dataset splits do not have exactly the requested weights #292

Comments

ageron commented Mar 20, 2019 • edited Loading

Conchylicultor commented Apr 8, 2019

Conchylicultor commented Apr 12, 2019

ageron commented Mar 20, 2019 •

edited

Loading