-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: data split issue #404
Fix: data split issue #404
Conversation
What's wrong with setting the seed to a value other than 0? If we make this change, we will have to ensure that lines like if config.project.seed != 0:
seed_everything(config.project.seed) are changed in other files - |
If seed is non-zero, there is no problem. Does seed 0 means that the seed is not fixed in this project? It seems that anomalib/anomalib/data/folder.py Lines 163 to 166 in 8364577
Now, the PR code solves the data contamination problem, but each traininig data and test data are created identically for the seed 0. |
@jeongHwarr not sure if I quite understand the contamination problem. Currently we treat seed 0 as having no seed. However, even with seed 0 I get no overlap between train split and the test split. Here are the sample lines of code I am using for testing this. This does not raise Assertion Error for multiple runs. from anomalib.data.folder import make_dataset
from pathlib import Path
dataset = make_dataset(normal_dir=Path("../datasets/bottle/good/"), abnormal_dir=Path('../datasets/bottle/broken_large'), split_ratio=0.33, seed=0)
train_split = dataset[dataset.split == "train"]
test_split = dataset[dataset.split == "test"]
assert train_split["image_path"].values.all() != test_split["image_path"].values.all() Here is another sample using Folder DataModule folder_dataset = Folder(root=Path("../datasets/bottle"),normal_dir="good", abnormal_dir='broken_large', split_ratio=0.1, seed=0, image_size=256)
folder_dataset.setup()
assert folder_dataset.test_data.samples["image_path"].values.all() != folder_dataset.train_data.samples["image_path"].values.all() |
There is no overlap in your test code because you called However, in the code of this project, anomalib/anomalib/data/folder.py Lines 163 to 166 in 8364577
To reproduce this situation, set normal_dir to None in yaml, let the above part be called, and then compare the training data and test data. Then you can find the overlap there.
anomalib/anomalib/data/folder.py Lines 476 to 523 in 8364577
anomalib/anomalib/data/folder.py Lines 246 to 256 in 8364577
To solve this problem, anomalib/anomalib/data/folder.py Lines 172 to 175 in 8364577
|
@jeongHwarr I see what you mean. Thanks for spotting this. I'll look into the issue. But does it make sense to keep seed 0 as no seed or should we refactor the code to make |
@jeongHwarr Thanks again for spotting this issue and creating a PR for addressing it. Is it possible for you to make a few changes to this PR (https://github.com/openvinotoolkit/anomalib/pull/437/files) so that we can merge this and you can be a contributor to this repo 😀. Other option is to always merge this and I can create a new PR to address this. |
I think that I've reviewed your PR and nothing seems to change! |
Description
Provide a summary of the modification as well as the issue that has been resolved. List any dependencies that this modification necessitates.
Fixes When test_split is applied to a custom dataset, test data contamination occurs #403
Changes
Checklist