-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Fix label mismatch of evaluation and validation with large dataset in semantic segmentation #1851
Conversation
@ashwinvaidya17 , could you double check the effect of
@sungchul2 , maybe you also have knowledge about the segmentation issue, do you have any idea or comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice work. BTW, unit-test checking is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
But it would be better to handle it in mask_from_annnotation
if there is a way to solve it in mask_from_annnotation
.
@kprokofi Hmm.. My VOC dataset is working well. Could you share your model checkpoint and validation dataset? FYI, this will be completely solved when my follow-up PR is merged (please see (3) in this PR description -> bg label is not still included and soft threshold is 0.5 in default) |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #1851 +/- ##
===========================================
- Coverage 80.52% 80.51% -0.02%
===========================================
Files 477 477
Lines 32813 32834 +21
===========================================
+ Hits 26423 26435 +12
- Misses 6390 6399 +9
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@supersoob thank you for this update. Testing cityscapes I found out mismatch in labels caused by: |
Currently, Cityscapes dataset is not supported due to the background label issue. |
Thank you for testing with various dataset. But unfortunately, every background insertion in otx is root cause for running cityscapes. Even if I shift the label, cityscape won't work as long as I don't remove all bg label in otx which is a very huge job. And that solution affects to voc and other dataset with bg. I plan to stabilize voc first and then look at cityscapes.. I hope to merge this PR if you don't have performance problem with voc and custom seg with >10 labels(with backgrounds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge it now! Thank you for solving labels mismatch problem
Issue
(1) 'otx eval' (inference step & final evaluation score after training) shows us very low score for models trained with dataset with multiple classes(>10)
(2) The order of classes written in validation log is wrong with dataset with multiple classes(>10)
(3) [TODO] Semantic segmentation always produces lower evaluation score compared to the best validation score -> (Found solution but need some discussion)
Root Causes
(1)(2) : When converting segmentation mask to otx 2d numpy array(mapping to class index), it str-sorts the labels with id key.
training_extensions/otx/api/utils/segmentation_utils.py
Line 70 in 6cba8c9
Due to this, when num_classes is above 10, the order is newly made like this. That caused mismatch with unsorted label schema.json which is saved in order of initial dataset_meta.json. Plus, it caused mismatch in actual prediction order from model head output.
(3) : Two factors in this issue. Background ignored and soft_threshold. The final evaluation score does not include the background label score, which never be the same with the best validation score. soft_threshold is set to 0.5 in default. It ignores the max score in prediction below the soft_threshold. But in eval hook(validation), soft_threshold is not considered which is same as threshold 0.0.
Solution
(1) : label dictionary are sorted before evaluation in
inference.py
rather than removing thesorted(labels)
inmask_from_annnotation
because anomalib also uses this func and can be affected to Geti which generates random bytes id.(2) :
project_labels
are sorted before converting gt mask to otx mask andself.CLASSES
are realigned with sorted one.(3) : To be updated in other PR (need to discuss)
Checklists