diff --git a/docs/source/guide/explanation/algorithms/segmentation/semantic_segmentation.rst b/docs/source/guide/explanation/algorithms/segmentation/semantic_segmentation.rst index e631301702d..a41013ef3ea 100644 --- a/docs/source/guide/explanation/algorithms/segmentation/semantic_segmentation.rst +++ b/docs/source/guide/explanation/algorithms/segmentation/semantic_segmentation.rst @@ -14,16 +14,17 @@ The output of semantic segmentation is typically an image where each pixel is co | -We solve this task by utilizing `FCN Head `_ with implementation from `MMSegmentation `_ on the multi-level image features obtained by the feature extractor backbone (`Lite-HRNet `_). +We solve this task by utilizing segmentation decoder heads on the multi-level image features obtained by the feature extractor backbone. For the supervised training we use the following algorithms components: .. _semantic_segmentation_supervised_pipeline: - ``Augmentations``: Besides basic augmentations like random flip, random rotate and random crop, we use mixing images technique with different `photometric distortions `_. -- ``Optimizer``: We use `Adam `_ optimizer with weight decay set to zero and gradient clipping with maximum quadratic norm equals to 40. +- ``Optimizer``: We use `Adam `_ and `AdamW ` optimizers. -- ``Learning rate schedule``: For scheduling training process we use **ReduceLROnPlateau** with linear learning rate warmup for 100 iterations. This method monitors a target metric (in our case we use metric on the validation set) and if no improvement is seen for a ``patience`` number of epochs, the learning rate is reduced. +- ``Learning rate schedule``: For scheduling training process we use **ReduceLROnPlateau** with linear learning rate warmup for 100 iterations for `Lite-HRNet `_ family. This method monitors a target metric (in our case we use metric on the validation set) and if no improvement is seen for a ``patience`` number of epochs, the learning rate is reduced. + For `SegNext `_ and `DinoV2 `_ models we use `PolynomialLR `_ scheduler. - ``Loss function``: We use standard `Cross Entropy Loss `_ to train a model. @@ -39,14 +40,6 @@ For the dataset handling inside OpenVINO™ Training Extensions, we use `Dataset At this end we support `Common Semantic Segmentation `_ data format. If you organized supported dataset format, starting training will be very simple. We just need to pass a path to the root folder and desired model recipe to start training: -.. note:: - - Due to some internal limitations, the dataset should always consist of a "background" label. If your dataset doesn't have a background label, rename the first label to "background" in the ``meta.json`` file. - - -.. note:: - - Currently, metrics with models trained with our OTX dataset adapter can differ from popular benchmarks. To avoid this and train the model on exactly the same segmentation masks as intended by the authors, please, set the parameter ``use_otx_adapter`` to ``False``. ****** Models @@ -55,43 +48,46 @@ Models We support the following ready-to-use model recipes: -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ -| Recipe ID | Name | Complexity (GFLOPs) | Model size (MB) | -+======================================================================================================================================================================================+========================+=====================+=================+ -| `Custom_Semantic_Segmentation_Lite-HRNet-s-mod2_OCR `_ | Lite-HRNet-s-mod2 | 1.44 | 3.2 | -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ -| `Custom_Semantic_Segmentation_Lite-HRNet-18-mod2_OCR `_ | Lite-HRNet-18-mod2 | 2.82 | 4.3 | -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ -| `Custom_Semantic_Segmentation_Lite-HRNet-x-mod3_OCR `_ | Lite-HRNet-x-mod3 | 9.20 | 5.7 | -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ -| `Custom_Semantic_Segmentation_SegNext_T `_ | SegNext-t | 6.07 | 4.23 | -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ -| `Custom_Semantic_Segmentation_SegNext_S `_ | SegNext-s | 15.35 | 13.9 | -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ -| `Custom_Semantic_Segmentation_SegNext_B `_ | SegNext-b | 32.08 | 27.56 | -+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+---------------------+-----------------+ - -All of these models are members of the same `Lite-HRNet `_ backbones family. They differ in the trade-off between accuracy and inference/training speed. ``Lite-HRNet-x-mod3`` is the recipe with heavy-size architecture for accurate predictions but it requires long training. -Whereas the ``Lite-HRNet-s-mod2`` is the lightweight architecture for fast inference and training. It is the best choice for the scenario of a limited amount of data. The ``Lite-HRNet-18-mod2`` model is the middle-sized architecture for the balance between fast inference and training time. - -Use `SegNext `_ model which can achieve superior perfomance while preserving fast inference and fast training. - -In the table below the `Dice score `_ on some academic datasets using our :ref:`supervised pipeline ` is presented. We use 512x512 image crop resolution, for other hyperparameters, please, refer to the related recipe. We trained each model with single Nvidia GeForce RTX3090. ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| Recipe Path | Complexity (GFLOPs) | Model size (M) | FPS (GPU) | iter time (sec) | ++======================================================================================================================================================================================+=====================+=================+=================+=================+ +| `Lite-HRNet-s-mod2 `_ | 1.44 | 0.82 | 37.68 | 0.151 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| `Lite-HRNet-18-mod2 `_ | 2.63 | 1.10 | 31.17 | 0.176 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| `Lite-HRNet-x-mod3 `_ | 9.20 | 1.50 | 15.07 | 0.347 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| `SegNext_T `_ | 12.44 | 4.23 | 104.90 | 0.126 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| `SegNext_S `_ | 30.93 | 13.90 | 85.67 | 0.134 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| `SegNext_B `_ | 64.65 | 27.56 | 61.91 | 0.215 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ +| `DinoV2 `_ | 124.01 | 24.40 | 3.52 | 0.116 | ++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+-----------------+-----------------+-----------------+ + +All of these models differ in the trade-off between accuracy and inference/training speed. For example, ``SegNext_B`` is the recipe with heavy-size architecture for more accurate predictions, but it requires longer training. +Whereas the ``Lite-HRNet-s-mod2`` is the lightweight architecture for fast inference and training. It is the best choice for the scenario of a limited amount of data. The ``Lite-HRNet-18-mod2`` and ``SegNext_S`` models are the middle-sized architectures for the balance between fast inference and training time. +``DinoV2`` is the state-of-the-art model producing universal features suitable for all image-level and pixel-level visual tasks. This model doesn't require fine-tuning of the whole backbone, but only segmentation decode head. Because of that, it provides faster training preserving high accuracy. + +In the table below the `Dice score `_ on some academic datasets using our :ref:`supervised pipeline ` is presented. We use 512x512 (560x560 fot DinoV2) image crop resolution, for other hyperparameters, please, refer to the related recipe. We trained each model with single Nvidia GeForce RTX3090. +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ -| Model name | `DIS5K `_ | `Cityscapes `_ | `Pascal-VOC 2012 `_ | `KITTI full `_ | Mean | +| Model name | `DIS5K `_ | `Cityscapes `_ | `Pascal-VOC 2012 `_ | `KITTI `_ | Mean | +=======================+==============================================================+=====================================================+======================================================================+=================================================================+========+ -| Lite-HRNet-s-mod2 | 79.95 | 62.38 | 58.26 | 36.06 | 59.16 | +| Lite-HRNet-s-mod2 | 78.73 | 69.25 | 63.26 | 41.73 | 63.24 | ++-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ +| Lite-HRNet-18-mod2 | 81.43 | 72.66 | 62.10 | 46.73 | 65.73 | +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ -| Lite-HRNet-18-mod2 | 81.12 | 65.04 | 63.48 | 39.14 | 62.20 | +| Lite-HRNet-x-mod3 | 82.36 | 74.57 | 59.55 | 49.97 | 66.61 | +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ -| Lite-HRNet-x-mod3 | 79.98 | 59.97 | 61.9 | 41.55 | 60.85 | +| SegNext-t | 83.99 | 77.09 | 84.05 | 48.99 | 73.53 | +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ -| SegNext-t | 85.05 | 70.67 | 80.73 | 51.25 | 68.99 | +| SegNext-s | 85.54 | 79.45 | 86.00 | 52.19 | 75.80 | +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ -| SegNext-s | 85.62 | 70.91 | 82.31 | 52.94 | 69.82 | +| SegNext-b | 86.76 | 76.14 | 87.92 | 57.73 | 77.14 | +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ -| SegNext-b | 87.92 | 76.94 | 85.01 | 55.49 | 73.45 | +| DinoV2 | 84.87 | 73.58 | 88.15 | 65.91 | 78.13 | +-----------------------+--------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------+--------+ .. note:: diff --git a/src/otx/algo/segmentation/base_model.py b/src/otx/algo/segmentation/base_model.py index 76de6df2047..057c9b3c1b4 100644 --- a/src/otx/algo/segmentation/base_model.py +++ b/src/otx/algo/segmentation/base_model.py @@ -67,7 +67,7 @@ def forward( - Otherwise, returns the model outputs after interpolation. """ enc_feats = self.backbone(inputs) - outputs = self.decode_head(enc_feats) + outputs = self.decode_head(inputs=enc_feats) if mode == "tensor": return outputs diff --git a/src/otx/core/model/segmentation.py b/src/otx/core/model/segmentation.py index 80578929aca..6d3c63cad7b 100644 --- a/src/otx/core/model/segmentation.py +++ b/src/otx/core/model/segmentation.py @@ -164,7 +164,7 @@ def __init__( def _customize_inputs(self, entity: SegBatchDataEntity) -> dict[str, Any]: mode = "loss" if self.training else "predict" - masks = torch.stack(entity.masks).long() + masks = torch.stack(entity.masks).long() if mode == "loss" else None return {"inputs": entity.images, "img_metas": entity.imgs_info, "masks": masks, "mode": mode}