[Feature] Add Maskfeat Support (#485)

* [Feature]: Add MaskfeatMaskGenerator Pipeline * [Feature]: Add HogLayerC for MaskFeat * [Feature]: Add Backbone of MaskFeat * [Feature]: Add Head of MaskFeat * [Feature]: Add Algorithms of MaskFeat * [Feature]: Add Config of MaskFeat * [Doc] Update Readme of MaskFeat * [Fix] fix ut and hog_layer. * [fix] Add and correct docstring * [Fix] Refine the docstring of MaskFeat * [fix] fix value of trunc_normal_ * [fix] rename the finetune config of maskfeat * [fix] rename the fine-tuning config of maskfeat * [fix] rename the fine-tuning config of maskfeat * [fix] add new paramwise_options in fine-tuning config * [fix] update the top-1 accuary of maskfeat * [fix] update the top-1 accuary of maskfeat in model_zoo * [fix] rename MaskfeatMaskGenerator
open-mmlab · Sep 30, 2022 · af7eb03 · af7eb03
1 parent 1c57f3f
commit af7eb03
Show file tree

Hide file tree

Showing 23 changed files with 925 additions and 17 deletions.
diff --git a/configs/benchmarks/classification/imagenet/vit-base-p16_ft-8xb256-coslr-100e_in1k.py b/configs/benchmarks/classification/imagenet/vit-base-p16_ft-8xb256-coslr-100e_in1k.py
@@ -0,0 +1,76 @@
+_base_ = [
+    '../_base_/models/vit-base-p16_ft.py',
+    '../_base_/datasets/imagenet.py',
+    '../_base_/schedules/adamw_coslr-100e_in1k.py',
+    '../_base_/default_runtime.py',
+]
+# maskfeat fine-tuning setting
+
+# dataset
+img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+train_pipeline = [
+    dict(
+        type='RandomAug',
+        input_size=224,
+        color_jitter=0.4,
+        auto_augment='rand-m9-mstd0.5-inc1',
+        interpolation='bicubic',
+        re_prob=0.25,
+        re_mode='pixel',
+        re_count=1,
+        mean=(0.485, 0.456, 0.406),
+        std=(0.229, 0.224, 0.225))
+]
+test_pipeline = [
+    dict(type='Resize', size=256, interpolation=3),
+    dict(type='CenterCrop', size=224),
+    dict(type='ToTensor'),
+    dict(type='Normalize', **img_norm_cfg)
+]
+data = dict(
+    samples_per_gpu=256,
+    drop_last=False,
+    workers_per_gpu=32,
+    train=dict(pipeline=train_pipeline),
+    val=dict(pipeline=test_pipeline))
+
+# model
+model = dict(
+    backbone=dict(init_cfg=dict()),
+    head=dict(
+        type='MaskFeatFinetuneHead',
+        num_classes=1000,
+        embed_dim=768,
+        label_smooth_val=0.1))
+
+# optimizer
+optimizer = dict(
+    lr=0.002 * 8 / 2,
+    betas=(0.9, 0.999),
+    weight_decay=0.05,
+    paramwise_options={
+        'ln': dict(weight_decay=0.),
+        'bias': dict(weight_decay=0.),
+        'pos_embed': dict(weight_decay=0.),
+        'cls_token': dict(weight_decay=0.),
+    },
+    constructor='TransformerFinetuneConstructor',
+    model_type='vit',
+    layer_decay=0.65)
+
+# learning policy
+lr_config = dict(
+    policy='CosineAnnealing',
+    min_lr=1e-6,
+    warmup='linear',
+    warmup_iters=20,
+    warmup_ratio=1e-08,
+    warmup_by_epoch=True)
+
+# runtime
+checkpoint_config = dict(interval=1, max_keep_ckpts=3, out_dir='')
+persistent_workers = True
+log_config = dict(
+    interval=100, hooks=[
+        dict(type='TextLoggerHook'),
+    ])
diff --git a/configs/selfsup/_base_/datasets/imagenet_maskfeat.py b/configs/selfsup/_base_/datasets/imagenet_maskfeat.py
@@ -0,0 +1,35 @@
+# dataset settings
+data_source = 'ImageNet'
+dataset_type = 'SingleViewDataset'
+img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+train_pipeline = [
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        scale=(0.5, 1.0),
+        ratio=(0.75, 1.3333),
+        interpolation='bicubic'),
+    dict(type='RandomHorizontalFlip')
+]
+
+# prefetch
+prefetch = False
+if not prefetch:
+    train_pipeline.extend(
+        [dict(type='ToTensor'),
+         dict(type='Normalize', **img_norm_cfg)])
+
+train_pipeline.append(dict(type='MaskFeatMaskGenerator', mask_ratio=0.4))
+
+# dataset summary
+data = dict(
+    samples_per_gpu=256,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_source=dict(
+            type=data_source,
+            data_prefix='data/imagenet/train',
+            ann_file='data/imagenet/meta/train.txt'),
+        pipeline=train_pipeline,
+        prefetch=prefetch))
diff --git a/configs/selfsup/_base_/models/maskfeat_vit-base-p16.py b/configs/selfsup/_base_/models/maskfeat_vit-base-p16.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='MaskFeat',
+    backbone=dict(
+        type='MaskFeatViT',
+        arch='b',
+        patch_size=16,
+        drop_path_rate=0,
+    ),
+    head=dict(type='MaskFeatPretrainHead', hog_dim=108),
+    hog_para=dict(
+        nbins=9,  # Number of bin. Defaults to 9.
+        pool=8,  # Number of cell. Defaults to 8.
+        gaussian_window=16  # Size of gaussian kernel. Defaults to 16.
+    ))
diff --git a/configs/selfsup/maskfeat/README.md b/configs/selfsup/maskfeat/README.md
@@ -0,0 +1,34 @@
+# MaskFeat
+
+> [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
+
+<div align="center">
+<img src="https://user-images.githubusercontent.com/48178838/190090285-428f07c0-0887-4ce8-b94f-f719cfd25622.png" width="60%"/>
+</div>
+
+## Models and Benchmarks
+
+Here, we report the results of the model, which is pre-trained on ImageNet-1k
+for 400 epochs, the details are below:
+
+| Backbone | Pre-train epoch | Fine-tuning Top-1 |                                                            Pre-train Config                                                            |                                                                     Fine-tuning Config                                                                      |                                                                                                                      Download                                                                                                                       |
+| :------: | :-------------: | :---------------: | :------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| ViT-B/16 |       300       |       83.5        | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k.py) | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/maskfeat_vit-base-p16_ft-8xb512-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) \| [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json) |
+
+## Citation
+
+```bibtex
+@article{He2021MaskedAA,
+  title={Masked Autoencoders Are Scalable Vision Learners},
+  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
+  Piotr Doll'ar and Ross B. Girshick},
+  journal={ArXiv},
+  year={2021}
+}
+```
diff --git a/configs/selfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k.py b/configs/selfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/maskfeat_vit-base-p16.py',
+    '../_base_/datasets/imagenet_maskfeat.py',
+    '../_base_/schedules/adamw_coslr-300e_in1k.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset
+data = dict(samples_per_gpu=256, workers_per_gpu=32)
+
+# optimizer
+optimizer = dict(
+    lr=2e-4 * 8,
+    betas=(0.9, 0.999),
+    weight_decay=0.05,
+    paramwise_options={
+        'ln': dict(weight_decay=0.),
+        'bias': dict(weight_decay=0.),
+    })
+optimizer_config = dict(grad_clip=dict(max_norm=0.02))
+
+# learning policy
+lr_config = dict(
+    policy='CosineAnnealing',
+    min_lr=1e-6,
+    warmup='linear',
+    warmup_iters=30,
+    warmup_ratio=1e-06,
+    warmup_by_epoch=True)
+
+# schedule
+runner = dict(max_epochs=300)
+
+# runtime
+checkpoint_config = dict(interval=1, max_keep_ckpts=3, out_dir='')
+persistent_workers = True
+log_config = dict(
+    interval=100, hooks=[
+        dict(type='TextLoggerHook'),
+    ])
diff --git a/configs/selfsup/maskfeat/metafile.yaml b/configs/selfsup/maskfeat/metafile.yaml
@@ -0,0 +1,27 @@
+Collections:
+  - Name: MaskFeat
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+        URL: https://arxiv.org/abs/2112.09133v1
+        Title: "Masked Feature Prediction for Self-Supervised Visual Pre-Training"
+    README: configs/selfsup/maskfeat/README.md
+
+Models:
+  - Name: maskfeat_vit-base-p16_8xb256-coslr-300e_in1k
+    In Collection: MaskFeat
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+    Results:
+      - Task: Self-Supervised Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.5
+    Config: configs/selfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k.py
+    Weights: https://download.openmmlab.com/mmselfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k_20220913-591d4c4b.pth
diff --git a/docs/en/algorithms/maskfeat.md b/docs/en/algorithms/maskfeat.md
@@ -0,0 +1,34 @@
+# MaskFeat
+
+> [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
+
+<div align="center">
+<img src="https://user-images.githubusercontent.com/48178838/190090285-428f07c0-0887-4ce8-b94f-f719cfd25622.png" width="60%"/>
+</div>
+
+## Models and Benchmarks
+
+Here, we report the results of the model, which is pre-trained on ImageNet-1k
+for 400 epochs, the details are below:
+
+| Backbone | Pre-train epoch | Fine-tuning Top-1 |                                                            Pre-train Config                                                            |                                                                     Fine-tuning Config                                                                      |                                                                                                                      Download                                                                                                                       |
+| :------: | :-------------: | :---------------: | :------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| ViT-B/16 |       300       |       83.5        | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k.py) | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/maskfeat_vit-base-p16_ft-8xb512-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) \| [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json) |
+
+## Citation
+
+```bibtex
+@article{He2021MaskedAA,
+  title={Masked Autoencoders Are Scalable Vision Learners},
+  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
+  Piotr Doll'ar and Ross B. Girshick},
+  journal={ArXiv},
+  year={2021}
+}
+```
diff --git a/docs/en/model_zoo.md b/docs/en/model_zoo.md
@@ -26,6 +26,7 @@ All models and part of benchmark results are recorded below.
 | [MAE](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/README.md)                           | [mae_vit-base-p16_8xb512-coslr-400e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k.py)                                | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) \| [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json)                       |
 | [SimMIM](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/simmim/README.md)                     | [simmim_swin-base_16xb128-coslr-100e_in1k-192](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/simmim/simmim_swin-base_16xb128-coslr-100e_in1k-192.py)                   | [model](https://download.openmmlab.com/mmselfsup/simmim/simmim_swin-base_16xb128-coslr-100e_in1k-192_20220316-1d090125.pth) \| [log](https://download.openmmlab.com/mmselfsup/simmim/simmim_swin-base_16xb128-coslr-100e_in1k-192_20220316-1d090125.log.json)             |
 | [CAE](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/simmim/README.md)                        | [cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/cae/cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k.py)                      | [model](https://download.openmmlab.com/mmselfsup/cae/cae_vit-base-p16_16xb256-coslr-300e_in1k-224_20220427-4c786349.pth) \| [log](https://download.openmmlab.com/mmselfsup/cae/cae_vit-base-p16_16xb256-coslr-300e_in1k-224_20220427-4c786349.log.json)                   |
+| [MaskFeat](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/maskfeat/README.md)                 | [maskfeat_vit-base-p16_8xb256-coslr-300e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k.py)                 | [model](https://download.openmmlab.com/mmselfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k_20220913-591d4c4b.pth) \| [log](https://download.openmmlab.com/mmselfsup/maskfeat/maskfeat_vit-base-p16_8xb256-coslr-300e_in1k_20220829_225552.log.json)           |
 
 Remarks:
 
@@ -63,11 +64,12 @@ If not specified, we use linear evaluation setting from [MoCo](http://openaccess
 
 ### ImageNet Fine-tuning
 
-| Algorithm | Config                                                                                                                                                                     | Remarks | Top-1 (%) |
-| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | --------- |
-| MAE       | [mae_vit-base-p16_8xb512-coslr-400e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k.py)              |         | 83.1      |
-| SimMIM    | [simmim_swin-base_16xb128-coslr-100e_in1k-192](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/simmim/simmim_swin-base_16xb128-coslr-100e_in1k-192.py) |         | 82.9      |
-| CAE       | [cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/cae/cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k.py)    |         | 83.2      |
+| Algorithm | Config                                                                                                                                                                                                 | Remarks | Top-1 (%) |
+| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | --------- |
+| MAE       | [mae_vit-base-p16_8xb512-coslr-400e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k.py)                                          |         | 83.1      |
+| SimMIM    | [simmim_swin-base_16xb128-coslr-100e_in1k-192](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/simmim/simmim_swin-base_16xb128-coslr-100e_in1k-192.py)                             |         | 82.9      |
+| CAE       | [cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/cae/cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k.py)                                |         | 83.2      |
+| MaskFeat  | [maskfeat_vit-base-p16_8xb256-fp16-coslr-300e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/maskfeat_vit-base-p16_ft-8xb512-coslr-100e_in1k.py) |         | 83.5      |
 
 ### COCO17 Object Detection and Instance Segmentation