Method | Backbone | Teacher | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
MVD | ViT-S | ViT-B | 400 | 16x5x3 | script/checkpoint | script | 80.6 | 94.7 |
MVD | ViT-S | ViT-L | 400 | 16x5x3 | script/checkpoint | script | 81.0 | 94.8 |
MVD | ViT-B | ViT-B | 400 | 16x5x3 | script/checkpoint | script | 82.7 | 95.4 |
MVD | ViT-B | ViT-L | 400 | 16x5x3 | script/checkpoint | script | 83.4 | 95.8 |
MVD | ViT-L | ViT-L | 400 | 16x5x3 | script/checkpoint | script | 86.0 | 96.9 |
MVD | ViT-L | ViT-L | 800 | 16x5x3 | script | script | 86.4 | 97.0 |
MVD | ViT-H | ViT-H | 800 | 16x5x3 | script | script | 87.3 | 97.4 |
Method | Backbone | Teacher | Epoch | #Frame | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|
MVD | ViT-S | ViT-B | 400 | 16x2x3 | script | 70.7 | 92.6 |
MVD | ViT-S | ViT-L | 400 | 16x2x3 | script | 70.9 | 92.8 |
MVD | ViT-B | ViT-B | 400 | 16x2x3 | script | 72.5 | 93.6 |
MVD | ViT-B | ViT-L | 400 | 16x2x3 | script | 73.7 | 94.0 |
MVD | ViT-L | ViT-L | 400 | 16x2x3 | script | 76.1 | 95.4 |
MVD | ViT-L | ViT-L | 800 | 16x2x3 | script | 76.7 | 95.5 |
MVD | ViT-H | ViT-H | 800 | 16x2x3 | script | 77.3 | 95.7 |
- We report the results of MVD finetuned with
I3D dense sampling
on Kinetics400 andTSN uniform sampling
on Something-Something V2, respectively. - #Frame = #input_frame x #clip x #crop.
- #input_frame means how many frames are input for model during the test phase.
- #crop means spatial crops (e.g., 3 for left/right/center crop).
- #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).