Skip to content

Latest commit

 

History

History
34 lines (27 loc) · 4.64 KB

MODEL_ZOO.md

File metadata and controls

34 lines (27 loc) · 4.64 KB

MVD Model Zoo

Kinetics-400

Method Backbone Teacher Epoch #Frame Pre-train Fine-tune Top-1 Top-5
MVD ViT-S ViT-B 400 16x5x3 script/checkpoint script 80.6 94.7
MVD ViT-S ViT-L 400 16x5x3 script/checkpoint script 81.0 94.8
MVD ViT-B ViT-B 400 16x5x3 script/checkpoint script 82.7 95.4
MVD ViT-B ViT-L 400 16x5x3 script/checkpoint script 83.4 95.8
MVD ViT-L ViT-L 400 16x5x3 script/checkpoint script 86.0 96.9
MVD ViT-L ViT-L 800 16x5x3 script script 86.4 97.0
MVD ViT-H ViT-H 800 16x5x3 script script 87.3 97.4

Something-Something V2

Method Backbone Teacher Epoch #Frame Fine-tune Top-1 Top-5
MVD ViT-S ViT-B 400 16x2x3 script 70.7 92.6
MVD ViT-S ViT-L 400 16x2x3 script 70.9 92.8
MVD ViT-B ViT-B 400 16x2x3 script 72.5 93.6
MVD ViT-B ViT-L 400 16x2x3 script 73.7 94.0
MVD ViT-L ViT-L 400 16x2x3 script 76.1 95.4
MVD ViT-L ViT-L 800 16x2x3 script 76.7 95.5
MVD ViT-H ViT-H 800 16x2x3 script 77.3 95.7

Note:

  • We report the results of MVD finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.
  • #Frame = #input_frame x #clip x #crop.
  • #input_frame means how many frames are input for model during the test phase.
  • #crop means spatial crops (e.g., 3 for left/right/center crop).
  • #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).