MVD Model Zoo

Method	Backbone	Teacher	Epoch	#Frame	Pre-train	Fine-tune	Top-1	Top-5
MVD	ViT-S	ViT-B	400	16x5x3	script/checkpoint	script	80.6	94.7
MVD	ViT-S	ViT-L	400	16x5x3	script/checkpoint	script	81.0	94.8
MVD	ViT-B	ViT-B	400	16x5x3	script/checkpoint	script	82.7	95.4
MVD	ViT-B	ViT-L	400	16x5x3	script/checkpoint	script	83.4	95.8
MVD	ViT-L	ViT-L	400	16x5x3	script/checkpoint	script	86.0	96.9
MVD	ViT-L	ViT-L	800	16x5x3	script	script	86.4	97.0
MVD	ViT-H	ViT-H	800	16x5x3	script	script	87.3	97.4

Method	Backbone	Teacher	Epoch	#Frame	Fine-tune	Top-1	Top-5
MVD	ViT-S	ViT-B	400	16x2x3	script	70.7	92.6
MVD	ViT-S	ViT-L	400	16x2x3	script	70.9	92.8
MVD	ViT-B	ViT-B	400	16x2x3	script	72.5	93.6
MVD	ViT-B	ViT-L	400	16x2x3	script	73.7	94.0
MVD	ViT-L	ViT-L	400	16x2x3	script	76.1	95.4
MVD	ViT-L	ViT-L	800	16x2x3	script	76.7	95.5
MVD	ViT-H	ViT-H	800	16x2x3	script	77.3	95.7

We report the results of MVD finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.
#Frame = #input_frame x #clip x #crop.
#input_frame means how many frames are input for model during the test phase.
#crop means spatial crops (e.g., 3 for left/right/center crop).
#clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).

Provide feedback