Image Captioning and Attention Visualization Image captioning with pretrained DeiT v3 as encoder on a subset of MSCOCO dataset CIDEr score: 0.9413 CLIP score: 0.7310 Attention map visualization for image captioning: See problem 2 & 3 in Report.pdf and Spec.pdf more details.