This repository contains the source code of CLIP-KD [CLIP-KD: An Empirical Study of CLIP Model Distillation].
pip install -r requirements-training.txt
pip install -r requirements-test.txt
OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as an argument to main.py
.
The script src/data/gather_cc.py
will collect the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs and then run the script from our repository:
For easy notation, we rename Train_GCC-training
as cc3m_train
, and Validation_GCC-1.1.0-Validation
as cc3m_val
.
python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]
Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.
The generated cc3m_train.csv
is:
title filepath
XXXXXX train/X/X.jpg
... ...
The generated cc3m_val.csv
is:
title filepath
XXXXXX val/X/X.jpg
... ...
The script src/data/gather_cc12m.py
will collect the Conceptual 12M images. First, download the Conceptual 12M URLs and then run the script from our repository:
python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]
The generated cc12m.csv
is:
title filepath
XXXXXX train/X/X.jpg
... ...
Our downloaded CC12M training set contains 9.97M images.
The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.
Role | Network | Method | ImageNet Acc | Train script |
---|---|---|---|---|
Teacher | ViT-B/16 | - | 36.99 | sh |
Student | ViT-T/16 | Baseline | 30.55 | sh |
Student | ViT-T/16 | +CRD | 31.94 | sh |
Student | ViT-T/16 | +FD | 34.23 | sh |
Student | ViT-T/16 | +MFD | 34.09 | sh |
Student | ViT-T/16 | +GD | 31.54 | sh |
Student | ViT-T/16 | +ICL | 33.11 | sh |
Student | ViT-T/16 | +AFD | 31.42 | sh |
The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.
Role | Network | Method | ImageNet Acc | Train script | Download |
---|---|---|---|---|---|
Teacher | ViT-B/16 | - | 36.99 | sh | model | log |
Student | ViT-T/16 | Baseline | 30.55 | sh | model | log |
Student | ViT-T/16 | CLIP-KD | 34.90 | sh | model | log |
Student | MobileViT-S | Baseline | 32.60 | sh | model | log |
Student | MobileViT-S | CLIP-KD | 35.96 | sh | model | log |
Student | Swin-T | Baseline | 36.38 | sh | model | log |
Student | Swin-T | CLIP-KD | 40.18 | sh | model | log |
Student | MobileNetV3 | Baseline | 25.11 | sh | model | log |
Student | MobileNetV3 | CLIP-KD | 26.95 | sh | model | log |
Student | EfficientNet-B0 | Baseline | 32.55 | sh | model | log |
Student | EfficientNet-B0 | CLIP-KD | 35.44 | sh | model | log |
Student | ResNet-18 | Baseline | 28.55 | sh | model | log |
Student | ResNet-18 | CLIP-KD | 31.36 | sh | model | log |
The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.
Role | Network | Method | ImageNet Acc | Train script | Download |
---|---|---|---|---|---|
Teacher | ResNet-101 | - | 36.76 | sh | model | log |
Student | MobileViT-S | Baseline | 32.60 | sh | model | log |
Student | MobileViT-S | CLIP-KD | 34.97 | sh | model | log |
Student | Swin-T | Baseline | 36.38 | sh | model | log |
Student | Swin-T | CLIP-KD | 39.51 | sh | model | log |
Student | MobileNetV3 | Baseline | 25.11 | sh | model | log |
Student | MobileNetV3 | CLIP-KD | 26.15 | sh | model | log |
Student | EfficientNet-B0 | Baseline | 32.55 | sh | model | log |
Student | EfficientNet-B0 | CLIP-KD | 34.64 | sh | model | log |
Student | ResNet-18 | Baseline | 28.55 | sh | model | log |
Student | ResNet-18 | CLIP-KD | 30.88 | sh | model | log |
The teacher is pretrained on Laion-400M. Students are distilled on CC3M+12M.
Role | Network | Method | ImageNet | Train script | Download |
---|---|---|---|---|---|
Teacher | ViT-L/14 | - | 72.8 | - | model |
Student | ViT-B/16 | Baseline | 37.0 | sh | model | log |
Student | ViT-B/16 | CLIP-KD | 57.5 | sh | model | log |
Student | ViT-T/16 | Baseline | 30.6 | sh | model | log |
Student | ViT-T/16 | CLIP-KD | 40.9 | sh | model | log |
Role | Network | Method | ImageNet | Train script | Download |
---|---|---|---|---|---|
Teacher | ViT-B/16 | - | 67.1 | - | model |
Student | ViT-T/16 | Baseline | 30.6 | sh | model | log |
Student | ViT-T/16 | CLIP-KD | 42.6 | sh | model | log |
Student | ResNet-50 | Baseline | 35.3 | sh | model | log |
Student | ResNet-50 | CLIP-KD | 55.4 | sh | model | log |
Evaluation a pretrained model on MSCOCO and Flickr cross-retrieval and ImageNet variants (ImageNet-V2, ImageNet-Rendition and ImageNet-Sketch) classification. Please refer to eval_coco.sh and eval_flickr.sh.
Our codebase is bulit over open_clip, an open-source codebase to run CLIP models.
We would appreciate it if our paper and repo are helpful to you!
@inproceedings{yang2024clip,
title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}