BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities
This is the official PyTorch code for the paper:
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities
Shaozhe Hao1,
Xuantong Liu2,
Xianbiao Qi3*,
Shihao Zhao1,
Bojia Zi4,
Rong Xiao3,
Kai Han1†,
Kwan-Yee K. Wong1†
1The University of Hong Kong 2Hong Kong University of Science and Technology
3Intellifusion 4The Chinese University of Hong Kong
(*: Project lead; †: Corresponding authors)
ICLR 2025
[Project page] [arXiv] [Colab]
TL;DR: We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities.
You can simply install the environment with the file environment.yml
by:
conda env create -f environment.yml
conda activate BiGR
Please first download the pretrained weights for tokenizers and BiGR models to run our tests.
We train Binary Autoencoder (B-AE) by adapting the official code of Binary Latent Diffusion. We provide pretrained weights for different configurations.
256x256 resolution
B-AE | Size | Checkpoint |
---|---|---|
d24 | 332M | download |
d32 | 332M | download |
512x512 resolution
B-AE | Size | Checkpoint |
---|---|---|
d32-512 | 315M | download |
We provide pretrained weights for BiGR models in various sizes.
256x256 resolution
Model | B-AE | Size | Checkpoint |
---|---|---|---|
BiGR-L-d24 | d24 | 1.35G | download |
BiGR-XL-d24 | d24 | 3.20G | download |
BiGR-XXL-d24 | d24 | 5.92G | download |
BiGR-XXL-d32 | d32 | 5.92G | download |
512x512 resolution
Model | B-AE | Size | Checkpoint |
---|---|---|---|
BiGR-L-d32-res512 | d32-res512 | 1.49G | download |
We provide the sample script for 256x256 image generation in script/sample.sh
.
bash script/sample.sh
Please specify the code dimension $CODE
, your B-AE checkpoint path $CKPT_BAE
, and your BiGR checkpoint path
$CKPT_BIGR
.
You may also want to try different settings of the CFG scale $CFG
, the number of sample iterations $ITER
, and the gumbel temperature $GUMBEL
. We recommend using small gumbel temperature for better visual quality (e.g., GUMBEL=0
). You can increase gumbel temperature to enhance generation diversity.
You can generate 512x512 images using script/sample_512.sh
. Note that you need to specify the corresponding 512x512 tokenizers and models.
bash script/sample_512.sh
BiGR supports various zero-shot generalized applications, without the need for task-specific structural changes or parameter fine-tuning.
You can easily download testing images and run our scripts to get started. Feel free to play with your own images.
bash script/app_inpaint.sh
bash script/app_outpaint.sh
You need to save the source image and the mask in the same folder, with the image as a *.JPEG
file and the mask as a *.png
file.
You can then specify the source image path $IMG
.
You can customize masks using this gradio demo.
bash script/app_edit.sh
In addition to the source image path $IMG
, you also need to give a class index $CLS
for editing.
bash script/app_interpolate.sh
You need to specify two class indices $CLS1
and $CLS2
.
bash script/app_enrich.sh
You need to specify the source image path $IMG
.
You can train BiGR yourself by running:
bash script/train.sh
You need to specify the ImageNet-1K dataset path --data-path
.
We train L/XL-sized models using 8 A800 GPUs and XXL-sized models using 32 A800 GPUs on 4 nodes.
This project builds on Diffusion Transformer, Binary Latent Diffusion, and LlamaGen. We thank these great works!
If you use this code in your research, please consider citing our paper:
@misc{hao2024bigr,
title={Bi{GR}: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities},
author={Shaozhe Hao and Xuantong Liu and Xianbiao Qi and Shihao Zhao and Bojia Zi and Rong Xiao and Kai Han and Kwan-Yee~K. Wong},
year={2024},
}