This respository contains the code for the CVPR 2024 paper Doubly Abductive Counterfactual Inference for Text-based Image Editing.
First, clone the repository:
git clone https://github.com/xuesong39/DAC
Then, install the dependencies in a new virtual environment:
cd DAC
git clone https://github.com/huggingface/diffusers -b v0.24.0
cd diffusers
pip install -e .
Finally, cd in the main folder DAC
and run:
pip install -r requirements.txt
The images and annotations we use in the paper can be found here.
For the format of data used in the experiments, we provide some examples in the folder DAC/data
. For example, for the image DAC/data/cat/train/cat.jpeg
, the folder containing source prompt is DAC/data/cat/
while that containing target prompt is DAC/data/cat-cap/
.
The fine-tuning script for abduction on U is train_text_to_image_lora.sh
as follows:
export MODEL_NAME="stabilityai/stable-diffusion-2-1-base"
export TRAIN_DIR="ORIGIN_DATA_PATH"
CUDA_VISIBLE_DEVICES=0 accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$TRAIN_DIR --caption_column="text" \
--resolution=512 \
--train_batch_size=1 \
--num_train_epochs=1000 --checkpointing_steps=1000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--rank=512 \
--output_dir="U_PATH" \
--validation_prompt="xxx" \
--report_to="wandb" \
--validation_epochs=500
Please specify TRAIN_DIR
(e.g., "./data/cat/"), --output_dir
(e.g., "ckpt/cat"), and --validation_prompt
(e.g., "A cat.").
The fine-tuning script for abduction on Δ is train_text_to_image_lora_t.sh
as follows:
export MODEL_NAME="stabilityai/stable-diffusion-2-1-base"
export TRAIN_DIR="TARGET_DATA_PATH"
CUDA_VISIBLE_DEVICES=0 accelerate launch train_text_to_image_lora_t.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--unet_lora_path="U_PATH" \
--train_data_dir=$TRAIN_DIR --caption_column="text" \
--resolution=512 --train_text_encoder \
--train_batch_size=1 \
--num_train_epochs=1000 --checkpointing_steps=1000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--annealing=0.8 \
--output_dir="DELTA_PATH" \
--report_to="wandb" \
--validation_epochs=500
Please specify TRAIN_DIR
(e.g., "./data/cat-cap/"), --unet_lora_path
(e.g., "ckpt/cat"), and --output_dir
(e.g., "ckpt/cat-cap-annealing0.8"). You can also change --annealing
to achieve control on hyperparameter
The inference script is inference_t.sh
as follows:
CUDA_VISIBLE_DEVICES=0 python inference_t.py \
--annealing=0.8 \
--unet_path="U_PATH" \
--text_path="DELTA_PATH" \
--target_prompt="xxx" \
--save_path="./"
Please specify --unet_path
(e.g., "ckpt/cat"), --text_path
(e.g., "ckpt/cat-cap-annealing0.8"), and --target_prompt
(e.g., "A cat wearing a wool cap.").
This part contains the implementation mentioned in the ablation analysis section in the paper, i.e., ablation on Abduction-1. We could incorporate another exogenous variable T in the Abduction-1 to further improve fidelity.
The fine-tuning script for abduction on U is the same as the above.
The fine-tuning script for abduction on T is train_text_to_image_lora_t.sh
as follows:
export MODEL_NAME="stabilityai/stable-diffusion-2-1-base"
export TRAIN_DIR="ORIGIN_DATA_PATH"
CUDA_VISIBLE_DEVICES=0 accelerate launch train_text_to_image_lora_t.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--unet_lora_path="U_PATH" \
--train_data_dir=$TRAIN_DIR --caption_column="text" \
--resolution=512 --train_text_encoder \
--train_batch_size=1 \
--num_train_epochs=1000 --checkpointing_steps=1000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--annealing=0.8 \
--output_dir="T_PATH" \
--report_to="wandb" \
--validation_epochs=500
Please specify TRAIN_DIR
(e.g., "./data/cat/"), --unet_lora_path
(e.g., "ckpt/cat"), and --output_dir
(e.g., "ckpt/cat-annealing0.8")
The fine-tuning script for abduction on Δ is train_text_to_image_lora_t2.sh
as follows:
export MODEL_NAME="stabilityai/stable-diffusion-2-1-base"
export TRAIN_DIR="TARGTE_DATA_PATH"
CUDA_VISIBLE_DEVICES=0 accelerate launch train_text_to_image_lora_t2.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--unet_lora_path="U_PATH" \
--text_lora1_path="T_PATH" \
--train_data_dir=$TRAIN_DIR --caption_column="text" \
--resolution=512 --train_text_encoder \
--train_batch_size=1 \
--num_train_epochs=1000 --checkpointing_steps=1000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--annealing=0.8 \
--output_dir="DELTA_PATH" \
--report_to="wandb" \
--validation_epochs=500
Please specify TRAIN_DIR
(e.g., "./data/cat-cap/"), --unet_lora_path
(e.g., "ckpt/cat"), --text_lora1_path
(e.g., "ckpt/cat-annealing0.8"), and --output_dir
(e.g., "ckpt/cat-cap-annealing0.8-t2").
The inference script is inference_t2.sh
as follows:
CUDA_VISIBLE_DEVICES=0 python inference_t2.py \
--annealing=0.8 \
--unet_path="U_PATH" \
--text1_path="T_PATH" \
--text2_path="DELTA_PATH" \
--target_prompt="xxx" \
--save_path="./"
Please specify --unet_path
(e.g., "ckpt/cat"), --text1_path
(e.g., "ckpt/cat-annealing0.8"), --text2_path
(e.g., "ckpt/cat-cap-annealing0.8-t2"), and --target_prompt
(e.g., "A cat wearing a wool cap.").
We provide some checkpoints in the following:
Image | Abduction-1 | Abduction-2 |
---|---|---|
DAC/data/cat |
U | Δ |
DAC/data/glass |
U | Δ |
DAC/data/black |
U | Δ |
DAC/data/cat |
U, T | Δ |
DAC/data/glass |
U, T | Δ |
DAC/data/black |
U, T | Δ |
In this code we refer to the following codebase: Diffusers and PEFT. Great thanks to them!