Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running stable diffusion dreambooth on mac m3 max apple silicon. #7498

Closed
sagargulabani opened this issue Mar 27, 2024 · 35 comments
Closed
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@sagargulabani
Copy link

Describe the bug

I am trying to run dreambooth stable diffusion on m3 max.
However I am running into an issue because of which whenever I am trying to generate the class images for the concepts, it fails.

Reproduction

To reproduce the errors, try to setup dreambooth extension of m3 max apple silicon.
Then try to generate class images. It will fail.

As per this issue, someone suggested us to open an issue in this respository.

Please help us. Thank you.

Logs

400 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/sagargulabani/dev/automatic1111/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 735, in start_training
    result = main(class_gen_method=class_gen_method)
  File "/Users/sagargulabani/dev/automatic1111/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 2003, in main
    return inner_loop()
  File "/Users/sagargulabani/dev/automatic1111/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 126, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "/Users/sagargulabani/dev/automatic1111/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 380, in inner_loop
    count, instance_prompts, class_prompts = generate_classifiers(
  File "/Users/sagargulabani/dev/automatic1111/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/utils/gen_utils.py", line 211, in generate_classifiers
    new_images = builder.generate_images(prompts, pbar)
  File "/Users/sagargulabani/dev/automatic1111/stable-diffusion-webui/extensions/sd_dreambooth_extension/helpers/image_builder.py", line 235, in generate_images
    with self.accelerator.autocast(), torch.inference_mode():
  File "/opt/anaconda3/envs/automatic1111/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/opt/anaconda3/envs/automatic1111/lib/python3.10/site-packages/accelerate/accelerator.py", line 2907, in autocast
    autocast_context = get_mixed_precision_context_manager(self.native_amp, cache_enabled=cache_enabled)
  File "/opt/anaconda3/envs/automatic1111/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1372, in get_mixed_precision_context_manager
    return torch.autocast(device_type=state.device.type, dtype=torch.float16, cache_enabled=cache_enabled)
  File "/opt/anaconda3/envs/automatic1111/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'
Generating class images 0/1400::   0%|

System Info

Apple M3 Max 30 CPU 40 GPU, 16 inch, 48 GB of RAM.
Python version - 3.10.14
diffusers - 0.27.2
transformers - 4.30.2
torch - 2.1.0

Who can help?

@Sayakp

@sagargulabani sagargulabani added the bug Something isn't working label Mar 27, 2024
@tolgacangoz
Copy link
Contributor

tolgacangoz commented Mar 27, 2024

Hi @sagargulabani,
Isn't this issue related to Stable Diffusion web UI's sd_dreambooth_extension extension?
Did/Could you try diffusers's DreamBooth? Also, see mps related page.
But, I guess autocast is not supported yet in mps. They started a PR, but unfortunately, it seems that they abandoned 😞. Nevertheless, I guess there is an ongoing PR here that may be a solution.

@sagargulabani
Copy link
Author

yes, that is true. The issue is related to the webui.
I tried running training dreambooth sdxl locally and I am running into the following error

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'rescale_betas_zero_snr', 'clip_sample_range', 'variance_type', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'reverse_transformer_layers_per_block', 'dropout', 'attention_type'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/Users/sagargulabani/.cache/huggingface/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1964, in <module>
    main(args)
  File "/Users/sagargulabani/.cache/huggingface/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1167, in main
    unet_lora_config = LoraConfig(
TypeError: LoraConfig.__init__() got an unexpected keyword argument 'use_dora'
Traceback (most recent call last):
  File "/opt/anaconda3/envs/hf/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
    simple_launcher(args)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/envs/hf/bin/python', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=dog', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--output_dir=lora-trained-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=500', '--validation_prompt=A photo of sks dog in a bucket', '--validation_epochs=25', '--seed=0', '--push_to_hub']' returned non-zero exit status 1.

My peft version is 0.7.0

and this is my command to run

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"


accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

@sagargulabani
Copy link
Author

I did run it by removing the dora flag from the script. [here] (

) @linoytsaban

After that I ran into the following issue.

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'dynamic_thresholding_ratio', 'thresholding', 'clip_sample_range', 'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/Users/sagargulabani/.cache/huggingface/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1963, in <module>
    main(args)
  File "/Users/sagargulabani/.cache/huggingface/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1503, in main
    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1263, in prepare
    result = tuple(
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1264, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1140, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1330, in prepare_model
    autocast_context = get_mixed_precision_context_manager(self.native_amp, self.autocast_handler)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1745, in get_mixed_precision_context_manager
    return torch.autocast(device_type=device_type, dtype=torch.float16, **autocast_kwargs)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'
Traceback (most recent call last):
  File "/opt/anaconda3/envs/hf/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
    simple_launcher(args)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/envs/hf/bin/python', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=dog', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--output_dir=lora-trained-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=500', '--validation_prompt=A photo of sks dog in a bucket', '--validation_epochs=25', '--seed=0', '--push_to_hub']' returned non-zero exit status 1.

@sayakpaul
Copy link
Member

You should remove "mixed_precision="fp16"" when using M3. Cc: @bghira

@sayakpaul
Copy link
Member

And yes #7447 should be helpful.

@sagargulabani
Copy link
Author

Hi @sayakpaul ,

I did remove that and run but it looks like the code gets stuck

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'dynamic_thresholding_ratio', 'thresholding', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
03/28/2024 20:06:58 - INFO - __main__ - ***** Running training *****
03/28/2024 20:06:58 - INFO - __main__ -   Num examples = 5
03/28/2024 20:06:58 - INFO - __main__ -   Num batches each epoch = 5
03/28/2024 20:06:58 - INFO - __main__ -   Num Epochs = 250
03/28/2024 20:06:58 - INFO - __main__ -   Instantaneous batch size per device = 1
03/28/2024 20:06:58 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
03/28/2024 20:06:58 - INFO - __main__ -   Gradient Accumulation steps = 4
03/28/2024 20:06:58 - INFO - __main__ -   Total optimization steps = 500
Steps:   0%|                                                                                                                            | 0/500 [00:00<?, ?it/s]

Its not progressing beyond this.
I am using an m3 max with 48 GB of RAM.

also I had to remove use_dora flag

from here to run the script.

@sagargulabani
Copy link
Author

So I figure that is moving, but it is extremely extremely slow.

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub
03/28/2024 20:06:39 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: mps

Mixed precision type: no

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'dynamic_thresholding_ratio', 'thresholding', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
03/28/2024 20:06:58 - INFO - __main__ - ***** Running training *****
03/28/2024 20:06:58 - INFO - __main__ -   Num examples = 5
03/28/2024 20:06:58 - INFO - __main__ -   Num batches each epoch = 5
03/28/2024 20:06:58 - INFO - __main__ -   Num Epochs = 250
03/28/2024 20:06:58 - INFO - __main__ -   Instantaneous batch size per device = 1
03/28/2024 20:06:58 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
03/28/2024 20:06:58 - INFO - __main__ -   Gradient Accumulation steps = 4
03/28/2024 20:06:58 - INFO - __main__ -   Total optimization steps = 500
Steps:   0%|▏                                                                                       | 1/500 [10:04<83:43:30, 604.03s/it, loss=0.0871, lr=0.0001]

Any suggestions to make it faster.

@bghira
Copy link
Contributor

bghira commented Mar 28, 2024

are you on 14.4? i've been using pytorch 2.2 and i get about 10 seconds per step with 1 megapixel images on a M3 Max 128G. do you observe any memory / swap pressure?

@bghira
Copy link
Contributor

bghira commented Mar 28, 2024

also, in my environment, i've been running with --mixed_precision=fp16 but i'm not sure why that's erroring out for you the way it is.

the code only returns an error to the user when mixed_precision="bf16", informing them to use fp16 instead. the default is actually fp32, which seems to be in use here hence the extreme slowdown.

the goal should be to ensure that mixed_precision=fp16 works on mps.

the relevant section from the linked PR:

    # Some configurations require autocast to be disabled.
    enable_autocast = True
    if torch.backends.mps.is_available() or (
        accelerator.mixed_precision == "fp16" or accelerator.mixed_precision == "bf16"
    ):
        enable_autocast = False

disables autocast on MPS.

wasn't sure whether the initial report included that PR or not. if it didn't, could you re-attempt with --mixed_precision=fp16

@sagargulabani
Copy link
Author

so this is the error I see when I run in it with mixed precision fp16

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision=fp16 \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub
/opt/anaconda3/envs/hf/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
03/29/2024 09:18:34 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: mps

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/Users/sagargulabani/dev/huggingface-transformers/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1985, in <module>
    main(args)
  File "/Users/sagargulabani/dev/huggingface-transformers/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1525, in main
    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1263, in prepare
    result = tuple(
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1264, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1140, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/accelerator.py", line 1330, in prepare_model
    autocast_context = get_mixed_precision_context_manager(self.native_amp, self.autocast_handler)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1745, in get_mixed_precision_context_manager
    return torch.autocast(device_type=device_type, dtype=torch.float16, **autocast_kwargs)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'
Traceback (most recent call last):
  File "/opt/anaconda3/envs/hf/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
    simple_launcher(args)
  File "/opt/anaconda3/envs/hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/envs/hf/bin/python', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=dog', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--output_dir=lora-trained-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=500', '--validation_prompt=A photo of sks dog in a bucket', '--validation_epochs=25', '--seed=0', '--push_to_hub']' returned non-zero exit status 1.

yes I am on Mac OS Sonoma 14.4, just upgraded it.

@sagargulabani
Copy link
Author

When I run the code without mixed precision fp16, these are the screenshots of what I see in my activity monitor, in htop and asitop. I see that the GPU is not being utilized a lot.

Screenshot 2024-03-28 at 8 39 59 AM Screenshot 2024-03-29 at 9 22 45 AM

@bghira
Copy link
Contributor

bghira commented Mar 29, 2024

are you running the latest main branch?

@sagargulabani
Copy link
Author

yes, I took a pull yesterday.

I also took a pull right now - 34c90db (this is the commit)

did run pip install -e ..

and after that also getting the same error.

This is what my pip list command for diffusers shows.
diffusers 0.28.0.dev0 /Users/sagargulabani/dev/huggingface-transformers/diffusers

@bghira
Copy link
Contributor

bghira commented Mar 31, 2024

#7530 might fix this one @sagargulabani

@sagargulabani
Copy link
Author

Hi @bghira
I checked out to this commit - bghira@ad3eb80

and tried to run the same command above with the same script - train_dreambooth_lora_sdxl.py
but still running into the same issue -

RuntimeError: User specified an unsupported autocast device_type 'mps'

@bghira
Copy link
Contributor

bghira commented Apr 1, 2024

@sagargulabani i've updated that script in particular for that PR. it now uses native_amp = False in the Accelerator config.

can you please re-run with that change? i will put it to the rest of the scripts after

@akospalfi
Copy link

akospalfi commented Apr 1, 2024

@bghira I've been having the same problem as @sagargulabani and your new changes with explicity disabling native amp leads to a different error type:

loc("mps_add"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<2x1280xf16>' and 'tensor<1280xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
Traceback (most recent call last):
  File "/Users/palfia/jax-metal/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/Users/palfia/jax-metal/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/Users/palfia/jax-metal/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
    simple_launcher(args)
  File "/Users/palfia/jax-metal/lib/python3.9/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

edit: script parameters (pytorch 2.2.2)

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a sks dog" \
  --class_prompt="a dog" \
  --mixed_precision=fp16 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=2e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=100 \
  --max_train_steps=800

@bghira
Copy link
Contributor

bghira commented Apr 1, 2024

was there more to the traceback before that one? that's the traceback from Accelerate, but the one from the trainer is needed to know where this error originated. i believe it's in log_validations where the dtypes change. this is something i saw when also updating to pytorch 2.2 latest.

i'm really hoping we don't have to run .to() on all of the embeds.

@bghira
Copy link
Contributor

bghira commented Apr 1, 2024

@sayakpaul i think i'm in a bit of a need of rescuing on this issue. do you have an ideas how to proceed? maybe a dummycast wrapper in train utils as i mentioned last week? the dtypes have to be the same everywhere for MPS.

@akospalfi
Copy link

was there more to the traceback before that one? that's the traceback from Accelerate, but the one from the trainer is needed to know where this error originated. i believe it's in log_validations where the dtypes change. this is something i saw when also updating to pytorch 2.2 latest.

i'm really hoping we don't have to run .to() on all of the embeds.

This is the full log, I can't see anything more useful:

/Users/palfia/jax-metal/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
/Users/palfia/jax-metal/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
/Users/palfia/jax-metal/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
04/01/2024 17:50:18 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: mps

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'rescale_betas_zero_snr', 'sample_max_value', 'thresholding'} was not found in config. Values will be initialized to default values.
04/01/2024 17:50:20 - INFO - __main__ - ***** Running training *****
04/01/2024 17:50:20 - INFO - __main__ -   Num examples = 100
04/01/2024 17:50:20 - INFO - __main__ -   Num batches each epoch = 100
04/01/2024 17:50:20 - INFO - __main__ -   Num Epochs = 8
04/01/2024 17:50:20 - INFO - __main__ -   Instantaneous batch size per device = 1
04/01/2024 17:50:20 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
04/01/2024 17:50:20 - INFO - __main__ -   Gradient Accumulation steps = 1
04/01/2024 17:50:20 - INFO - __main__ -   Total optimization steps = 800
Steps:   0%|                                                                                                                                                                                  | 0/800 [00:00<?, ?it/s]loc("mps_add"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<2x1280xf16>' and 'tensor<1280xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
Traceback (most recent call last):
  File "/Users/palfia/jax-metal/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/Users/palfia/jax-metal/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/Users/palfia/jax-metal/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
    simple_launcher(args)
  File "/Users/palfia/jax-metal/lib/python3.9/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/Users/palfia/jax-metal/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=/Users/palfia/fun/converted_dreamshaper_v8', '--instance_data_dir=/Users/palfia/fun/train_db/J/instance_images/prepared', '--class_data_dir=/Users/palfia/fun/train_db/J/class_images', '--output_dir=/Users/palfia/fun/dreambooth_models', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=a sks dog', '--class_prompt=a dog', '--mixed_precision=fp16', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=2e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--max_train_steps=800']' died with <Signals.SIGABRT: 6>.

@sagargulabani
Copy link
Author

Hi @bghira, I also see the same error as @akospalfi

@bghira
Copy link
Contributor

bghira commented Apr 1, 2024

i'm able to reproduce this one locally, but it's not clear why it's happening. the text encoder hidden states are fp16, the noisy inputs are fp16.

i can train locally on SimpleTuner, which handles dtypes differently, but it's not clear which difference is causing this problem.

@sagargulabani
Copy link
Author

Hi @bghira @sayakpaul
Just following up on this one on how we could go about it.

@bghira
Copy link
Contributor

bghira commented Apr 8, 2024

it's been complicated to do in a non-invasive way for the diffusers project.

for now, i've been running dreambooth via simpletuner for the last few days successfully, introducing single subjects via these config values on pytorch 2.4 nightly.

Copy link

github-actions bot commented May 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 3, 2024
@bghira
Copy link
Contributor

bghira commented May 3, 2024

not stale, just waiting on some pytorch improvements

@yiyixuxu yiyixuxu removed the stale Issues that haven't received updates label May 3, 2024
@sagargulabani
Copy link
Author

we can close this now that pytorch supports mps ?

@sayakpaul
Copy link
Member

Have you verified if it runs successfully?

@sagargulabani
Copy link
Author

no I haven't verified it. Will verify and let you know.

@bghira
Copy link
Contributor

bghira commented Jul 2, 2024

well, no. it's not even in a release yet :-)

@bghira
Copy link
Contributor

bghira commented Jul 2, 2024

and it was now reverted out of pytorch/main due to regressions :[

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@sayakpaul
Copy link
Member

@sagargulabani does this work now?

@github-actions github-actions bot removed the stale Issues that haven't received updates label Sep 26, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 21, 2024
@sayakpaul
Copy link
Member

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

6 participants