diff --git a/examples/sentence-transformers-training/nli/README.md b/examples/sentence-transformers-training/nli/README.md index 189b7e2f81..4d21543da6 100644 --- a/examples/sentence-transformers-training/nli/README.md +++ b/examples/sentence-transformers-training/nli/README.md @@ -67,16 +67,16 @@ Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 1 python training_nli.py intfloat/e5-mistral-7b-instruct --peft --lora_target_module "q_proj" "k_proj" "v_proj" --learning_rate 1e-5 ``` -## Multi-card Training with Deepspeed Zero2/3 +## Multi-card Training with Deepspeed Zero3 -Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can use the Zero2/Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements. +Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we will use the Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements. -Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2. +Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero3. ```bash python ../../gaudi_spawn.py --world_size 4 --use_deepspeed training_nli.py intfloat/e5-mistral-7b-instruct --deepspeed ds_config.json --bf16 --no-use_hpu_graphs_for_training --learning_rate 1e-7 ``` -In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. To further reduce memory usage, change the stage to 3 (DeepSpeed Zero3) in the `ds_config.json` file. +In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. # Dataset diff --git a/examples/sentence-transformers-training/nli/ds_config.json b/examples/sentence-transformers-training/nli/ds_config.json index 5d5b80af99..565d31b6d1 100644 --- a/examples/sentence-transformers-training/nli/ds_config.json +++ b/examples/sentence-transformers-training/nli/ds_config.json @@ -8,7 +8,7 @@ }, "gradient_clipping": 1.0, "zero_optimization": { - "stage": 2, + "stage": 3, "overlap_comm": false, "reduce_scatter": false, "contiguous_gradients": false diff --git a/examples/sentence-transformers-training/sts/README.md b/examples/sentence-transformers-training/sts/README.md index 0fcd44e1a7..61e5af90f4 100644 --- a/examples/sentence-transformers-training/sts/README.md +++ b/examples/sentence-transformers-training/sts/README.md @@ -54,17 +54,17 @@ Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 1 python training_stsbenchmark.py intfloat/e5-mistral-7b-instruct --peft --lora_target_modules "q_proj" "k_proj" "v_proj" ``` -## Multi-card Training with Deepspeed Zero2/3 +## Multi-card Training with Deepspeed Zero3 -Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can use the Zero2/Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements. +Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we will use the Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements. -Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2. +Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero3. ```bash python ../../gaudi_spawn.py --world_size 4 --use_deepspeed training_stsbenchmark.py intfloat/e5-mistral-7b-instruct --deepspeed ds_config.json --bf16 --no-use_hpu_graphs_for_training --learning_rate 1e-7 ``` -In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. To further reduce memory usage, change the stage to 3 (DeepSpeed Zero3) in the `ds_config.json` file. +In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. # Training data diff --git a/examples/sentence-transformers-training/sts/ds_config.json b/examples/sentence-transformers-training/sts/ds_config.json index 5d5b80af99..565d31b6d1 100644 --- a/examples/sentence-transformers-training/sts/ds_config.json +++ b/examples/sentence-transformers-training/sts/ds_config.json @@ -8,7 +8,7 @@ }, "gradient_clipping": 1.0, "zero_optimization": { - "stage": 2, + "stage": 3, "overlap_comm": false, "reduce_scatter": false, "contiguous_gradients": false