From 9deeeafd2b9026cdc2fa9aaea505f14a74676db3 Mon Sep 17 00:00:00 2001
From: ZhengHongming888 <hongming.zheng@intel.com>
Date: Mon, 3 Mar 2025 15:29:17 -0800
Subject: [PATCH 1/2] fix Sentence Transformer restart issue

---
 examples/sentence-transformers-training/nli/README.md     | 6 +++---
 .../sentence-transformers-training/nli/ds_config.json     | 2 +-
 examples/sentence-transformers-training/sts/README.md     | 8 ++++----
 .../sentence-transformers-training/sts/ds_config.json     | 2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/examples/sentence-transformers-training/nli/README.md b/examples/sentence-transformers-training/nli/README.md
index 189b7e2f81..25d1583cc6 100644
--- a/examples/sentence-transformers-training/nli/README.md
+++ b/examples/sentence-transformers-training/nli/README.md
@@ -67,16 +67,16 @@ Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 1
 python training_nli.py intfloat/e5-mistral-7b-instruct --peft --lora_target_module "q_proj" "k_proj" "v_proj" --learning_rate 1e-5
 ```
 
-## Multi-card Training with Deepspeed Zero2/3
+## Multi-card Training with Deepspeed Zero3
 
-Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can use the Zero2/Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
+Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we will use the Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
 
 Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2.
 
 ```bash
 python ../../gaudi_spawn.py --world_size 4 --use_deepspeed training_nli.py intfloat/e5-mistral-7b-instruct --deepspeed ds_config.json --bf16 --no-use_hpu_graphs_for_training --learning_rate 1e-7
 ```
-In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. To further reduce memory usage, change the stage to 3 (DeepSpeed Zero3) in the `ds_config.json` file.
+In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. 
 
 # Dataset
 
diff --git a/examples/sentence-transformers-training/nli/ds_config.json b/examples/sentence-transformers-training/nli/ds_config.json
index 5d5b80af99..565d31b6d1 100644
--- a/examples/sentence-transformers-training/nli/ds_config.json
+++ b/examples/sentence-transformers-training/nli/ds_config.json
@@ -8,7 +8,7 @@
     },
     "gradient_clipping": 1.0,
     "zero_optimization": {
-        "stage": 2,
+        "stage": 3,
         "overlap_comm": false,
         "reduce_scatter": false,
         "contiguous_gradients": false
diff --git a/examples/sentence-transformers-training/sts/README.md b/examples/sentence-transformers-training/sts/README.md
index 0fcd44e1a7..61e5af90f4 100644
--- a/examples/sentence-transformers-training/sts/README.md
+++ b/examples/sentence-transformers-training/sts/README.md
@@ -54,17 +54,17 @@ Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 1
 python training_stsbenchmark.py intfloat/e5-mistral-7b-instruct --peft --lora_target_modules "q_proj" "k_proj" "v_proj"
 ```
 
-## Multi-card Training with Deepspeed Zero2/3
+## Multi-card Training with Deepspeed Zero3
 
-Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can use the Zero2/Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
+Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we will use the Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
 
-Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2.
+Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero3.
 
 ```bash
 python ../../gaudi_spawn.py --world_size 4 --use_deepspeed training_stsbenchmark.py intfloat/e5-mistral-7b-instruct --deepspeed ds_config.json --bf16 --no-use_hpu_graphs_for_training --learning_rate 1e-7
 ```
 
-In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. To further reduce memory usage, change the stage to 3 (DeepSpeed Zero3) in the `ds_config.json` file.
+In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. 
 
 # Training data
 
diff --git a/examples/sentence-transformers-training/sts/ds_config.json b/examples/sentence-transformers-training/sts/ds_config.json
index 5d5b80af99..565d31b6d1 100644
--- a/examples/sentence-transformers-training/sts/ds_config.json
+++ b/examples/sentence-transformers-training/sts/ds_config.json
@@ -8,7 +8,7 @@
     },
     "gradient_clipping": 1.0,
     "zero_optimization": {
-        "stage": 2,
+        "stage": 3,
         "overlap_comm": false,
         "reduce_scatter": false,
         "contiguous_gradients": false

From b2be69dbc03ce15305c8dad23ee987eb1e1a865f Mon Sep 17 00:00:00 2001
From: ZhengHongming888 <hongming.zheng@intel.com>
Date: Mon, 3 Mar 2025 15:50:31 -0800
Subject: [PATCH 2/2] minor

---
 examples/sentence-transformers-training/nli/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/sentence-transformers-training/nli/README.md b/examples/sentence-transformers-training/nli/README.md
index 25d1583cc6..4d21543da6 100644
--- a/examples/sentence-transformers-training/nli/README.md
+++ b/examples/sentence-transformers-training/nli/README.md
@@ -71,7 +71,7 @@ python training_nli.py intfloat/e5-mistral-7b-instruct --peft --lora_target_modu
 
 Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we will use the Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
 
-Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2.
+Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero3.
 
 ```bash
 python ../../gaudi_spawn.py --world_size 4 --use_deepspeed training_nli.py intfloat/e5-mistral-7b-instruct --deepspeed ds_config.json --bf16 --no-use_hpu_graphs_for_training --learning_rate 1e-7