Skip to content

Commit 5e4f119

Browse files
committed
Update README commands for more models to use --sdp_on_bf16
1 parent fdc79d4 commit 5e4f119

File tree

6 files changed

+63
-43
lines changed

6 files changed

+63
-43
lines changed

examples/image-to-text/README.md

+30-15
Original file line numberDiff line numberDiff line change
@@ -44,63 +44,71 @@ python3 run_pipeline.py \
4444
--model_name_or_path Salesforce/blip-image-captioning-large \
4545
--image_path "https://ankur3107.github.io/assets/images/image-captioning-example.png" \
4646
--use_hpu_graphs \
47-
--bf16
47+
--bf16 \
48+
--sdp_on_bf16
4849
```
4950

5051
To run Llava-1.5-7b inference, use the following command:
5152
```bash
5253
python3 run_pipeline.py \
5354
--model_name_or_path llava-hf/llava-1.5-7b-hf \
5455
--use_hpu_graphs \
55-
--bf16
56+
--bf16 \
57+
--sdp_on_bf16
5658
```
5759

5860
To run Llava-1.5-13b inference, use the following command:
5961
```bash
6062
python3 run_pipeline.py \
6163
--model_name_or_path llava-hf/llava-1.5-13b-hf \
6264
--use_hpu_graphs \
63-
--bf16
65+
--bf16 \
66+
--sdp_on_bf16
6467
```
6568

6669
To run Llava-v1.6-mistral-7b inference, use the following command:
6770
```bash
6871
python3 run_pipeline.py \
6972
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
7073
--use_hpu_graphs \
71-
--bf16
74+
--bf16 \
75+
--sdp_on_bf16
7276
```
7377

7478
To run Llava-v1.6-vicuna-13b inference, use the following command:
7579
```bash
7680
python3 run_pipeline.py \
7781
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
7882
--use_hpu_graphs \
79-
--bf16
83+
--bf16 \
84+
--sdp_on_bf16
8085
```
8186

8287
To run Llava-hf/llava-v1.6-34b-hf inference, use the following command:
8388
```bash
8489
python3 run_pipeline.py \
8590
--model_name_or_path llava-hf/llava-v1.6-34b-hf \
8691
--use_hpu_graphs \
87-
--bf16
92+
--bf16 \
93+
--sdp_on_bf16
8894
```
8995

9096
To run google/paligemma-3b-mix-224 inference, use the following command:
9197
```bash
9298
python3 run_pipeline.py \
9399
--model_name_or_path google/paligemma-3b-mix-224 \
94100
--use_hpu_graphs \
95-
--bf16
101+
--bf16 \
102+
--sdp_on_bf16
96103
```
97104

98105
To run Llava-hf/llama3-llava-next-8b-hf inference, use the following command:
99106
```bash
100107
python3 run_pipeline.py \
101108
--model_name_or_path llava-hf/llama3-llava-next-8b-hf \
102109
--use_hpu_graphs \
103-
--bf16
110+
--bf16 \
111+
--sdp_on_bf16
104112
```
105113

106114
To run idefics2 inference, use the following command:
@@ -109,7 +117,8 @@ To run idefics2 inference, use the following command:
109117
python3 run_pipeline.py \
110118
--model_name_or_path HuggingFaceM4/idefics2-8b \
111119
--use_hpu_graphs \
112-
--bf16
120+
--bf16 \
121+
--sdp_on_bf16
113122
```
114123

115124
To run mllama inference using reduced precision in the SDPA, use the following command:
@@ -134,7 +143,8 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
134143
--model_name_or_path llava-hf/llava-1.5-7b-hf \
135144
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
136145
--use_hpu_graphs \
137-
--bf16
146+
--bf16 \
147+
--sdp_on_bf16
138148
```
139149

140150
Here is an example to quantize the model based on previous measurements for Llava-1.5-7b:
@@ -143,7 +153,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python r
143153
--model_name_or_path llava-hf/llava-1.5-7b-hf \
144154
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
145155
--use_hpu_graphs \
146-
--bf16
156+
--bf16 \
157+
--sdp_on_bf16
147158
```
148159

149160

@@ -153,7 +164,8 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
153164
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
154165
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
155166
--use_hpu_graphs \
156-
--bf16
167+
--bf16 \
168+
--sdp_on_bf16
157169
```
158170

159171
Here is an example to quantize the model based on previous measurements for Llava-v1.6-mistral-7b:
@@ -162,7 +174,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python r
162174
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
163175
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
164176
--use_hpu_graphs \
165-
--bf16
177+
--bf16 \
178+
--sdp_on_bf16
166179
```
167180

168181
Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b:
@@ -171,7 +184,8 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
171184
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
172185
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
173186
--use_hpu_graphs \
174-
--bf16
187+
--bf16 \
188+
--sdp_on_bf16
175189
```
176190

177191
Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b:
@@ -180,7 +194,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python r
180194
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
181195
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
182196
--use_hpu_graphs \
183-
--bf16
197+
--bf16 \
198+
--sdp_on_bf16
184199
```
185200

186201
### Inference with FusedSDPA

examples/question-answering/README.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -224,8 +224,7 @@ python ../gaudi_spawn.py \
224224
--use_hpu_graphs_for_inference \
225225
--throughput_warmup_steps 3 \
226226
--max_train_samples 45080 \
227-
--deepspeed ../../tests/configs/deepspeed_zero_2.json \
228-
--sdp_on_bf16
227+
--deepspeed ../../tests/configs/deepspeed_zero_2.json
229228
```
230229

231230

examples/speech-recognition/README.md

+8-10
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,7 @@ python run_speech_recognition_ctc.py \
8787
--throughput_warmup_steps="3" \
8888
--bf16 \
8989
--use_hpu_graphs_for_training \
90-
--use_hpu_graphs_for_inference \
91-
--sdp_on_bf16
90+
--use_hpu_graphs_for_inference
9291
```
9392

9493
On a single HPU, this script should run in *ca.* 6 hours and yield a CTC loss of **0.059** and a word error rate of **0.0423**.
@@ -129,8 +128,7 @@ python ../gaudi_spawn.py \
129128
--throughput_warmup_steps 3 \
130129
--bf16 \
131130
--use_hpu_graphs_for_training \
132-
--use_hpu_graphs_for_inference \
133-
--sdp_on_bf16
131+
--use_hpu_graphs_for_inference
134132
```
135133

136134
On 8 HPUs, this script should run in *ca.* 49 minutes and yield a CTC loss of **0.0613** and a word error rate of **0.0458**.
@@ -178,8 +176,7 @@ python ../gaudi_spawn.py \
178176
--use_lazy_mode \
179177
--gaudi_config_name Habana/wav2vec2 \
180178
--throughput_warmup_steps 3 \
181-
--deepspeed ../../tests/configs/deepspeed_zero_2.json \
182-
--sdp_on_bf16
179+
--deepspeed ../../tests/configs/deepspeed_zero_2.json
183180
```
184181
185182
[The documentation](https://huggingface.co/docs/optimum/habana/usage_guides/deepspeed) provides more information about how to use DeepSpeed within Optimum Habana.
@@ -211,8 +208,7 @@ python run_speech_recognition_ctc.py \
211208
--use_lazy_mode \
212209
--gaudi_config_name="Habana/wav2vec2" \
213210
--bf16 \
214-
--use_hpu_graphs_for_inference \
215-
--sdp_on_bf16
211+
--use_hpu_graphs_for_inference
216212
```
217213
## Sequence to Sequence
218214

@@ -259,7 +255,8 @@ python run_speech_recognition_seq2seq.py \
259255
--use_hpu_graphs_for_inference \
260256
--label_features_max_length 128 \
261257
--dataloader_num_workers 8 \
262-
--throughput_warmup_steps 3
258+
--throughput_warmup_steps 3 \
259+
--sdp_on_bf16
263260
```
264261

265262
If training on a different language, you should be sure to change the `language` argument. The `language` and `task` arguments should be omitted for English speech recognition.
@@ -329,5 +326,6 @@ python run_speech_recognition_seq2seq.py \
329326
--use_habana \
330327
--use_hpu_graphs_for_inference \
331328
--label_features_max_length 128 \
332-
--dataloader_num_workers 8
329+
--dataloader_num_workers 8 \
330+
--sdp_on_bf16
333331
```

examples/text-classification/README.md

+2-4
Original file line numberDiff line numberDiff line change
@@ -194,8 +194,7 @@ python ../gaudi_spawn.py \
194194
--use_lazy_mode \
195195
--use_hpu_graphs_for_inference \
196196
--throughput_warmup_steps 3 \
197-
--deepspeed ../../tests/configs/deepspeed_zero_2.json \
198-
--sdp_on_bf16
197+
--deepspeed ../../tests/configs/deepspeed_zero_2.json
199198
```
200199

201200
You can look at the [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/deepspeed) for more information about how to use DeepSpeed in Optimum Habana.
@@ -221,6 +220,5 @@ python run_glue.py \
221220
--use_lazy_mode \
222221
--use_hpu_graphs_for_inference \
223222
--throughput_warmup_steps 2 \
224-
--bf16 \
225-
--sdp_on_bf16
223+
--bf16
226224
```

examples/text-generation/README.md

+22-11
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,8 @@ python run_generation.py \
7979
--use_kv_cache \
8080
--max_new_tokens 100 \
8181
--do_sample \
82-
--prompt "Here is my prompt"
82+
--prompt "Here is my prompt" \
83+
--sdp_on_bf16
8384
```
8485

8586
If you want to provide several prompts as inputs, here is how to do it:
@@ -91,7 +92,8 @@ python run_generation.py \
9192
--max_new_tokens 100 \
9293
--do_sample \
9394
--batch_size 2 \
94-
--prompt "Hello world" "How are you?"
95+
--prompt "Hello world" "How are you?" \
96+
--sdp_on_bf16
9597
```
9698

9799
> The batch size should be larger than or equal to the number of prompts. Otherwise, only the first N prompts are kept with N being equal to the batch size.
@@ -110,7 +112,8 @@ python run_generation.py \
110112
--use_kv_cache \
111113
--num_return_sequences 1 \
112114
--temperature 0 \
113-
--prompt "Alice and Bob"
115+
--prompt "Alice and Bob" \
116+
--sdp_on_bf16
114117
```
115118

116119
### Benchmark
@@ -137,7 +140,8 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
137140
--batch_size 1 \
138141
--use_hpu_graphs \
139142
--use_kv_cache \
140-
--max_new_tokens 100
143+
--max_new_tokens 100 \
144+
--sdp_on_bf16
141145
```
142146

143147
You can also run Llama2-70B on Gaudi2 with all optimizations enabled using the following command:
@@ -152,7 +156,8 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
152156
--attn_softmax_bf16 \
153157
--limit_hpu_graphs \
154158
--reuse_cache \
155-
--trim_logits
159+
--trim_logits \
160+
--sdp_on_bf16
156161
```
157162

158163
To run Falcon-7B inference, use the following command:
@@ -164,7 +169,8 @@ python run_generation.py \
164169
--use_kv_cache \
165170
--batch_size 1 \
166171
--max_new_tokens 128 \
167-
--do_sample
172+
--do_sample \
173+
--sdp_on_bf16
168174
```
169175

170176
To run Falcon-40B inference on 8 Gaudi2 cards, use the following command:
@@ -195,7 +201,8 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
195201
> --use_hpu_graphs \
196202
> --use_kv_cache \
197203
> --max_new_tokens 100 \
198-
> --bf16
204+
> --bf16 \
205+
> --sdp_on_bf16
199206
> ```
200207
201208
### Use any dataset from the Hugging Face Hub
@@ -214,7 +221,8 @@ python run_generation.py \
214221
--use_kv_cache \
215222
--dataset_name JulesBelveze/tldr_news \
216223
--column_name content \
217-
--bf16
224+
--bf16 \
225+
--sdp_on_bf16
218226
```
219227
220228
> The prompt length is limited to 16 tokens. Prompts longer than this will be truncated.
@@ -233,7 +241,8 @@ python run_generation.py \
233241
--bf16 \
234242
--max_new_tokens 100 \
235243
--prompt "Here is my prompt" \
236-
--peft_model yard1/llama-2-7b-sql-lora-test
244+
--peft_model yard1/llama-2-7b-sql-lora-test \
245+
--sdp_on_bf16
237246
```
238247

239248
### Using growing bucket optimization
@@ -490,7 +499,8 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py
490499
--max_new_tokens 100 \
491500
--batch_size 1 \
492501
--reuse_cache \
493-
--bf16
502+
--bf16 \
503+
--sdp_on_bf16
494504
```
495505

496506
Here is an example to quantize the model based on previous measurements for gemma with 1 card:
@@ -502,7 +512,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_gemma.json python run_generation
502512
--max_new_tokens 100 \
503513
--batch_size 1 \
504514
--reuse_cache \
505-
--bf16
515+
--bf16 \
516+
--sdp_on_bf16
506517
```
507518

508519

tests/test_text_generation_example.py

-1
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,6 @@ def _test_text_generation(
221221

222222
if "gemma" in model_name.lower():
223223
command += ["--use_flash_attention"]
224-
command += ["--sdp_on_bf16"]
225224

226225
if "decilm" in model_name.lower():
227226
command += ["--sdp_on_bf16"]

0 commit comments

Comments
 (0)