Skip to content

Commit

Permalink
Revert placing llama on cpu (#1827)
Browse files Browse the repository at this point in the history
  • Loading branch information
ugolowic authored and regisss committed Mar 6, 2025
1 parent 6d575e8 commit 81f33ed
Show file tree
Hide file tree
Showing 3 changed files with 2 additions and 16 deletions.
3 changes: 0 additions & 3 deletions examples/text-generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,6 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \

To run Llama3-405B inference on 8 Gaudi3 cards use the following command:
```bash
DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API=true \
ENABLE_LB_BUNDLE_ALL_COMPUTE_MME=0 ENABLE_EXPERIMENTAL_FLAGS=1 \
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
Expand Down Expand Up @@ -497,7 +496,6 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
Here is an example to measure the tensor quantization statistics on Llama3-405B with 8 cards:
> Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3.
```bash
DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API=true \
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_lm_eval.py \
-o acc_llama3_405b_bs1_quant.txt \
Expand All @@ -516,7 +514,6 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ..
Here is an example to quantize the model based on previous measurements for Llama3-405B with 8 cards:
> Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3.
```bash
DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API=true \
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,5 @@
"mode": "QUANTIZE",
"observer": "maxabs",
"scale_method": "unit_scale",
"whitelist": {"types": [], "names": []},
"blacklist": {"types": [], "names": []},
"quantize_weight": false,
"dump_stats_path": "./results/hk",
"ignore_modules_wo_measures": "True",
"dump_stats_xlsx_path": "./run_outputs/fp8stats.xlsx"
"dump_stats_path": "./hqt_output/measure"
}
8 changes: 1 addition & 7 deletions examples/text-generation/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -439,12 +439,7 @@ def setup_distributed_model(args, model_dtype, model_kwargs, logger):
logger.info("DeepSpeed is enabled.")
deepspeed.init_distributed(dist_backend="hccl")
config = AutoConfig.from_pretrained(args.model_name_or_path, torch_dtype=model_dtype, **model_kwargs)

keep_module_on_host = False
if "Llama-3.1-405B" in args.model_name_or_path:
keep_module_on_host = True

load_to_meta = False if keep_module_on_host else model_on_meta(config)
load_to_meta = model_on_meta(config)

if args.assistant_model is None:
assistant_model = None
Expand Down Expand Up @@ -499,7 +494,6 @@ def setup_distributed_model(args, model_dtype, model_kwargs, logger):

# Initialize the model
ds_inference_kwargs = {"dtype": model_dtype}
ds_inference_kwargs["keep_module_on_host"] = keep_module_on_host
ds_inference_kwargs["tensor_parallel"] = {"tp_size": args.world_size}
ds_inference_kwargs["enable_cuda_graph"] = args.use_hpu_graphs
ds_inference_kwargs["injection_policy"] = get_ds_injection_policy(config)
Expand Down

0 comments on commit 81f33ed

Please sign in to comment.