Skip to content

Commit

Permalink
🌐 [i18n-KO] Translated fsdp.md to Korean (#32261)
Browse files Browse the repository at this point in the history
* docs: ko: fsdp.md

* feat: nmt draft

* fix: manual edits

* Apply suggestions from code review

Co-authored-by: κΉ€μ€€μž¬ <55151385+junejae@users.noreply.github.com>
Co-authored-by: Minki Kim <100768622+1kmmk1@users.noreply.github.com>

* fix: resolve suggestions

* Update docs/source/ko/fsdp.md

Co-authored-by: κΉ€μ€€μž¬ <55151385+junejae@users.noreply.github.com>

* Update docs/source/ko/fsdp.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: κΉ€μ€€μž¬ <55151385+junejae@users.noreply.github.com>
Co-authored-by: Minki Kim <100768622+1kmmk1@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
  • Loading branch information
4 people authored Aug 8, 2024
1 parent e0396bd commit 496207a
Show file tree
Hide file tree
Showing 2 changed files with 140 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/source/ko/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -170,8 +170,8 @@
title: (λ²ˆμ—­μ€‘) Methods and tools for efficient training on a single GPU
- local: perf_train_gpu_many
title: 닀쀑 GPUμ—μ„œ ν›ˆλ ¨ μ§„ν–‰ν•˜κΈ°
- local: in_translation
title: (λ²ˆμ—­μ€‘) Fully Sharded Data Parallel
- local: fsdp
title: μ™„μ „ λΆ„ν•  데이터 병렬 처리
- local: in_translation
title: (λ²ˆμ—­μ€‘) DeepSpeed
- local: perf_train_cpu
Expand Down
138 changes: 138 additions & 0 deletions docs/source/ko/fsdp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# μ™„μ „ λΆ„ν•  데이터 병렬 처리(FSDP) [[fully-sharded-data-parallel]]

[Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)은 λͺ¨λΈμ˜ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μ‚¬μš© κ°€λŠ₯ν•œ GPU(μž‘μ—…μž λ˜λŠ” *랭크*라고도 함) μˆ˜μ— 따라 λΆ„ν• ν•˜λŠ” 데이터 병렬 처리 λ°©μ‹μž…λ‹ˆλ‹€. [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)와 달리, FSDPλŠ” 각 GPU에 λͺ¨λΈμ„ λ³΅μ œν•˜κΈ° λ•Œλ¬Έμ— λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ μ€„μž…λ‹ˆλ‹€. μ΄λŠ” GPU λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ ν–₯μƒμ‹œν‚€λ©° 적은 수의 GPU둜 훨씬 더 큰 λͺ¨λΈμ„ ν›ˆλ ¨ν•  수 있게 ν•©λ‹ˆλ‹€. FSDPλŠ” λΆ„μ‚° ν™˜κ²½μ—μ„œμ˜ ν›ˆλ ¨μ„ μ‰½κ²Œ 관리할 수 μžˆλŠ” 라이브러리인 Accelerate와 ν†΅ν•©λ˜μ–΄ 있으며, λ”°λΌμ„œ [`Trainer`] ν΄λž˜μŠ€μ—μ„œ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

μ‹œμž‘ν•˜κΈ° 전에 Accelerateκ°€ μ„€μΉ˜λ˜μ–΄ 있고 μ΅œμ†Œ PyTorch 2.1.0 μ΄μƒμ˜ 버전이 μ„€μΉ˜λ˜μ–΄ μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”.

```bash
pip install accelerate
```

## FSDP ꡬ성 [[fsdp-configuration]]

μ‹œμž‘ν•˜λ €λ©΄ [`accelerate config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) λͺ…령을 μ‹€ν–‰ν•˜μ—¬ ν›ˆλ ¨ ν™˜κ²½μ— λŒ€ν•œ ꡬ성 νŒŒμΌμ„ μƒμ„±ν•˜μ„Έμš”. AccelerateλŠ” 이 ꡬ성 νŒŒμΌμ„ μ‚¬μš©ν•˜μ—¬ `accelerate config`μ—μ„œ μ„ νƒν•œ ν›ˆλ ¨ μ˜΅μ…˜μ— 따라 μžλ™μœΌλ‘œ μ˜¬λ°”λ₯Έ ν›ˆλ ¨ ν™˜κ²½μ„ μ„€μ •ν•©λ‹ˆλ‹€.

```bash
accelerate config
```

`accelerate config`λ₯Ό μ‹€ν–‰ν•˜λ©΄ ν›ˆλ ¨ ν™˜κ²½μ„ κ΅¬μ„±ν•˜κΈ° μœ„ν•œ 일련의 μ˜΅μ…˜λ“€μ΄ λ‚˜νƒ€λ‚©λ‹ˆλ‹€. 이 μ„Ήμ…˜μ—μ„œλŠ” κ°€μž₯ μ€‘μš”ν•œ FSDP μ˜΅μ…˜ 쀑 일뢀λ₯Ό λ‹€λ£Ήλ‹ˆλ‹€. λ‹€λ₯Έ μ‚¬μš© κ°€λŠ₯ν•œ FSDP μ˜΅μ…˜μ— λŒ€ν•΄ 더 μ•Œμ•„λ³΄κ³  μ‹Άλ‹€λ©΄ [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) λ§€κ°œλ³€μˆ˜λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

### λΆ„ν•  μ „λž΅ [[sharding-strategy]]

FSDPλŠ” μ—¬λŸ¬ 가지 λΆ„ν•  μ „λž΅μ„ μ œκ³΅ν•©λ‹ˆλ‹€:

* `FULL_SHARD` - λͺ¨λΈ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž 간에 λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `1`을 μ„ νƒν•˜μ„Έμš”
* `SHARD_GRAD_OP` - κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž 간에 λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `2`λ₯Ό μ„ νƒν•˜μ„Έμš”
* `NO_SHARD` - 아무 것도 λΆ„ν• ν•˜μ§€ μ•ŠμŒ (DDP와 동일); 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `3`을 μ„ νƒν•˜μ„Έμš”
* `HYBRID_SHARD` - 각 μž‘μ—…μžκ°€ 전체 볡사본을 가지고 μžˆλŠ” μƒνƒœμ—μ„œ λͺ¨λΈ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž λ‚΄μ—μ„œ λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `4`λ₯Ό μ„ νƒν•˜μ„Έμš”
* `HYBRID_SHARD_ZERO2` - 각 μž‘μ—…μžκ°€ 전체 볡사본을 가지고 μžˆλŠ” μƒνƒœμ—μ„œ κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž λ‚΄μ—μ„œ λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `5`λ₯Ό μ„ νƒν•˜μ„Έμš”

이것은 `fsdp_sharding_strategy` ν”Œλž˜κ·Έλ‘œ ν™œμ„±ν™”λ©λ‹ˆλ‹€.

### CPU μ˜€ν”„λ‘œλ“œ [[cpu-offload]]

μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” λ§€κ°œλ³€μˆ˜μ™€ κ·Έλ ˆμ΄λ””μ–ΈνŠΈλ₯Ό CPU둜 μ˜€ν”„λ‘œλ“œν•˜μ—¬ 더 λ§Žμ€ GPU λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•˜κ³  FSDPλ‘œλ„ μΆ©λΆ„ν•˜μ§€ μ•Šμ€ 큰 λͺ¨λΈμ„ GPU에 μ μž¬ν•  수 μžˆλ„λ‘ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” `accelerate config`λ₯Ό μ‹€ν–‰ν•  λ•Œ `fsdp_offload_params: true`둜 μ„€μ •ν•˜μ—¬ ν™œμ„±ν™”λ©λ‹ˆλ‹€.

### λž˜ν•‘ μ •μ±… [[wrapping-policy]]

FSDPλŠ” λ„€νŠΈμ›Œν¬μ˜ 각 λ ˆμ΄μ–΄λ₯Ό λž˜ν•‘ν•˜μ—¬ μ μš©λ©λ‹ˆλ‹€. λž˜ν•‘μ€ 일반적으둜 쀑첩 λ°©μ‹μœΌλ‘œ 적용되며 각각 순방ν–₯으둜 μ§€λ‚˜κ°„ ν›„ 전체 κ°€μ€‘μΉ˜λ₯Ό μ‚­μ œν•˜μ—¬ λ‹€μŒ λ ˆμ΄μ–΄μ—μ„œ μ‚¬μš©ν•  λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•©λ‹ˆλ‹€. *μžλ™ λž˜ν•‘* 정책은 이λ₯Ό κ΅¬ν˜„ν•˜λŠ” κ°€μž₯ κ°„λ‹¨ν•œ 방법이며 μ½”λ“œλ₯Ό λ³€κ²½ν•  ν•„μš”κ°€ μ—†μŠ΅λ‹ˆλ‹€. Transformer λ ˆμ΄μ–΄λ₯Ό λž˜ν•‘ν•˜λ €λ©΄ `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP`λ₯Ό μ„ νƒν•˜κ³  λž˜ν•‘ν•  λ ˆμ΄μ–΄λ₯Ό μ§€μ •ν•˜λ €λ©΄ `fsdp_transformer_layer_cls_to_wrap`λ₯Ό μ„ νƒν•˜μ„Έμš” (예: `BertLayer`).

λ˜λŠ” νŠΉμ • λ§€κ°œλ³€μˆ˜ 수λ₯Ό μ΄ˆκ³Όν•  경우 FSDPκ°€ λ ˆμ΄μ–΄μ— μ μš©λ˜λŠ” 크기 기반 λž˜ν•‘ 정책을 선택할 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” `fsdp_wrap_policy: SIZE_BASED_WRAP` 및 `min_num_param`을 μ›ν•˜λŠ” 크기의 μž„κ³„κ°’μœΌλ‘œ μ„€μ •ν•˜μ—¬ ν™œμ„±ν™”λ©λ‹ˆλ‹€.

### 체크포인트 [[checkpointing]]

쀑간 μ²΄ν¬ν¬μΈνŠΈλŠ” `fsdp_state_dict_type: SHARDED_STATE_DICT`둜 μ €μž₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. CPU μ˜€ν”„λ‘œλ“œκ°€ ν™œμ„±ν™”λœ 랭크 0μ—μ„œ 전체 μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ₯Ό μ €μž₯ν•˜λŠ” 데 μ‹œκ°„μ΄ 많이 걸리고, λΈŒλ‘œλ“œμΊμŠ€νŒ… 쀑 λ¬΄κΈ°ν•œ λŒ€κΈ°ν•˜μ—¬ `NCCL Timeout` 였λ₯˜κ°€ λ°œμƒν•  수 있기 λ•Œλ¬Έμž…λ‹ˆλ‹€. [`~accelerate.Accelerator.load_state`] λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ λΆ„ν• λœ μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ‘œ ν›ˆλ ¨μ„ μž¬κ°œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```py
# κ²½λ‘œκ°€ λ‚΄μž¬λœ 체크포인트
accelerator.load_state("ckpt")
```

κ·ΈλŸ¬λ‚˜ ν›ˆλ ¨μ΄ λλ‚˜λ©΄ 전체 μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ₯Ό μ €μž₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. λΆ„ν• λœ μƒνƒœ λ”•μ…”λ„ˆλ¦¬λŠ” FSDPμ™€λ§Œ ν˜Έν™˜λ˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.

```py
if trainer.is_fsdp_enabled:
trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

trainer.save_model(script_args.output_dir)
```

### TPU [[tpu]]

[PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html)λŠ” TPU에 λŒ€ν•œ FSDP ν›ˆλ ¨μ„ μ§€μ›ν•˜λ©° `accelerate config`둜 μƒμ„±λœ FSDP ꡬ성 νŒŒμΌμ„ μˆ˜μ •ν•˜μ—¬ ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μœ„μ—μ„œ μ§€μ •ν•œ λΆ„ν•  μ „λž΅ 및 λž˜ν•‘ μ˜΅μ…˜ 외에도 μ•„λž˜μ— ν‘œμ‹œλœ λ§€κ°œλ³€μˆ˜λ₯Ό νŒŒμΌμ— μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```yaml
xla: True # PyTorch/XLAλ₯Ό ν™œμ„±ν™”ν•˜λ €λ©΄ True둜 μ„€μ •ν•΄μ•Ό ν•©λ‹ˆλ‹€
xla_fsdp_settings: # XLA νŠΉμ • FSDP λ§€κ°œλ³€μˆ˜
xla_fsdp_grad_ckpt: True # gradient checkpointing을 μ‚¬μš©ν•©λ‹ˆλ‹€
```
[`xla_fsdp_settings`](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128)λŠ” FSDP에 λŒ€ν•œ 좔가적인 XLA νŠΉμ • λ§€κ°œλ³€μˆ˜λ₯Ό ꡬ성할 수 있게 ν•©λ‹ˆλ‹€.

## ν›ˆλ ¨ μ‹œμž‘ [[launch-training]]

μ˜ˆμ‹œ FSDP ꡬ성 νŒŒμΌμ€ λ‹€μŒκ³Ό 같을 수 μžˆμŠ΅λ‹ˆλ‹€:

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: BertLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

ν›ˆλ ¨μ„ μ‹œμž‘ν•˜λ €λ©΄ [`accelerate launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) λͺ…령을 μ‹€ν–‰ν•˜μ„Έμš”. 이 λ•Œ 전에 `accelerate config`둜 μƒμ„±ν•œ ꡬ성 νŒŒμΌμ„ μžλ™μœΌλ‘œ μ‚¬μš©ν•©λ‹ˆλ‹€.

```bash
accelerate launch my-trainer-script.py
```

```bash
accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/ my-trainer-script.py
```

## λ‹€μŒ 단계 [[next-steps]]

FSDPλŠ” 맀우 큰 λͺ¨λΈμ„ ν›ˆλ ¨ν•  λ•Œ κ°•λ ₯ν•œ 도ꡬ가 될 수 있으며, μ—¬λŸ¬ 개의 GPUλ‚˜ TPUλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ λ§€κ°œλ³€μˆ˜, μ˜΅ν‹°λ§ˆμ΄μ € 및 κ·Έλ ˆμ΄λ””μ–ΈνŠΈ μƒνƒœλ₯Ό λΆ„ν• ν•˜κ³  λΉ„ν™œμ„± μƒνƒœμΌ λ•Œ, CPU둜 μ˜€ν”„λ‘œλ“œν•˜λ©΄ FSDPλŠ” λŒ€κ·œλͺ¨ ν›ˆλ ¨μ˜ 높은 μ—°μ‚° λΉ„μš©μ„ 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€. 더 μ•Œμ•„λ³΄κ³  μ‹Άλ‹€λ©΄ λ‹€μŒ μžλ£Œκ°€ 도움이 될 수 μžˆμŠ΅λ‹ˆλ‹€:

* [FSDP](https://huggingface.co/docs/accelerate/usage_guides/fsdp)에 λŒ€ν•œ 더 깊이 μžˆλŠ” Accelerate κ°€μ΄λ“œλ₯Ό 따라가 λ³΄μ„Έμš”.
* [PyTorch의 μ™„μ „ λΆ„ν•  데이터 병렬 처리 (FSDP) APIλ₯Ό μ†Œκ°œν•©λ‹ˆλ‹€](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) λΈ”λ‘œκ·Έ 글을 μ½μ–΄λ³΄μ„Έμš”.
* [FSDPλ₯Ό μ‚¬μš©ν•˜μ—¬ ν΄λΌμš°λ“œ TPUμ—μ„œ PyTorch λͺ¨λΈ 크기 μ‘°μ ˆν•˜κΈ°](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) λΈ”λ‘œκ·Έ 글을 μ½μ–΄λ³΄μ„Έμš”.

0 comments on commit 496207a

Please sign in to comment.