Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

๐ŸŒ [i18n-KO] Translated bitsandbytes.md to Korean #32408

Merged
merged 11 commits into from
Aug 8, 2024
24 changes: 23 additions & 1 deletion docs/source/ko/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,28 @@
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Interoperability with GGUF files
title: (๋ฒˆ์—ญ์ค‘) ๊ฐœ๋ฐœ์ž ๊ฐ€์ด๋“œ
- sections:
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Getting started
- local: quantization/bitsandbytes
title: bitsandbytes
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) GPTQ
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) AWQ
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) AQLM
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Quanto
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) EETQ
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) HQQ
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Optimum
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Contribute new quantization method
title: (๋ฒˆ์—ญ์ค‘) ๊ฒฝ๋Ÿ‰ํ™” ๋ฉ”์†Œ๋“œ
- sections:
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Getting started
Expand Down Expand Up @@ -746,4 +768,4 @@
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Utilities for Time Series
title: (๋ฒˆ์—ญ์ค‘) Internal Helpers
title: (๋ฒˆ์—ญ์ค‘) API
title: (๋ฒˆ์—ญ์ค‘) API
314 changes: 314 additions & 0 deletions docs/source/ko/quantization/bitsandbytes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

โš ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# bitsandbytes[[bitsandbytes]]

[bitsandbytes](https://github.com/TimDettmers/bitsandbytes)๋Š” ๋ชจ๋ธ์„ 8๋น„ํŠธ ๋ฐ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 8๋น„ํŠธ ์–‘์žํ™”๋Š” fp16์—์„œ ์ด์ƒ์น˜๋ฅผ ์ œ์™ธํ•œ ๊ฐ’์„ int8๋กœ ๋ณ€ํ™˜ํ•œ ํ›„, ์ด์ƒ์น˜ ๊ฐ’์„ fp16์œผ๋กœ ๋‹ค์‹œ ๋ณ€ํ™˜ํ•˜๊ณ  ์ด๋“ค์„ ํ•ฉ์‚ฐํ•˜์—ฌ fp16์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ด์ƒ์น˜ ๊ฐ’์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์ €ํ•˜ ํšจ๊ณผ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 4๋น„ํŠธ ์–‘์žํ™”๋Š” ๋ชจ๋ธ์„ ๋”์šฑ ์••์ถ•ํ•˜๋ฉฐ, [QLoRA](https://hf.co/papers/2305.14314)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ์–‘์žํ™”๋œ LLM์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐ ํ”ํžˆ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

bitsandbytes๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

<hfoptions id="bnb">
<hfoption id="8-bit">

```bash
pip install transformers accelerate bitsandbytes>0.37.0
```

</hfoption>
<hfoption id="4-bit">

```bash
pip install bitsandbytes>=0.39.0
pip install --upgrade accelerate transformers
```

</hfoption>
</hfoptions>

์ด์ œ `BitsAndBytesConfig`๋ฅผ [`~PreTrainedModel.from_pretrained`] ๋ฉ”์„œ๋“œ์— ์ „๋‹ฌํ•˜์—ฌ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Accelerate์™€ ํ•จ๊ป˜ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๊ณ  `torch.nn.Linear` ๋ ˆ์ด์–ด๊ฐ€ ํฌํ•จ๋œ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

<hfoptions id="bnb">
<hfoption id="8-bit">

๋ชจ๋ธ์„ 8๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ `device_map="auto"`๋ฅผ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
Copy link
Contributor

@Jwaminju Jwaminju Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
๋ชจ๋ธ์„ 8๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ `device_map="auto"`๋ฅผ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
๋ชจ๋ธ์„ 8๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ `device_map="auto"`๋ฅผ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

llm์„ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ฑธ ๋ณด๋ฉด, large๋„ ๋Œ€๊ทœ๋ชจ๋กœ ํ†ต์ผํ•˜๋ฉด ์–ด๋–จ๊นŒ ์‹ถ์Šต๋‹ˆ๋‹ค.
๋„์–ด์“ฐ๊ธฐ๋„ ๊ณ ์ณค์Šต๋‹ˆ๋‹ค!


```py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-1b7",
quantization_config=quantization_config
)
```

๊ธฐ๋ณธ์ ์œผ๋กœ `torch.nn.LayerNorm`๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ชจ๋“ˆ์€ `torch.float16`์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์›ํ•˜๋ฉด `torch_dtype` ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ด ๋ชจ๋“ˆ์˜ ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

```py
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
"facebook/opt-350m",
quantization_config=quantization_config,
torch_dtype=torch.float32
)
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
```

๋ชจ๋ธ์ด 8๋น„ํŠธ๋กœ ์–‘์žํ™”๋˜๋ฉด ์ตœ์‹  ๋ฒ„์ „์˜ Transformers์™€ bitsandbytes๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํ•œ ์–‘์žํ™”๋œ ๊ฐ€์ค‘์น˜๋ฅผ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, [`~PreTrainedModel.push_to_hub`] ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ 8๋น„ํŠธ ๋ชจ๋ธ์„ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–‘์žํ™” ๊ตฌ์„ฑ ํŒŒ์ผ(config.json)์ด ๋จผ์ € ํ‘ธ์‹œ๋˜๊ณ , ๊ทธ ๋‹ค์Œ ์–‘์žํ™”๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
๋ชจ๋ธ์ด 8๋น„ํŠธ๋กœ ์–‘์žํ™”๋˜๋ฉด ์ตœ์‹  ๋ฒ„์ „์˜ Transformers์™€ bitsandbytes๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํ•œ ์–‘์žํ™”๋œ ๊ฐ€์ค‘์น˜๋ฅผ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, [`~PreTrainedModel.push_to_hub`] ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ 8๋น„ํŠธ ๋ชจ๋ธ์„ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–‘์žํ™” ๊ตฌ์„ฑ ํŒŒ์ผ(config.json)์ด ๋จผ์ € ํ‘ธ์‹œ๋˜๊ณ , ๊ทธ ๋‹ค์Œ ์–‘์žํ™”๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.
๋ชจ๋ธ์ด 8๋น„ํŠธ๋กœ ์–‘์žํ™”๋˜๋ฉด ์ตœ์‹  ๋ฒ„์ „์˜ Transformers์™€ bitsandbytes๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํ•œ ์–‘์žํ™”๋œ ๊ฐ€์ค‘์น˜๋ฅผ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, [`~PreTrainedModel.push_to_hub`] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 8๋น„ํŠธ ๋ชจ๋ธ์„ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–‘์žํ™” ๊ตฌ์„ฑ ํŒŒ์ผ(config.json)์ด ๋จผ์ € ํ‘ธ์‹œ๋˜๊ณ , ๊ทธ ๋‹ค์Œ ์–‘์žํ™”๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.

glossary์— method๋Š” ๋ฉ”์†Œ๋“œ๋ผ๊ณ  ๋ฒˆ์—ญ๋˜์–ด ์žˆ๋„ค์š”!




```py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-560m",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

model.push_to_hub("bloom-560m-8bit")
```

</hfoption>
<hfoption id="4-bit">

๋ชจ๋ธ์„ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด 4๋ฐฐ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ `device_map="auto"`๋ฅผ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
๋ชจ๋ธ์„ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด 4๋ฐฐ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ `device_map="auto"`๋ฅผ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
๋ชจ๋ธ์„ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด 4๋ฐฐ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ `device_map="auto"`๋ฅผ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

glossary์—์„œ llm์„ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ๋กœ ๋ฒˆ์—ญํ•ด์„œ, large๋ฅผ ๋Œ€๊ทœ๋ชจ๋กœ ๋ฒˆ์—ญํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.


```py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-1b7",
quantization_config=quantization_config
)
```

๊ธฐ๋ณธ์ ์œผ๋กœ torch.nn.LayerNorm๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ชจ๋“ˆ์€ `torch.float16`์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์›ํ•˜๋ฉด `torch_dtype` ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ด ๋ชจ๋“ˆ์˜ ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/huggingface/transformers/pull/32408/files#r1703434051

FYI, ๋‚˜์ค‘์— ์œ— ๋ถ€๋ถ„์ด ๊ฒฐ์ •๋œ๋‹ค๋ฉด, ํ†ต์ผ๋˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋„ค์š”!


```py
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
"facebook/opt-350m",
quantization_config=quantization_config,
torch_dtype=torch.float32
)
model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
```

`bitsandbytes>=0.41.3`์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ 4๋น„ํŠธ ๋ชจ๋ธ์„ ์ง๋ ฌํ™”ํ•˜๊ณ  Hugging Face Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ 4๋น„ํŠธ ์ •๋ฐ€๋„๋กœ ๋กœ๋“œํ•œ ํ›„ `model.push_to_hub()`๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ `model.save_pretrained()` ๋ช…๋ น์–ด๋กœ ๋กœ์ปฌ์— ์ง๋ ฌํ™”๋œ 4๋น„ํŠธ ๋ชจ๋ธ์„ ์ €์žฅํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

</hfoption>
</hfoptions>

<Tip warning={true}>


8๋น„ํŠธ ๋ฐ 4๋น„ํŠธ ๊ฐ€์ค‘์น˜๋กœ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ *์ถ”๊ฐ€* ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

</Tip>

๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํ™•์ธํ•˜๋ ค๋ฉด `get_memory_footprint`๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:


```py
print(model.get_memory_footprint())
```

์–‘์žํ™”๋œ ๋ชจ๋ธ์€ [`~PreTrainedModel.from_pretrained`]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ `load_in_8bit` ๋˜๋Š” `load_in_4bit` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š๊ณ ๋„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:


```py
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")
```

## 8๋น„ํŠธ (LLM.int8() ์•Œ๊ณ ๋ฆฌ์ฆ˜)[[8-bit-(llm.int8()-algorithm)]]

<Tip>

8๋น„ํŠธ ์–‘์žํ™”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ์ด [๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ](https://huggingface.co/blog/hf-bitsandbytes-integration)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”!

</Tip>

์ด ์„น์…˜์—์„œ๋Š” ์˜คํ”„๋กœ๋“œ, ์ด์ƒ์น˜ ์ž„๊ณ—๊ฐ’, ๋ชจ๋“ˆ ๋ณ€ํ™˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ ๋ฐ ๋ฏธ์„ธ ์กฐ์ •๊ณผ ๊ฐ™์€ 8๋น„ํŠธ ๋ชจ๋ธ์˜ ํŠน์ • ๊ธฐ๋Šฅ์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

### ์˜คํ”„๋กœ๋“œ[[offloading]]

8๋น„ํŠธ ๋ชจ๋ธ์€ CPU์™€ GPU ๊ฐ„์— ๊ฐ€์ค‘์น˜๋ฅผ ์˜คํ”„๋กœ๋“œํ•˜์—ฌ ๋งค์šฐ ํฐ ๋ชจ๋ธ์„ ๋ฉ”๋ชจ๋ฆฌ์— ์žฅ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. CPU๋กœ ์ „์†ก๋œ ๊ฐ€์ค‘์น˜๋Š” ์‹ค์ œ๋กœ **float32**๋กœ ์ €์žฅ๋˜๋ฉฐ 8๋น„ํŠธ๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) ๋ชจ๋ธ์˜ ์˜คํ”„๋กœ๋“œ๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด [`BitsAndBytesConfig`]๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์„ธ์š”:

```py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
```

CPU์— ์ „๋‹ฌํ•  `lm_head`๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๊ฒƒ์„ GPU์— ์ ์žฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์šฉ์ž ์ •์˜ ์žฅ์น˜ ๋งต์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค:

```py
device_map = {
"transformer.word_embeddings": 0,
"transformer.word_embeddings_layernorm": 0,
"lm_head": "cpu",
"transformer.h": 0,
"transformer.ln_f": 0,
}
```

์ด์ œ ์‚ฌ์šฉ์ž ์ •์˜ `device_map`๊ณผ `quantization_config`์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:

```py
model_8bit = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-1b7",
device_map=device_map,
quantization_config=quantization_config,
)
```

### ์ด์ƒ์น˜ ์ž„๊ณ—๊ฐ’[[outlier-threshold]]

"์ด์ƒ๊ฐ’"์€ ํŠน์ • ์ž„๊ณ—๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ํžˆ๋“  ์ƒํƒœ ๊ฐ’์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๊ฐ’์€ fp16์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ๊ฐ’์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ •๊ทœ ๋ถ„ํฌ([-3.5, 3.5])๋ฅผ ๋”ฐ๋ฅด์ง€๋งŒ, ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์ด ๋ถ„ํฌ๋Š” ๋งค์šฐ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค([-60, 6] ๋˜๋Š” [6, 60]). 8๋น„ํŠธ ์–‘์žํ™”๋Š” ~5 ์ •๋„์˜ ๊ฐ’์—์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๊ทธ ์ด์ƒ์—์„œ๋Š” ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ข‹์€ ๊ธฐ๋ณธ ์ž„๊ณ—๊ฐ’ ๊ฐ’์€ 6์ด์ง€๋งŒ, ๋” ๋ถˆ์•ˆ์ •ํ•œ ๋ชจ๋ธ(์†Œํ˜• ๋ชจ๋ธ ๋˜๋Š” ๋ฏธ์„ธ ์กฐ์ •)์—๋Š” ๋” ๋‚ฎ์€ ์ž„๊ณ—๊ฐ’์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ์ž„๊ณ—๊ฐ’์„ ์ฐพ์œผ๋ ค๋ฉด [BitsAndBytesConfig]์—์„œ `llm_int8_threshold` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‹คํ—˜ํ•ด๋ณด๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:

```py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
llm_int8_threshold=10,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device_map,
quantization_config=quantization_config,
)
```

### ๋ชจ๋“ˆ ๋ณ€ํ™˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ[[skip-module-conversion]]

[Jukebox](model_doc/jukebox)์™€ ๊ฐ™์€ ์ผ๋ถ€ ๋ชจ๋ธ์€ ๋ชจ๋“  ๋ชจ๋“ˆ์„ 8๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ์ด๋Š” ์‹ค์ œ๋กœ ๋ถˆ์•ˆ์ •์„ฑ์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Jukebox์˜ ๊ฒฝ์šฐ, [`BitsAndBytesConfig`]์˜ `llm_int8_skip_modules` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ `lm_head` ๋ชจ๋“ˆ์„ ๊ฑด๋„ˆ๋›ฐ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

```py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
llm_int8_skip_modules=["lm_head"],
)

model_8bit = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=quantization_config,
)
```

### ๋ฏธ์„ธ ์กฐ์ •[[finetuning]]

[PEFT](https://github.com/huggingface/peft) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด [flan-t5-large](https://huggingface.co/google/flan-t5-large) ๋ฐ [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b)์™€ ๊ฐ™์€ ๋Œ€ํ˜• ๋ชจ๋ธ์„ 8๋น„ํŠธ ์–‘์žํ™”๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์‹œ `device_map` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „๋‹ฌํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ๋ชจ๋ธ์ด ์ž๋™์œผ๋กœ GPU์— ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์›ํ•˜๋Š” ๊ฒฝ์šฐ `device_map` ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์žฅ์น˜ ๋งต์„ ์‚ฌ์šฉ์ž ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (`device_map="auto"`๋Š” ์ถ”๋ก ์—๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[PEFT](https://github.com/huggingface/peft) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด [flan-t5-large](https://huggingface.co/google/flan-t5-large) ๋ฐ [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b)์™€ ๊ฐ™์€ ๋Œ€ํ˜• ๋ชจ๋ธ์„ 8๋น„ํŠธ ์–‘์žํ™”๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์‹œ `device_map` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „๋‹ฌํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ๋ชจ๋ธ์ด ์ž๋™์œผ๋กœ GPU์— ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์›ํ•˜๋Š” ๊ฒฝ์šฐ `device_map` ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์žฅ์น˜ ๋งต์„ ์‚ฌ์šฉ์ž ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (`device_map="auto"`๋Š” ์ถ”๋ก ์—๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค).
[PEFT](https://github.com/huggingface/peft) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด [flan-t5-large](https://huggingface.co/google/flan-t5-large) ๋ฐ [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b)์™€ ๊ฐ™์€ ๋Œ€ํ˜• ๋ชจ๋ธ์„ 8๋น„ํŠธ ์–‘์žํ™”๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์‹œ `device_map` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „๋‹ฌํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ GPU์— ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์›ํ•˜๋Š” ๊ฒฝ์šฐ `device_map` ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์žฅ์น˜ ๋งต์„ ์‚ฌ์šฉ์ž ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (`device_map="auto"`๋Š” ์ถ”๋ก ์—๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค).

๊ธ€๋กœ์„œ๋ฆฌ ๊ธฐ๋ฐ˜ ์ˆ˜์ •์ž…๋‹ˆ๋‹ค.


## 4๋น„ํŠธ (QLoRA ์•Œ๊ณ ๋ฆฌ์ฆ˜)[[4-bit-(qlora-algorithm)]]

<Tip>

์ด [๋…ธํŠธ๋ถ](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf)์—์„œ 4๋น„ํŠธ ์–‘์žํ™”๋ฅผ ์‹œ๋„ํ•˜๊ณ  ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด [๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ](https://huggingface.co/blog/4bit-transformers-bitsandbytes)์—์„œ ํ™•์ธํ•˜์„ธ์š”.


</Tip>

์ด ์„น์…˜์—์„œ๋Š” ๊ณ„์‚ฐ ๋ฐ์ดํ„ฐ ์œ ํ˜• ๋ณ€๊ฒฝ, Normal Float 4 (NF4) ๋ฐ์ดํ„ฐ ์œ ํ˜• ์‚ฌ์šฉ, ์ค‘์ฒฉ ์–‘์žํ™”์™€ ๊ฐ™์€ 4๋น„ํŠธ ๋ชจ๋ธ์˜ ํŠน์ • ๊ธฐ๋Šฅ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.


### ๋ฐ์ดํ„ฐ ์œ ํ˜• ๊ณ„์‚ฐ[[compute-data-type]]

๊ณ„์‚ฐ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด [`BitsAndBytesConfig`]์—์„œ `bnb_4bit_compute_dtype` ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ float32(๊ธฐ๋ณธ๊ฐ’)์—์„œ bf16์œผ๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:


```py
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
```

### Normal Float 4 (NF4)[[normal-float-4-(nf4)]]

NF4๋Š” [QLoRA](https://hf.co/papers/2305.14314) ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœ๋œ 4๋น„ํŠธ ๋ฐ์ดํ„ฐ ์œ ํ˜•์œผ๋กœ, ์ •๊ทœ ๋ถ„ํฌ์—์„œ ์ดˆ๊ธฐํ™”๋œ ๊ฐ€์ค‘์น˜์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. 4๋น„ํŠธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ NF4๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” [`BitsAndBytesConfig`]์—์„œ `bnb_4bit_quant_type` ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

```py
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
```

์ถ”๋ก ์˜ ๊ฒฝ์šฐ, `bnb_4bit_quant_type`์€ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์™€ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด `bnb_4bit_compute_dtype` ๋ฐ `torch_dtype` ๊ฐ’์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

### ์ค‘์ฒฉ ์–‘์žํ™”[[nested-quantization]]

์ค‘์ฒฉ ์–‘์žํ™”๋Š” ์ถ”๊ฐ€์ ์ธ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ์ถ”๊ฐ€์ ์ธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ ์ด๋ฏธ ์–‘์žํ™”๋œ ๊ฐ€์ค‘์น˜์˜ 2์ฐจ ์–‘์žํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋งค๊ฐœ๋ณ€์ˆ˜๋‹น ์ถ”๊ฐ€๋กœ 0.4๋น„ํŠธ๋ฅผ ์ ˆ์•ฝํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ค‘์ฒฉ ์–‘์žํ™”๋ฅผ ํ†ตํ•ด 16GB NVIDIA T4 GPU์—์„œ ์‹œํ€€์Šค ๊ธธ์ด 1024, ๋ฐฐ์น˜ ํฌ๊ธฐ 1, ๊ทธ๋ผ๋””์–ธํŠธ ๋ˆ„์  4๋‹จ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```py
from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)
```

## `bitsandbytes` ๋ชจ๋ธ์˜ ๋น„์–‘์žํ™”[[dequantizing-`bitsandbytes`-models]]
์–‘์žํ™”๋œ ํ›„์—๋Š” ๋ชจ๋ธ์„ ์›๋ž˜์˜ ์ •๋ฐ€๋„๋กœ ๋น„์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋Š” ๋ชจ๋ธ์˜ ์ž‘์€ ํ’ˆ์งˆ ์†์‹ค์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์ถฉ๋ถ„ํ•œ GPU RAM์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

model_id = "facebook/opt-125m"

model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
tokenizer = AutoTokenizer.from_pretrained(model_id)

model.dequantize()

text = tokenizer("Hello my name is", return_tensors="pt").to(0)

out = model.generate(**text)
print(tokenizer.decode(out[0]))
```