-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #181 from tigrisdata/Xe/chonky-bois
blog: why models so large
- Loading branch information
Showing
2 changed files
with
212 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
--- | ||
slug: chonky-models | ||
title: How do large language models get so large? | ||
description: > | ||
AI models, comprised mainly of floating-point numbers, function by processing | ||
inputs through various components like tokenizers and embedding models. They | ||
range in size from gigabytes to terabytes, with larger parameter counts | ||
enhancing performance and nuance representation. How do they get so large | ||
though? | ||
image: ./tiger-ship.webp | ||
keywords: [object storage, blob storage, s3, ai, architecture] | ||
authors: [xe] | ||
tags: [object storage, reliability, performance] | ||
--- | ||
|
||
import InlineCta from "@site/src/components/InlineCta"; | ||
|
||
 | ||
|
||
<center> | ||
<small> | ||
<em> | ||
A majestic blue tiger riding on a sailing ship. The tiger is very large. | ||
Image generated using PonyXL. | ||
</em> | ||
</small> | ||
</center> | ||
|
||
AI models can get pretty darn large. Larger models seem to perform better than | ||
smaller models, but we don’t quite know why. My work MacBook has 64 gigabytes of | ||
RAM and I’m able to use nearly all of it when I do AI inference. Somehow these | ||
40+ gigabyte blobs of floating point numbers are able to take a question about | ||
the color of the sky and spit out an answer. At some level this is a miracle of | ||
technology, but how does it work? | ||
|
||
Today I’m going to cover what an AI model really is and the parts that make it | ||
up. I’m not going to cover the linear algebra at play nor any of the neural | ||
networks. Most people want to start with an off the shelf model, anyway. | ||
|
||
{/* truncate */} | ||
|
||
## What are AI models made out of? | ||
|
||
At the core an AI model is really just a ball of floating-point numbers that the | ||
input goes through to get an output. There’s two basic kinds of models: language | ||
models and image diffusion models. They’re both very similar, but they have some | ||
different parts. | ||
|
||
A text generation model has a few basic parts: | ||
|
||
- A tokenizer model to break input into pieces of words, grammatical separators, | ||
and emoji. | ||
- An embedding model to take the frequencies of relationships between tokens and | ||
generate the “concept”, which is what allows a model to see that “hot” and | ||
“warm” are similar. | ||
- Token predictor weights, which the embeddings are passed through in order to | ||
determine which tokens are most likely to come next. | ||
|
||
Note these are really three individual models | ||
[stacked](https://bojackhorseman.fandom.com/wiki/Vincent_Adultman) on top of | ||
each other, but they only make sense together. You cannot separate them nor | ||
exchange the parts. | ||
|
||
Of all of those, the token predictor weights are the biggest part. The number of | ||
“parameters” a language model has refers to the number of floating point numbers | ||
in the token predictor weights. An 8 billion parameter language has 8 billion | ||
floating point parameters. | ||
|
||
An image diffusion model has most of the same parts as a language model: | ||
|
||
- A tokenizer to take your input and break it into pieces of words, grammatical | ||
separators, and emoji. | ||
- An embedding model to turn those tokens into a latent space, a kind of | ||
embedding that works better for latent diffusion. | ||
- A de-noising model (unet) that gradually removes noise from the latent space | ||
to make the image reveal itself. | ||
- A Variational AutoEncoder (VAE) that is used to encode a latent space into an | ||
image. | ||
|
||
Most of the time, a single model (such as Stable Diffusion XL, PonyXL, or a | ||
finetune) will include all four of these models in one single `.safetensors` | ||
file. | ||
|
||
In the earlier days of Stable Diffusion 1.5, you usually got massive quality | ||
gains by swapping out the VAE model for a variant that works best for you (one | ||
does anime style images better, one was optimized for making hands look correct, | ||
one was optimized for a specific kind of pastel watercolor style, etc). Stable | ||
Diffusion XL and later largely made the VAE that’s baked into the model good | ||
enough that you no longer needed to care. | ||
|
||
Of the three stacked models, the de-noising model is the size that's cited. The | ||
tokenizer, embedding model, and variational autoencoder are extras on the side. | ||
Stable Diffusion XL has 6.6 billion parameters and Flux [dev] has 12 billion | ||
parameters in their de-noising models, and the other models fit into about 5-10% | ||
of the model size and are not counted by the number of parameters, however they | ||
do contribute to the final model size. | ||
|
||
We currently believe that the more parameters a model has allows it to represent | ||
nuance more accurately. This generally means that a 70 billion parameter | ||
language model is able to handle tasks that an 8 billion parameter language | ||
model can’t, or that a 70 billion parameter language model will be able to do | ||
tasks better than an 8 billion parameter language model. | ||
|
||
Recently smaller models are catching up, | ||
[bigger isn't always better](https://www.scientificamerican.com/article/when-it-comes-to-ai-models-bigger-isnt-always-better/). | ||
Bigger models require more compute and introduce performance bottlenecks. The | ||
reality is that people are going to use large models, so we need to design | ||
systems that can handle them. | ||
|
||
## Quantization | ||
|
||
If you’re lucky enough to have access to high-vram GPUs on the cheap, you don’t | ||
need to worry about quantization. Quantization is a form of compression where | ||
you take a model’s floating-point weights and convert them to a smaller number, | ||
such as converting a 70 billion parameter model with 140 gigabytes of float16 | ||
parameters (16 bit floating numbers) to 35 gigabytes of 4-bit parameters (Q4). | ||
This is a lossy operation, but it will save precious gigabytes from your docker | ||
images and let bigger models fit into smaller GPUs. | ||
|
||
:::note | ||
|
||
When you read a model quantization level like `Q4` or `fp16`/`float16`, you can | ||
interpret it like this: | ||
|
||
| Initial Letter | Number | What it means | | ||
| :-------------------- | :----- | :----------------------------------------------------------------------------------------------- | | ||
| `Q` or `I` | `4` | Four bit integers | | ||
| `f`, `fp`, or `float` | `16` | 16 bit [IEEE754](https://en.wikipedia.org/wiki/IEEE_754) floating-point numbers (half-precision) | | ||
|
||
::: | ||
|
||
Using quantization is a tradeoff between the amount of video memory (GPU ram) | ||
you have and the desired task. A 70B model at Q4 quantization will have a loss | ||
in quality compared to running it at the full float16 quantization, but you can | ||
run that 70B model on a single GPU instead of needing two to four GPUs to get it | ||
running. | ||
|
||
Most of the time you won’t need to quantize image diffusion models to get them | ||
running (with some exceptions for getting Flux \[dev\] running on low-end | ||
consumer GPUs). This is something that is almost exclusively done with language | ||
models. | ||
|
||
In order to figure out how much memory a model needs at float16 quantization, | ||
follow this rule of thumb: | ||
|
||
``` | ||
(Number of parameters * size of each parameter) * 1.25 | ||
``` | ||
|
||
This means that an 8 billion parameter model at 16 bit floating point precision | ||
will take about 20 gigabytes of video memory, but can use more depending on the | ||
size of your context window. | ||
|
||
## Where to store them | ||
|
||
The bigger your AI model, the larger the weights will be. | ||
|
||
AI models are big blobs of data (model weights and overhead) that need to be | ||
loaded into GPU memory for use. Most of the time, the runtimes for AI models | ||
want the bytes for the model to be present on the disk before they load them. | ||
This raises the question of “Where do I store these things?” | ||
|
||
There’s several options that people use in production: | ||
|
||
- Git LFS such as with HuggingFace. | ||
- Putting the model weights into object storage (like Tigris) and downloading | ||
them when the application starts up. | ||
- Putting the model weights into dedicated layers of your docker images (such as | ||
with [depot.ai](https://depot.ai/)). | ||
- Mounting a remote filesystem that has the models already in them and using | ||
that directly. | ||
|
||
All of these have their own pros and cons. Git LFS is mature, but if you want to | ||
run it on your own hardware, it requires you to set up a dedicated git forge | ||
program such as Gitea. Using a remote filesystem can lock you into the | ||
provider’s implementation of that filesystem (such as with AWS Elastic | ||
FileSystem). Putting model weights into your docker images can cause extraction | ||
time to increase and can go over the limits of your docker registry of choice. | ||
When using Tigris (or another object store), you'll need to either download the | ||
model weights to disk on startup or set up a performant shared filesystem like | ||
[GeeseFS](https://www.tigrisdata.com/docs/training/geesefs-linux/). | ||
|
||
Keep all this in mind as you’re comparing options. | ||
|
||
## In summary | ||
|
||
We’ve spent a lot of time as an industry thinking about the efficiency of Docker | ||
builds and moving code around as immutable artifacts. AI models have the same | ||
classic problems, but with larger artifact size. Many systems are designed under | ||
the assumption that your images are under an undocumented “reasonable” size | ||
limit, probably less than 140 gigabytes of floating-point numbers. | ||
|
||
Don’t feel bad if your system is struggling to keep up with the rapid rate of | ||
image growth. It wasn’t designed to deal with the problems we have today, so we | ||
get to build with new tools. However, in a pinch shoving your model weights into | ||
a Docker image will work out just fine if you’re dealing with 8 billion | ||
parameter models at Q4 quantization or less. | ||
|
||
Above that threshold, you’ll need to chunk up your models into smaller pieces | ||
[like upstream models do](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B/tree/main). | ||
If that’s a problem, you can store your models in Tigris. We’ll handle any large | ||
files for you without filetype restrictions or restrictive limits. Our filesize | ||
limit is 5 terabytes. If your model is bigger than 5 terabytes, please get in | ||
touch with us. We would love to know how we can help. | ||
|
||
<InlineCta | ||
title={"Want to try it out?"} | ||
subtitle={ | ||
"Make a global bucket with no egress fees and store all your models all over the world." | ||
} | ||
button={"Get Started"} | ||
/> |
Binary file not shown.