Skip to content

Commit

Permalink
Merge pull request #181 from tigrisdata/Xe/chonky-bois
Browse files Browse the repository at this point in the history
blog: why models so large
  • Loading branch information
Xe authored Jan 23, 2025
2 parents e901169 + bae6b28 commit c0fcb6b
Show file tree
Hide file tree
Showing 2 changed files with 212 additions and 0 deletions.
212 changes: 212 additions & 0 deletions blog/2025-01-23-chonky-models/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
---
slug: chonky-models
title: How do large language models get so large?
description: >
AI models, comprised mainly of floating-point numbers, function by processing
inputs through various components like tokenizers and embedding models. They
range in size from gigabytes to terabytes, with larger parameter counts
enhancing performance and nuance representation. How do they get so large
though?
image: ./tiger-ship.webp
keywords: [object storage, blob storage, s3, ai, architecture]
authors: [xe]
tags: [object storage, reliability, performance]
---

import InlineCta from "@site/src/components/InlineCta";

![A majestic blue tiger riding on a sailing ship. The tiger is very large.](./tiger-ship.webp)

<center>
<small>
<em>
A majestic blue tiger riding on a sailing ship. The tiger is very large.
Image generated using PonyXL.
</em>
</small>
</center>

AI models can get pretty darn large. Larger models seem to perform better than
smaller models, but we don’t quite know why. My work MacBook has 64 gigabytes of
RAM and I’m able to use nearly all of it when I do AI inference. Somehow these
40+ gigabyte blobs of floating point numbers are able to take a question about
the color of the sky and spit out an answer. At some level this is a miracle of
technology, but how does it work?

Today I’m going to cover what an AI model really is and the parts that make it
up. I’m not going to cover the linear algebra at play nor any of the neural
networks. Most people want to start with an off the shelf model, anyway.

{/* truncate */}

## What are AI models made out of?

At the core an AI model is really just a ball of floating-point numbers that the
input goes through to get an output. There’s two basic kinds of models: language
models and image diffusion models. They’re both very similar, but they have some
different parts.

A text generation model has a few basic parts:

- A tokenizer model to break input into pieces of words, grammatical separators,
and emoji.
- An embedding model to take the frequencies of relationships between tokens and
generate the “concept”, which is what allows a model to see that “hot” and
“warm” are similar.
- Token predictor weights, which the embeddings are passed through in order to
determine which tokens are most likely to come next.

Note these are really three individual models
[stacked](https://bojackhorseman.fandom.com/wiki/Vincent_Adultman) on top of
each other, but they only make sense together. You cannot separate them nor
exchange the parts.

Of all of those, the token predictor weights are the biggest part. The number of
“parameters” a language model has refers to the number of floating point numbers
in the token predictor weights. An 8 billion parameter language has 8 billion
floating point parameters.

An image diffusion model has most of the same parts as a language model:

- A tokenizer to take your input and break it into pieces of words, grammatical
separators, and emoji.
- An embedding model to turn those tokens into a latent space, a kind of
embedding that works better for latent diffusion.
- A de-noising model (unet) that gradually removes noise from the latent space
to make the image reveal itself.
- A Variational AutoEncoder (VAE) that is used to encode a latent space into an
image.

Most of the time, a single model (such as Stable Diffusion XL, PonyXL, or a
finetune) will include all four of these models in one single `.safetensors`
file.

In the earlier days of Stable Diffusion 1.5, you usually got massive quality
gains by swapping out the VAE model for a variant that works best for you (one
does anime style images better, one was optimized for making hands look correct,
one was optimized for a specific kind of pastel watercolor style, etc). Stable
Diffusion XL and later largely made the VAE that’s baked into the model good
enough that you no longer needed to care.

Of the three stacked models, the de-noising model is the size that's cited. The
tokenizer, embedding model, and variational autoencoder are extras on the side.
Stable Diffusion XL has 6.6 billion parameters and Flux [dev] has 12 billion
parameters in their de-noising models, and the other models fit into about 5-10%
of the model size and are not counted by the number of parameters, however they
do contribute to the final model size.

We currently believe that the more parameters a model has allows it to represent
nuance more accurately. This generally means that a 70 billion parameter
language model is able to handle tasks that an 8 billion parameter language
model can’t, or that a 70 billion parameter language model will be able to do
tasks better than an 8 billion parameter language model.

Recently smaller models are catching up,
[bigger isn't always better](https://www.scientificamerican.com/article/when-it-comes-to-ai-models-bigger-isnt-always-better/).
Bigger models require more compute and introduce performance bottlenecks. The
reality is that people are going to use large models, so we need to design
systems that can handle them.

## Quantization

If you’re lucky enough to have access to high-vram GPUs on the cheap, you don’t
need to worry about quantization. Quantization is a form of compression where
you take a model’s floating-point weights and convert them to a smaller number,
such as converting a 70 billion parameter model with 140 gigabytes of float16
parameters (16 bit floating numbers) to 35 gigabytes of 4-bit parameters (Q4).
This is a lossy operation, but it will save precious gigabytes from your docker
images and let bigger models fit into smaller GPUs.

:::note

When you read a model quantization level like `Q4` or `fp16`/`float16`, you can
interpret it like this:

| Initial Letter | Number | What it means |
| :-------------------- | :----- | :----------------------------------------------------------------------------------------------- |
| `Q` or `I` | `4` | Four bit integers |
| `f`, `fp`, or `float` | `16` | 16 bit [IEEE754](https://en.wikipedia.org/wiki/IEEE_754) floating-point numbers (half-precision) |

:::

Using quantization is a tradeoff between the amount of video memory (GPU ram)
you have and the desired task. A 70B model at Q4 quantization will have a loss
in quality compared to running it at the full float16 quantization, but you can
run that 70B model on a single GPU instead of needing two to four GPUs to get it
running.

Most of the time you won’t need to quantize image diffusion models to get them
running (with some exceptions for getting Flux \[dev\] running on low-end
consumer GPUs). This is something that is almost exclusively done with language
models.

In order to figure out how much memory a model needs at float16 quantization,
follow this rule of thumb:

```
(Number of parameters * size of each parameter) * 1.25
```

This means that an 8 billion parameter model at 16 bit floating point precision
will take about 20 gigabytes of video memory, but can use more depending on the
size of your context window.

## Where to store them

The bigger your AI model, the larger the weights will be.

AI models are big blobs of data (model weights and overhead) that need to be
loaded into GPU memory for use. Most of the time, the runtimes for AI models
want the bytes for the model to be present on the disk before they load them.
This raises the question of “Where do I store these things?”

There’s several options that people use in production:

- Git LFS such as with HuggingFace.
- Putting the model weights into object storage (like Tigris) and downloading
them when the application starts up.
- Putting the model weights into dedicated layers of your docker images (such as
with [depot.ai](https://depot.ai/)).
- Mounting a remote filesystem that has the models already in them and using
that directly.

All of these have their own pros and cons. Git LFS is mature, but if you want to
run it on your own hardware, it requires you to set up a dedicated git forge
program such as Gitea. Using a remote filesystem can lock you into the
provider’s implementation of that filesystem (such as with AWS Elastic
FileSystem). Putting model weights into your docker images can cause extraction
time to increase and can go over the limits of your docker registry of choice.
When using Tigris (or another object store), you'll need to either download the
model weights to disk on startup or set up a performant shared filesystem like
[GeeseFS](https://www.tigrisdata.com/docs/training/geesefs-linux/).

Keep all this in mind as you’re comparing options.

## In summary

We’ve spent a lot of time as an industry thinking about the efficiency of Docker
builds and moving code around as immutable artifacts. AI models have the same
classic problems, but with larger artifact size. Many systems are designed under
the assumption that your images are under an undocumented “reasonable” size
limit, probably less than 140 gigabytes of floating-point numbers.

Don’t feel bad if your system is struggling to keep up with the rapid rate of
image growth. It wasn’t designed to deal with the problems we have today, so we
get to build with new tools. However, in a pinch shoving your model weights into
a Docker image will work out just fine if you’re dealing with 8 billion
parameter models at Q4 quantization or less.

Above that threshold, you’ll need to chunk up your models into smaller pieces
[like upstream models do](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B/tree/main).
If that’s a problem, you can store your models in Tigris. We’ll handle any large
files for you without filetype restrictions or restrictive limits. Our filesize
limit is 5 terabytes. If your model is bigger than 5 terabytes, please get in
touch with us. We would love to know how we can help.

<InlineCta
title={"Want to try it out?"}
subtitle={
"Make a global bucket with no egress fees and store all your models all over the world."
}
button={"Get Started"}
/>
Binary file added blog/2025-01-23-chonky-models/tiger-ship.webp
Binary file not shown.

0 comments on commit c0fcb6b

Please sign in to comment.