diff --git a/blog/2025-01-23-chonky-models/index.mdx b/blog/2025-01-23-chonky-models/index.mdx new file mode 100644 index 0000000..f590203 --- /dev/null +++ b/blog/2025-01-23-chonky-models/index.mdx @@ -0,0 +1,212 @@ +--- +slug: chonky-models +title: How do large language models get so large? +description: > + AI models, comprised mainly of floating-point numbers, function by processing + inputs through various components like tokenizers and embedding models. They + range in size from gigabytes to terabytes, with larger parameter counts + enhancing performance and nuance representation. How do they get so large + though? +image: ./tiger-ship.webp +keywords: [object storage, blob storage, s3, ai, architecture] +authors: [xe] +tags: [object storage, reliability, performance] +--- + +import InlineCta from "@site/src/components/InlineCta"; + +![A majestic blue tiger riding on a sailing ship. The tiger is very large.](./tiger-ship.webp) + +
+ + + A majestic blue tiger riding on a sailing ship. The tiger is very large. + Image generated using PonyXL. + + +
+ +AI models can get pretty darn large. Larger models seem to perform better than +smaller models, but we don’t quite know why. My work MacBook has 64 gigabytes of +RAM and I’m able to use nearly all of it when I do AI inference. Somehow these +40+ gigabyte blobs of floating point numbers are able to take a question about +the color of the sky and spit out an answer. At some level this is a miracle of +technology, but how does it work? + +Today I’m going to cover what an AI model really is and the parts that make it +up. I’m not going to cover the linear algebra at play nor any of the neural +networks. Most people want to start with an off the shelf model, anyway. + +{/* truncate */} + +## What are AI models made out of? + +At the core an AI model is really just a ball of floating-point numbers that the +input goes through to get an output. There’s two basic kinds of models: language +models and image diffusion models. They’re both very similar, but they have some +different parts. + +A text generation model has a few basic parts: + +- A tokenizer model to break input into pieces of words, grammatical separators, + and emoji. +- An embedding model to take the frequencies of relationships between tokens and + generate the “concept”, which is what allows a model to see that “hot” and + “warm” are similar. +- Token predictor weights, which the embeddings are passed through in order to + determine which tokens are most likely to come next. + +Note these are really three individual models +[stacked](https://bojackhorseman.fandom.com/wiki/Vincent_Adultman) on top of +each other, but they only make sense together. You cannot separate them nor +exchange the parts. + +Of all of those, the token predictor weights are the biggest part. The number of +“parameters” a language model has refers to the number of floating point numbers +in the token predictor weights. An 8 billion parameter language has 8 billion +floating point parameters. + +An image diffusion model has most of the same parts as a language model: + +- A tokenizer to take your input and break it into pieces of words, grammatical + separators, and emoji. +- An embedding model to turn those tokens into a latent space, a kind of + embedding that works better for latent diffusion. +- A de-noising model (unet) that gradually removes noise from the latent space + to make the image reveal itself. +- A Variational AutoEncoder (VAE) that is used to encode a latent space into an + image. + +Most of the time, a single model (such as Stable Diffusion XL, PonyXL, or a +finetune) will include all four of these models in one single `.safetensors` +file. + +In the earlier days of Stable Diffusion 1.5, you usually got massive quality +gains by swapping out the VAE model for a variant that works best for you (one +does anime style images better, one was optimized for making hands look correct, +one was optimized for a specific kind of pastel watercolor style, etc). Stable +Diffusion XL and later largely made the VAE that’s baked into the model good +enough that you no longer needed to care. + +Of the three stacked models, the de-noising model is the size that's cited. The +tokenizer, embedding model, and variational autoencoder are extras on the side. +Stable Diffusion XL has 6.6 billion parameters and Flux [dev] has 12 billion +parameters in their de-noising models, and the other models fit into about 5-10% +of the model size and are not counted by the number of parameters, however they +do contribute to the final model size. + +We currently believe that the more parameters a model has allows it to represent +nuance more accurately. This generally means that a 70 billion parameter +language model is able to handle tasks that an 8 billion parameter language +model can’t, or that a 70 billion parameter language model will be able to do +tasks better than an 8 billion parameter language model. + +Recently smaller models are catching up, +[bigger isn't always better](https://www.scientificamerican.com/article/when-it-comes-to-ai-models-bigger-isnt-always-better/). +Bigger models require more compute and introduce performance bottlenecks. The +reality is that people are going to use large models, so we need to design +systems that can handle them. + +## Quantization + +If you’re lucky enough to have access to high-vram GPUs on the cheap, you don’t +need to worry about quantization. Quantization is a form of compression where +you take a model’s floating-point weights and convert them to a smaller number, +such as converting a 70 billion parameter model with 140 gigabytes of float16 +parameters (16 bit floating numbers) to 35 gigabytes of 4-bit parameters (Q4). +This is a lossy operation, but it will save precious gigabytes from your docker +images and let bigger models fit into smaller GPUs. + +:::note + +When you read a model quantization level like `Q4` or `fp16`/`float16`, you can +interpret it like this: + +| Initial Letter | Number | What it means | +| :-------------------- | :----- | :----------------------------------------------------------------------------------------------- | +| `Q` or `I` | `4` | Four bit integers | +| `f`, `fp`, or `float` | `16` | 16 bit [IEEE754](https://en.wikipedia.org/wiki/IEEE_754) floating-point numbers (half-precision) | + +::: + +Using quantization is a tradeoff between the amount of video memory (GPU ram) +you have and the desired task. A 70B model at Q4 quantization will have a loss +in quality compared to running it at the full float16 quantization, but you can +run that 70B model on a single GPU instead of needing two to four GPUs to get it +running. + +Most of the time you won’t need to quantize image diffusion models to get them +running (with some exceptions for getting Flux \[dev\] running on low-end +consumer GPUs). This is something that is almost exclusively done with language +models. + +In order to figure out how much memory a model needs at float16 quantization, +follow this rule of thumb: + +``` +(Number of parameters * size of each parameter) * 1.25 +``` + +This means that an 8 billion parameter model at 16 bit floating point precision +will take about 20 gigabytes of video memory, but can use more depending on the +size of your context window. + +## Where to store them + +The bigger your AI model, the larger the weights will be. + +AI models are big blobs of data (model weights and overhead) that need to be +loaded into GPU memory for use. Most of the time, the runtimes for AI models +want the bytes for the model to be present on the disk before they load them. +This raises the question of “Where do I store these things?” + +There’s several options that people use in production: + +- Git LFS such as with HuggingFace. +- Putting the model weights into object storage (like Tigris) and downloading + them when the application starts up. +- Putting the model weights into dedicated layers of your docker images (such as + with [depot.ai](https://depot.ai/)). +- Mounting a remote filesystem that has the models already in them and using + that directly. + +All of these have their own pros and cons. Git LFS is mature, but if you want to +run it on your own hardware, it requires you to set up a dedicated git forge +program such as Gitea. Using a remote filesystem can lock you into the +provider’s implementation of that filesystem (such as with AWS Elastic +FileSystem). Putting model weights into your docker images can cause extraction +time to increase and can go over the limits of your docker registry of choice. +When using Tigris (or another object store), you'll need to either download the +model weights to disk on startup or set up a performant shared filesystem like +[GeeseFS](https://www.tigrisdata.com/docs/training/geesefs-linux/). + +Keep all this in mind as you’re comparing options. + +## In summary + +We’ve spent a lot of time as an industry thinking about the efficiency of Docker +builds and moving code around as immutable artifacts. AI models have the same +classic problems, but with larger artifact size. Many systems are designed under +the assumption that your images are under an undocumented “reasonable” size +limit, probably less than 140 gigabytes of floating-point numbers. + +Don’t feel bad if your system is struggling to keep up with the rapid rate of +image growth. It wasn’t designed to deal with the problems we have today, so we +get to build with new tools. However, in a pinch shoving your model weights into +a Docker image will work out just fine if you’re dealing with 8 billion +parameter models at Q4 quantization or less. + +Above that threshold, you’ll need to chunk up your models into smaller pieces +[like upstream models do](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B/tree/main). +If that’s a problem, you can store your models in Tigris. We’ll handle any large +files for you without filetype restrictions or restrictive limits. Our filesize +limit is 5 terabytes. If your model is bigger than 5 terabytes, please get in +touch with us. We would love to know how we can help. + + diff --git a/blog/2025-01-23-chonky-models/tiger-ship.webp b/blog/2025-01-23-chonky-models/tiger-ship.webp new file mode 100644 index 0000000..98407fa Binary files /dev/null and b/blog/2025-01-23-chonky-models/tiger-ship.webp differ