Tokenization is the process of taking a sequence of text and breaking it into units called tokens. You can think of tokens as being words, but in general they can be parts of words.
Tokens are generally then converted to "token IDs" that are integer encodings of the tokens.
Example:
text = "Hello, world! This is tokenization."
tokens = ["<start>", "Hello", ",", " ", "world", "!", " ", "This", " ", "is", " ", "token", "iza", "tion", ".", "<end>"]
token_ids = [1, 123, 22, 2223, 10, 335, 556, 10, ... ]
- Tokenization is basically a map from word parts to integers.
- It is important to note that tokenization is dependent on a vocabulary used to make the map.
- So note that a certain tokenization may not support any language. The language needs to be in vocabulary.
- A typical vocabulary size is something like
$\sim$ 50,000.
See also:
- Tutorial video by Andrej Karpathy: Let's build the GPT Tokenizer
Often tokenization is done in the DataLoader
, which also forms batches of the data in the form of a tensor for the model.
To square up the input tensor size, often one needs to pad the sequences to a common max sequence length (MSL).
Often the pad token ID is 0
, so a padded sequence would look like
token_ids = [1, 123, 22, 2223, 10, 335, 556, 10, ..., 0, 0, 0]
The input tensor shape for language models is often:
[batch_size][max_seq_length] = e.g. [8][256]
After tokenization, the next step in a language model is to embed the tokens,
which is a map from the token IDs to a vector in some large space,
with dimension called the embedding_size
.
The tensor shape of the output of the embedding is
[batch_size][max_seq_length][embedding_size] = e.g. [8][256][1280]
After the embedding parameters are trained end-to-end with a model, remarkably, you can give some semantic interpretations to some basis vectors in the embedding space. Famously, for example
Another example where a dimension in the embedding correlates with the capital of a country:
See also:
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
- Mikolov, T. et al. (2013). Distributed representations of words and phrases and their compositionality.
- Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations.
- Olah, C. (2014). Deep learning, NLP, and representations.
- RNNs and LSTMs
- Olah, C. (2015). Understanding LSTM networks.
- seq2seq
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks.
- Watershed moment in NLP with deep learning
- First very successful encoder-decoder based model
- Bahdanau "attention"
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate.
- Google Neural Machine Translation (GNMT)
Chain rule of language modeling (chain rule of probability):
or for the whole sequence:
A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.
Auto-regressive inference follows this chain rule. If done with greedy search:
Beam search:
- Beam search as used in NLP is described in Sutskever (2014).
- Vaswani, A. et al. (2017). Attention is all you need.
- Describe architecture
- Describe self-attention
- Note the complexity is
$T^2$
Autoregressive decoding:
KV-cache:
Quadratic complexity in sequence length:
- Note a lot of research in reducing the quadratic complexity
- Note a lot of research in extending context length (e.g., llama3 has 8k context)
- Note Mamba claims to have linear in
$T$ complexity
Note that there are also variants of transformers that move and/or change the normalization layers. Most transformers now user "pre-layer-norm" unlike the original.
Some transformer models (e.g., llama3) use RMSNorm instead of LayerNorm.
See also:
- Zhang, B. & Sennrich, R. (2019). Root mean square layer normalization.
- Xiong, R. et al. (2020). On layer normalization in the transformer architecture.
- Phuong, M. & Hutter, M. (2022). Formal algorithms for transformers.
- Tutorial video by Andrej Karpathy: Let's reproduce GPT-2 (124M).
- BERT is encoder-only
- BERT has bidirectional attention
- Pretrained with masked language modeling (MLM)
- For encoding: sequence to vector, or for classification tasks: seq to class
- Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.
- Liu, Y. et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach.
- T5 is an encoder-decoder
- Encoder-decoder good for sequence-to-sequence modeling: translation, summarization
- T5 also demonstrated that classification tasks can be done as sequence-to-sequence
- Causal attention
- Recap various attention schemes
- Raffel, C. et al. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer.
- GPT is decoder-only
- Causal attention
- Decoder-only models like GPT:
- Falcon (TII, open)
- Llama (Meta, open)
- Meta. (2024). Introducing Llama 3.1: Our most capable models to date.
- Dubey, A. et al. (2024). The Llama 3 herd of models.
- Ainslie, J. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. - GQA
- Zhang, B. & Sennrich, R. (2019). Root mean square layer normalization. - RMSNorm
- Ramachandran et al. (2017). Searching for activation functions. - SiLU/Swish activation function used in FFN
- GPT-4 (OpenAI, closed)
- Chinchilla (DeepMind, closed)
- Claude (Anthropic, closed)
- Instruction finetuning
- Reinforcement Learning from Human Feedback (RLHF)
- Direct Preference Optimization (DPO)
- GPT: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
- GPT-2: Radford, A. et al. (2019). Language models are unsupervised multitask learners.
- GPT-3: Brown, T.B. et al. (2020). Language models are few-shot learners.
- InstructGPT: Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback.
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. - chain of thought (CoT) prompting
- GPT-4: OpenAI. (2023). GPT-4 Technical Report.
- Rafailov, R. (2023). Direct Preference Optimization: Your language model is secretly a reward model.
- Blog: RLHF progress: Scaling DPO to 70B.
- Timbers, F. (2023). Five years of GPT progress
- Chen, C. (2023). Transformer taxonomy.
- Naveed, H. (2023). A comprehensive overview of large language models.
- Nvidia. (2024). Nemotron-4 340B Technical Report.
- The return of recurrence?
- SSMs and Mamba
- Gu, A., Goel, K., & Ré, C. (2021). Efficiently modeling long sequences with structured state spaces.
- Merrill, W. & Sabharwal, A. (2022). The parallelism tradeoff: Limitations of log-precision transformers.
- Bulatov, A., Kuratov, Y., & Burtsev, M.S. (2022). Recurrent memory transformer.
- Raffel, C. (2023). A new alchemy: Language model development as a subfield?.
- Bulatov, A., Kuratov, Y., & Burtsev, M.S. (2023). Scaling transformer to 1M tokens and beyond with RMT.
- Bertsch, A., Alon, U., Neubig, G., & Gormley, M.R. (2023). Unlimiformer: Long-range transformers with unlimited length input.
- Mialon, G. et al. (2023). Augmented Language Models: a Survey.
- Peng, B. et al. (2023). RWKV: Reinventing RNNs for the Transformer Era.
- Sun, Y. et al. (2023). Retentive network: A successor to transformer for large language models.
- Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces.
- Wang, H. et al. (2023). BitNet: Scaling 1-bit transformers for large language models.
- Ma, S. et al. (2024). The era of 1-bit LLMs: All large language models are in 1.58 bits.
- Ma, X. et al. (2024). Megalodon: Efficient LLM pretraining and inference with unlimited context length.
- Bhargava, A., Witkowski, C., Shah, M., & Thomson, M. (2023). What's the magic word? A control theory of LLM prompting.
- Sun, Y. et al. (2024). Learning to (learn at test time): RNNs with expressive hidden states.
- Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.
- Yang, J. et al. (2023). Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond.
- Raschka, S. (2023). Understanding large language models.
- Mohamadi, S. et al. (2023). ChatGPT in the age of generative AI and large language models: A concise survey.
- Zhao, W.X. et al. (2023). A survey of large language models.
- Banerjee, S., Agarwal, A., & Singla, S. (2024). LLMs will always hallucinate, and we need to live with this.
- Anti-hype LLM reading list
- Bowman, S.R. (2023). Eight things to know about large language models.
- Up next: Parallelism and hardware
- Previous: Computer vision