Natural language

Tokenization

Tokenization is the process of taking a sequence of text and breaking it into units called tokens. You can think of tokens as being words, but in general they can be parts of words.

Tokens are generally then converted to "token IDs" that are integer encodings of the tokens.

Example:

text = "Hello, world! This is tokenization."
tokens = ["<start>", "Hello", ",", " ", "world", "!", " ", "This", " ", "is", " ", "token", "iza", "tion", ".", "<end>"]
token_ids = [1, 123, 22, 2223, 10, 335, 556, 10, ... ]

Tokenization is basically a map from word parts to integers.
It is important to note that tokenization is dependent on a vocabulary used to make the map.
So note that a certain tokenization may not support any language. The language needs to be in vocabulary.
A typical vocabulary size is something like $\sim$ 50,000.

Input tensor shape

Often tokenization is done in the DataLoader, which also forms batches of the data in the form of a tensor for the model. To square up the input tensor size, often one needs to pad the sequences to a common max sequence length (MSL).

Often the pad token ID is 0, so a padded sequence would look like

token_ids = [1, 123, 22, 2223, 10, 335, 556, 10, ..., 0, 0, 0]

The input tensor shape for language models is often:

[batch_size][max_seq_length]  =  e.g. [8][256]

Embedding (word2vec)

After tokenization, the next step in a language model is to embed the tokens, which is a map from the token IDs to a vector in some large space, with dimension called the embedding_size.

The tensor shape of the output of the embedding is

[batch_size][max_seq_length][embedding_size]  =  e.g. [8][256][1280]

After the embedding parameters are trained end-to-end with a model, remarkably, you can give some semantic interpretations to some basis vectors in the embedding space. Famously, for example

$$ \vec{E}(\mathrm{king}) - \vec{E}(\mathrm{man}) + \vec{E}(\mathrm{woman}) \approx \vec{E}(\mathrm{queen}) $$

Another example where a dimension in the embedding correlates with the capital of a country:

seq2seq

RNNs and LSTMs
- Olah, C. (2015). Understanding LSTM networks.
seq2seq
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks.
- Watershed moment in NLP with deep learning
- First very successful encoder-decoder based model
Bahdanau "attention"
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate.
Google Neural Machine Translation (GNMT)
- Wu, Y. et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation.

Chain rule of language modeling (chain rule of probability):

$$ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) $$

or for the whole sequence:

$$ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) $$

$$ = P(x_1) P(x_2 | x_1) P(x_3 | x_1 x_2) P(x_4 | x_1 x_2 x_3) \ldots $$

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

$$ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{model}(x_1 \ldots x_{t-1}) = \underset{V}{\mathrm{softmax}}\left( \mathrm{logits}(x_1 \ldots x_{t-1}) \right) $$

Auto-regressive inference follows this chain rule. If done with greedy search:

$$ \hat{x}_{t} = $$

$$ \underset{x_t \in V}{\mathrm{argmax}} \ P(x_t | x_1 \ldots x_{t-1}) $$

Beam search:

Beam search as used in NLP is described in Sutskever (2014).

Transformer

Vaswani, A. et al. (2017). Attention is all you need.

Describe architecture
Describe self-attention
Note the complexity is $T^2$

$$ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\intercal}{\sqrt{d_k}}\right) V $$

Autoregressive decoding:

KV-cache:

Quadratic complexity in sequence length:

Note a lot of research in reducing the quadratic complexity
Note a lot of research in extending context length (e.g., llama3 has 8k context)
Note Mamba claims to have linear in $T$ complexity

Note that there are also variants of transformers that move and/or change the normalization layers. Most transformers now user "pre-layer-norm" unlike the original.

Some transformer models (e.g., llama3) use RMSNorm instead of LayerNorm.

BERT

BERT is encoder-only
BERT has bidirectional attention
Pretrained with masked language modeling (MLM)
For encoding: sequence to vector, or for classification tasks: seq to class
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.
Liu, Y. et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach.

T5

T5 is an encoder-decoder
Encoder-decoder good for sequence-to-sequence modeling: translation, summarization
T5 also demonstrated that classification tasks can be done as sequence-to-sequence
Causal attention
Recap various attention schemes

Raffel, C. et al. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer.

GPT

GPT is decoder-only
Causal attention
Decoder-only models like GPT:
- Falcon (TII, open)
- Llama (Meta, open)
  - Meta. (2024). Introducing Llama 3.1: Our most capable models to date.
  - Dubey, A. et al. (2024). The Llama 3 herd of models.
  - Ainslie, J. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. - GQA
  - Zhang, B. & Sennrich, R. (2019). Root mean square layer normalization. - RMSNorm
  - Ramachandran et al. (2017). Searching for activation functions. - SiLU/Swish activation function used in FFN
- GPT-4 (OpenAI, closed)
- Chinchilla (DeepMind, closed)
- Claude (Anthropic, closed)
Instruction finetuning
Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)

GPT: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
GPT-2: Radford, A. et al. (2019). Language models are unsupervised multitask learners.
GPT-3: Brown, T.B. et al. (2020). Language models are few-shot learners.
InstructGPT: Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback.
Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. - chain of thought (CoT) prompting
GPT-4: OpenAI. (2023). GPT-4 Technical Report.
Rafailov, R. (2023). Direct Preference Optimization: Your language model is secretly a reward model.
Blog: RLHF progress: Scaling DPO to 70B.
Timbers, F. (2023). Five years of GPT progress
Chen, C. (2023). Transformer taxonomy.
Naveed, H. (2023). A comprehensive overview of large language models.
Nvidia. (2024). Nemotron-4 340B Technical Report.

What comes after transformer?

The return of recurrence?
SSMs and Mamba
Gu, A., Goel, K., & Ré, C. (2021). Efficiently modeling long sequences with structured state spaces.
Merrill, W. & Sabharwal, A. (2022). The parallelism tradeoff: Limitations of log-precision transformers.
Bulatov, A., Kuratov, Y., & Burtsev, M.S. (2022). Recurrent memory transformer.
Raffel, C. (2023). A new alchemy: Language model development as a subfield?.
Bulatov, A., Kuratov, Y., & Burtsev, M.S. (2023). Scaling transformer to 1M tokens and beyond with RMT.
Bertsch, A., Alon, U., Neubig, G., & Gormley, M.R. (2023). Unlimiformer: Long-range transformers with unlimited length input.
Mialon, G. et al. (2023). Augmented Language Models: a Survey.
Peng, B. et al. (2023). RWKV: Reinventing RNNs for the Transformer Era.
Sun, Y. et al. (2023). Retentive network: A successor to transformer for large language models.
Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces.
Wang, H. et al. (2023). BitNet: Scaling 1-bit transformers for large language models.
Ma, S. et al. (2024). The era of 1-bit LLMs: All large language models are in 1.58 bits.
Ma, X. et al. (2024). Megalodon: Efficient LLM pretraining and inference with unlimited context length.
Bhargava, A., Witkowski, C., Shah, M., & Thomson, M. (2023). What's the magic word? A control theory of LLM prompting.
Sun, Y. et al. (2024). Learning to (learn at test time): RNNs with expressive hidden states.
Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.

Conclusion

Yang, J. et al. (2023). Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond.
Raschka, S. (2023). Understanding large language models.
Mohamadi, S. et al. (2023). ChatGPT in the age of generative AI and large language models: A concise survey.
Zhao, W.X. et al. (2023). A survey of large language models.
Banerjee, S., Agarwal, A., & Singla, S. (2024). LLMs will always hallucinate, and we need to live with this.
Anti-hype LLM reading list
Bowman, S.R. (2023). Eight things to know about large language models.

Up next: Parallelism and hardware
Previous: Computer vision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

natural-language.md

natural-language.md

Natural language

Contents

Tokenization

Input tensor shape

Embedding (word2vec)

seq2seq

Transformer

BERT

T5

GPT

What comes after transformer?

Conclusion

Files

natural-language.md

Latest commit

History

natural-language.md

File metadata and controls

Natural language

Contents

Tokenization

Input tensor shape

Embedding (word2vec)

seq2seq

Transformer

BERT

T5

GPT

What comes after transformer?

Conclusion