Skip to content

Latest commit

 

History

History
275 lines (189 loc) · 14.1 KB

natural-language.md

File metadata and controls

275 lines (189 loc) · 14.1 KB

Natural language

Contents

  1. Tokenization
  2. Input tensor shape
  3. Embedding (word2vec)
  4. seq2seq
  5. Transformer
  6. BERT
  7. T5
  8. GPT
  9. Conclusion

Tokenization

Tokenization is the process of taking a sequence of text and breaking it into units called tokens. You can think of tokens as being words, but in general they can be parts of words.

Tokens are generally then converted to "token IDs" that are integer encodings of the tokens.

Example:

text = "Hello, world! This is tokenization."
tokens = ["<start>", "Hello", ",", " ", "world", "!", " ", "This", " ", "is", " ", "token", "iza", "tion", ".", "<end>"]
token_ids = [1, 123, 22, 2223, 10, 335, 556, 10, ... ]
  • Tokenization is basically a map from word parts to integers.
  • It is important to note that tokenization is dependent on a vocabulary used to make the map.
  • So note that a certain tokenization may not support any language. The language needs to be in vocabulary.
  • A typical vocabulary size is something like $\sim$ 50,000.

See also:

Input tensor shape

Often tokenization is done in the DataLoader, which also forms batches of the data in the form of a tensor for the model. To square up the input tensor size, often one needs to pad the sequences to a common max sequence length (MSL).

Often the pad token ID is 0, so a padded sequence would look like

token_ids = [1, 123, 22, 2223, 10, 335, 556, 10, ..., 0, 0, 0]

The input tensor shape for language models is often:

[batch_size][max_seq_length]  =  e.g. [8][256]

Embedding (word2vec)

After tokenization, the next step in a language model is to embed the tokens, which is a map from the token IDs to a vector in some large space, with dimension called the embedding_size.

The tensor shape of the output of the embedding is

[batch_size][max_seq_length][embedding_size]  =  e.g. [8][256][1280]

After the embedding parameters are trained end-to-end with a model, remarkably, you can give some semantic interpretations to some basis vectors in the embedding space. Famously, for example

$$ \vec{E}(\mathrm{king}) - \vec{E}(\mathrm{man}) + \vec{E}(\mathrm{woman}) \approx \vec{E}(\mathrm{queen}) $$

word2vec visualization 1 (source: https://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html/).

Another example where a dimension in the embedding correlates with the capital of a country:

word2vec visualization 2 (source: https://arxiv.org/abs/1310.4546).

See also:

seq2seq

Chain rule of language modeling (chain rule of probability):

$$ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) $$

or for the whole sequence:

$$ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) $$

$$ = P(x_1) P(x_2 | x_1) P(x_3 | x_1 x_2) P(x_4 | x_1 x_2 x_3) \ldots $$

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

$$ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{model}(x_1 \ldots x_{t-1}) = \underset{V}{\mathrm{softmax}}\left( \mathrm{logits}(x_1 \ldots x_{t-1}) \right) $$

Auto-regressive inference follows this chain rule. If done with greedy search:

$$ \hat{x}_{t} = $$

$$ \underset{x_t \in V}{\mathrm{argmax}} \ P(x_t | x_1 \ldots x_{t-1}) $$

Beam search:

  • Beam search as used in NLP is described in Sutskever (2014).

Transformer

Diagram of the Transformer model (source: d2l.ai).

  • Describe architecture
  • Describe self-attention
  • Note the complexity is $T^2$

$$ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\intercal}{\sqrt{d_k}}\right) V $$

Autoregressive decoding:

Autoregressive decoding. source: https://hrithickcodes.medium.com/the-math-behind-the-machine-a-deep-dive-into-the-transformer-architecture-a3902333e4a4

KV-cache:

KV-cache explained. source: https://medium.com/@joaolages/kv-caching-explained-276520203249

Quadratic complexity in sequence length:

  • Note a lot of research in reducing the quadratic complexity
  • Note a lot of research in extending context length (e.g., llama3 has 8k context)
  • Note Mamba claims to have linear in $T$ complexity

Note that there are also variants of transformers that move and/or change the normalization layers. Most transformers now user "pre-layer-norm" unlike the original.

Pre-layer-norm transformer (source: 2002.04745).

Some transformer models (e.g., llama3) use RMSNorm instead of LayerNorm.

See also:

BERT

T5

  • T5 is an encoder-decoder
  • Encoder-decoder good for sequence-to-sequence modeling: translation, summarization
  • T5 also demonstrated that classification tasks can be done as sequence-to-sequence
  • Causal attention
  • Recap various attention schemes

T5 description of types of transformer architectures (source: 1910.10683).

GPT

Development of ChatGPT (source: 2302.10724).

What comes after transformer?

Conclusion

Evolutionary tree of LLMs (source: 2304.13712).