The year column represents publishing year in arxiv and not the year of latest revision. Supplementary
pdf/docs is mostly not included in the links below. A paper may fall into multiple categories but organized into a general one.
Adding only papers worth implementing, important concepts that can be applied in future with high quality or need to be revisited again with easy to understand, SOTA, close to SOTA results or unique ideas. Not adding paper if concept not understood, too hard or anything meaningful not found.
- 🚩 Represents overall good understanding, easier to understand with diagrams, examples or good explanation simple language.
- 🎯 Represents partially reviewed paper.
Topic | Year |
---|---|
SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY | |
Attention Mechanisms in Computer Vision: A Survey | |
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning | 2023 |
Basics
- For
multi-task
Natural Language Understanding (NLU) objectives such as question answering, semantic similarity, document classification etc. - Some of these tasks are part of General Language Understanding Evaluation (GLUE) multi-task benchmark.
- Supervised learning suffer from lack of large annotated dataset and quality. Learning from raw text removes dependence of on supervised only methods.
GPT
- GPT (Generative Pre-Training) is a
semi-supervised
approach withunsupervised pre-training
andsupervised fine-tuning
. -
Two stages
, Generative Pre-Training on unlabeled data, Discriminative Finetuning for each specific task with task aware input transformation and minimal model architecture change. - GPT aquires useful linguistic knowledge for
downstream tasks
and outperforms specifically crafted task specific models. - Goal is to
learn universal representation
to apply to wide range of tasks with little adaptation. - Input text is processed as single contiguous sequence of tokens.
-
Training
requireslarge corpus of unlabeled data
andmanually annotated data
for each target task.
Model
-
Transformer
is used for model architecture due abilityhandle long-term dependencies
in text. - Multi-layer
transformer decoder
is used for language modeling. - Models is multiple layer
decoder only transformer
with masked self attention heads. -
Learned positional embedding
is used instead of sinusoidal in original transformers.
Unsupervised Pre-training
-
Pre-training
acts asregularizer
providingbetter generalization
in deep neural nets. -
Unsupervised pretraining
goal is to findgood initialization point
instead of modifying supervised objective. - Unsupervised pretraining objective based on
context window of tokens
predicts likelihood the next token.
Supervised Fine-tuning
- Uses
labeled dataset
for supervised task. Input sequence of tokensx1, x2, ..., xN
has output labely
. - An
additional linear ouput layer
is added after final layer of transformer to predicty
for given task. - Uses
label prediction objective
and additionallylanguage modeling as auxiliary objective
(loss) of unsupervised pre-training for supervised-finetuning. -
Extra parameters
added to unsupervised pre-trained model is final linear layer weights$W_y$ and embedding for delimiter tokens.
Input Transformation
-
Byte Pair Encoding (BPE)
used for sub-word tokenization. - Instead of using task specific architectures, the inputs are
converted to sequence of tokens
that the pretrained model can process. - Includes
start and end tokens
for each input to pretrained model,<s>
,<e>
. - For
text entailment
, premisep
and hypothesish
token sequence inlcude delimiter token$
between them. - For
document QA
(Question Answering) and common sense reasoning include documentz
, a questionq
, a set of answers$a_k$ . Document context, question and each individual answer is concatenated with delimiter token in between to produce input data to the pretrained model ($z$ ,$q$ ,$\$$ ,$a_1$ ), ($z$ ,$q$ ,$\$$ ,$a_2$ ), ..., ($z$ ,$q$ ,$\$$ ,$a_K$ ). Softmax layer is used to produce output distribution over all possible answers. - For
sentence similarity
with two sentence sequence of tokenss1
,s2
there is no ordering. Start, end tokens are added with delimiter between the sentences to produce output. Also the sentences are swapped to again produce output both which is concatenated before feeding to linear layer.
Datasets
- Natural Language inference (NLI): SNLI, MNLI, QNLI.
- Question Answering: SQuAD (Stanford Question Answering Dataset).
- Semantic Similarity: Quora Question Pairs (QQP), Semantic Textual Similarity benchmark (STS-B), Microsoft Paraphrase corpus (MRPC).
Prerequisite
- GPT-1 paper.
Applying Pre-trained Language Representations to Downstream tasks
Two strategies
for applying pre-trained language representations downstream tasks. They arefine-tuning
andfeature-based
.- Feature-based representations are applied as
additional features
to task-specific architectures. - Fine-tuning approaches like
GPT
introducesminimal task specific parameters
. It is trained by downstream tasks byfine-tuning all pre-trained parameters
.
Limitations of existing Unidirectional Language Models
- Standard language models are
unidirectional
.GPT
usesleft-to-right
architecture, where canonly attend to previous tokens
in self attention layer by masking future tokens. - This can be
harmful
when fine-tuning ontoken-level tasks like Question Answering
, which requires context incorporated from both directions.
BERT
- BERT (Bidirectional Encoder Representations from Transformers) is a
language representation model
introduced in the paper. Similar to GPT
, BERT has pre-training and fine-tuning stage. Each downstream task has separate finetuned model even if they are initialized with same pre-trained model parameters.- In contrast to GPT which is transformer decoder model due to left only context, bidirectional context model BERT is a
transformer encoder model
. - Designed to
pretrain
deep bidirectional representations fromunlabeled data
byjointly conditioning on left and right context
in all layers. An additional output layer
is added to createSOTA (state-of-the-art)
models that performs well on sentence-level and token-level tasks. Tasks include QA, language inference etc.Alleviates constraints
of unidirectional models (e.g. GPT) by introducingMasked Language Model (MLM)
pre-training task.- Also uses
Next Sentence Prediction (NSP)
task to jointly pretrain on text-pair representations. - Similar to GPT, during finetuning
all parameters are fine-tuned
. - For finetuning task specific input and outputs are added. Adding an extra output layer on top of pre-trained model can be used for classification, sentiment analysis etc.
- BERT is effective in both feature-based (extracting fixed features from pre-trained network) and fine-tuning (all parameters including pre-trained) approaches.
Increasing model size
will lead tocontinual improvement
on large scale tasks.- Uses
WordPiece
tokenization.
Special Tokens
[CLS]
token added in front of every training example.Final state
corresponding to this token is used as aggregate representation forclassification tasks
.[SEP]
special separator token for separating sentence pairs. It can be used for separating question and answers.[MASK]
token is used for pre-training and not used in fine-tuning.
Input and Ouput Representations
- BERT input representation can unambiguously handle both single and paired sentence
<question, answer>
in one token sequence. - Here, a sentence is considered as
arbitrary span of contiguous text
, rather than liguistic sentence. - Sequence here is referred as
input token sequence
to BERT which can be asingle sentence or pair of sentences packed together
. Learned positional embedding
is also added to each tokens for indicating whether it belongs to sentence A or B.
Masked Language Model (MLM)
- MLM (Cloze Task) used for pre-training
randomly masks some input tokens
, and the objective is topredict original vocabulary id
based only on context. - MLM enables representations to
fuse left and right context
. - Bidirectional conditioning would allow each word indirectly see itself.
[Mask]
token appears only in pre-training and not in fine-tuning. To mitigate this, words are not always replaced with mask tokens.- 80% of the times word is replaced with mask token, 10% time with random token and 10% of the times with same token. Cross entropy loss is used to predict original token. Example, unlabeled sentence,
my dog is hairy
. 80% of the time label will be mask token,my dog is [mask]
. 10% of the time random word,my dog is apple
and 10% of the time same word,my dog is hairy
. - Masked LM only makes prediction on 15% of the token each batch. MLM does not know which words it needs predict or which have been replaced with random words.
Next Sentence Prediction (NSP)
- Used for pre-training and shown to be beneficial for QA and NLI.
- Downstream tasks like
Question Answering (QA)
,Natural Language Inference (NLI)
are based onrelationship between two sentences
, which is not directly captured by language modeling. - For this task, 50% of the time sentence B is
actual next sentence
to sentence A labeled withIsNext
. Rest of the 50% time sentence B israndomly chosen
from corpus and labeled byNotNext
. - Input example,
[CLS] the man went to [MASK] store [SEP] he bought a gallon of [MASK] milk [SEP]
and output labelIsNext
. - Input example,
[CLS] the man went to [MASK] store [SEP] penguin [MASK] are flight ##less birds [SEP]
and output labelNotNext
.
Foundation Models
Foundation models
are models trained onbroad data
generally trained usingself-supervised learning at scale
, that can be adapted (e.g., fine-tune) to wide range of downstream tasks. Example include, BERT, DALL-E, GPT-3, CLIP etc.- Foundation model provide
general-purpose engine
forprocessing multimodal information
. - They are based on
deep neural networks (DNN)
andself-supervised learning (SSL)
. - Incentivizes
homogenization
, which is consolidation of methodologies for building ML systems across wide range of applications. - Transfer learning makes it possible, Scale makes it powerlful. Their
scale
result inemergent capabilities
. - Led to unprecedented level of homogenization.
Almost all SOTA NLP models adapted
from fewfoundation models, BERT, RoBERTa, BART, T5,
etc.' - Led to surprising emergence due to scale,
in-context learning
in 175B parameter GPT-3.In-context learning
allowsadaptation to downstream tasks by
providing it withprompt
. - Can
centralize information
from all datafrom various modalities (e.g., text, image, audio)
and can beadapted
to wide range ofdownstream tasks (e.g., QA, image captioning, following instructions, object recognition).
Foundation model itself is incomplete
, but serves ascommon basis
from which manytask-specific models
built via adaptation.
Basic Knowledge
- In deep learning,
pretraining
is a dominanttransfer learning approach
. - In
self-supervised learning
pretraining task is derived fromunlabeled data
. Transformer
model architecture leverageshardware parallelism
to train more expressive models.
Memory
- Important to distinguish between
explicit facts
that can bestored in external memory storage
(e.g., vector database) andimplicit knowledge
reflected throughtrainable weights of network.
Decoupling explicit and implicit knowledge
enjoy multiplebenefits
compared to implicitly encoding all information together through network weights. 🔗 paper link- This separation
mitigates inflation in model size
,number of parameters
needed to store growing amounts of knowledge. It is alsokey to memory update (model patching), manipulation and adaptation
. 🔗 paper link
Adaptations
Low storage adaptation
, approaches include,fine-tuning final layer weights
,only bias vectors
,low rank weight tensors
, .Low memory adaptations
include,gradient checkpointing
to trade off computation and memory.
Temporal Adaptations
Information is constantly changing
like clothing style, new heads of state elected. Thistemporal shift
presents a problem due to computationally demanding nature of training foundation models.Temporarily-partioned
diagnostic datasets help quantify the rate at which language models become outdated.- An alternative to addressing
temporal shift
is to designretrieval-based (sami-parametric) models
, whichaugment the model
input withadditional context retrieved from large human-interpretable databases
. In this case,adaptation
corresponds toupdating
individual units ofinformation in database
.
Continual Learning
- A natural extension of adaptation is
continual learning
orcontinual adaptation
, in order tokeep model's knowledge up-to-date
with world events,continually add data
from new domain ormodalities
. - A problem of this is
continual learning
inducescatastrophic forgetting
in neural networks, whereold tasks or data are rapidly forgotten
with train distribution changes. Memory mechanisms
have shown promise for continual learning in foundation models.Techniques for localizing knowledge
in foundation model in order tomake targetted parameter updates
may help prevent forgetting, butrepeated application
of such updatesinduce significant forgetting
.
Efficient Knowledge Representation
Retrieval-based
models such asREALM, RAG, RETRO
takedifferent approach
to model designthan simply increasing model parameters
.- Instead of trying to
accumulate implicit-knowledge
from ever larger datasetsdirectly to DNN model
with billions of parameters,retrieval-based
methodsstore knowledge outside model parameters
in form of text passages,capturing knowledge
in form ofdense vector representations
. - These models use
top-k
search mechanisms to extract knowledge based on each input,while keeping the DNN model small
. - This results in
improved maintainability
of the model as developers canupdate knowledge by replacing text passage
, its dense vector representation, metadata,without needing to train large DNN
.
Drawbacks
Defects
of the model areinherited
to all adapted modelsdownstream
.Homogenization
create single point of failure. Susceptible toadversarial examples
,data poisoning attack
, which could be transfered to adapted applications.Emergence
properties of foundation models generatesubstantial uncertainty
over capabilities.- Existing foundation models are able to
memorize sensitive information
in training data andregurgitate
such datawhen queried
via standard API.
Basics
Ideal controllable image synthesis
approaches should have, 1)flexibility:
able tocontrol
differentspatial properties like position, shape, pose, expression
of generated objects; 2)Precision:
control
spatial propertieswith high precision
; 3)Generality:
should beapplicable to different objects
without limiting to certain categories.- To
track points in videos
, an obvious approach is throughoptical flow estimation between consecutive frames
.
Previous Approaches
Previous GAN
based methodsgain controllability
by usingmanually annotated training data
orprior 3D model
often lacking flexibility, precision, and generality. These fail to generalize to new object categories.Text guided
image synthesislacks precision and flexibility
in term of editing spatial attributes. For example, moving object by specific number of pixels.
DragGAN
- DragGAN is a
point-based interactive image editing method
not requiring additional tracking models. Itoutperforms SOTA
point-based tracking approaches likeRAFT
andPIPs
. Leverages pre-trained GAN
to precisely follow user input, but also stay in manifold of realistic images.Latent code is optimized incrementally
thatmoves
multiple handle points to their corresponding target locations.- It allows
edit
ofpose, shape, expression, layout
accrossdiverse object
categories. - User
sets some handle points
in interactive manner by clicking andcorresponding target point pairs
in manner of(handle point, target point)
. This method willmove handle points to target points
of GAN-generated images. - Can also
use bindary masks to edit on only masked location by denoting movable region
.Mask reduces ambiguity
andkeeps certain regions fixed
. Handling more than one point
with precision controlenables more diverse and accurate image manipulation
.- This approach can
hallucinate occlued content
, liketeeth inside a lion's mouth
, and candeform following object's rigidity
, likebeding a horse leg
.
GAN Properties
- DragGAN is built on
key insight
that thefeature space of GAN is sufficiently discriminative
to enable both motion supervision and precision point tracking. - DragGAN
deformation
is performed onlearned image manifold of a GAN
, whichtend to obey underlying object structure
.
StyleGAN Terminology
- In
StyleGAN2
, a 512 dimension latent codez
mapped to 512 dimensional intermediate latent codew
viamapping network
. The space ofw
is refered to asW
. - Generator
G
produces output image,I = G(w)
from latent codew
.w
is copied several times and sent to different layers to control different attributes. - Different
w
can be sent tol
different layers which isW+
space.This space
is less constrained andmore expressive
. - As generator
G
learns tomap low dimensional space to high dimensional space
, it can be seen asmodelling image manifold
.
Method
Optimizing latent codes
that incrementatly moves handle points to target location andpoint tracking method
to faithfully trace trajectory of handle points.- User inputs
handle points and target points with each point having (x, y) location
. Eachhandle point have corresponding target point pair
. Motion supervision drives handle points to target points
.Point tracking updates handle points
to track the object in image. This processcontinues
untilhandle point reaches corresponding target point
.- Manipulation is performed in
optimization
manner. Each optimization has two steps,motion supervision
andpoint tracking
. - After
each motion supervision
step point movessmall step
, but theamount is unknown
. - Point tracking is required because if handle points (e.g., nose of lion) are not accurately updated, then in next motion supervision step it will move wrong points (e.g., face of lion).
Motion Supervision
- Intermediate features of generator is discriminative enough such that simple loss is enough to supervise motion.
- Feature map
F
after6th block of StyleGAN2
is considered as it gives good trade-off between discriminativeness and resolution. - Feature map
F
isresized
tosame size as final image
viabilinear interpolation
. - At
each motion supervision step
, theloss
is used tooptimize
latent codew
for one step. - Spatial attributes of image affected of
w
for first6 layers
while remaining ones affect appearence. Thus,w
is updated for only first 6 layers while fixing others to preserve appearance.
Point Tracking
- Feature map
F
after6th block of StyleGAN2
is also considered here with 256x256 resolution and interpolated to same size as image if needed.
Input Image Editing
- Performed using
GAN Inversion
techniques thatembeds
input images toStyleGAN latent space
.
Datasets
- FFHQ, AFHQCat, LSUN Car, LSUN Cat, Landscapes HQ.
Limitations
Extrapolation capability
of DragGAN islimited by diversity of training data
.Deviating
fromtraining data may lead to artifacts
.Handle points
intexture-rich
location should be picked as texture-less regions suffer from drift in tracking.