Papers

The year column represents publishing year in arxiv and not the year of latest revision. Supplementary pdf/docs is mostly not included in the links below. A paper may fall into multiple categories but organized into a general one.

Adding only papers worth implementing, important concepts that can be applied in future with high quality or need to be revisited again with easy to understand, SOTA, close to SOTA results or unique ideas. Not adding paper if concept not understood, too hard or anything meaningful not found.

🚩 Represents overall good understanding, easier to understand with diagrams, examples or good explanation simple language.
🎯 Represents partially reviewed paper.

Papers Read

Topic	Year
Uncategorized
Feature Pyramid Networks for Object Detection (FPN)	2016
COIN: COmpression with Implicit Neural representations	2021
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation	2019
Attention Is All You Need	2017
Denoising Diffusion Probabilistic Models	2020
Improved Denoising Diffusion Probabilistic Models	2021
Deep Image Prior Supplementary	2018
Adding Conditional Control to Text-to-Image Diffusion Models	2023
LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS	2021

Vision Transformers
🚩 AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (Vision Transformer, ViT)	2020

Knowledge Distillation
🚩 Distilling the Knowledge in a Neural Network (Knowledge Distillation)	2015
🚩 On the Efficacy of Knowledge Distillation	2019

Object Recognition, Clutering, Verification
🚩 FaceNet: A Unified Embedding for Face Recognition and Clustering	2015
🚩 Siamese Neural Networks for One-shot Image Recognition	2015

Image Captioning
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge	2016

Text to Image Generation
Zero-Shot Text-to-Image Generation (DALL-E)	2021

Multimodal Deep Learning
Learning Transferable Visual Models From Natural Language Supervision (CLIP)	2021

Image Super Resolution
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (SRGAN)	2016

Image Segmentation
U-Net: Convolutional Networks for Biomedical Image Segmentation (UNet)	2015

Convolutional Neural Network (CNN) Architectures
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)	2012
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning	2016
Going deeper with convolutions (Inception/GoogLeNet)	2014
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION (VGG)	2014
🚩 Wide Residual Networks (WRN)	2016
🚩 Deep Residual Learning for Image Recognition (ResNet)	2015
🚩 Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)	2016

Survey/Review Papers
GAN Inversion: A Survey	2021

Generative Adversarial Network (GAN)
Generative Adversarial Networks (GANs)	2014
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN)	2015
Improved Training of Wasserstein GANs (WGAN-GP)	2017
Conditional Generative Adversarial Nets	2014
🚩 Analyzing and Improving the Image Quality of StyleGAN (StyleGAN 2)	2019
🚩 Training Generative Adversarial Networks with Limited Data (StyleGAN 2 ADA)	2020
🚩 Alias-Free Generative Adversarial Networks (StyleGAN 3)	2021
🚩 PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION (ProGAN)	2017
🚩 A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN)	2018

Image to Image Translation
Image-to-Image Translation with Conditional Adversarial Networks (pix2pix)	2016
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (CycleGAN)	2017
🚩 Semantic Image Synthesis with Spatially-Adaptive Normalization (GauGAN/SPADE)	2019

Neural Style Transfer (NST)
A Neural Algorithm of Artistic Style Image Style Transfer Using Convolutional Neural Networks	2016 2015
🚩 Perceptual Losses for Real-Time Style Transfer and Super-Resolution 🚩 Supplementary	2016

Language Models
Language Models are Unsupervised Multitask Learners (GPT-2)	2019

Reading List

Topic	Year
SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY
Attention Mechanisms in Computer Vision: A Survey
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning	2023

Summary

Language Models

2018

`GPT` | Improving Language Understanding by Generative Pre-Training

Basics

For multi-task Natural Language Understanding (NLU) objectives such as question answering, semantic similarity, document classification etc.
Some of these tasks are part of General Language Understanding Evaluation (GLUE) multi-task benchmark.
Supervised learning suffer from lack of large annotated dataset and quality. Learning from raw text removes dependence of on supervised only methods.

GPT

GPT (Generative Pre-Training) is a semi-supervised approach with unsupervised pre-training and supervised fine-tuning.
Two stages, Generative Pre-Training on unlabeled data, Discriminative Finetuning for each specific task with task aware input transformation and minimal model architecture change.
GPT aquires useful linguistic knowledge for downstream tasks and outperforms specifically crafted task specific models.
Goal is to learn universal representation to apply to wide range of tasks with little adaptation.
Input text is processed as single contiguous sequence of tokens.
Training requires large corpus of unlabeled data and manually annotated data for each target task.

Model

Transformer is used for model architecture due ability handle long-term dependencies in text.
Multi-layer transformer decoder is used for language modeling.
Models is multiple layer decoder only transformer with masked self attention heads.
Learned positional embedding is used instead of sinusoidal in original transformers.

Unsupervised Pre-training

Pre-training acts as regularizer providing better generalization in deep neural nets.
Unsupervised pretraining goal is to find good initialization point instead of modifying supervised objective.
Unsupervised pretraining objective based on context window of tokens predicts likelihood the next token.

Supervised Fine-tuning

Uses labeled dataset for supervised task. Input sequence of tokens x1, x2, ..., xN has output label y.
An additional linear ouput layer is added after final layer of transformer to predict y for given task.
Uses label prediction objective and additionally language modeling as auxiliary objective (loss) of unsupervised pre-training for supervised-finetuning.
Extra parameters added to unsupervised pre-trained model is final linear layer weights $W_y$ and embedding for delimiter tokens.

Input Transformation

Byte Pair Encoding (BPE) used for sub-word tokenization.
Instead of using task specific architectures, the inputs are converted to sequence of tokens that the pretrained model can process.
Includes start and end tokens for each input to pretrained model, <s>, <e>.
For text entailment, premise p and hypothesis h token sequence inlcude delimiter token $ between them.
For document QA (Question Answering) and common sense reasoning include document z, a question q, a set of answers $a_k$. Document context, question and each individual answer is concatenated with delimiter token in between to produce input data to the pretrained model ($z$, $q$, $\$$, $a_1$), ($z$, $q$, $\$$, $a_2$), ..., ($z$, $q$, $\$$, $a_K$). Softmax layer is used to produce output distribution over all possible answers.
For sentence similarity with two sentence sequence of tokens s1, s2 there is no ordering. Start, end tokens are added with delimiter between the sentences to produce output. Also the sentences are swapped to again produce output both which is concatenated before feeding to linear layer.

Datasets

Natural Language inference (NLI): SNLI, MNLI, QNLI.
Question Answering: SQuAD (Stanford Question Answering Dataset).
Semantic Similarity: Quora Question Pairs (QQP), Semantic Textual Similarity benchmark (STS-B), Microsoft Paraphrase corpus (MRPC).

`BERT` | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Prerequisite

GPT-1 paper.

Applying Pre-trained Language Representations to Downstream tasks

Two strategies for applying pre-trained language representations downstream tasks. They are fine-tuning and feature-based.
Feature-based representations are applied as additional features to task-specific architectures.
Fine-tuning approaches like GPT introduces minimal task specific parameters. It is trained by downstream tasks by fine-tuning all pre-trained parameters.

Limitations of existing Unidirectional Language Models

Standard language models are unidirectional. GPT uses left-to-right architecture, where can only attend to previous tokens in self attention layer by masking future tokens.
This can be harmful when fine-tuning on token-level tasks like Question Answering, which requires context incorporated from both directions.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a language representation model introduced in the paper.
Similar to GPT, BERT has pre-training and fine-tuning stage. Each downstream task has separate finetuned model even if they are initialized with same pre-trained model parameters.
In contrast to GPT which is transformer decoder model due to left only context, bidirectional context model BERT is a transformer encoder model.
Designed to pretrain deep bidirectional representations from unlabeled data by jointly conditioning on left and right context in all layers.
An additional output layer is added to create SOTA (state-of-the-art) models that performs well on sentence-level and token-level tasks. Tasks include QA, language inference etc.
Alleviates constraints of unidirectional models (e.g. GPT) by introducing Masked Language Model (MLM) pre-training task.
Also uses Next Sentence Prediction (NSP) task to jointly pretrain on text-pair representations.
Similar to GPT, during finetuning all parameters are fine-tuned.
For finetuning task specific input and outputs are added. Adding an extra output layer on top of pre-trained model can be used for classification, sentiment analysis etc.
BERT is effective in both feature-based (extracting fixed features from pre-trained network) and fine-tuning (all parameters including pre-trained) approaches.
Increasing model size will lead to continual improvement on large scale tasks.
Uses WordPiece tokenization.

Special Tokens

[CLS] token added in front of every training example. Final state corresponding to this token is used as aggregate representation for classification tasks.
[SEP] special separator token for separating sentence pairs. It can be used for separating question and answers.
[MASK] token is used for pre-training and not used in fine-tuning.

Input and Ouput Representations

BERT input representation can unambiguously handle both single and paired sentence <question, answer> in one token sequence.
Here, a sentence is considered as arbitrary span of contiguous text, rather than liguistic sentence.
Sequence here is referred as input token sequence to BERT which can be a single sentence or pair of sentences packed together.
Learned positional embedding is also added to each tokens for indicating whether it belongs to sentence A or B.

Masked Language Model (MLM)

MLM (Cloze Task) used for pre-training randomly masks some input tokens, and the objective is to predict original vocabulary id based only on context.
MLM enables representations to fuse left and right context.
Bidirectional conditioning would allow each word indirectly see itself.
[Mask] token appears only in pre-training and not in fine-tuning. To mitigate this, words are not always replaced with mask tokens.
80% of the times word is replaced with mask token, 10% time with random token and 10% of the times with same token. Cross entropy loss is used to predict original token. Example, unlabeled sentence, my dog is hairy. 80% of the time label will be mask token, my dog is [mask]. 10% of the time random word, my dog is apple and 10% of the time same word, my dog is hairy.
Masked LM only makes prediction on 15% of the token each batch. MLM does not know which words it needs predict or which have been replaced with random words.

Next Sentence Prediction (NSP)

Used for pre-training and shown to be beneficial for QA and NLI.
Downstream tasks like Question Answering (QA), Natural Language Inference (NLI) are based on relationship between two sentences, which is not directly captured by language modeling.
For this task, 50% of the time sentence B is actual next sentence to sentence A labeled with IsNext. Rest of the 50% time sentence B is randomly chosen from corpus and labeled by NotNext.
Input example, [CLS] the man went to [MASK] store [SEP] he bought a gallon of [MASK] milk [SEP] and output label IsNext.
Input example, [CLS] the man went to [MASK] store [SEP] penguin [MASK] are flight ##less birds [SEP] and output label NotNext.

Foundation Models

2021

`Foundation Models` | On the Opportunities and Risks of Foundation Models

Foundation Models

Foundation models are models trained on broad data generally trained using self-supervised learning at scale, that can be adapted (e.g., fine-tune) to wide range of downstream tasks. Example include, BERT, DALL-E, GPT-3, CLIP etc.
Foundation model provide general-purpose engine for processing multimodal information.
They are based on deep neural networks (DNN) and self-supervised learning (SSL).
Incentivizes homogenization, which is consolidation of methodologies for building ML systems across wide range of applications.
Transfer learning makes it possible, Scale makes it powerlful. Their scale result in emergent capabilities.
Led to unprecedented level of homogenization. Almost all SOTA NLP models adapted from few foundation models, BERT, RoBERTa, BART, T5, etc.'
Led to surprising emergence due to scale, in-context learning in 175B parameter GPT-3. In-context learning allows adaptation to downstream tasks by providing it with prompt.
Can centralize information from all data from various modalities (e.g., text, image, audio) and can be adapted to wide range of downstream tasks (e.g., QA, image captioning, following instructions, object recognition).
Foundation model itself is incomplete, but serves as common basis from which many task-specific models built via adaptation.

Basic Knowledge

In deep learning, pretraining is a dominant transfer learning approach.
In self-supervised learning pretraining task is derived from unlabeled data.
Transformer model architecture leverages hardware parallelism to train more expressive models.

Memory

Important to distinguish between explicit facts that can be stored in external memory storage (e.g., vector database) and implicit knowledge reflected through trainable weights of network.
Decoupling explicit and implicit knowledge enjoy multiple benefits compared to implicitly encoding all information together through network weights. 🔗 paper link
This separation mitigates inflation in model size, number of parameters needed to store growing amounts of knowledge. It is also key to memory update (model patching), manipulation and adaptation. 🔗 paper link

Adaptations

Low storage adaptation, approaches include, fine-tuning final layer weights, only bias vectors, low rank weight tensors, .
Low memory adaptations include, gradient checkpointing to trade off computation and memory.

Temporal Adaptations

Information is constantly changing like clothing style, new heads of state elected. This temporal shift presents a problem due to computationally demanding nature of training foundation models.
Temporarily-partioned diagnostic datasets help quantify the rate at which language models become outdated.
An alternative to addressing temporal shift is to design retrieval-based (sami-parametric) models, which augment the model input with additional context retrieved from large human-interpretable databases. In this case, adaptation corresponds to updating individual units of information in database.

Continual Learning

A natural extension of adaptation is continual learning or continual adaptation, in order to keep model's knowledge up-to-date with world events, continually add data from new domain or modalities.
A problem of this is continual learning induces catastrophic forgetting in neural networks, where old tasks or data are rapidly forgotten with train distribution changes.
Memory mechanisms have shown promise for continual learning in foundation models.
Techniques for localizing knowledge in foundation model in order to make targetted parameter updates may help prevent forgetting, but repeated application of such updates induce significant forgetting.

Efficient Knowledge Representation

Retrieval-based models such as REALM, RAG, RETRO take different approach to model design than simply increasing model parameters.
Instead of trying to accumulate implicit-knowledge from ever larger datasets directly to DNN model with billions of parameters, retrieval-based methods store knowledge outside model parameters in form of text passages, capturing knowledge in form of dense vector representations.
These models use top-k search mechanisms to extract knowledge based on each input, while keeping the DNN model small.
This results in improved maintainability of the model as developers can update knowledge by replacing text passage, its dense vector representation, metadata, without needing to train large DNN.

Drawbacks

Defects of the model are inherited to all adapted models downstream.
Homogenization create single point of failure. Susceptible to adversarial examples, data poisoning attack, which could be transfered to adapted applications.
Emergence properties of foundation models generate substantial uncertainty over capabilities.
Existing foundation models are able to memorize sensitive information in training data and regurgitate such data when queried via standard API.

Controllable Image Synthesis

2023

`DragGAN` | Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Basics

Ideal controllable image synthesis approaches should have, 1) flexibility: able to control different spatial properties like position, shape, pose, expression of generated objects; 2) Precision: control spatial properties with high precision; 3) Generality: should be applicable to different objects without limiting to certain categories.
To track points in videos, an obvious approach is through optical flow estimation between consecutive frames.

Previous Approaches

Previous GAN based methods gain controllability by using manually annotated training data or prior 3D model often lacking flexibility, precision, and generality. These fail to generalize to new object categories.
Text guided image synthesis lacks precision and flexibility in term of editing spatial attributes. For example, moving object by specific number of pixels.

DragGAN

DragGAN is a point-based interactive image editing method not requiring additional tracking models. It outperforms SOTA point-based tracking approaches like RAFT and PIPs.
Leverages pre-trained GAN to precisely follow user input, but also stay in manifold of realistic images.
Latent code is optimized incrementally that moves multiple handle points to their corresponding target locations.
It allows edit of pose, shape, expression, layout accross diverse object categories.
User sets some handle points in interactive manner by clicking and corresponding target point pairs in manner of (handle point, target point). This method will move handle points to target points of GAN-generated images.
Can also use bindary masks to edit on only masked location by denoting movable region. Mask reduces ambiguity and keeps certain regions fixed.
Handling more than one point with precision control enables more diverse and accurate image manipulation.
This approach can hallucinate occlued content, like teeth inside a lion's mouth, and can deform following object's rigidity, like beding a horse leg.

GAN Properties

DragGAN is built on key insight that the feature space of GAN is sufficiently discriminative to enable both motion supervision and precision point tracking.
DragGAN deformation is performed on learned image manifold of a GAN, which tend to obey underlying object structure.

StyleGAN Terminology

In StyleGAN2, a 512 dimension latent code z mapped to 512 dimensional intermediate latent code w via mapping network. The space of w is refered to as W.
Generator G produces output image, I = G(w) from latent code w. w is copied several times and sent to different layers to control different attributes.
Different w can be sent to l different layers which is W+ space. This space is less constrained and more expressive.
As generator G learns to map low dimensional space to high dimensional space, it can be seen as modelling image manifold.

Method

Optimizing latent codes that incrementatly moves handle points to target location and point tracking method to faithfully trace trajectory of handle points.
User inputs handle points and target points with each point having (x, y) location. Each handle point have corresponding target point pair.
Motion supervision drives handle points to target points. Point tracking updates handle points to track the object in image. This process continues until handle point reaches corresponding target point.
Manipulation is performed in optimization manner. Each optimization has two steps, motion supervision and point tracking.
After each motion supervision step point moves small step, but the amount is unknown.
Point tracking is required because if handle points (e.g., nose of lion) are not accurately updated, then in next motion supervision step it will move wrong points (e.g., face of lion).

Motion Supervision

Intermediate features of generator is discriminative enough such that simple loss is enough to supervise motion.
Feature map F after 6th block of StyleGAN2 is considered as it gives good trade-off between discriminativeness and resolution.
Feature map F is resized to same size as final image via bilinear interpolation.
At each motion supervision step, the loss is used to optimize latent code w for one step.
Spatial attributes of image affected of w for first 6 layers while remaining ones affect appearence. Thus, w is updated for only first 6 layers while fixing others to preserve appearance.

Point Tracking

Feature map F after 6th block of StyleGAN2 is also considered here with 256x256 resolution and interpolated to same size as image if needed.

Input Image Editing

Performed using GAN Inversion techniques that embeds input images to StyleGAN latent space.

Datasets

FFHQ, AFHQCat, LSUN Car, LSUN Cat, Landscapes HQ.

Limitations

Extrapolation capability of DragGAN is limited by diversity of training data. Deviating from training data may lead to artifacts.
Handle points in texture-rich location should be picked as texture-less regions suffer from drift in tracking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

papers.md

papers.md

Papers

Papers Read

Uncategorized

Vision Transformers

Knowledge Distillation

Object Recognition, Clutering, Verification

Image Captioning

Text to Image Generation

Multimodal Deep Learning

Image Super Resolution

Image Segmentation

Convolutional Neural Network (CNN) Architectures

Survey/Review Papers

Generative Adversarial Network (GAN)

Image to Image Translation

Neural Style Transfer (NST)

Language Models

Reading List

Summary

Language Models

2018

`GPT` | Improving Language Understanding by Generative Pre-Training

`BERT` | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Foundation Models

2021

`Foundation Models` | On the Opportunities and Risks of Foundation Models

Controllable Image Synthesis

2023

`DragGAN` | Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Files

papers.md

Latest commit

History

papers.md

File metadata and controls

Papers

Papers Read

Uncategorized

Vision Transformers

Knowledge Distillation

Object Recognition, Clutering, Verification

Image Captioning

Text to Image Generation

Multimodal Deep Learning

Image Super Resolution

Image Segmentation

Convolutional Neural Network (CNN) Architectures

Survey/Review Papers

Generative Adversarial Network (GAN)

Image to Image Translation

Neural Style Transfer (NST)

Language Models

Reading List

Summary

Language Models

2018

GPT | Improving Language Understanding by Generative Pre-Training

BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Foundation Models

2021

Foundation Models | On the Opportunities and Risks of Foundation Models

Controllable Image Synthesis

2023

DragGAN | Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

`GPT` | Improving Language Understanding by Generative Pre-Training

`BERT` | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

`Foundation Models` | On the Opportunities and Risks of Foundation Models

`DragGAN` | Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold