Implicit Generative Models Evaluation

Qualitative Evaluation

Nearest Neighbors

Real samples from training set are displayed next to their nearest neighbors in the achievable generation space.

Cons :

Typically computed with Euclidean distance which is very sensitive to minor perceptual perturbations
Overfitting to training set makes it trivial to pass this test

"Turing-like" tests

Measure ability to fool subjects with generated samples

Cons :

Cumbersome, expensive, experimental hazards causing inconsistent evaluation settings between subjects
Fails to evaluate diversity --> Overfitting models pass this test too

Visualizing Internals of the Model

Visualize representation disentanglement, space continuity, discriminator features and globally any facet concerning model's regularity.

Probabilistic Measures

Image Quality Assessment Metrics

Image quality assessment provides a measure of the quality of an image in reference to the original image or not. We here review some metrics that have been used in works on generative methods for remote sensing (Wang et al. 2019, Grohnfeldt et al. 2018)

PSNR (Peak Signal to Noise Ratio)

Compares the power of a clean image y to the power of corrupting noise from its corrupted version x as :

Pros :

Cons : High sensitivity towards biases in brightness

SAM (Spectral Angle Mapper, Boardman et al. 1993)

Estimates spectra similarity by comparing band similarities.

Given a pair of NxNxd images x and y, we have :

Variations :

Kernel-SAM : use kernel trick on base SAM expression

Pros :

Cons :

SSIM (Structural Similarity Index, Wang et al. 2004)

Estimates structural disparities based on luminosity, constrast and structure for a pair of image windows x and y as :

see here for luminosity, contrast and structure expressions

Pros : Finds large-scale mode collapse reliably

Cons : Fails to diagnose smaller effects such as loss of variations in colors and textures + does not assess quality in terms of similarity to the dataset

Variations :

ESSIM: adds edge information
MS-SSIM: multi-scale comparison
FSIM: compares phase congruency and gradient magnitude
CW-SSIM: compares complex wavelet transform (deals with issues of image scaling, translation and rotation)

Sharpness Difference (SD)

Pretty self-explaining ?

where

Pros :

Cons :

Sanity Check at Mastered Tasks

References

Pros and Cons of GAN Evaluation Measures, Borji 2018 : Comprehensive overview on GANs evaluation measures
A note on the evaluation of generative models, Theis et al. 2015 : provides good explanations on why some measures are inconsistent with each other
More to come

This is a footer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit Generative Models Evaluation

Qualitative Evaluation

Nearest Neighbors

"Turing-like" tests

Visualizing Internals of the Model

Probabilistic Measures

Image Quality Assessment Metrics

PSNR (Peak Signal to Noise Ratio)

SAM (Spectral Angle Mapper, Boardman et al. 1993)

SSIM (Structural Similarity Index, Wang et al. 2004)

Sharpness Difference (SD)

Sanity Check at Mastered Tasks

References

Clone this wiki locally