Skip to content

Implicit Generative Models Evaluation

Shahine edited this page May 20, 2020 · 12 revisions

Qualitative Evaluation

Nearest Neighbors

Real samples from training set are displayed next to their nearest neighbors in the achievable generation space.

Cons :

  • Typically computed with Euclidean distance which is very sensitive to minor perceptual perturbations
  • Overfitting to training set makes it trivial to pass this test

"Turing-like" tests

Measure ability to fool subjects with generated samples

Cons :

  • Cumbersome, expensive, experimental hazards causing inconsistent evaluation settings between subjects
  • Fails to evaluate diversity --> Overfitting models pass this test too

Visualizing Internals of the Model

Visualize representation disentanglement, space continuity, discriminator features and globally any facet of the model's regularity.

Image Quality Assessment Metrics

Image quality assessment provides a measure of the quality of an image in reference to the original image or not. We here review some metrics that have been used in works on generative methods for remote sensing (Wang et al. 2019, Grohnfeldt et al. 2018)

PSNR (Peak Signal to Noise Ratio)

Compares the power of a clean image y to the power of corrupting noise from its corrupted version x as :

psnr-expression

Pros :

Cons : High sensitivity towards biases in brightness

SAM (Spectral Angle Mapper, Boardman et al. 1993)

Estimates spectra similarity by comparing band similarities.

Given a pair of NxNxd images x and y, we have :

sam-expression

Variations :

  • Kernel-SAM : use kernel trick on base SAM expression

Pros :

Cons :

SSIM (Structural Similarity Index, Wang et al. 2004)

Estimates structural disparities based on luminosity, constrast and structure for a pair of image windows x and y as :

ssim-expression

see here for luminosity, contrast and structure expressions

Pros : Finds large-scale mode collapse reliably

Cons : Fails to diagnose smaller effects such as loss of variations in colors and textures + does not assess quality in terms of similarity to the dataset

Variations :

  • ESSIM: adds edge information
  • MS-SSIM: multi-scale comparison
  • FSIM: compares phase congruency and gradient magnitude
  • CW-SSIM: compares complex wavelet transform (deals with issues of image scaling, translation and rotation)

Sharpness Difference (SD)

Pretty self-explanatory ? 😄

where

Pros :

Cons :


Table of IQA metrics

1
Metric Comment Ref
Implementation
2
Full-Reference Error/Distortion-based Mean Absolute Error - -
np.abs(x - y).mean()
3
Mean Squared Error - -
np.square(x - y).mean()
4
PSNR - -
5
SVD-distortion averages stretcher deviation by block None found
6
Distortion Measure didn't understand this one quite well None found
7
Similarity-based Structural Content Ratio of squares sum -
np.mean(y**2/x**2)
8
Mutual Information - -
9
Cross-Correlation - -
10
Spectral Angle Mapper easy to implement None found
11
Universal Index lesser version of SSIM
12
Structural Similarity Index (SSIM)
structure x luminosity x constrast
13
Mutliscale-SSIM same but multiple image scales
14
Features-SSIM phase congruency and gradient magnitude None found
15
Complex-Wavelett-SSIM handles scaling, translation and rotations
16
No-Reference BRISQUE
estimates asymmetric generalized Gaussian params on MSCN distribution - requires training
17
GMLOGQA gradient magnitude and laplacian of gaussian response - required training
18
ILNIQE estimates Weibull params fitting gradient magnitude - requires training
19
SSEQ spatial and spectral entropy features - requires training
20
ENIQA
improved SSEQ with multiscales, Log-Gabor and bandwise approach - requires training

Probabilistic Measures

to be completed but, as of now, not a priority in the context of virtual remote sensing product generation as we have access to the generation groundtruth and would rather focus on evaluation procedures based on comparison to groundtruth

References