Transform the residuals from RegressOutNBreg()? #539

yueqiw · 2018-06-12T09:53:28Z

Hi,

The internal RegressOutNBreg() function writes the (pearson) residuals to the @scale.data slot. Since it uses raw counts as input, shouldn't the residuals still have very skewed distributions with long tails? Before doing the scaling and PCA step, is it a good idea to transform the residuals to more normally distributed?

In the RegressOutResid() function, there is the log(x-min) transform for poisson and negbinom regression:

if (use.umi) {
    data.resid <- log1p(
      x = sweep(
        x = data.resid,
        MARGIN = 1,
        STATS = apply(X = data.resid, MARGIN = 1, FUN = min),
        FUN = "-"
      )
    )
  }

But this part is missing in RegressOutNBreg(). Is there a reason behind not doing the transform? Thanks!

The text was updated successfully, but these errors were encountered:

ChristophH · 2018-06-12T12:19:27Z

While RegressOutNBreg() takes raw values as input, it outputs Pearson residuals, like you said. By definition this accounts for the expected variance and the result has (approximately) mean 0 and variance 1. The Pearson residuals thus have a meaning: the number of standard deviations that an observed value (UMI count) is above or below the expected value given the model. We only expect to see significant deviation from that model if 1) a gene is highly variable and this variability is driven by a factor that was not regressed out (e.g. a cell population marker), or 2) the NB model does not fit the data at all (not the case for the UMI data we have looked at so far). So, under the NB model there is no need to perform additional transformation.

yueqiw · 2018-06-13T01:50:33Z

Thanks!
I agree that NB model is a good choice to fit the UMI data. But the skewed distribution of pearson residuals for genes with biologically interesting variations might affect downstream analysis, such as PCA.

As far as I understand, there are 3 types of relevant situations:
(1) A gene's expression is very well explained by the unwanted variable (e.g. nUMI) that we want to regress out. In this case, we get only small residuals after RegressOut.
(2) A gene's expression is partially influenced by unwanted variables but also by biological variations (e.g. cell types). Cells will have different expected values predicted by unwanted variables. After RegressOut, unwanted variations are gone but many cells still have significant deviations/residuals due to biological factors.
(3) A gene's expression is mainly driven by biological factors (like many HVGs). We get large residuals after RegressOut. And the distribution of these residuals will have similar shapes as the raw counts.

The Pearson residuals thus have a meaning: the number of standard deviations that an observed value (UMI count) is above or below the expected value given the model.

The distribution of Pearson residuals for GLM is usually not normal but skewed, even though it's zero-mean and unit-variance. This is because the scaling does not change the overall shape of the distribution. In situations (2) and especially (3), for genes with interesting biological variations, we are left with large biological residuals, and their distributions should have similar shapes as raw counts. Using pearson residual for downstream analysis would be similar as using (raw_counts - mean)/std where raw_counts have been previously adjusted for technical variations.

We only expect to see significant deviation from that model if 1) a gene is highly variable and this variability is driven by a factor that was not regressed out (e.g. a cell population marker)

This is true as discussed above. But in for these genes, should we use the long-tailed pearson residuals?

yueqiw · 2018-06-13T03:19:46Z

Deviance residuals of negative binomial are much less skewed, although I'm not sure how to properly interpret it.

satijalab · 2018-06-13T05:49:35Z

Thanks for the interesting and well-thought out questions.

We choose not to scale these residuals because, as you say, this means that genes with 'biological' effects will have higher variance. Therefore they will contribute more to the PCA. We think this is a positive feature, and will drive the dimensional reduction to capture primarily biological effects.

This will down-weight the effect of genes whose variance is primarily explained by sequencing depth. Again, we think this is a positive feature, but there may be cases where that is undesirable. You can certainly transform the pearson residuals on your own, prior to performing PCA. If you do experiment with this, we'd be curious to see the results.

Equality thresholds in FindMarkers

ChristophH closed this as completed Jun 12, 2018

mojaveazure added the Analysis Question label Jun 12, 2018

andrewwbutler added a commit that referenced this issue Mar 12, 2021

Merge pull request #539 from satijalab/fix/diffexp.thresholds

04dabdd

Equality thresholds in FindMarkers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform the residuals from RegressOutNBreg()? #539

Transform the residuals from RegressOutNBreg()? #539

yueqiw commented Jun 12, 2018 •

edited by mojaveazure

Loading

ChristophH commented Jun 12, 2018

yueqiw commented Jun 13, 2018 •

edited

Loading

yueqiw commented Jun 13, 2018

satijalab commented Jun 13, 2018

Transform the residuals from RegressOutNBreg()? #539

Transform the residuals from RegressOutNBreg()? #539

Comments

yueqiw commented Jun 12, 2018 • edited by mojaveazure Loading

ChristophH commented Jun 12, 2018

yueqiw commented Jun 13, 2018 • edited Loading

yueqiw commented Jun 13, 2018

satijalab commented Jun 13, 2018

yueqiw commented Jun 12, 2018 •

edited by mojaveazure

Loading

yueqiw commented Jun 13, 2018 •

edited

Loading