Better support for DE in scverse #189

Zethson · 2023-01-13T11:55:02Z

Continuation of a discussion in theislab/single-cell-best-practices#141

Goal

Implementation of useful helper methods for common DE tools. We don't plan to reimplement several DE methods. Other people are trying this already.

Tools

We could implement wrapper functions that make it streamlined to run DE tools within the scverse ecosystem. The output of the tools should be in the right slots of the AnnData object.

Most important

wrap basic methods from scanpy (wilcoxon, t-test, ...)
wrap pydesq2, see also scverse integration owkin/PyDESeq2#15

Other options

Disclaimer: while the rpy2 code snippets should work in principle, I haven't used them extensively and am not sure if they follow the best practices of the respective tool.

edgeR (through rpy2) [code snippet]
MAST (through rpy2) [code snippet] -> should additionally support LME models.
Limma (through rpy2)
statsmodels linear model wrapper [code snippet]

Input specification

I envisage something like

def de_analysis(
    adata: AnnData,
    groupby: str,
    method: Literal["t-test", "wilcoxon", "pydeseq2"],
    *,
    formula: Optional[str],
    contrast: Optional[Any],
    inplace: bool = True,
    key_addeed: Optional[str] = None,
) -> pd.DataFrame: 
    """
    Perform differential expression analysis
    
    Parameters
    ----------
    adata
        single-cell or pseudobulk AnnData object
    groupby
        column in adata.obs that contains the factor to test, e.g. `treatment`. 
        For simple statistical tests (t-test, wilcoxon), it is sufficient to specify groupby. Linear models require to specify
        a formula. In that case, the `groupby` column is used to compute the contrast. 
    method
        Which method to use to perform the DE test
    formula
        model specification for linear models. E.g. `~ treatment + sex + age`. 
        MUST contain the factor specified in `groupby`.         
    contrast
        TODO
        See e.g. https://www.statsmodels.org/devel/contrasts.html for more information.
    inplace
        if True, save the result in `adata.varm[key_added]`
    key_added
        Key under which the result is saved in `adata.varm` if inplace is True.
        If set to None this defaults to `de_{method}_{groupby}`. 
    """

Clearly, the challenge here is to find an interface that works with both simple methods like t-test and the more flexible linear models. It will also always be a tradeoff between simplicity and the possibility to use all features of the underlying methods (e.g. do we want to support arbitrary contrasts? Do we want to support multiple contrasts for a single model fit?).
Some of these cases could require splitting the function up into a fit and a compute_contrast step. In that case an OOP interface might be more appropriate.

Output specification

Each DE tool should output a data frame with at least the following columns

gene_id
log2 fold change
mean expression
unadjusted p-value
adjusted p-value

and optionally other columns (may depend on the method).

Plots

Add volcano plots from decoupler/wrap them (plot_volcano)

Note: an (optionally) interactive version based on plotly/altair could be really nice -- hover over the dots to view the gene name.
Add paired or unpaired scatter+boxplots are nice ways of visualizing individual genes (each dot is a pseudobulk-sample)

[code snipped based on seaborn]
Add bar charts. Scatterplot on top for paired observations. Each dot represents the fold change for one paired observation.
[code snipped based on altair]
heatmap of coefficients with FDR correction

[code snippet based on altair]

The text was updated successfully, but these errors were encountered:

Zethson · 2023-01-13T11:55:50Z

@grst what else do you think we should add? Feel free to edit my issue (hope you have the rights now - if not please ping me).

grst · 2023-01-13T12:08:59Z

Tools

Personally, I never made good experiences with rpy2 -- which is one reason why I stopped using these functions in more recent projects. Instead I performed DE in R and connected it through a workflow manager.

Are you sure you want to support that? It would also entail quite complicated CI setups if we were to test it.

Maybe we start with pydeseq2, t-test, wilcoxon test?

Plots

another nice one I forgot, which is especially useful when comparing between multiple conditions:

Visualization frameworks

My plotting functions are a mix of altair and matplotlib. Do you want to stick to one of them?

Zethson · 2023-01-13T12:27:08Z

Personally, I never made good experiences with rpy2 -- which is one reason why I stopped using these functions in more recent projects. Instead I performed DE in R and connected it through a workflow manager.

Are you sure you want to support that? It would also entail quite complicated CI setups if we were to test it.

rpy2 is at the moment an optional dependency for pertpy because milopy can also work with edgeR. I'd keep stick to this for support for the R options.

Maybe we start with pydeseq2, t-test, wilcoxon test?

Yup. We'd have to discuss what the interface should look like. Would you base it on the current scanpy t-test/wilcoxon test or would you redesign it completely? If yes, what should it look like?

My plotting functions are a mix of altair and matplotlib. Do you want to stick to one of them?

Preferably matplotlib. We replaced altair code in other tools with matplotlib equivalents.

grst · 2023-01-13T12:39:06Z

Would you base it on the current scanpy t-test/wilcoxon test or would you redesign it completely? If yes, what should it look like?

I would probably use the scanpy implementation but redesign the output. IMO all tools should return a dataframe with one row per var with p-value and logFC. This dataframe could optionally be saved in adata.varm.

I also have a wrapper for a statsmodels linear model which could be relevant here:
https://github.com/icbi-lab/luca/blob/89c4e6109bc723f6958cae7af791398b28e8e422/lib/scanpy_helper_submodule/scanpy_helpers/compare_groups/lm.py#L24-L151

Preferably matplotlib. We replaced altair code in other tools with matplotlib equivalents.

Fair enough, shouldn't be too hard to rewrite it. Maybe an excuse to play with the new seaborn API.

Another point to consider is how the interface would look like. groupby is nice for a simple pairwise or groupwise comparison, but what about more complicated cases? Do we allow to specify patsy model strings?

Zethson · 2023-01-13T13:18:34Z

I would probably use the scanpy implementation but redesign the output. IMO all tools should return a dataframe with one row per var with p-value and logFC. This dataframe could optionally be saved in adata.varm.

Yes, that makes a lot of sense.

I also have a wrapper for a statsmodels linear model which could be relevant here: https://github.com/icbi-lab/luca/blob/89c4e6109bc723f6958cae7af791398b28e8e422/lib/scanpy_helper_submodule/scanpy_helpers/compare_groups/lm.py#L24-L151

Yeah, why not.

Preferably matplotlib. We replaced altair code in other tools with matplotlib equivalents.

Fair enough, shouldn't be too hard to rewrite it. Maybe an excuse to play with the new seaborn API.

Yup, but we also have manpower for this if you don't want to do this yourself.

Another point to consider is how the interface would look like. groupby is nice for a simple pairwise or groupwise comparison, but what about more complicated cases? Do we allow to specify patsy model strings?

I wouldn't really need them for a quick t-test or Wilcoxon test but when doing serious work with edgeR/deseq2 for complex setups - yes I think that we need to support them.

grst · 2023-01-13T13:24:12Z

Yup, but we also have manpower for this if you don't want to do this yourself.

ok, even better. So implementation-wise, what would you need from my side?

Zethson · 2023-01-13T14:44:59Z

Yup, but we also have manpower for this if you don't want to do this yourself.

ok, even better. So implementation-wise, what would you need from my side?

If you explicitly designed an interface (parameters that you want and output DF) it would certainly help and guide one of my students. I am confident though that I could also do that if you're busy.

Could you please edit my post and add links to the current implementations to all of your plots?

You are of course welcome to contribute to pertpy, but if you're busy one of my students will help out.

grst · 2023-01-13T19:45:05Z

Updated the comment and added some draft specs for further discussion.

Zethson · 2023-01-13T21:33:34Z

Excellent, thank you very much.
So would the de_analysis have an additional parameter called "method" to which this is passed to? Or would you have separate public functions per test?

grst · 2023-01-16T06:50:20Z

Good point. I would have said the former. Updated the specs accordingly.

Zethson · 2023-01-16T09:17:36Z

What do we do with https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.rank_genes_groups.html#scanpy.pl.rank_genes_groups and all of its siblings? Think it expects the old format?

grst · 2023-01-16T09:37:32Z

tbh, I'm not a big fan of how scanpy currently stores the DE results. So I wouldn't mind if the scanpy functions were rewritten. But that would be either a breaking change or require a bunch of code handling legacy formats, so not sure if this is going to happen in a reasonable timeframe.

Alternatively, these functions could be mirrored in pertpy, together with the other plotting functions planned. I have a code snippet for adding a data frame to anndata such that it works with the scanpy plotting functions here.

Zethson · 2023-01-16T09:42:22Z

tbh, I'm not a big fan of how scanpy currently stores the DE results. So I wouldn't mind if the scanpy functions were rewritten. But that would be either a breaking change or require a bunch of code handling legacy formats, so not sure if this is going to happen in a reasonable timeframe.

Unlikely.

Alternatively, these functions could be mirrored in pertpy, together with the other plotting functions planned. I have a code snippet for adding a data frame to anndata such that it works with the scanpy plotting functions here.

So the mirroring would work as follows: users calls

pt.pl.rank_genes_groups(adata, key="MYDERESULTSDF")

Under the hood we call your function and then call the scanpy plotting function?

Or how would you design it?

grst · 2023-01-16T09:50:50Z

Maybe by using a context manager that temporarily adds the converted data frame to adata.uns?

def rank_genes_groups(adata, key, *args, **kwargs):
   with _add_de_res_for_scanpy(adata, key) as adata:
       sc.pl.rank_genes_groups(adata, *args, **kwargs)

Zethson · 2023-01-16T09:54:11Z

Yeah, that's a lovely idea. Do you think that we should expose another "add_de_res_for_pertpy" function that does the same thing but this time load the DE results into adata.varm? We know where to store such dataframes, but newbies are always confused with the slots and varm is rarely used.

But I think that we've got it figured out! Do you want to tackle any of this? If not, I'll find someone as mentioned.

grst · 2023-01-16T10:05:51Z

Yeah, that's a lovely idea. Do you think that we should expose another "add_de_res_for_pertpy" function that does the same thing but this time load the DE results into adata.varm

Either that, or allow to pass a data frame directly to the plotting function:

def plot_something(
    adata: AnnData,
    de_res: pd.DataFrame | str,
    p_col: str = "FDR",
    fc_col: str = "log2FC",
):
    """
    de_res:
         This may be either
            * the key under which the results are stored in `adata.varm`, or
            * a pandas data frame with DE results. The data frame must use gene 
               identifiers from `adata.var_names` as index. 
    """

But I think that we've got it figured out! Do you want to tackle any of this? If not, I'll find someone as mentioned.

If you had someone it would be awesome! I'm happy to review if required.

Zethson · 2024-09-16T10:44:15Z

Closing this now in favor of more specific issues.

Zethson added the enhancement New feature or request label Jan 13, 2023

Zethson assigned Zethson and grst Jan 13, 2023

Zethson assigned timtreis Jan 16, 2023

ivirshup mentioned this issue Jan 18, 2023

Call numba.set_num_threads when n_jobs is specified (e.g. in scanpy.tl.rank_genes_group) scverse/scanpy#2390

Open

1 task

grst mentioned this issue Dec 1, 2023

Interaction with plotting functions scverse/multi-condition-comparisions#24

Closed

Zethson mentioned this issue Mar 22, 2024

0.7.0 release #563

Closed

6 tasks

Zethson unassigned timtreis Apr 28, 2024

Lilly-May self-assigned this Aug 30, 2024

Lilly-May mentioned this issue Sep 5, 2024

Add plots for DE analysis and unify plotting API #654

Merged

2 tasks

Zethson closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for DE in scverse #189

Better support for DE in scverse #189

Zethson commented Jan 13, 2023 •

edited by grst

Loading

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 16, 2023

Zethson commented Jan 16, 2023

grst commented Jan 16, 2023 •

edited

Loading

Zethson commented Jan 16, 2023

grst commented Jan 16, 2023 •

edited

Loading

Zethson commented Jan 16, 2023

grst commented Jan 16, 2023

Zethson commented Sep 16, 2024

Better support for DE in scverse #189

Better support for DE in scverse #189

Comments

Zethson commented Jan 13, 2023 • edited by grst Loading

Goal

Tools

Input specification

Output specification

Plots

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Tools

Plots

Visualization frameworks

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 13, 2023

Zethson commented Jan 13, 2023

grst commented Jan 16, 2023

Zethson commented Jan 16, 2023

grst commented Jan 16, 2023 • edited Loading

Zethson commented Jan 16, 2023

grst commented Jan 16, 2023 • edited Loading

Zethson commented Jan 16, 2023

grst commented Jan 16, 2023

Zethson commented Sep 16, 2024

Zethson commented Jan 13, 2023 •

edited by grst

Loading

grst commented Jan 16, 2023 •

edited

Loading

grst commented Jan 16, 2023 •

edited

Loading