Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very small number of trees affect on AUC and comparing AUC across experiments #14

Open
vincentrose88 opened this issue Nov 4, 2021 · 6 comments

Comments

@vincentrose88
Copy link

Hi

Great work on the R-package Augur!

I'm using it to prioritise cell types response on treatment on a disease in two setups: 1x treatment and 5x treatment, and I have a couple of question on how to interpret and use the AUC results:

AUC comparison across experiments?

My question is: Can I compare the AUC across these experiments directly, or can I only use the rank?

For example: Does the Cell-type_A in G2 have a comparable response to Cell-type_A in G1, while Cell-type_I have a significantly bigger response in G2 than G1 in below results?

Results

The experimental groups and results are (anonymised due this being a clients data):

G1: 1x treatment + disease (case) VS 1x placebo + disease (control)

  cell_type     auc
  <chr>         <dbl>
1 Cell-type_B   0.952
2 Cell-type_A   0.944
3 Cell-type_C   0.838
4 Cell-type_E   0.719
5 Cell-type_D   0.707
6 Cell-type_F   0.668
7 Cell-type_H   0.666
8 Cell-type_G   0.666
9 Cell-type_I   0.640

G2: 5x treatment + disease (case) VS 5x placebo + disease (control)

  cell_type    auc
  <chr>        <dbl>
1 Cell-type_A  0.991
2 Cell-type_B  0.976
3 Cell-type_C  0.974
4 Cell-type_D  0.957
5 Cell-type_E  0.957
6 Cell-type_F  0.953
7 Cell-type_G  0.946
8 Cell-type_H  0.935
9 Cell-type_I  0.931

Number of trees affect on AUC

For the experiment group G2 (5x treatment vs 5x placebo), I only get useful results if I use a very low number of trees, as you suggest in your paper (Methods: Hyperparameter analysis)

[…] Empirically, we suggest decreasing the number of trees in the random forest classifier in scenarios where perfect classification can be achieved for many cell types (Supplementary Fig. 10g).

My question is simply: Does it makes sense to have so few trees?

Results

(Only number of trees changes, all other options are default)

Num_tree = 50

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_E  1   
2 Cell-type_A  1   
3 Cell-type_I  1
4 Cell-type_D  1
5 Cell-type_H  1
6 Cell-type_F  1
7 Cell-type_C  1
8 Cell-type_B  1
9 Cell-type_G  1

Num_tree = 10

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_E  1   
2 Cell-type_A  1   
3 Cell-type_I  1.00
4 Cell-type_D  1.00
5 Cell-type_H  1.00
6 Cell-type_F  1.00
7 Cell-type_C  1.00
8 Cell-type_B  1.00
9 Cell-type_G  1.00

Num_tree = 5

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  1.00 
2 Cell-type_B  0.999
3 Cell-type_E  0.998
4 Cell-type_C  0.996
5 Cell-type_D  0.996
6 Cell-type_F  0.995
7 Cell-type_H  0.993
8 Cell-type_G  0.993
9 Cell-type_I  0.990

Num_tree = 3

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.996
2 Cell-type_B  0.995
3 Cell-type_C  0.989
4 Cell-type_F  0.984
5 Cell-type_D  0.983
6 Cell-type_E  0.982
7 Cell-type_G  0.979
8 Cell-type_H  0.975
9 Cell-type_I  0.965

Num_tree = 2

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.991
2 Cell-type_B  0.976
3 Cell-type_C  0.974
4 Cell-type_D  0.957
5 Cell-type_E  0.957
6 Cell-type_F  0.953
7 Cell-type_G  0.946
8 Cell-type_H  0.935
9 Cell-type_I  0.931

Num_tree = 1

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.942
2 Cell-type_B  0.933
3 Cell-type_C  0.893
4 Cell-type_G  0.873
5 Cell-type_E  0.870
6 Cell-type_D  0.870
7 Cell-type_F  0.857
8 Cell-type_H  0.839
9 Cell-type_I  0.812

Looking forward to your feedback and thanks in advance!

Kind regard

@jordansquair
Copy link
Collaborator

Are you using a seurat object as input or directly a count/normalized matrix? If a Seurat object, can you check the default assay?

@vincentrose88
Copy link
Author

vincentrose88 commented Nov 4, 2021

Are you using a seurat object as input or directly a count/normalized matrix? If a Seurat object, can you check the default assay?

Yes I'm using a Seurat object and the default assay is "integrated"

> DefaultAssay(seurat_obj)
[1] "integrated"

@jordansquair
Copy link
Collaborator

You will want to switch that back to "RNA" or directly input the count matrix.

DefaultAssay(obj) = "RNA"

Then run Augur.

To answer your question about the experimental design. Yes, you can compare the AUCs themselves.

You may want to consider using differential prioritization for this case also. You can see our protocol: https://www.nature.com/articles/s41596-021-00561-x for more details (specifically Case Study #4).

@vincentrose88
Copy link
Author

You will want to switch that back to "RNA" or directly input the count matrix.

DefaultAssay(obj) = "RNA"

Then run Augur.

To answer your question about the experimental design. Yes, you can compare the AUCs themselves.

You may want to consider using differential prioritization for this case also. You can see our protocol: https://www.nature.com/articles/s41596-021-00561-x for more details (specifically Case Study #4).

Thanks!

I'll give that a try!

@vincentrose88
Copy link
Author

Using RNA as the default assay I get more sensible results (with num tree = 50):

  annotation  auc
1 Cell-type_A 0.6052060
2 Cell-type_B 0.5276417
3 Cell-type_C 0.5242139
4 Cell-type_D 0.5189135
5 Cell-type_E 0.5170862
6 Cell-type_F 0.5112566
7 Cell-type_G 0.5066270
8 Cell-type_H 0.4989002

Thanks for the help! You can consider this issue closed 👍

@vincentrose88
Copy link
Author

Thinking more about these results, I'm surprised that the AUC is so much higher when running on a Seurat integrated space than on RNA:

RNA (num_tree = 50)

  annotation  auc
1 Cell-type_A 0.6052060
2 Cell-type_B 0.5276417
3 Cell-type_C 0.5242139
4 Cell-type_D 0.5189135
5 Cell-type_E 0.5170862
6 Cell-type_F 0.5112566
7 Cell-type_G 0.5066270
8 Cell-type_H 0.4989002

Integrated (num_tree = 2)

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.991
2 Cell-type_B  0.976
3 Cell-type_C  0.974
4 Cell-type_D  0.957
5 Cell-type_E  0.957
6 Cell-type_F  0.953
7 Cell-type_G  0.946
8 Cell-type_H  0.935
9 Cell-type_I  0.931

Do you have any explanation for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants