Question regarding the r2g/tf2g importance scores #72

psl-schaefer · 2022-11-15T13:11:14Z

psl-schaefer
Nov 15, 2022

If I understand the GENIE3/GRNboost algorithms correctly, we first we first normalize the variance per target to 1, such that the importance scores per target sum up to 1 (given that the prediction is almost perfect) which facilitates the comparison of importances score for different target genes.

Why do the importances scores per gene sum up to 1 in the region-gene data frame, but not in the tf-gene data frame?

This is for example how the sum of importances scores per gene looks like

Short update here: The variances per target are not normalized to 1 as in GENIE3, but instead the arborteo package (https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L150) uses the importances returned by scikit, which in turn are computed by averaging the reduction of the FriedmanMSE (https://github.com/scikit-learn/scikit-learn/blob/14130f44eb6cba8a2fb2eff8383be8909783cad0/sklearn/ensemble/_gb.py#L725, https://github.com/scikit-learn/scikit-learn/blob/9268eea91f143f4f5619f7671fdabf3ecb9adf1a/sklearn/tree/_tree.pyx#L1060) attributed to a given variable over all trees. The normalization is performed in the last step when each importance value is divided by the sum of all importance values.

SeppeDeWinter · 2022-11-23T15:59:35Z

SeppeDeWinter
Nov 23, 2022
Maintainer

Hi @schae211

I moved your question to the discussions.

The reason for this discrepancy is that slightly different parameters are are used to calculate TF-to-gene and region-to-gene importances:

For TF-to-gene:

'learning_rate': 0.01,
'n_estimators': 5000,  # can be arbitrarily large
'max_features': 0.1,
'subsample': 0.9

see:

scenicplus/src/scenicplus/TF_to_gene.py

Line 270 in 9320635

SGBM_KWARGS # regressor_kwargs

And for region-to-gene:

'learning_rate': 0.01,
'n_estimators': 500,
'max_features': 0.1

see:

scenicplus/src/scenicplus/enhancer_to_gene.py

Line 497 in 9320635

regressor_kwargs=GBM_KWARGS,

This causes a different type of normalisation to occur in the internals of the arborto package.
see: https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L150

For the TF-to-gene: is_oob_heuristic_supported(regressor_type, regressor_kwargs) will be equal to True, while it will be False for the region-to-gene.

Hope this helps.

Best.

Seppe

0 replies

psl-schaefer · 2022-12-05T20:17:43Z

psl-schaefer
Dec 5, 2022
Author

Hi @SeppeDeWinter

Thanks for your reply.

So the SGBM uses a subset of the data for each base learner such that we can estimate the test error using OOB samples (similar to random forest).

However, I still do not understand whether it would make sense to normalize the TF2G importance scores per gene?

0 replies

psl-schaefer · 2022-12-12T17:06:47Z

psl-schaefer
Dec 12, 2022
Author

I though a little bit more about the problem and I don't quite understand why the importance measures are multiplied by the number of trees here: https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L168

Given that the number of trees trained for each gene varies substantially (see my histogram above, according to https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L51, we can have up to 5000 trees per gene), it seems like some kind of bias is introduced when building the GRN using the GSEA approach:

For a given TF, all it's "linked" genes are ranked according to the importance for the GSEA (see:

scenicplus/src/scenicplus/grn_builder/gsea_approach.py

Line 115 in 0173233

order_TFs_to_genes_by='importance',

,

scenicplus/src/scenicplus/grn_builder/gsea_approach.py

Line 259 in 0173233

    
           TF2G_adj_activating_ranking = pd.Series(TF2G_adj_activating[order_TFs_to_genes_by]).sort_values(ascending=False)

). However, for genes were the SGBM comprised more trees, the importance measure are biased towards larger values and I don't quite see how this would make sense.

0 replies

SeppeDeWinter · 2022-12-12T17:51:20Z

SeppeDeWinter
Dec 12, 2022
Maintainer

Hi @schae211

Sorry for the late reply, nice that you're taking a close look at the method.

The reason is as follows:

We use the GBM to predict the gene expression vector based on the expression vectors of all TFs and we extract the feature importance from the GBM. The feature importance represents how important each TF is for predicting the gene expression and this value is internally normalised by Sklearn.

Afterwards we want to get the opposite ranking (how "important" is each gene for each TF), for that reason the values are denormalised in Arboreto. That way we can compare genes per TF.

Feel free to share/try suggestions for improvements.

Best,

Seppe

1 reply

psl-schaefer Dec 13, 2022
Author

Hi @SeppeDeWinter,

Thanks for you reply.

I kind of get the idea, but I am not sure whether it works as you proposed. (The codebase for scikit is quite large, so I am not sure whether what I wrote is correct)

I think what you want per model ("gene") is the total reduction in FriedmanMSE that can be attributed to a given TF. However, I think you actually get something very different. Here is the relevant code from scikit: https://github.com/scikit-learn/scikit-learn/blob/14130f44eb6cba8a2fb2eff8383be8909783cad0/sklearn/ensemble/_gb.py#L725, https://github.com/scikit-learn/scikit-learn/blob/9268eea91f143f4f5619f7671fdabf3ecb9adf1a/sklearn/tree/_tree.pyx#L1060

The importance values are computed for each tree by looping through each non-terminal node and computing the reduction in FriedmanMSE (the importance per feature is the sum over all nodes where this feature was used to make a split)
The, the average over all trees is computed.
Lastly, the importances are normalized by dividing by their sum.

[4. In arboreto these importances are then multiplied by the number of trees.]

I think what one would actually need in step 2 is the sum over all trees, and then omitting the normalization in step 3. With the current implementation you generate larger importance values for TF-gene links where more trees were used to predict the gene expression. However, fitting more trees simply means that more iterations were required until the OOB error stopped decreasing; and I don't see how this is related to how important a TF is for a gene or vice versa.

Apart from that I generally wonder how the whole procedure is affected by the heteroscedasticity of the count data. If I understand SCENIC+ correctly we predict the raw gene expression counts. Thus the reduction is MSE that can be achieved per gene might be different depending on the mean gene expression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding the r2g/tf2g importance scores #72

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question regarding the r2g/tf2g importance scores #72

psl-schaefer Nov 15, 2022

Replies: 4 comments · 1 reply

SeppeDeWinter Nov 23, 2022 Maintainer

psl-schaefer Dec 5, 2022 Author

psl-schaefer Dec 12, 2022 Author

SeppeDeWinter Dec 12, 2022 Maintainer

psl-schaefer Dec 13, 2022 Author

psl-schaefer
Nov 15, 2022

Replies: 4 comments 1 reply

SeppeDeWinter
Nov 23, 2022
Maintainer

psl-schaefer
Dec 5, 2022
Author

psl-schaefer
Dec 12, 2022
Author

SeppeDeWinter
Dec 12, 2022
Maintainer

psl-schaefer Dec 13, 2022
Author