Question regarding the r2g/tf2g importance scores #72
Replies: 4 comments 1 reply
-
Hi @schae211 I moved your question to the discussions. The reason for this discrepancy is that slightly different parameters are are used to calculate TF-to-gene and region-to-gene importances: For TF-to-gene: 'learning_rate': 0.01,
'n_estimators': 5000, # can be arbitrarily large
'max_features': 0.1,
'subsample': 0.9 see: scenicplus/src/scenicplus/TF_to_gene.py Line 270 in 9320635 And for region-to-gene: 'learning_rate': 0.01,
'n_estimators': 500,
'max_features': 0.1 see: scenicplus/src/scenicplus/enhancer_to_gene.py Line 497 in 9320635 This causes a different type of normalisation to occur in the internals of the arborto package. For the TF-to-gene: Hope this helps. Best. Seppe |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply. So the SGBM uses a subset of the data for each base learner such that we can estimate the test error using OOB samples (similar to random forest). However, I still do not understand whether it would make sense to normalize the TF2G importance scores per gene? |
Beta Was this translation helpful? Give feedback.
-
I though a little bit more about the problem and I don't quite understand why the importance measures are multiplied by the number of trees here: https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L168 Given that the number of trees trained for each gene varies substantially (see my histogram above, according to https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L51, we can have up to 5000 trees per gene), it seems like some kind of bias is introduced when building the GRN using the GSEA approach: For a given TF, all it's "linked" genes are ranked according to the importance for the GSEA (see: , ). However, for genes were the SGBM comprised more trees, the importance measure are biased towards larger values and I don't quite see how this would make sense. |
Beta Was this translation helpful? Give feedback.
-
Hi @schae211 Sorry for the late reply, nice that you're taking a close look at the method. The reason is as follows: We use the GBM to predict the gene expression vector based on the expression vectors of all TFs and we extract the feature importance from the GBM. The feature importance represents how important each TF is for predicting the gene expression and this value is internally normalised by Sklearn. Afterwards we want to get the opposite ranking (how "important" is each gene for each TF), for that reason the values are denormalised in Arboreto. That way we can compare genes per TF. Feel free to share/try suggestions for improvements. Best, Seppe |
Beta Was this translation helpful? Give feedback.
-
If I understand the GENIE3/GRNboost algorithms correctly, we first we first normalize the variance per target to 1, such that the importance scores per target sum up to 1 (given that the prediction is almost perfect) which facilitates the comparison of importances score for different target genes.
Why do the importances scores per gene sum up to 1 in the region-gene data frame, but not in the tf-gene data frame?
This is for example how the sum of importances scores per gene looks like
Short update here: The variances per target are not normalized to 1 as in GENIE3, but instead the arborteo package (https://github.com/aertslab/arboreto/blob/2f475dca08f47a60acc2beb8dd897e77b7495ca4/arboreto/core.py#L150) uses the importances returned by scikit, which in turn are computed by averaging the reduction of the FriedmanMSE (https://github.com/scikit-learn/scikit-learn/blob/14130f44eb6cba8a2fb2eff8383be8909783cad0/sklearn/ensemble/_gb.py#L725, https://github.com/scikit-learn/scikit-learn/blob/9268eea91f143f4f5619f7671fdabf3ecb9adf1a/sklearn/tree/_tree.pyx#L1060) attributed to a given variable over all trees. The normalization is performed in the last step when each importance value is divided by the sum of all importance values.
Beta Was this translation helpful? Give feedback.
All reactions