Add `rest` and `intercept` column to glexobj$m #25

jyliuu · 2024-11-18T15:08:57Z

To be merged after #24.
This PR adds an intercept column to glex_obj$m for all supported tree-methods. Moreover, if the task is regression, a rest column will also be added which is equal to predictions - rowSums(glex_obj$m)

jemus42 · 2024-11-19T13:30:31Z

I think we need to be careful here, as changing the structure of the glex object has ripple effects across every plotting and reshaping function and needs to be consistent across the 2 (or 3, counting rpf) supported learners.

Some thoughts:

$intercept was stored as a scalar specifically to avoid having to store a constant column in $m. I understand that it's more convenient to have it all stored in $m such that things like rowSums "just work", but it also feels kind of inefficient (probably wouldn't matter too much, I admit). We can change that structure if it ends up being easier down the road, but there's a few considerations to make, even minor things like "what happens if the dataset happens to contain a feature named intercept for some reason?"
For the randomPlantedForest output, there already is a $remainder term, see example below. I don't care whether we call it $rest or $remainder, but naturally we should be consistent 😅 I also see that it doesn't help that rpf basically has its own machinery in a different package, which just adds to the complexity. That's one of the things that makes the remainder thing complicated, with the other part being that the remainder needs to make sense for regression, binary classification, and multiclass classification, which requires extra handling. See e.g. here in rpf https://github.com/PlantedML/randomPlantedForest/blob/41fe7eef99cfc60fbc04f6a08d30a932ffc097e0/R/predict_components.R#L124-L143 and see Add remainder vector to glex output #11 as well. Happy to make progress here though, it's been on my list for some time, but I wanted to avoid a "this just works for regression now sry" type situation, which I always find frustrating as a user.
On that note, we might also address the part where $shap is stored even if max_interaction is set in glex which then makes $shap effectively meaningless (see also Calculate shap values only for the selected features #18)

Example for remainder term

library(glex)
library(xgboost)
library(randomPlantedForest)
set.seed(234)
options(max.print = 10)

# this is completely arbitrary nonsense
xdat <- data.frame(
  x1 = rnorm(100),
  x2 = rpois(100, 2),
  x3 = runif(100)
)
xdat <- within(xdat, y <- 3 * x1 + 0.5 * (x2 + x3) + 3 * abs(x1 * x3))

# rpf has remainder term
rpf_fit <- rpf(y ~ ., data = xdat, num.trees = 50, max_interaction = 3)
rpf_glex <- glex(rpf_fit, xdat, max_interaction = 2)
rpf_glex$remainder
#>  [1] -0.0153642314 -0.0166246943 -0.0396781809  0.0453551935 -0.0003334053
#>  [6]  0.0038383014  0.0034622827  0.0306458754 -0.0007525907 -0.0162234933
#>  [ reached getOption("max.print") -- omitted 90 entries ]

# also, intercept is stored as scalar
rpf_glex$intercept
#> [1] 2.053044


# xgb not yet
xgb_fit <- xgboost(data = as.matrix(xdat[, 1:3]), label = xdat$y, max_depth = 3, 
                   early_stopping_rounds = 50, nrounds = 1000, verbose = FALSE)
xgb_glex <- glex(xgb_fit, as.matrix(xdat[, 1:3]), max_interaction = 2)
xgb_glex$remainder
#> NULL

xgb_glex$intercept
#> [1] 2.141527

# Also, shap is stored but known to be wrong due to max_interaction limit
xgb_glex$shap
#>               x1           x2          x3
#>            <num>        <num>       <num>
#>   1:  1.98061587  0.056079863  0.12250051
#>   2: -3.80587144 -0.281592150 -0.33564143
#>  [ reached getOption("max.print") -- omitted 99 rows ]

^{Created on 2024-11-19 with reprex v2.1.1}

jyliuu and others added 12 commits November 15, 2024 11:30

Split FastPD expectation into new file

c677249

Split rest into multiple files

2b9387e

Add helper.cpp

b4a5ca8

Augment up to max_interaction

b079e3b

Marginalize by looking at S - features to explain

6d577ed

Add get_all_subsets up to max size

24d8d49

Declare functions

0f6a7fd

Marginalize up to max_interaction

f578b95

Add test to ensure correctness of interactions

0889f95

Fix empirical leaf-weighting for ranger only

1ab2074

Fix test

2fead1d

Merge branch 'master' into optimization-fastpd

d690e79

jyliuu marked this pull request as ready for review November 18, 2024 15:09

jyliuu requested review from jemus42, mnwright and MHiabu November 18, 2024 15:09

jyliuu added 6 commits November 18, 2024 16:30

Use more generalizeable naming

62261a6

Don't remove intercept column

508a56d

Add rest column for regression tasks

47ecc7b

Convert to matrix if not so

b8f2f90

Update tests to accomodate for intercept

30470b7

Update README

3c596c7

jyliuu force-pushed the add-rest branch from 7fe2598 to 3c596c7 Compare November 18, 2024 15:35

Merge branch 'master' into add-rest

a9028b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `rest` and `intercept` column to glexobj$m #25

Add `rest` and `intercept` column to glexobj$m #25

jyliuu commented Nov 18, 2024

jemus42 commented Nov 19, 2024

Add rest and intercept column to glexobj$m #25

Are you sure you want to change the base?

Add rest and intercept column to glexobj$m #25

Conversation

jyliuu commented Nov 18, 2024

jemus42 commented Nov 19, 2024

Example for remainder term

Add `rest` and `intercept` column to glexobj$m #25

Add `rest` and `intercept` column to glexobj$m #25