r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

Question

Welcome To Ask or Share your Answers For Others

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

I try to solve a common problem in medicine: the combination of a prediction model with other sources, eg, an expert opinion [sometimes heavily emphysised in medicine], called superdoc predictor in this post.

This could be solved by stack a model with a logistic regression (that enters the expert opinion) as described on page 26 in this paper:

Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here

I've tried this here without considering overfitting (I did not apply out of fold predictions of the lower learner):

Example data

# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes

# add the super doctors opinion to the data
set.seed(2323)
data %>% 
  rowwise() %>% 
  mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data

# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]

Stacked models without considering out of fold predictions:

# elastic net regression (without the superdoc's opinion)
set.seed(2323)
model <- train(
  diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
  trControl = trainControl("repeatedcv",
                           number = 10,
                           repeats=10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)


# extract the coefficients for the best alpha and lambda  
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs

coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)


# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)


# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions


# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE) 
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions,     ci = TRUE)
#> 
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)

Now I would like to consider out of fold predictions using the mlr3 package family according to this very helpful post: Tuning a stacked learner

#library
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3filters)
library(mlr3tuning)
library(paradox)
library(glmnet)

# creat elastic net regression
glmnet_lrn =  lrn("classif.cv_glmnet", predict_type = "prob")

# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet") #I could not find a setting to filter the predictors (ie, not send the superdoc predictor here)

# summarize steps 
level0 = gunion(list(
  glmnet_cv1,
  po("nop", id = "only_superdoc_predictor")))  %>>% #I could not find a setting to send only the superdoc predictor to "union1"
  po("featureunion", id = "union1")


# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")

# combine ensemble model
ensemble = level0 %>>% log_reg_lrn
ensemble$plot(html = FALSE)

^{Created on 2021-03-15 by the reprex package (v1.0.0)}

My question (I am rather new to the `mlr3` package family)

is the mlr3 package family well suited for the ensemble model I try to build?
if yes, how cold I finalize the ensemle model and make predictions on the test.data

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:39:47+0000

I think mlr3 / mlr3pipelines is well suited for your task. It appears that what you are missing is mainly the PipeOpSelect / po("select"), which lets you extract features based on their name or other properties and makes use of Selector objects. Your code should probably look something like

library("mlr3")
library("mlr3pipelines")
library("mlr3learners")

# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")

# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")

# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
  selector = selector_invert(selector_name("superdoc")))

# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
  selector = selector_name("superdoc"))

# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
  select_superdoc,
  drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")

# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")

# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn

This is what it looks like:

ensemble$plot(html = FALSE)

To train and evaluate the model, we need to create Task objects:

train.task <- TaskClassif$new("train.data", train.data, target = "diabetes")
test.task <- TaskClassif$new("test.data", test.data, target = "diabetes")

The model can now be trained, can then be used for prediction, and the quality of the prediction can be evaluated. This works best if we turn the ensemble into a Learner:

elearner = as_learner(ensemble)
# Train the Learner:
elearner$train(train.task)
# (The training may give a warning because the glm gets the colinear features:
# The positive and the negative probabilities)

Get the prediction on the test set:

prediction = elearner$predict(test.task)
print(prediction)
#> <PredictionClassif> for 218 observations:
#>     row_ids truth response  prob.neg   prob.pos
#>           1   neg      neg 0.9417067 0.05829330
#>           2   neg      neg 0.9546343 0.04536566
#>           3   neg      neg 0.9152019 0.08479810
#> ---                                            
#>         216   neg      neg 0.9147406 0.08525943
#>         217   pos      neg 0.9078216 0.09217836
#>         218   neg      neg 0.9578515 0.04214854

The prediction was made on a Task, so it can be used directly measure performance against ground truth, e.g. using the "classif.auc" Measure:

msr("classif.auc")$score(prediction)
#> [1] 0.9308455

Two notes here:

You have split up your data into training and test set manually. mlr3 gives you the possibility to do resampling automatically, based on a single Task object. This can then go beyond simple train-test splits. Using the data from the question, and doing a 10-fold cross-validation would look like this:
```
all.task <- TaskClassif$new("all.data", data, target = "diabetes")
rr = resample(all.task, elearner, rsmp("cv"))  # will take some time
rr$aggregate(msr("classif.auc"))
#> classif.auc 
#>   0.9366438
```
I have shown how to construct the graph using the po("select") PipeOps, because it is fully general: You can choose to have some feature both in the glmnet_lrn Learner, as well as in the log_reg_lrn directly, by playing around with the selector values. If all you want to do is really to "divert" a feature from a single operation, you can also use the affect_columns to a Selector that selects the column you want. The following creates a (linear) graph that does exactly the same, but is less flexible:
```
glmnet_cv1_nosuperdoc = po("learner_cv", glmnet_lrn, id = "glmnet",
  affect_columns = selector_invert(selector_name("superdoc")))
ensemble2 = glmnet_cv1_nosuperdoc %>>% log_reg_lrn
e2learner = as_learner(ensemble2)
# etc.
```

Categories

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

My question (I am rather new to the `mlr3` package family)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

My question (I am rather new to the mlr3 package family)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

My question (I am rather new to the `mlr3` package family)