Predictive modelling lab1

1. Hands-on excerise building a simple predictive logsitic regression model

1.1 Descriptive analysis

  • For this tutorial, we will be using the OVA data. This data set contains information on the follow-up of patients with severe ovary cancer.

  • We use SURV, survival status at 4 years as a binary outcome

  • Predictors considered are: KARN, BRODERS, FIGO, ASCITES, DIAM, and AGE

  • The data are as follows:

    • T survival time in days
    • D 1 = dead at time T; 0 = alive at time T
    • SURV 1 = T > 1460 (4 years); 0 = T <= 1460
    • KARN Karnovsky-index, coded as 6 ( <= 60), 7, 8, 9, 10 (= 100)
    • BRODERS Broders grading
    • FIGO 3 or 4
    • ASCITES 1 = absent; 2 = present; 0 = unknown
    • DIAM diameter residual tumor 3 = >= 5 cm; 4 = 2 - 5 cm; 5 = 1 - 2 cm; 6 = < 1 cm; 7 = only microscopical residuals
    • AGE age at diagnose (year)
library(tidyverse)
library(gtsummary)
library(DT)

ova <- read.csv("OVA.csv", header = TRUE)

# look at new data;
datatable(ova,
          rownames = FALSE,
          options = list(dom = 't')) 
# display summary statistics
ova %>% 
  tbl_summary(
  missing_text = "(Missing)")
Characteristic N = 3581
t 735 (369, 1,619)
d 266 (74%)
surv 112 (31%)
karn
    6 20 (5.6%)
    7 46 (13%)
    8 47 (13%)
    9 108 (30%)
    10 137 (38%)
broders
    1 42 (14%)
    2 89 (29%)
    3 127 (41%)
    4 49 (16%)
    (Missing) 51
figo
    3 262 (73%)
    4 96 (27%)
ascites
    0 52 (15%)
    1 94 (26%)
    2 212 (59%)
diam
    3 145 (41%)
    4 68 (19%)
    5 49 (14%)
    6 67 (19%)
    7 29 (8.1%)
age 55 (45, 65)
1 Median (IQR); n (%)
# display summary statistics by survival status;
ova %>% 
  tbl_summary(
  by = surv,
  missing_text = "(Missing)") %>%
  add_overall() %>%
  add_p()
Characteristic Overall, N = 3581 0, N = 2461 1, N = 1121 p-value2
t 735 (369, 1,619) 478 (237, 778) 1,891 (1,667, 2,156) <0.001
d 266 (74%) 246 (100%) 20 (18%) <0.001
karn


0.003
    6 20 (5.6%) 19 (7.7%) 1 (0.9%)
    7 46 (13%) 39 (16%) 7 (6.3%)
    8 47 (13%) 34 (14%) 13 (12%)
    9 108 (30%) 67 (27%) 41 (37%)
    10 137 (38%) 87 (35%) 50 (45%)
broders


0.004
    1 42 (14%) 19 (9.2%) 23 (23%)
    2 89 (29%) 63 (31%) 26 (26%)
    3 127 (41%) 94 (46%) 33 (33%)
    4 49 (16%) 30 (15%) 19 (19%)
    (Missing) 51 40 11
figo


<0.001
    3 262 (73%) 164 (67%) 98 (88%)
    4 96 (27%) 82 (33%) 14 (13%)
ascites


0.005
    0 52 (15%) 38 (15%) 14 (13%)
    1 94 (26%) 52 (21%) 42 (38%)
    2 212 (59%) 156 (63%) 56 (50%)
diam


<0.001
    3 145 (41%) 119 (48%) 26 (23%)
    4 68 (19%) 49 (20%) 19 (17%)
    5 49 (14%) 33 (13%) 16 (14%)
    6 67 (19%) 35 (14%) 32 (29%)
    7 29 (8.1%) 10 (4.1%) 19 (17%)
age 55 (45, 65) 58 (50, 66) 49 (40, 56) <0.001
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

1.2 Data manipulation

  • We will have a guest lecture on missing data by Prof. Aya Mitani, see course schedule.

  • We can use multiple imputation to handle missing data, this will be covered in the guest lecture.

  • In this example, we show how you can summarize missingness by other covariates and we create a new category for unkown BRODERS scores (0=unknown)

#step 1: summarize data by missing status;

ova %>%
  mutate(missing = ifelse(is.na(broders)==T,1,0)) %>% 
  select(-broders) %>%
  tbl_summary(
  by = missing,
  missing_text = "(Missing)") %>%
  add_p()
Characteristic 0, N = 3071 1, N = 511 p-value2
t 779 (386, 1,667) 649 (235, 1,407) 0.043
d 225 (73%) 41 (80%) 0.3
surv 101 (33%) 11 (22%) 0.11
karn

0.7
    6 17 (5.5%) 3 (5.9%)
    7 37 (12%) 9 (18%)
    8 40 (13%) 7 (14%)
    9 96 (31%) 12 (24%)
    10 117 (38%) 20 (39%)
figo

0.069
    3 230 (75%) 32 (63%)
    4 77 (25%) 19 (37%)
ascites

0.4
    0 42 (14%) 10 (20%)
    1 84 (27%) 10 (20%)
    2 181 (59%) 31 (61%)
diam

0.3
    3 123 (40%) 22 (43%)
    4 60 (20%) 8 (16%)
    5 39 (13%) 10 (20%)
    6 57 (19%) 10 (20%)
    7 28 (9.1%) 1 (2.0%)
age 56 (46, 65) 54 (45, 66) >0.9
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test
#Step 2: instead of imputation, we will create a new category to label missing broders scores

ova <- ova %>%
  mutate(broders2 = ifelse(is.na(broders)==T,0,broders)) #if missing assign value 0, otherwise assign the original value;
Tips on handling missing data
  • when you have multiple variables with missing information, you can use data vistualization to display missing pattern, reference tutorial R see, https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html

  • Under missing at random, Rule of thumb, if you have <= 25% missing, you can consider multiple imputation. Number of imputed copies to be considered between 5 to 20.

  • For categorical variable, adding a new category for missing can be preferred if you don’t want to make any missing assumptions.

  • At this step, you can also evaluate your data and see if recoding is needed to collapse categories (categories with rare outcome or categories with small cell).

1.3 Model building strategies

  • Predictive models should include those variables that reflect the study population of interest

  • When more is not better: variables can be highly correlated, removing one of the highly correlated predictor does not compromise the model performance and the reduced model is computationally more efficient.

  • Good practice to assess pair-wise correlation before building your predictive model

  • You can also use model selection approaches for example, LASSO or bootstrap + stepwise (not stepwise by itself), however, since we are building a predictive model. Model selection should ideally be performed to maximize predictive power and accuracy not fit! That is comparing predictive model based on C-statistics, ROC not AIC or BIC.

#create a dataset to host all variables that will be used to build the prediction model;
ova2 <- ova %>% select(-t, -d, -broders)

#create pair-wise correlation plot;
library(corrplot)
corrplot(cor(ova2))

  • now let’s build a model no interaction and no non-linear terms, just a simple additive logistic regression model
m0 <- glm(surv ~ as.factor(karn)+as.factor(figo)+as.factor(ascites)+as.factor(diam)+age+as.factor(broders2), family="binomial", data = ova2)
summary(m0)

Call:
glm(formula = surv ~ as.factor(karn) + as.factor(figo) + as.factor(ascites) + 
    as.factor(diam) + age + as.factor(broders2), family = "binomial", 
    data = ova2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0132  -0.7626  -0.4787   0.7940   2.5970  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -0.666311   1.264863  -0.527 0.598343    
as.factor(karn)7      1.081171   1.150849   0.939 0.347497    
as.factor(karn)8      1.901206   1.124047   1.691 0.090762 .  
as.factor(karn)9      1.825588   1.092649   1.671 0.094763 .  
as.factor(karn)10     1.708743   1.090427   1.567 0.117105    
as.factor(figo)4     -0.919926   0.355006  -2.591 0.009562 ** 
as.factor(ascites)1  -0.006856   0.438393  -0.016 0.987523    
as.factor(ascites)2  -0.416012   0.398599  -1.044 0.296631    
as.factor(diam)4      0.554593   0.380419   1.458 0.144883    
as.factor(diam)5      0.682469   0.421686   1.618 0.105570    
as.factor(diam)6      1.205796   0.372404   3.238 0.001204 ** 
as.factor(diam)7      1.704534   0.495323   3.441 0.000579 ***
age                  -0.043949   0.009831  -4.470 7.81e-06 ***
as.factor(broders2)1  1.272036   0.526319   2.417 0.015655 *  
as.factor(broders2)2  0.252897   0.462477   0.547 0.584494    
as.factor(broders2)3  0.063403   0.443116   0.143 0.886224    
as.factor(broders2)4  0.895321   0.520517   1.720 0.085421 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 444.89  on 357  degrees of freedom
Residual deviance: 351.60  on 341  degrees of freedom
AIC: 385.6

Number of Fisher Scoring iterations: 5

1.4 Model performance

  • Two key elements in assessing the performance of a fitted logistic regression model are the assessment of model calibration and model discrimination.

  • Calibration refers to the agreement between observed outcomes and predictions (brier score, calibration plot)

  • Discrimination refers to the ability of model predictions to discriminate between those with and those without the outcome (C-statistics, ROC)

  • First, lets estimate c-statistics (AUC) and plot ROC curve

    • “The c-statistic is defined as the proportion of all possible pairs of subjects, one of whom experienced the outcome of interest and one of whom did not, in which the subject who experienced the outcome of interest had a higher predicted probability of the outcome occurring than does the subject who did not experience the outcome.” - From Austin, Harrell, and Steyerberg, 2021
# install.packages('ROCR')
library(ROCR)

# in-sample: the same training data set on which the model was fit;
prob <- predict.glm(m0, type='response', newdata =  ova2)
pred <- prediction(prob, ova2$surv)

# AUC
auc <- performance(pred,"auc")@y.values[[1]][1]
auc #c-statistics for model m0 is 0.8 pretty good!
[1] 0.8003412
# plot the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf, col="navyblue", cex.main=1,
     main= paste("Logistic Regression ROC Curve \n AUC =", round(auc,3)))
abline(a=0, b = 1, col='red')

# we can obtain a confidence interval for AUC via bootstrap;
# package pROC has a function that allows you do do this more easily;
# I provided below sample codes to do bootstrap manually;

n<-dim(ova2)[1]
auc.boot <- rep(NA, 1000) #initiating a empty vector of size 1000 to save the bootstrap auc;

set.seed(123)
for(i in 1:1000){
  # random draw with replacement to form a bootstrap sample with same size as the original data;
    obs.boot <- sample(x = 1:n, size = n, replace = T)
    data.boot <- ova2[obs.boot, ]
    
    # fit the model on bootstrap sample
    m0.boot <- glm(m0$formula , #extracting model formula from m0;
                      data=data.boot,
                      family = binomial(link = "logit"))
    
    # calculate and save auc for each bootstrap iteration;
    prob1 = predict(m0.boot, type='response', data.boot)
    pred1 = prediction(prob1, data.boot$surv)
    auc.boot[i] <- performance(pred1,"auc")@y.values[[1]][1]
}

# 95% CI for auc using empirical distribution;
quantile(auc.boot, c(.025, .975)) 
     2.5%     97.5% 
0.7767338 0.8635063 
  • Brier score is simply the mean of squared differences between the predicted probability of outcome = 1 and the observed event status (0 or 1). A measurement that quantifies the predictive accuracy.

  • Brier score between 0 and 1, 0 is perfect prediction; 0.25 is predicting by chance - as flipping a coin; 1 is worst prediction, worst than flipping a coin.

\[ \text{Brier score} = \frac{1}{N} \sum_{i=1}^N (P(y_i=1 \mid x_{i1}, \ldots, x_{ip}) - y_i)^2 \]

BrierScore <- mean((prob-ova2$surv)^2)
BrierScore
[1] 0.1618604
#try this yourself, can you use the bootstrap code to get the empirical distirbution of BrierScore?
  • Regression calibration plot to compare the probability of outcome from the data with the estimated probability of outcome from the model.
    • perfect calibration if the empirical probability matches with the model estimated probability in a diagonal line,
    • we can fit a linear regression model for the data probability (as y) and model estimated probability (as x) and report the intercept and slope of this simple linear regression as the calibration intercept and slope
pdat <- with(ova2, data.frame(surv,prob = predict(m0, type = "response")))

lm(surv ~ prob, data = pdat)

Call:
lm(formula = surv ~ prob, data = pdat)

Coefficients:
(Intercept)         prob  
   0.001157     0.996300  
# intercept estimated at 0.001157 and slope estimated at 0.9963;

ggplot(pdat, aes(prob,surv))+
  geom_abline(slope=1, intercept = 0, color = "red")+
  geom_smooth(method = stats::loess, se = FALSE, color="black")+
  xlab("Esimated Probability")+
  ylab("Data Probability")+
  ggtitle("Logistic Regression Calibration Plot")

# getting calibration plot using rms package;
library(rms)
ova2 <- ova2 %>%
  mutate(karn = as.factor(karn),
         figo = as.factor(figo),
         ascites = as.factor(ascites),
         diam = as.factor(diam),
         broders2 = as.factor(broders2))

m0_rms <- lrm( surv ~ karn + figo + ascites + diam + age + broders2, data = ova2, x = TRUE, y = TRUE)
plot(calibrate(m0_rms, B=200))


n=358   Mean absolute error=0.031   Mean squared error=0.00115
0.9 Quantile of absolute error=0.055
#Resampling Validation of a Fitted Model using rms package (might have seen this from Biostat 2 with Kevin);
validate(m0_rms, B=200)
          index.orig training    test optimism index.corrected   n
Dxy           0.6007   0.6418  0.5608   0.0811          0.5196 200
R2            0.3225   0.3710  0.2747   0.0963          0.2262 200
Intercept     0.0000   0.0000 -0.1236   0.1236         -0.1236 200
Slope         1.0000   1.0000  0.7745   0.2255          0.7745 200
Emax          0.0000   0.0000  0.0791   0.0791          0.0791 200
D             0.2578   0.3039  0.2147   0.0892          0.1686 200
U            -0.0056  -0.0056  0.0152  -0.0208          0.0152 200
Q             0.2634   0.3095  0.1995   0.1100          0.1534 200
B             0.1619   0.1523  0.1707  -0.0184          0.1803 200
g             1.5101   1.9302  1.4687   0.4615          1.0486 200
gp            0.2571   0.2735  0.2353   0.0382          0.2190 200
  • Somers’ Dxy, \(= 2 \times (C-0.5)\), in this case the corrected Dxy = 0.5186, thus the corrected c-statistics/AUC is 0.5186/2 + 0.5 = 0.7593.

1.5 Updating model by adding a non-linear term on variable age and compare this model to the simple additive model

  • fitting a new model with a spline term on variable age

  • comparing this model to the previous model on prediction performance

library(splines)

m1 <- glm(surv ~ karn + figo + ascites + diam + splines::ns(age,3) + broders2, family="binomial", data = ova2)
summary(m1)

Call:
glm(formula = surv ~ karn + figo + ascites + diam + splines::ns(age, 
    3) + broders2, family = "binomial", data = ova2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9711  -0.7586  -0.4199   0.7981   2.6286  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -3.03792    1.55831  -1.949 0.051236 .  
karn7                 1.16521    1.16492   1.000 0.317188    
karn8                 1.96890    1.14024   1.727 0.084213 .  
karn9                 2.08343    1.10957   1.878 0.060423 .  
karn10                1.84071    1.10573   1.665 0.095973 .  
figo4                -0.94158    0.36213  -2.600 0.009319 ** 
ascites1             -0.06259    0.44853  -0.140 0.889014    
ascites2             -0.51534    0.41020  -1.256 0.208993    
diam4                 0.45989    0.39067   1.177 0.239126    
diam5                 0.64230    0.42853   1.499 0.133915    
diam6                 1.06054    0.37732   2.811 0.004943 ** 
diam7                 1.65249    0.50099   3.298 0.000972 ***
splines::ns(age, 3)1 -2.77535    0.66573  -4.169 3.06e-05 ***
splines::ns(age, 3)2  1.14961    2.28102   0.504 0.614268    
splines::ns(age, 3)3 -0.31205    1.30546  -0.239 0.811079    
broders21             1.39227    0.54262   2.566 0.010293 *  
broders22             0.36669    0.47535   0.771 0.440457    
broders23             0.09572    0.45323   0.211 0.832731    
broders24             0.95063    0.53452   1.778 0.075327 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 444.89  on 357  degrees of freedom
Residual deviance: 342.87  on 339  degrees of freedom
AIC: 380.87

Number of Fisher Scoring iterations: 5
prob <- predict.glm(m1, type='response', newdata =  ova2)
pred <- prediction(prob, ova2$surv)

# AUC
auc <- performance(pred,"auc")@y.values[[1]][1]
auc #new auc is 0.812, improved!
[1] 0.8120645
BrierScore <- mean((prob-ova2$surv)^2)
BrierScore #also improved;
[1] 0.1576356
#comparing two calibration plots;
# pdat <- with(ova2, data.frame(surv,prob = predict(m0, type = "response")))
pdat_nonlinear <- with(ova2, data.frame(surv,prob_linear = predict(m0, type = "response"), prob_nonlinear = predict(m1, type = "response")))

ggplot(pdat_nonlinear, aes(prob_linear,surv))+
  geom_abline(slope=1, intercept = 0, aes(colour = "Reference"), colour="red")+
  geom_smooth(aes(x= prob_linear,y=surv,colour = "Linear"), method = stats::loess)+
  geom_smooth(aes(x=prob_nonlinear,y=surv,colour = "Spline"), method = stats::loess)+
  labs(x="Esimated Probability",
       y="Data Probability")+
  scale_color_manual(name="Regression model",
                     breaks=c("Reference","Linear","Spline"),
                     values = c("Reference"="red","Linear"="black","Spline"="blue")) +
  ggtitle("Logistic Regression Calibration Plot")

  • The spline model has better predictive performance comparing to the linear model.

2. Machine learning methods

2.1 Best online introductory tutorial on machine learning methods (personal opinion)

  • Machine Learning for Everyone at https://vas3k.com/blog/machine_learning/.
  • Taking a few figures from this online tutorial to give an overview of ML and classic methods

  • For binary prediction model, we will be looking at supervised learning and binary classification methods

  • Popular algorithms:

    • Logistic Regression, Support Vector Machine, discriminant analysis, Neural Networks
    • Decision Tree, Random Forest, AdeBoost, XGBoost
    • K-Nearest Neighbours, Naive Bayes

2.2 Super (Machine) Learning

  • Super learning can be used to obtain robust estimator. In a nut-shell it uses loss-function based ML tool and cross-validation to obtain the best prediction of model parameter of interest, based on a weighted average of a library of machine learning algorithms.

  • Guide to SuperLearner by Chris Kennedy at https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html

  • New visual guide created by Katherine Hoffman

  • List of machine learning algorithms under SuperLearner R package
library(SuperLearner)
listWrappers()
 [1] "SL.bartMachine"      "SL.bayesglm"         "SL.biglasso"        
 [4] "SL.caret"            "SL.caret.rpart"      "SL.cforest"         
 [7] "SL.earth"            "SL.extraTrees"       "SL.gam"             
[10] "SL.gbm"              "SL.glm"              "SL.glm.interaction" 
[13] "SL.glmnet"           "SL.ipredbagg"        "SL.kernelKnn"       
[16] "SL.knn"              "SL.ksvm"             "SL.lda"             
[19] "SL.leekasso"         "SL.lm"               "SL.loess"           
[22] "SL.logreg"           "SL.mean"             "SL.nnet"            
[25] "SL.nnls"             "SL.polymars"         "SL.qda"             
[28] "SL.randomForest"     "SL.ranger"           "SL.ridge"           
[31] "SL.rpart"            "SL.rpartPrune"       "SL.speedglm"        
[34] "SL.speedlm"          "SL.step"             "SL.step.forward"    
[37] "SL.step.interaction" "SL.stepAIC"          "SL.svm"             
[40] "SL.template"         "SL.xgboost"         
[1] "All"
[1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
[4] "screen.randomForest"   "screen.SIS"            "screen.template"      
[7] "screen.ttest"          "write.screen.template"
  • We can use ML to model predict the binary outcome.
y <- ova2$surv
x <- ova2[,2:7]

# parallel computing to speed up;
library(parallel)
# Create a cluster using most CPUs;
cl <- parallel::makeCluster(8)

# Load libraries on workers
parallel::clusterEvalQ(cl, { 
  library(SuperLearner); library(xgboost)
 })
[[1]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[2]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[3]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[4]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[5]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[6]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[7]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        

[[8]]
 [1] "xgboost"      "SuperLearner" "gam"          "foreach"      "splines"     
 [6] "nnls"         "stats"        "graphics"     "grDevices"    "utils"       
[11] "datasets"     "methods"      "base"        
# Export all references to cluster nodes;
parallel::clusterExport(cl, c('x','y'))

# Set a common seed for the cluster;
parallel::clusterSetRNGStream(cl, iseed = 123)

# Do cross validation in parallel
 system.time({
   ensem.AUC.cv <- CV.SuperLearner(Y=y ,X=x,
                                   cvControl=list(V=10),
                                   parallel=cl,
                                   family=binomial(),
                                   method="method.AUC",
                                   SL.library=list("SL.glm",
                                            "SL.ksvm",
                                            "SL.xgboost",
                                            "SL.nnet"))
   })
   user  system elapsed 
   0.01    0.01   17.82 
summary(ensem.AUC.cv)

Call:  
CV.SuperLearner(Y = y, X = x, family = binomial(), SL.library = list("SL.glm",  
    "SL.ksvm", "SL.xgboost", "SL.nnet"), method = "method.AUC", cvControl = list(V = 10),  
    parallel = cl) 

Risk is based on: Area under ROC curve (AUC)

All risk estimates are based on V =  10 

      Algorithm     Ave se     Min     Max
  Super Learner 0.74128 NA 0.61154 0.86622
    Discrete SL 0.75173 NA 0.63077 0.86622
     SL.glm_All 0.75173 NA 0.63077 0.86622
    SL.ksvm_All 0.70975 NA 0.60800 0.86288
 SL.xgboost_All 0.68759 NA 0.62393 0.84444
    SL.nnet_All 0.66410 NA 0.50000 0.80400
parallel::stopCluster(cl)

# set.seed(123)
# Get V-fold cross-validated risk estimate without parallel;
cv.model <- CV.SuperLearner(y,
                            x,
                            V=5,
                            family = binomial(),
                            method = "method.AUC",
                            SL.library=list("SL.glm",
                                            "SL.ksvm",
                                            "SL.xgboost",
                                            "SL.nnet"))

# Print out the summary statistics
# summary(cv.model)
# plot(cv.model)

Logistic regression works pretty well!