2 Wilcoxon rank sum test; Pearson’s Chi-squared test
1.2 Data manipulation
We will have a guest lecture on missing data by Prof. Aya Mitani, see course schedule.
We can use multiple imputation to handle missing data, this will be covered in the guest lecture.
In this example, we show how you can summarize missingness by other covariates and we create a new category for unkown BRODERS scores (0=unknown)
#step 1: summarize data by missing status;ova %>%mutate(missing =ifelse(is.na(broders)==T,1,0)) %>%select(-broders) %>%tbl_summary(by = missing,missing_text ="(Missing)") %>%add_p()
Characteristic
0, N = 3071
1, N = 511
p-value2
t
779 (386, 1,667)
649 (235, 1,407)
0.043
d
225 (73%)
41 (80%)
0.3
surv
101 (33%)
11 (22%)
0.11
karn
0.7
6
17 (5.5%)
3 (5.9%)
7
37 (12%)
9 (18%)
8
40 (13%)
7 (14%)
9
96 (31%)
12 (24%)
10
117 (38%)
20 (39%)
figo
0.069
3
230 (75%)
32 (63%)
4
77 (25%)
19 (37%)
ascites
0.4
0
42 (14%)
10 (20%)
1
84 (27%)
10 (20%)
2
181 (59%)
31 (61%)
diam
0.3
3
123 (40%)
22 (43%)
4
60 (20%)
8 (16%)
5
39 (13%)
10 (20%)
6
57 (19%)
10 (20%)
7
28 (9.1%)
1 (2.0%)
age
56 (46, 65)
54 (45, 66)
>0.9
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test
#Step 2: instead of imputation, we will create a new category to label missing broders scoresova <- ova %>%mutate(broders2 =ifelse(is.na(broders)==T,0,broders)) #if missing assign value 0, otherwise assign the original value;
Tips on handling missing data
when you have multiple variables with missing information, you can use data vistualization to display missing pattern, reference tutorial R see, https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html
Under missing at random, Rule of thumb, if you have <= 25% missing, you can consider multiple imputation. Number of imputed copies to be considered between 5 to 20.
For categorical variable, adding a new category for missing can be preferred if you don’t want to make any missing assumptions.
At this step, you can also evaluate your data and see if recoding is needed to collapse categories (categories with rare outcome or categories with small cell).
1.3 Model building strategies
Predictive models should include those variables that reflect the study population of interest
When more is not better: variables can be highly correlated, removing one of the highly correlated predictor does not compromise the model performance and the reduced model is computationally more efficient.
Good practice to assess pair-wise correlation before building your predictive model
You can also use model selection approaches for example, LASSO or bootstrap + stepwise (not stepwise by itself), however, since we are building a predictive model. Model selection should ideally be performed to maximize predictive power and accuracy not fit! That is comparing predictive model based on C-statistics, ROC not AIC or BIC.
#create a dataset to host all variables that will be used to build the prediction model;ova2 <- ova %>%select(-t, -d, -broders)#create pair-wise correlation plot;library(corrplot)corrplot(cor(ova2))
now let’s build a model no interaction and no non-linear terms, just a simple additive logistic regression model
m0 <-glm(surv ~as.factor(karn)+as.factor(figo)+as.factor(ascites)+as.factor(diam)+age+as.factor(broders2), family="binomial", data = ova2)summary(m0)
Call:
glm(formula = surv ~ as.factor(karn) + as.factor(figo) + as.factor(ascites) +
as.factor(diam) + age + as.factor(broders2), family = "binomial",
data = ova2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0132 -0.7626 -0.4787 0.7940 2.5970
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.666311 1.264863 -0.527 0.598343
as.factor(karn)7 1.081171 1.150849 0.939 0.347497
as.factor(karn)8 1.901206 1.124047 1.691 0.090762 .
as.factor(karn)9 1.825588 1.092649 1.671 0.094763 .
as.factor(karn)10 1.708743 1.090427 1.567 0.117105
as.factor(figo)4 -0.919926 0.355006 -2.591 0.009562 **
as.factor(ascites)1 -0.006856 0.438393 -0.016 0.987523
as.factor(ascites)2 -0.416012 0.398599 -1.044 0.296631
as.factor(diam)4 0.554593 0.380419 1.458 0.144883
as.factor(diam)5 0.682469 0.421686 1.618 0.105570
as.factor(diam)6 1.205796 0.372404 3.238 0.001204 **
as.factor(diam)7 1.704534 0.495323 3.441 0.000579 ***
age -0.043949 0.009831 -4.470 7.81e-06 ***
as.factor(broders2)1 1.272036 0.526319 2.417 0.015655 *
as.factor(broders2)2 0.252897 0.462477 0.547 0.584494
as.factor(broders2)3 0.063403 0.443116 0.143 0.886224
as.factor(broders2)4 0.895321 0.520517 1.720 0.085421 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 444.89 on 357 degrees of freedom
Residual deviance: 351.60 on 341 degrees of freedom
AIC: 385.6
Number of Fisher Scoring iterations: 5
1.4 Model performance
Two key elements in assessing the performance of a fitted logistic regression model are the assessment of model calibration and model discrimination.
Calibration refers to the agreement between observed outcomes and predictions (brier score, calibration plot)
Discrimination refers to the ability of model predictions to discriminate between those with and those without the outcome (C-statistics, ROC)
First, lets estimate c-statistics (AUC) and plot ROC curve
“The c-statistic is defined as the proportion of all possible pairs of subjects, one of whom experienced the outcome of interest and one of whom did not, in which the subject who experienced the outcome of interest had a higher predicted probability of the outcome occurring than does the subject who did not experience the outcome.” - From Austin, Harrell, and Steyerberg, 2021
# install.packages('ROCR')library(ROCR)# in-sample: the same training data set on which the model was fit;prob <-predict.glm(m0, type='response', newdata = ova2)pred <-prediction(prob, ova2$surv)# AUCauc <-performance(pred,"auc")@y.values[[1]][1]auc #c-statistics for model m0 is 0.8 pretty good!
[1] 0.8003412
# plot the ROC curveperf <-performance(pred,"tpr","fpr")plot(perf, col="navyblue", cex.main=1,main=paste("Logistic Regression ROC Curve \n AUC =", round(auc,3)))abline(a=0, b =1, col='red')
# we can obtain a confidence interval for AUC via bootstrap;# package pROC has a function that allows you do do this more easily;# I provided below sample codes to do bootstrap manually;n<-dim(ova2)[1]auc.boot <-rep(NA, 1000) #initiating a empty vector of size 1000 to save the bootstrap auc;set.seed(123)for(i in1:1000){# random draw with replacement to form a bootstrap sample with same size as the original data; obs.boot <-sample(x =1:n, size = n, replace = T) data.boot <- ova2[obs.boot, ]# fit the model on bootstrap sample m0.boot <-glm(m0$formula , #extracting model formula from m0;data=data.boot,family =binomial(link ="logit"))# calculate and save auc for each bootstrap iteration; prob1 =predict(m0.boot, type='response', data.boot) pred1 =prediction(prob1, data.boot$surv) auc.boot[i] <-performance(pred1,"auc")@y.values[[1]][1]}# 95% CI for auc using empirical distribution;quantile(auc.boot, c(.025, .975))
2.5% 97.5%
0.7767338 0.8635063
Brier score is simply the mean of squared differences between the predicted probability of outcome = 1 and the observed event status (0 or 1). A measurement that quantifies the predictive accuracy.
Brier score between 0 and 1, 0 is perfect prediction; 0.25 is predicting by chance - as flipping a coin; 1 is worst prediction, worst than flipping a coin.
#try this yourself, can you use the bootstrap code to get the empirical distirbution of BrierScore?
Regression calibration plot to compare the probability of outcome from the data with the estimated probability of outcome from the model.
perfect calibration if the empirical probability matches with the model estimated probability in a diagonal line,
we can fit a linear regression model for the data probability (as y) and model estimated probability (as x) and report the intercept and slope of this simple linear regression as the calibration intercept and slope
pdat <-with(ova2, data.frame(surv,prob =predict(m0, type ="response")))lm(surv ~ prob, data = pdat)
# intercept estimated at 0.001157 and slope estimated at 0.9963;ggplot(pdat, aes(prob,surv))+geom_abline(slope=1, intercept =0, color ="red")+geom_smooth(method = stats::loess, se =FALSE, color="black")+xlab("Esimated Probability")+ylab("Data Probability")+ggtitle("Logistic Regression Calibration Plot")
# getting calibration plot using rms package;library(rms)ova2 <- ova2 %>%mutate(karn =as.factor(karn),figo =as.factor(figo),ascites =as.factor(ascites),diam =as.factor(diam),broders2 =as.factor(broders2))m0_rms <-lrm( surv ~ karn + figo + ascites + diam + age + broders2, data = ova2, x =TRUE, y =TRUE)plot(calibrate(m0_rms, B=200))
n=358 Mean absolute error=0.031 Mean squared error=0.00115
0.9 Quantile of absolute error=0.055
#Resampling Validation of a Fitted Model using rms package (might have seen this from Biostat 2 with Kevin);validate(m0_rms, B=200)
The spline model has better predictive performance comparing to the linear model.
2. Machine learning methods
2.1 Best online introductory tutorial on machine learning methods (personal opinion)
Machine Learning for Everyone at https://vas3k.com/blog/machine_learning/.
Taking a few figures from this online tutorial to give an overview of ML and classic methods
For binary prediction model, we will be looking at supervised learning and binary classification methods
Popular algorithms:
Logistic Regression, Support Vector Machine, discriminant analysis, Neural Networks
Decision Tree, Random Forest, AdeBoost, XGBoost
K-Nearest Neighbours, Naive Bayes
2.2 Super (Machine) Learning
Super learning can be used to obtain robust estimator. In a nut-shell it uses loss-function based ML tool and cross-validation to obtain the best prediction of model parameter of interest, based on a weighted average of a library of machine learning algorithms.
Guide to SuperLearner by Chris Kennedy at https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html
New visual guide created by Katherine Hoffman
List of machine learning algorithms under SuperLearner R package
We can use ML to model predict the binary outcome.
y <- ova2$survx <- ova2[,2:7]# parallel computing to speed up;library(parallel)# Create a cluster using most CPUs;cl <- parallel::makeCluster(8)# Load libraries on workersparallel::clusterEvalQ(cl, { library(SuperLearner); library(xgboost) })
# Export all references to cluster nodes;parallel::clusterExport(cl, c('x','y'))# Set a common seed for the cluster;parallel::clusterSetRNGStream(cl, iseed =123)# Do cross validation in parallelsystem.time({ ensem.AUC.cv <-CV.SuperLearner(Y=y ,X=x,cvControl=list(V=10),parallel=cl,family=binomial(),method="method.AUC",SL.library=list("SL.glm","SL.ksvm","SL.xgboost","SL.nnet")) })
user system elapsed
0.01 0.01 17.82
summary(ensem.AUC.cv)
Call:
CV.SuperLearner(Y = y, X = x, family = binomial(), SL.library = list("SL.glm",
"SL.ksvm", "SL.xgboost", "SL.nnet"), method = "method.AUC", cvControl = list(V = 10),
parallel = cl)
Risk is based on: Area under ROC curve (AUC)
All risk estimates are based on V = 10
Algorithm Ave se Min Max
Super Learner 0.74128 NA 0.61154 0.86622
Discrete SL 0.75173 NA 0.63077 0.86622
SL.glm_All 0.75173 NA 0.63077 0.86622
SL.ksvm_All 0.70975 NA 0.60800 0.86288
SL.xgboost_All 0.68759 NA 0.62393 0.84444
SL.nnet_All 0.66410 NA 0.50000 0.80400
parallel::stopCluster(cl)# set.seed(123)# Get V-fold cross-validated risk estimate without parallel;cv.model <-CV.SuperLearner(y, x,V=5,family =binomial(),method ="method.AUC",SL.library=list("SL.glm","SL.ksvm","SL.xgboost","SL.nnet"))# Print out the summary statistics# summary(cv.model)# plot(cv.model)