2 Wilcoxon rank sum test; Pearson’s Chi-squared test
1.2 Data manipulation
We will have a guest lecture on missing data by Prof. Aya Mitani, see course schedule.
We can use multiple imputation to handle missing data, this will be covered in the guest lecture.
In this example, we show how you can summarize missingness by other covariates and we create a new category for unkown BRODERS scores (0=unknown)
#step 1: summarize data by missing status;# any variables have differential distribution by missing broder scores?;ova %>%mutate(missing =ifelse(is.na(broders)==T,1,0)) %>%select(-broders) %>%tbl_summary(by = missing,missing_text ="(Missing)") %>%add_p()
Characteristic
0, N = 3071
1, N = 511
p-value2
t
779 (386, 1,667)
649 (235, 1,407)
0.043
d
225 (73%)
41 (80%)
0.3
surv
101 (33%)
11 (22%)
0.11
karn
0.7
6
17 (5.5%)
3 (5.9%)
7
37 (12%)
9 (18%)
8
40 (13%)
7 (14%)
9
96 (31%)
12 (24%)
10
117 (38%)
20 (39%)
figo
0.069
3
230 (75%)
32 (63%)
4
77 (25%)
19 (37%)
ascites
0.4
0
42 (14%)
10 (20%)
1
84 (27%)
10 (20%)
2
181 (59%)
31 (61%)
diam
0.3
3
123 (40%)
22 (43%)
4
60 (20%)
8 (16%)
5
39 (13%)
10 (20%)
6
57 (19%)
10 (20%)
7
28 (9.1%)
1 (2.0%)
age
56 (46, 65)
54 (45, 66)
>0.9
race
0.4
non-white
114 (37%)
22 (43%)
white
193 (63%)
29 (57%)
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test
#Step 2: try simple imputation, we replace missing value with mean;ova <- ova %>%mutate(broders2 =ifelse(is.na(broders)==T,mean(broders, na.rm = T),broders)) #if missing assign mean value, otherwise assign the original value;
Tips on handling missing data
when you have multiple variables with missing information, you can use data vistualization to display missing pattern, reference tutorial R see, https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html
Under missing at random, Rule of thumb, if you have <= 25% missing, you can consider multiple imputation. Number of imputed copies to be considered between 5 to 20.
For categorical variable, adding a new category for missing can be preferred if you don’t want to make any missing assumptions.
At this step, you can also evaluate your data and see if recoding is needed to collapse categories (categories with rare outcome or categories with small cell).
1.3 Model building strategies
Predictive models should include those variables that reflect the study population of interest
When more is not better: variables can be highly correlated, removing one of the highly correlated predictor does not compromise the model performance and the reduced model is computationally more efficient.
Good practice to assess pair-wise correlation before building your predictive model
You can also use model selection approaches for example, LASSO or bootstrap + stepwise (not stepwise by itself), however, since we are building a predictive model. Model selection should ideally be performed to maximize predictive power and accuracy not fit! That is comparing predictive model based on C-statistics, ROC not AIC or BIC.
#create a dataset to host all variables that will be used to build the prediction model;ova2 <- ova %>%select(-t, -d, -broders)
now let’s build a model no interaction and no non-linear terms, just a simple additive logistic regression model
m0 <-glm(surv ~as.factor(karn)+as.factor(figo)+as.factor(ascites)+as.factor(diam)+age+as.factor(broders2)+as.factor(race), family="binomial", data = ova2)summary(m0)
Two key elements in assessing the performance of a fitted logistic regression model are the assessment of model calibration and model discrimination.
Calibration refers to the agreement between observed outcomes and predictions (brier score, calibration plot)
Discrimination refers to the ability of model predictions to discriminate between those with and those without the outcome (C-statistics, ROC)
First, lets estimate c-statistics (AUC) and plot ROC curve
“The c-statistic is defined as the proportion of all possible pairs of subjects, one of whom experienced the outcome of interest and one of whom did not, in which the subject who experienced the outcome of interest had a higher predicted probability of the outcome occurring than does the subject who did not experience the outcome.” - From Austin, Harrell, and Steyerberg, 2021
# install.packages('ROCR')library(ROCR)# in-sample: the same training data set on which the model was fit;prob <-predict.glm(m0, type='response', newdata = ova2)pred <-prediction(prob, ova2$surv)# AUCauc <-performance(pred,"auc")@y.values[[1]][1]auc #c-statistics for model m0 is 0.8 pretty good!
[1] 0.8007767
# plot the ROC curveperf <-performance(pred,"tpr","fpr")plot(perf, col="navyblue", cex.main=1,main=paste("Logistic Regression ROC Curve \n AUC =", round(auc,3)))abline(a=0, b =1, col='red')
# we can obtain a confidence interval for AUC via bootstrap;# package pROC has a function that allows you do do this more easily;# I provided below sample codes to do bootstrap manually;n<-dim(ova2)[1]auc.boot <-rep(NA, 1000) #initiating a empty vector of size 1000 to save the bootstrap auc;set.seed(123)for(i in1:1000){# random draw with replacement to form a bootstrap sample with same size as the original data; obs.boot <-sample(x =1:n, size = n, replace = T) data.boot <- ova2[obs.boot, ]# fit the model on bootstrap sample m0.boot <-glm(m0$formula , #extracting model formula from m0;data=data.boot,family =binomial(link ="logit"))# calculate and save auc for each bootstrap iteration; prob1 =predict(m0.boot, type='response', data.boot) pred1 =prediction(prob1, data.boot$surv) auc.boot[i] <-performance(pred1,"auc")@y.values[[1]][1]}# 95% CI for auc using empirical distribution;quantile(auc.boot, c(.025, .975))
2.5% 97.5%
0.7777468 0.8647307
Brier score is simply the mean of squared differences between the predicted probability of outcome = 1 and the observed event status (0 or 1). A measurement that quantifies the predictive accuracy.
Brier score between 0 and 1, 0 is perfect prediction; 0.25 is predicting by chance - as flipping a coin; 1 is worst prediction, worst than flipping a coin.
#try this yourself, can you use the bootstrap code to get the empirical distirbution of BrierScore?
Regression calibration plot to compare the probability of outcome from the data with the estimated probability of outcome from the model.
perfect calibration if the empirical probability matches with the model estimated probability in a diagonal line,
we can fit a linear regression model for the data probability (as y) and model estimated probability (as x) and report the intercept and slope of this simple linear regression as the calibration intercept and slope
pdat <-with(ova2, data.frame(surv,prob =predict(m0, type ="response")))lm(surv ~ prob, data = pdat)
# intercept estimated at 0.001157 and slope estimated at 0.9963;ggplot(pdat, aes(prob,surv))+geom_abline(slope=1, intercept =0, color ="red")+geom_smooth(method = stats::loess, se =FALSE, color="black")+xlab("Esimated Probability")+ylab("Data Probability")+ggtitle("Logistic Regression Calibration Plot")
# getting calibration plot using rms package;library(rms)ova2 <- ova2 %>%mutate(karn =as.factor(karn),figo =as.factor(figo),ascites =as.factor(ascites),diam =as.factor(diam),broders2 =as.factor(broders2))m0_rms <-lrm( surv ~ karn + figo + ascites + diam + age + broders2, data = ova2, x =TRUE, y =TRUE)plot(calibrate(m0_rms, B=200))
n=358 Mean absolute error=0.031 Mean squared error=0.00115
0.9 Quantile of absolute error=0.055
#Resampling Validation of a Fitted Model using rms package (might have seen this from Biostat 2 with Kevin);validate(m0_rms, B=200)
Difference in positive prediction rates between Non-White and White groups.
A value close to 0 indicates parity in positive predictions across races. Negative values suggest Non-White individuals are less likely to be predicted as having an event.
Equal Opportunity Difference:
Difference in sensitivity (true positive rates) between Non-White and White groups.
A value close to 0 indicates equal ability to correctly identify events across races. Positive values indicate Non-White individuals have higher sensitivity.
Predictive Parity Difference:
Difference in positive predictive values (PPV) between Non-White and White groups.
A value close to 0 indicates that the likelihood of correctly predicting an event is similar across races. Negative values suggest Non-White predictions are less reliable.
Disparate Impact:
Ratio of positive prediction rates between Non-White and White groups.
Values close to 1 indicate parity. Values below 0.8 (80%) are often considered indicative of potential bias.
library(caret)library(pROC)library(kableExtra)ova2$age_spline <- splines::ns(ova2$age,3)ova2s <-select(ova2, -c(age))# Prepare the outcome variable as a factor with meaningful labelsova2s <- ova2s %>%mutate(surv =factor(surv, levels =c(0,1), labels =c("No_Event", "Event")) )# ----------------------------# Cross-Validation Setup# ----------------------------# Define the cross-validation strategy: 5-fold CVtrain_control <-trainControl(method ="cv",number =5,savePredictions ="final", # Save the final predictions for each foldclassProbs =TRUE, summaryFunction = twoClassSummary, # To calculate ROCallowParallel =TRUE# Enable parallel processing if available)# ----------------------------# Model Training with Cross-Validation# ----------------------------# Train the logistic regression model with 5-fold cross-validationset.seed(123)model_cv <-train( surv ~ .,data = ova2s, # All predictors including spline termsmethod ="glm",family ="binomial",trControl = train_control,metric ="ROC")print(model_cv)
Generalized Linear Model
358 samples
7 predictor
2 classes: 'No_Event', 'Event'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 287, 287, 285, 287, 286
Resampling results:
ROC Sens Spec
0.7540495 0.8539592 0.4648221
# ----------------------------# Predictive Performance Assessment# ----------------------------# Extract cross-validated predictionscv_predictions <- model_cv$pred# For logistic regression, there is typically one set of predictions per fold# Remove any unnecessary columns (like mtry) that are not applicable to glmcv_predictions <- cv_predictions %>%select(obs, Event) # Calculate ROC and AUCroc_obj <-roc(cv_predictions$obs, cv_predictions$Event, levels =c("No_Event", "Event"), direction ="<")# Plot ROC Curveplot(roc_obj, main ="ROC Curve with 5-Fold Cross-Validation")
# ----------------------------# Fairness Assessment# ----------------------------# Define a prediction threshold, e.g., 0.5threshold <-0.5# Assign predicted classes based on the thresholdcv_predictions <- cv_predictions %>%mutate(pred_class =ifelse(Event >= threshold, 1, 0),actual_event =ifelse(obs =="Event", 1, 0),id =row_number() )# Merge cross-validated predictions with the original data to get race information# Merge cross-validated predictions with the original data to get race informationova2s$id <-row_number(ova2s) cv_predictions_full <-merge(ova2s, cv_predictions)# Split predictions by racewhite_ova2s <- cv_predictions_full %>%filter(race =="white")nonwhite_ova2s <- cv_predictions_full %>%filter(race =="non-white")# Function to calculate confusion matrix metricscalculate_metrics <-function(df) { TP <-sum(df$pred_class ==1& df$actual_event ==1) TN <-sum(df$pred_class ==0& df$actual_event ==0) FP <-sum(df$pred_class ==1& df$actual_event ==0) FN <-sum(df$pred_class ==0& df$actual_event ==1) sensitivity <-ifelse((TP + FN) ==0, NA, TP / (TP + FN)) # True Positive Rate specificity <-ifelse((TN + FP) ==0, NA, TN / (TN + FP)) # True Negative Rate ppv <-ifelse((TP + FP) ==0, NA, TP / (TP + FP)) # Positive Predictive Value npv <-ifelse((TN + FN) ==0, NA, TN / (TN + FN)) # Negative Predictive Value accuracy <-ifelse((TP + TN + FP + FN) ==0, NA, (TP + TN) / (TP + TN + FP + FN)) positive_rate <-ifelse(nrow(df) ==0, NA, sum(df$pred_class ==1) /nrow(df)) # P(Y_hat=1 | A=a)return(list(sensitivity = sensitivity,specificity = specificity,ppv = ppv,npv = npv,accuracy = accuracy,positive_rate = positive_rate ))}# Calculate metrics for allmetrics_all <-calculate_metrics(cv_predictions_full)# Calculate metrics for White groupmetrics_white <-calculate_metrics(white_ova2s)# Calculate metrics for Non-White groupmetrics_nonwhite <-calculate_metrics(nonwhite_ova2s)# Create a data frame for fairness metricsfairness_table <-data.frame(Metric =c("Sensitivity (True Positive Rate)","Specificity (True Negative Rate)","PPV (Positive Predictive Value)","NPV (Negative Predictive Value)","Accuracy","Positive Rate" ),All =c( metrics_all$sensitivity, metrics_all$specificity, metrics_all$ppv, metrics_all$npv, metrics_all$accuracy, metrics_all$positive_rate ),White_Group =c( metrics_white$sensitivity, metrics_white$specificity, metrics_white$ppv, metrics_white$npv, metrics_white$accuracy, metrics_white$positive_rate ),Non_White_Group =c( metrics_nonwhite$sensitivity, metrics_nonwhite$specificity, metrics_nonwhite$ppv, metrics_nonwhite$npv, metrics_nonwhite$accuracy, metrics_nonwhite$positive_rate ))# Round the numeric values for better readabilityfairness_table <- fairness_table %>%mutate(All =round(All,3),White_Group =round(White_Group, 3),Non_White_Group =round(Non_White_Group, 3) )fairness_table %>%kable(format ="html",caption ="Fairness Metrics by Race",align =c("l", "c","c", "c") ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE,position ="left" )
Fairness Metrics by Race
Metric
All
White_Group
Non_White_Group
Sensitivity (True Positive Rate)
0.464
0.452
0.487
Specificity (True Negative Rate)
0.854
0.839
0.876
PPV (Positive Predictive Value)
0.591
0.579
0.613
NPV (Negative Predictive Value)
0.778
0.758
0.810
Accuracy
0.732
0.712
0.765
Positive Rate
0.246
0.257
0.228
# This is the correct sensitivity, caret is flapped;caret::sensitivity(table(cv_predictions$pred_class,cv_predictions$actual_event), positive="1") #correct one;