Bayesian logistic regression

Learning Objectives
  • Learn how to fit logistic regression model with brms
  • Understand how to interpret and report regression posterior results
  • Learn Bayesian variable selection for logistic regression using shrinkage priors and projection predictive methods
  • Evaluate predictive performance of Bayesian logistic regression models (AUC, Brier score, calibration)
  • Estimate and interpret algorithmic fairness metrics from a Bayesian logistic regression model
  • Conduct Bayesian meta-regression for binary outcomes

1. Logistic regression

1.1 Models for Binary Data

  • For binary outcome Y, we are interested to model the probability of \(\pi_i = P(Y_i=1)\), \(i=1, \ldots, n\)

\[Y_i \mid \pi_i, n_i \sim Bern(\pi_i)\]

\[ E(Y_i \mid \pi_i) = \pi\]

\[logit( \pi_i) = \beta_0 + \beta_1 X_{i1}\]

  • logistic regression model featuring logit link that connects binary outcome to a conventional linear regression format, is part of the broader class of generalized linear models

  • Logit link convert probabilities to log odds.

\[\log(\frac{\pi_i}{1-\pi_i}) = \log(\frac{P(Y_i=1)}{1- P(Y_i=1)})\]

\[\log(\frac{\pi_i}{1-\pi_i}) = \beta_0 + \beta_1 X_{i1}\]

p1<-ggplot(data.frame(x = c(0, 1)), aes(x)) + 
  stat_function(fun = qlogis, n = 501) + 
  xlab(expression(pi)) + 
  ylab(expression(logit(pi))) + 
  ggtitle("logit function")+
  theme_bw()
p2<- ggplot(data.frame(x = c(-5, 5)), aes(x)) + 
  stat_function(fun = plogis, n = 501) + 
  ylab(expression(pi)) +
  xlab(expression(logit(pi)))+
  ggtitle("logit function")+
  theme_bw()
ggarrange(p1, p2, nrow = 1)

1.2 Logistic regression model and assumptions
  • Consider the logistic regression model \(Y\) with covariates \(X_i, \ldots, X_p\):

\[ \log(\frac{\pi_i}{1-\pi_i}) = \beta_0 + \beta_1 X_{i1} + \ldots + \beta_p X_{ip} \]

  • logistic regression coefficients is interpreted as log odds ratio!

  • Furthermore, we can recalculate probability using the inverse logit function (aka, expit function)

\[\pi_i = \frac{\exp(\beta_0 + \beta_1 X_{i1} + \ldots + \beta_1 X_{ip})}{1+\exp(\beta_0 + \beta_1 X_{i1} + \ldots + \beta_p X_{ip})}\]

  • Assumptions of logistic regression models
    1. Independent observations. This assumption can be modified if working with clustered data.
    2. Linear relationship between continuous covariates and log-odds We can check this assumption by examining marginal plots comparing the model predicted relationship between outcome (log-odds or logit-probability) and each continuous covariates.
    3. Multicollinearity Multicollinearity is the phenomenon when a number of the explanatory variables are strongly correlated.
    4. Correctly specified regression model This means that all relevant predictors for the response variable have been included in the model. This is often difficult to verify, but we can use posterior predictive distribution to check for regression fit. We can also use WAIC and LOO to compare different models.
  • Comparing to linear regression model we no-longer require the residuals to be normally distributed and homoscedasticity.

1.3 Working example: diagnostic Sensitivity analysis

dat <- read.table("data/COVIDtest.txt", 
                  sep = "",
                  header=T)
dat <- dat %>% 
  mutate(r = TP,
         n = TP + FN,
         age = round(rnorm(10, 70*sqrt(TP/(TP + FN)), 8)))


dat %>% 
  select(Study, r, n, age) %>%
  datatable(
  rownames = T,
  class = "compact",
  options = list(
    dom = 't',
    ordering = FALSE,
    paging = FALSE,
    searching = FALSE,
    columnDefs = list(list(className = 'dt-center', 
                      targets = 0:3))))

Unpooled model

  • recall from last week, We can assume a binomial likelihood for each study (unpooled),

\[r_i \sim Bin(p_i, n_i) \]

  • we can assume that the 10 true sensitivities \(p_i\) are independent by fitting different prior to each one: \[p_i \sim Beta(1,1)\]

  • we will obtain 10 separate posteriors \(P(p_i \mid r_i, n_i)\)

    • For example, we do not assume that \(p_1\) tells us anything about the value of \(p_2\)

Pooled

  • Suppose we wanted to estimate the overall sensitivity of the CLIA test
    • We do not want 10 separate estimates but a single estimate
  • If we thought that the studies all had the same true sensitivity p0, we could fit a model like this:
  1. Assume a binomial likelihood for each study

\[r_i \sim Bin(p_i, n_i) \]

  1. Assume that all studies have the same true sensitivity, \(p_0\): \[ p_i = p_0\]

  2. Put a prior on \(p_0\): \(p_0 \sim Beta(1,1)\)

Different prior on \(p_0\)

  • The beta prior is useful for models with little additional structure

    • e.g., estimating a single proportion; estimating a collection of proportions
  • It is difficult to use it to a model, where we want to include predictors of \(p_i\)

  • For most modelling approaches to binary data, we will use

    • the logit transformation of \(p_i\)
    • normal priors for parameters on this logit scale (e.g., normal prior for log-odds)

Logistic models

  • We still have the same likelihood \(r_i \sim Bin(p_i, n_i)\)

  • We define a new parameter using the logit transformation function:

\[logit(p_i) = log(\frac{p_i}{1-p_i}) = \alpha_i\] \[p_i = \frac{\exp(\alpha_i)}{1+\exp(\alpha_i)}\]

  • \(0<p_i<1 \Rightarrow -\infty < \alpha_i < \infty\)

  • Thus, it’s quite reasonable to use normal prior to characterize \(\alpha_i\)

Model 1. Pooled model with non-informative prior

\[r_i \mid p_i, n_i \sim Bin(p_i, n_i)\] \[log(\frac{p_i}{1-p_i}) = \alpha\] \[\alpha \sim N(0, \sigma = 5)\]

fit1 <- brm(r | trials(n) ~ 1,
  data = dat,
  prior = prior(normal(0,5), class=Intercept),
  family = binomial,
  sample_prior = T, #asking brms to generate prior sample!
  seed = 123)
saveRDS(fit1, "data/chp5_binary1")
fit1 <- readRDS("data/chp5_binary1")
print(fit1)
 Family: binomial 
  Links: mu = logit 
Formula: r | trials(n) ~ 1 
   Data: dat (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.33      0.06     1.21     1.46 1.01     1425     1875

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Model 2. Unpooled model with non-informative prior

\[r_i \mid p_i, n_i \sim Bin(p_i, n_i)\] \[log(\frac{p_i}{1-p_i}) = \beta_i \ \text{Study}_i\]

\[\beta_i \sim N(0, \sigma = 5)\]

fit2 <- brm(r | trials(n) ~ 0+Study,
  data = dat,
  prior =  c(prior(normal(0, 5), class = b)),
  family = binomial,
  seed = 123)
saveRDS(fit2, "data/chp5_binary2")
fit2 <- readRDS("data/chp5_binary2")
print(fit2)
 Family: binomial 
  Links: mu = logit 
Formula: r | trials(n) ~ 0 + Study 
   Data: dat (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
StudyCai           0.29      0.12     0.06     0.53 1.00     7694     2738
StudyInfantino     0.97      0.29     0.41     1.54 1.00     7996     2987
StudyJin          -0.07      0.38    -0.84     0.68 1.00     7548     2814
StudyLin           1.56      0.29     1.02     2.13 1.01     7693     3507
StudyLou           1.86      0.32     1.26     2.53 1.00     6928     2783
StudyMa            3.45      0.39     2.76     4.27 1.00     7110     3015
StudyQian          1.81      0.13     1.56     2.07 1.00     7295     2846
StudyXie           2.99      1.12     1.20     5.54 1.00     6428     2723
StudyYangchun      0.86      0.15     0.57     1.16 1.00     8293     2770
StudyZhong         4.13      1.13     2.37     6.87 1.00     5127     1952

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Model 3. Partial pooling model with non-informative prior

\[r_i \mid p_i, n_i \sim Bin(p_i, n_i)\] \[log(\frac{p_i}{1-p_i}) = \alpha + b_i\] \[\alpha \sim N(0, \sigma = 5)\] \[b_i \sim N(0, \sigma )\] \[\sigma \sim half-t(0,10, df=3)\]

fit3 <- brm(r | trials(n) ~ 1+(1|Study),
  data = dat,
  prior =  c(prior(normal(0, 5), class = Intercept),
             prior(student_t(3, 0, 10), class = sd)),
  family = binomial,
  seed = 123)
saveRDS(fit3, "data/chp5_binary3")
fit3 <- readRDS("data/chp5_binary3")
print(fit3)
 Family: binomial 
  Links: mu = logit 
Formula: r | trials(n) ~ 1 + (1 | Study) 
   Data: dat (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~Study (Number of levels: 10) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.38      0.45     0.77     2.47 1.00      725     1311

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.62      0.45     0.74     2.59 1.01      698     1030

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# comparing models;
loo1 <- loo(fit1)
loo2 <- loo(fit2)
loo3 <- loo(fit3)
loo_compare(loo1, loo2, loo3)
     elpd_diff se_diff
fit2   0.0       0.0  
fit3  -1.1       0.6  
fit1 -96.7      45.0  
# define expit function; 
# transform model estimated log odds to probability scale;
expit <- function(x){exp(x)/(1+exp(x))}

## 1.  Create one row per Study with n = 1 so that
##     the expected number of successes equals the success-probability
newdat <- dat %>% 
  distinct(Study) %>% 
  mutate(n = 1)      # one Bernoulli trial

# 2. Draw posterior expectations that INCLUDE the random intercept
draws <- posterior_epred(
  fit3,
  newdata    = newdat,
  re_formula = NULL   # keep group-level effects
)                   

# summarize posterior probability;
prob_summaries <- t(apply(draws, 2, quantile, probs = c(.025, .5, .975))) %>% 
  as.data.frame() %>% 
  setNames(c("median", "lower_95",  "upper_95")) %>% 
  mutate(Study = newdat$Study)

prob_summaries
      median  lower_95  upper_95     Study
1  0.7372276 0.8262236 0.8970372       Lin
2  0.9338387 0.9640941 0.9834530        Ma
3  0.5183541 0.5753933 0.6332119       Cai
4  0.6106512 0.7307572 0.8314371 Infantino
5  0.8869331 0.9625789 0.9934809     Zhong
6  0.3354143 0.5192255 0.6983388       Jin
7  0.7578684 0.9154367 0.9854626       Xie
8  0.6438882 0.7064076 0.7646523  Yangchun
9  0.8266485 0.8589106 0.8881886      Qian
10 0.7807128 0.8639924 0.9254228       Lou

both model 2 and model 3 fit the data well. There is no substantial difference in model fit using LOO between model 2 and 3. If we are to select the best model, we will proceed with model 3.

Generalized linear models

We can specify the type of model by two features - Family: type of outcome (binomial, normal, Poisson, gamma) - Link: relationship between mean parameter and predictor (identity, logic, log)

  • we have models in the form

\[ Y_i \sim Dist(\mu_i, \tau)\] \[g(\mu_i) = \beta_0 + \beta_1 X_{i1} + \ldots + \beta_1 X_{ip}\]

  • \(g()\) is called the link function

  • Distributions family (exponential family) can be specified as Normal, Poisson, Binomial, Bernoulli, etc.

  • Usually, we use the logit for binary data and gets odds ratios (logistic model)

    • If we use the log link, we can estimate relative risks
Sample from priors in brms
  • We can specify in brms to generate samples from our specified prior as part of the output

  • we can use these prior samples

    1. to compare with the posteriors to see how much the prior is updated by the data
    2. to check what information is in the prior
  • using the pooled model as an example

\[r_i \mid p_i, n_i \sim Bin(p_i, n_i)\] \[log(\frac{p_i}{1-p_i}) = \alpha\] \[\alpha \sim N(0, \sigma = 5)\]

fit1 <- brm(r | trials(n) ~ 1,
  data = dat,
  prior = prior(normal(0,5), class=Intercept),
  family = binomial,
  sample_prior = T, #asking brms to generate prior sample!
  seed = 123)
  • checking posterior results
    • computing posterior predicted probability of event (in this case the sensitivity of the CLIA test)
#posterior draws of log-odds estimate - the intercept;
posterior <- posterior_samples(fit1)

#posterior predicted probability of event;
prob.pooled <- expit(posterior$b_Intercept)

mean(prob.pooled)
[1] 0.7914033
quantile(prob.pooled, c(0.025, 0.975))
     2.5%     97.5% 
0.7711078 0.8119101 
  • comparing prior and posterior distribution visually
data.frame(Probability = c(inv_logit_scaled(posterior[,"b_Intercept"]),
                inv_logit_scaled(posterior[,"prior_Intercept"])),
           Type=rep(c("Posterior","Prior"),each=nrow(posterior))) %>%
  ggplot(aes(x=Probability,fill=Type))+
  geom_density(size=1,alpha=0.5)+
  labs(title="Prior: alpha ~ N(0,5); probability = exp(alpha)/(1+exp(alpha))")+
  theme_bw()+
  scale_fill_manual(values=c("goldenrod","steelblue"))

2. Logistic regression with predictors

  • Adding one predictor, age, for each study in modelling CLIA sensitivity

  • We have the following Bayesian model

\[r_i \mid p_i, n_i \sim Bin(p_i, n_i)\] \[log(\frac{p_i}{1-p_i}) = \alpha + \beta \ age_i\] \[\alpha \sim N(0, \sigma = 5)\] \[\beta \sim N(0, \sigma = 5)\]

  • \(\beta\) is interpreted as the changes on the log-OR associate 1 unit increase in age.
fit4 <- brm(r | trials(n) ~ age,
  data = dat,
  prior = c(prior(normal(0, 5), class = Intercept),
             prior(normal(0, 5), class = b)),
  family = binomial,
  cores = 4,
  seed = 123)
saveRDS(fit4, "data/chp5_binary4")
  • checking posterior results
    • computing posterior predicted probability of event (in this case the sensitivity of the CLIA test)
fit4 <- readRDS("data/chp5_binary4")

#posterior draws of log-odds estimate - the intercept;
posterior <- posterior_samples(fit4)

# beta posterior summary
exp(mean(posterior$b_age))
[1] 1.070023
exp(quantile(posterior$b_age, c(0.025, 0.975)))
    2.5%    97.5% 
1.057597 1.083063 
#posterior predicted probability testing positive for each study;
pp<-posterior_predict(fit4)
mean.pp<-colMeans(pp)/dat$n
lci.pp<- apply(pp, 2, function(x) quantile(x, 0.025))/dat$n
uci.pp<- apply(pp, 2, function(x) quantile(x, 0.975))/dat$n

pp.table <- data.frame(Study = dat$Study, 
                       r = dat$r, 
                       n = dat$n,
                       Prob_obs = round(dat$r/dat$n,2),
                       Prob_post = round(mean.pp,2),
                       CI_post = paste0("(",round(lci.pp,2),",",round(uci.pp,2),")"))

pp.table %>% 
  datatable(
  rownames = F,
  class = "compact",
  options = list(
    dom = 't',
    ordering = FALSE,
    paging = FALSE,
    searching = FALSE,
    columnDefs = list(list(className = 'dt-center', 
                      targets = 0:5))))

2.1 Centring continuous variable

  • our model says that sensitivity is different across studies according to the average age

\[P(\text{test positive}) = \frac{\exp(\alpha + \beta \ age)}{1+ \exp(\alpha + \beta \ age)}\]

  • We could obtain a prediction at the average of the ages mean(dat$age) across the studies

\[P(\text{test positive}) = \frac{\exp(\alpha + \beta \times mean( age))}{1+ \exp(\alpha + \beta \times mean( age))}\]

  • We could “centre” age at its average so that the intercept refers to the probability at the average age

\[r_i \mid p_i, n_i \sim Bin(p_i, n_i)\] \[log(\frac{p_i}{1-p_i}) = \alpha + \beta (\ age_i - mean(age))\] \[\alpha \sim N(0, \sigma = 5)\] \[\beta \sim N(0, \sigma = 5)\]

  • Using a centred predictor also reduces correlation between intercept and slope. This can speed up MCMC convergence in some models (computationally more efficient).
#centring variable age;
dat$age.c <- dat$age - mean(dat$age)

fit4c <- brm(r | trials(n) ~ age.c,
            data=dat,
            prior = c(prior(normal(0, 5), class = Intercept),
                      prior(normal(0, 5), class = b)),
            family = binomial,
            cores = 4,
            seed = 123)

saveRDS(fit4c, "data/chp5_binary4c")
  • comparing posterior distribution between centred and uncentred models
fit4c <- readRDS("data/chp5_binary4c")

#posterior draws of log-odds estimate - the intercept;
posteriorc <- posterior_samples(fit4c)

# beta posterior summary
exp(mean(posteriorc$b_age))
[1] 1.069927
exp(quantile(posteriorc$b_age, c(0.025, 0.975)))
    2.5%    97.5% 
1.057407 1.082997 
  • posterior predicted probability testing positive for each study comparing between the centred and uncentred models

  • we can see the results are almost identical and we achieved this with only 5000 iterations for the centred model.

#posterior predicted probability testing positive for each study;
ppc<-posterior_predict(fit4c)

pp.table <- data.frame(pp.table[,c(1,4:6)],
                       c_Prob_post = round(colMeans(ppc)/dat$n,2),
                       c_CI_post = paste0("(",round(apply(ppc, 2, function(x) quantile(x, 0.025))/dat$n,2),",",round(apply(ppc, 2, function(x) quantile(x, 0.975))/dat$n,2),")"))

pp.table %>% 
  datatable(
  rownames = F,
  class = "compact",
  options = list(
    dom = 't',
    ordering = FALSE,
    paging = FALSE,
    searching = FALSE,
    columnDefs = list(list(className = 'dt-center', 
                      targets = 0:5))))
  • We can investigate the correlation between two posterior regression parameters, \(\alpha\) (intercept) and \(\beta\) (log-OR for age).

  • The two parameters should be uncorrelated by model assumption! If you see visible correlation, you can consider thinning your MCMC or run more iterations.

p1 <- ggplot(posterior, 
             aes(b_Intercept, b_age))+
  geom_point(alpha=.1)+
  theme_bw()+
  ggtitle("Uncentred age")

p2 <- ggplot(posteriorc, aes(b_Intercept, b_age.c))+
  geom_point(alpha=0.25)+
  theme_bw()+
  ggtitle("Centred age")

ggarrange(p1,p2, nrow=1)

2.2 Model diagnostics

  1. Posterior predictive graphic check for model fit
p1<-pp_check(fit4c, ndraws = 50)
p2<-pp_check(fit4c, type = "stat_2d", stat = c("max", "min"))
ggarrange(p1,p2, nrow = 1)

3 Additional topics on Bayesian logistic regression models

In this example, we will conduct an analysis using a Bayesian logistic regression model.

  • Data come from a cross-sectional study of 1,225 smokers over the age of 40.

  • Each participant was assessed for chronic obstructive pulmonary disease (COPD), and characteristics of the type of cigarette they most frequently smoke were recorded.

  • The objective of the study was to identify associations between COPD diagnosis and cigarette characteristics

We will use the following variables from this data set:

  • TYPE: Type of cigarette, 1 = Menthol, 0 = Regular

  • NIC: Nicotine content, in mg

  • TAR: Tar content, in mg

  • LEN: Length of cigarette, in mm

  • FLTR: 1 = Filter, 0 = No Vlter

  • copd: 1 = COPD diagnosis, 0 = no COPD diagnosis

  • We fit a simple logistic regression, predicting the risk of Copd by Type of cigarette, Nicotine content, and Filter status.

  • We specify uninformative prior on log.odds, \(N(0,10)\)

m1_bern <- brm(copd ~ TYPE + NIC + FLTR, 
               data = copd, 
               family = bernoulli(link = "logit"), 
               prior = c(prior(normal(0,10), class = "b"), 
                         prior(normal(0,10), class = "Intercept")),
               seed = 123, silent = 2, refresh = 0)

saveRDS(m1_bern, file="data/chp6_logistic")
m1_bern <- readRDS("data/chp6_logistic")
summary(m1_bern)
 Family: bernoulli 
  Links: mu = logit 
Formula: copd ~ TYPE + NIC + FLTR 
   Data: copd (Number of observations: 1225) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -2.94      0.64    -4.20    -1.71 1.00     3108     2483
TYPE          0.25      0.21    -0.19     0.66 1.00     3480     2501
NIC           1.30      0.35     0.61     1.99 1.00     3306     2723
FLTR         -0.60      0.44    -1.42     0.29 1.00     3344     2878

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Directly from the posterior of exp(beta):
draws_beta_TYPE <- as.matrix(m1_bern, pars = "b_TYPE")
exp_beta_TYPE <- exp(draws_beta_TYPE)
quantile(exp_beta_TYPE, probs = c(.025, 0.5, .975))
     2.5%       50%     97.5% 
0.8286483 1.2832394 1.9293951 
plot(m1_bern)

Posterior Distribution + Trace plot combining all chains
  • Good mixing from trace plot
  • Rhats’ are all less then 1.01, convergence!
  • Other diagnostic plots below
mcmc_dens_overlay(m1_bern, pars = c("b_Intercept", "b_TYPE", "b_NIC", "b_FLTR"))

Posterior density plot by chain
mcmc_acf_bar(m1_bern, pars = c("b_Intercept", "b_TYPE", "b_NIC", "b_FLTR"))

Posterior density plot by chain

3.1 Regression regularization and shrinkage priors

  • When the number of parameters to be estimated is large relative to sample size, estimation using non-informative or weakly informative priors tend to overfit.

  • One way to avoid overfitting is to perform regularization, that is, to shrink some of the parameters to closer to zero. This makes the model fit less well to the existing data, but will be much more generalizable to an independent data set.

  • Sparsity-Inducing Priors

  • The horseshoe prior is a type of hierarchical prior for regression models by introducing a global scale and local scale parameters on the priors for the regression coefficients.

  • Specifically, with \(p\) predictors

\[Y_i \sim N(\mu_i, \sigma^2)\] \[\mu_i = \beta_0 + \sum_{m=1}^p\beta_m X_m\] \[\beta_0 \sim N(0,10)\] \[\beta_m \sim N(0,\tau\lambda_m)\] \[\lambda_m \sim Cauchy(10)\] \[\tau \sim Cauchy(\tau_0)\]

  • local scale \(\lambda_m\) can flexibly shrink the coefficient to close to zero.
dhs <- Vectorize(
  function(y, df = 1) {
  ff <- function(lam) dnorm(y, 0, sd = lam) * dt(lam, df) * 2
  if (y != 0) integrate(ff, lower = 0, upper = Inf)$value
  else Inf
  }
)
ggplot(data.frame(x = c(-6, 6)), aes(x = x)) +
  stat_function(fun = dhs, args = list(df = 1), n = 501,
                aes(col = "HS"), linetype = 1) +
  stat_function(fun = dnorm, n = 501, 
                aes(col = "norm"), linetype = 2) +
  scale_color_manual("", values = c("steelblue", "black"),
                     labels = c("horseshoe(1)", "N(0, 1)")) +
    xlab("y") + ylab("density") + ylim(0, 0.75)+
  theme_bw()

3.2 COPE example revisit

  • We fit a simple logistic regression, predicting the risk of copd by NIC, TYPE, FLTR, as well as all interaction terms between the three.
copd2 <- copd %>% select("NIC","TYPE","FLTR","copd")
fit4.1 <- brm(copd ~ (.)^2, 
               data = copd2, 
               family = bernoulli(link = "logit"), 
               prior = c(
                 prior(normal(0,10), class = "Intercept"),
                 # Prior guess of 25% of the terms are non-zero
                 prior(horseshoe(par_ratio = 2 / 8), class = "b")),
            iter = 7500,
            warmup = 5000,
            chains = 4,
            cores = getOption("mc.cores", 5), 
            seed = 123, 
            silent = 2, 
            refresh = 0)
fit4.2 <- brm(copd ~ (.)^2, 
               data = copd2, 
               family = bernoulli(link = "logit"), 
               prior = c(
                 prior(normal(0,10), class = "Intercept"),
                 prior(normal(0,10), class = "b")),
            iter = 7500,
            warmup = 5000,
            chains = 4,
            cores = getOption("mc.cores", 5), 
            seed = 123, 
            silent = 2, 
            refresh = 0)
saveRDS(fit4.1, file="data/chap7_h_example4.1")
saveRDS(fit4.2, file="data/chap7_h_example4.2")
fit4.1 <- readRDS("data/chap7_h_example4.1")
fit4.2 <- readRDS("data/chap7_h_example4.2")

summary(fit4.2)
 Family: bernoulli 
  Links: mu = logit 
Formula: copd ~ (NIC + TYPE + FLTR)^2 
   Data: copd2 (Number of observations: 1225) 
  Draws: 4 chains, each with iter = 7500; warmup = 5000; thin = 1;
         total post-warmup draws = 10000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -4.14      1.64    -7.70    -1.24 1.00     4232     4591
NIC           2.10      1.06     0.24     4.40 1.00     4173     4468
TYPE          0.36      7.00   -13.20    14.35 1.00     5180     4838
FLTR          0.62      1.71    -2.44     4.27 1.00     4082     4299
NIC:TYPE     -0.18      0.78    -1.72     1.39 1.00     6997     6004
NIC:FLTR     -0.82      1.17    -3.30     1.32 1.00     4112     4156
TYPE:FLTR     0.04      6.99   -13.85    13.62 1.00     5163     4770

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
summary(fit4.1)
 Family: bernoulli 
  Links: mu = logit 
Formula: copd ~ (NIC + TYPE + FLTR)^2 
   Data: copd2 (Number of observations: 1225) 
  Draws: 4 chains, each with iter = 7500; warmup = 5000; thin = 1;
         total post-warmup draws = 10000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -3.35      0.48    -4.13    -2.19 1.00     3095     4336
NIC           1.35      0.37     0.58     2.02 1.00     3972     4408
TYPE         -0.02      0.39    -1.34     0.46 1.03      174       28
FLTR         -0.10      0.29    -0.97     0.18 1.00     2445     1961
NIC:TYPE      0.01      0.15    -0.30     0.35 1.01      988      399
NIC:FLTR     -0.04      0.17    -0.50     0.18 1.01      711     5295
TYPE:FLTR     0.09      0.43    -0.22     1.55 1.03      175       29

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
  • Using the informative data-driven (variable selection) shrinkage priors, we can see that TYPE and NIC:TYPE as minimal effect on the risk of COPD.
# comparing models;
loo1 <- loo(fit4.1)
loo2 <- loo(fit4.2)
loo_compare(loo1, loo2)
       elpd_diff se_diff
fit4.1  0.0       0.0   
fit4.2 -1.9       1.3   

4. Bayesian Variable Selection for Logistic Regression

4.1 Why variable selection matters in logistic regression

In clinical and epidemiological research, we often have many candidate predictors, but only a subset are meaningfully associated with the outcome. The challenges for logistic regression in particular are:

  • Binary outcomes carry less information per observation than continuous ones, so overfitting is more likely with many predictors.
  • Standard non-informative priors do not prevent the model from assigning non-negligible probability mass to implausibly large log-odds ratios for noise predictors.
  • We want a model that is both interpretable and predictively sharp.

Two complementary Bayesian strategies are:

  1. Shrinkage priors (horseshoe) - fit all variables simultaneously and let the prior pull small effects toward zero.
  2. Projection predictive variable selection (projpred) - start from a well-regularized reference model and find the minimal subset of predictors that preserves its predictive performance.

4.2 Prior visualization for shrinkage on the log-odds scale

Before fitting, it is good practice to visualize what your shrinkage prior implies for the regression coefficients. A \(N(0, 10)\) prior places substantial mass far beyond that range and provides little regularization. The horseshoe prior, by contrast, keeps most mass near zero while retaining heavy tails.

x_grid <- seq(-6, 6, by = 0.01)

# Horseshoe marginal density (numerically integrated)
dhs <- Vectorize(function(y, df = 1) {
  ff <- function(lam) dnorm(y, 0, sd = lam) * dt(lam, df) * 2
  if (y != 0) integrate(ff, lower = 0, upper = Inf)$value else Inf
})

tibble(
  x       = rep(x_grid, 3),
  density = c(dnorm(x_grid, 0, 10),
              dnorm(x_grid, 0, 1),
              dhs(x_grid, df = 1)),
  Prior   = rep(c("N(0, 10) - diffuse",
                  "N(0, 1) - weakly informative",
                  "Horseshoe(1) - shrinkage"), each = length(x_grid))
) %>%
  ggplot(aes(x = x, y = density, colour = Prior)) +
  geom_line(linewidth = 1) +
  coord_cartesian(ylim = c(0, 0.6)) +
  labs(x = expression(beta ~ "(log-odds)"), y = "Density",
       title = "Prior comparison on the log-odds scale",
       colour = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

Interpreting prior choices on the log-odds scale
Prior Implied OR at \(\pm 2\) Character
\(N(0, 10)\) OR \(\approx\) 7.4 at 95th pct Very diffuse; little regularization
\(N(0, 1)\) 95% mass between \(\approx\) 0.14 and 7.4 Weakly informative
Horseshoe Heavy spike at 0, heavy tails Aggressive shrinkage of small effects

For most applied logistic regression analyses, \(N(0, 2.5)\) for slopes (Gelman recommendation) or the horseshoe are sensible defaults when variable selection is the goal.

4.3 Simulate a logistic data set with many predictors

We use the mlbench package’s Pima Indians diabetes data, augmented with noise predictors, to illustrate variable selection.

library(mlbench)
data(PimaIndiansDiabetes2)

# Use complete cases; recode outcome as 0/1
pima <- PimaIndiansDiabetes2 %>%
  drop_na() %>%
  mutate(
    diabetes_bin = as.integer(diabetes == "pos"),
    # Standardise continuous predictors for prior interpretability
    across(c(glucose, pressure, triceps, insulin, mass, pedigree, age),
           ~ scale(.)[, 1],
           .names = "{.col}_s")
  )

# Add 5 pure noise predictors
set.seed(42)
noise_mat <- matrix(rnorm(nrow(pima) * 5), ncol = 5,
                    dimnames = list(NULL, paste0("noise", 1:5)))
pima <- bind_cols(pima, as_tibble(noise_mat))

glimpse(pima %>% select(diabetes_bin, ends_with("_s"), starts_with("noise")))
Rows: 392
Columns: 13
$ diabetes_bin <int> 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, …
$ glucose_s    <dbl> -1.08965329, 0.46571891, -1.44609275, 2.40993414, 2.15070…
$ pressure_s   <dbl> -0.37317791, -2.45382847, -1.65357826, -0.05307782, -0.85…
$ triceps_s    <dbl> -0.58436292, 0.55670938, 0.27144131, 1.50760297, -0.58436…
$ insulin_s    <dbl> -0.5221747, 0.1005024, -0.5726620, 3.2559608, 5.8055711, …
$ mass_s       <dbl> -0.70951427, 1.42490909, -0.29685909, -0.36800653, -0.424…
$ pedigree_s   <dbl> -1.03055931, 5.10858225, -0.79610836, -1.05660941, -0.361…
$ age_s        <dbl> -0.96706323, 0.20931780, -0.47690447, 2.16995285, 2.75814…
$ noise1       <dbl> 1.37095845, -0.56469817, 0.36312841, 0.63286260, 0.404268…
$ noise2       <dbl> 0.49834869, -0.54953692, -0.27925650, 1.09651344, 0.44201…
$ noise3       <dbl> 0.76586658, -0.58809786, -0.66029583, 0.11301352, -0.3203…
$ noise4       <dbl> 0.44915843, 1.82762766, -1.11202371, 1.47480743, 0.671512…
$ noise5       <dbl> -0.07568379, 0.91774846, -0.34477871, -1.07700266, 1.0016…

4.4 Fit reference model with horseshoe prior

preds <- c("glucose_s", "pressure_s", "triceps_s", "insulin_s",
           "mass_s", "pedigree_s", "age_s",
           "noise1", "noise2", "noise3", "noise4", "noise5")

formula_vs <- as.formula(
  paste("diabetes_bin ~", paste(preds, collapse = " + "))
)

# Horseshoe: par_ratio = expected non-zero / total predictors
# We guess 5-7 of 12 are truly non-zero
fit_hs_logistic <- brm(
  formula_vs,
  data    = pima,
  family  = bernoulli(link = "logit"),
  prior   = c(
    prior(normal(0, 2),               class = "Intercept"),
    prior(horseshoe(par_ratio = 6/12), class = "b")
  ),
  iter    = 6000,
  warmup  = 3000,
  chains  = 4,
  cores   = 4,
  seed    = 123,
  silent  = 2,
  refresh = 0
)

fit_wide_logistic <- brm(
  formula_vs,
  data    = pima,
  family  = bernoulli(link = "logit"),
  prior   = c(
    prior(normal(0, 2),  class = "Intercept"),
    prior(normal(0, 10), class = "b")
  ),
  iter    = 6000,
  warmup  = 3000,
  chains  = 4,
  cores   = 4,
  seed    = 123,
  silent  = 2,
  refresh = 0
)

saveRDS(fit_hs_logistic,   "data/chp8_vs_hs_logistic")
saveRDS(fit_wide_logistic, "data/chp8_vs_wide_logistic")

4.5 Compare shrinkage vs wide prior: posterior coefficient plot

fit_hs_logistic   <- readRDS("data/chp8_vs_hs_logistic")
fit_wide_logistic <- readRDS("data/chp8_vs_wide_logistic")

coef_hs <- as.data.frame(fixef(fit_hs_logistic)) %>%
  rownames_to_column("term") %>%
  filter(term != "Intercept") %>%
  transmute(term, estimate = Estimate, lower = Q2.5, upper = Q97.5,
            model = "Horseshoe prior")

coef_wide <- as.data.frame(fixef(fit_wide_logistic)) %>%
  rownames_to_column("term") %>%
  filter(term != "Intercept") %>%
  transmute(term, estimate = Estimate, lower = Q2.5, upper = Q97.5,
            model = "Wide N(0,10) prior")

bind_rows(coef_hs, coef_wide) %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(x = estimate, y = term, xmin = lower, xmax = upper,
             colour = model)) +
  geom_point(position = position_dodge(width = 0.6), size = 2) +
  geom_errorbarh(position = position_dodge(width = 0.6), height = 0) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  labs(x = "Posterior log-OR estimate", y = NULL, colour = NULL,
       title = "Horseshoe vs wide prior - coefficient comparison") +
  theme_bw() +
  theme(legend.position = "bottom")

  • Notice how the horseshoe shrinks the noise predictors aggressively toward zero, while the wide prior leaves them spread out.

4.6 Projection predictive variable selection

The projpred package (Piironen et al. 2025) takes the full horseshoe posterior as a reference model and finds the minimal variable subset that preserves predictive performance. This avoids re-fitting many sub-models and propagates posterior uncertainty correctly.

library(projpred)

# Step 1: build the reference model object
refm_logistic <- get_refmodel(fit_hs_logistic)

# Step 2: cross-validated variable selection
# (set eval=FALSE in class; run once and save)
# cv_fits_logistic <- run_cvfun(refm_logistic, k = 5)
# cvvs_logistic <- cv_varsel(
#   refm_logistic,
#   cv_method    = "kfold",
#   cvfits       = cv_fits_logistic,
#   method       = "L1",
#   nterms_max   = 8,
#   verbose      = 0
# )
# saveRDS(cvvs_logistic, "data/chp8_cvvs_logistic")

# cvvs_logistic <- readRDS("data/chp8_cvvs_logistic")

# Step 3: plot ELPD by model size
# png("data/chp8_cvvs_mlpd.png", width = 800, height = 500, res = 120)
# options(projpred.plot_vsel_show_cv_proportions = TRUE)
# plot(cvvs_logistic, stats = "mlpd", deltas = TRUE)
# dev.off()

# Save suggested size and ranking as RDS (tiny objects)
# saveRDS(suggest_size(cvvs_logistic, stat = "mlpd"), "data/chp8_suggest_size.rds")
# saveRDS(ranking(cvvs_logistic), "data/chp8_ranking.rds")

# Load pre-saved outputs - no need to load cvvs_logistic
knitr::include_graphics("data/chp8_cvvs_mlpd.png")

suggest_size <- readRDS("data/chp8_suggest_size.rds")
ranking_out  <- readRDS("data/chp8_ranking.rds")

suggest_size
[1] 5
ranking_out
$fulldata
[1] "glucose_s"  "age_s"      "mass_s"     "triceps_s"  "pedigree_s"
[6] "noise5"     "noise1"     "pressure_s"

$foldwise
     [,1]        [,2]     [,3]        [,4]         [,5]         [,6]        
[1,] "glucose_s" "age_s"  "mass_s"    "triceps_s"  "pedigree_s" "noise1"    
[2,] "glucose_s" "age_s"  "mass_s"    "pedigree_s" "triceps_s"  "pressure_s"
[3,] "glucose_s" "age_s"  "mass_s"    "pedigree_s" "triceps_s"  "noise5"    
[4,] "glucose_s" "age_s"  "triceps_s" "mass_s"     "pedigree_s" "insulin_s" 
[5,] "glucose_s" "mass_s" "age_s"     "triceps_s"  "pedigree_s" "noise5"    
     [,7]         [,8]        
[1,] "noise5"     "pressure_s"
[2,] "noise1"     "noise5"    
[3,] "noise1"     "noise3"    
[4,] "pressure_s" "noise1"    
[5,] "noise1"     "noise2"    

attr(,"class")
[1] "ranking"
Interpreting projpred output

The plot shows how much predictive performance (in mean log predictive density, MLPD) is lost as we remove predictors one by one. A model is considered adequate when the MLPD is within one standard error of the full reference model. suggest_size() returns the smallest model satisfying this criterion.

4.7 Posterior inclusion probabilities via hypothesis()

As a complementary summary, we can use hypothesis() to compute the posterior probability that each log-OR exceeds zero (i.e., a positive effect):

# Get all b_ parameters except intercept
param_names <- variables(fit_hs_logistic)
param_names <- param_names[grepl("^b_", param_names)]
param_names <- param_names[!grepl("Intercept", param_names)]

# Strip the b_ prefix for hypothesis()
param_names_clean <- gsub("^b_", "", param_names)

# Wrap in backticks to handle any special characters
h_pos <- paste0("`", param_names_clean, "` > 0")

hyp_out <- hypothesis(fit_hs_logistic, h_pos)
hyp_out$hypothesis %>%
  arrange(desc(Post.Prob)) %>%
  mutate(across(where(is.numeric), \(x) round(x, 3))) %>%
  datatable(rownames = FALSE,
            options = list(dom = 't', pageLength = 15,
                           columnDefs = list(
                             list(className = 'dt-center', targets = 0:5))))

5. Bayesian Logistic Regression - Predictive Modelling

5.1 Motivation

Fitting a logistic regression model to estimate associations is not the same as building a predictive model. When the goal is prediction, we care about:

  • Discrimination: can the model rank high-risk individuals above low-risk ones? (AUC / c-statistic)
  • Calibration: do predicted probabilities agree with observed event rates?
  • Overall accuracy: what is the average prediction error? (Brier score)

In the Bayesian framework, each of these metrics is computed from the posterior predictive distribution, so we obtain a full uncertainty distribution over each metric - not just a point estimate.

5.2 Working data: Pima Indians diabetes (complete cases)

We continue with the pima dataset from Section 4. We split into training and test sets to evaluate out-of-sample performance.

set.seed(2024)
n_total  <- nrow(pima)
train_id <- sample(seq_len(n_total), size = floor(0.75 * n_total))

pima_train <- pima[train_id, ]
pima_test  <- pima[-train_id, ]

cat("Training n =", nrow(pima_train), "| Test n =", nrow(pima_test), "\n")
Training n = 294 | Test n = 98 
cat("Outcome prevalence (train):", round(mean(pima_train$diabetes_bin), 3), "\n")
Outcome prevalence (train): 0.316 
cat("Outcome prevalence (test) :", round(mean(pima_test$diabetes_bin), 3), "\n")
Outcome prevalence (test) : 0.378 

5.3 Prior specification and visualization

We use weakly informative priors. For standardized predictors, \(N(0, 2.5)\) on slopes means that a one-SD change in a predictor shifts log-odds by at most \(\pm 5\) units in 95% of the prior - already quite permissive.

x_grid <- seq(-10, 10, by = 0.01)

p_int <- ggplot(tibble(x = x_grid, y = dnorm(x_grid, 0, 2)), aes(x, y)) +
  geom_line(colour = "steelblue", linewidth = 1) +
  geom_vline(xintercept = c(qnorm(0.025, 0, 2), qnorm(0.975, 0, 2)),
             linetype = "dashed", colour = "grey50") +
  labs(title = "Intercept prior: N(0, 2)",
       x = "log-odds (intercept)", y = "Density") +
  theme_bw()

p_slope <- ggplot(tibble(x = x_grid, y = dnorm(x_grid, 0, 2.5)), aes(x, y)) +
  geom_line(colour = "tomato", linewidth = 1) +
  geom_vline(xintercept = c(qnorm(0.025, 0, 2.5), qnorm(0.975, 0, 2.5)),
             linetype = "dashed", colour = "grey50") +
  annotate("text", x = -8, y = 0.12,
           label = paste0("OR range: [",
                          round(exp(qnorm(0.025, 0, 2.5)), 2), ", ",
                          round(exp(qnorm(0.975, 0, 2.5)), 2), "]"),
           hjust = 0, size = 3) +
  labs(title = "Slope prior: N(0, 2.5)",
       x = expression(beta ~ "(log-OR)"), y = "Density") +
  theme_bw()

ggarrange(p_int, p_slope, nrow = 1)

5.4 Fit the predictive model

preds_pred <- c("glucose_s", "pressure_s", "triceps_s",
                "insulin_s", "mass_s", "pedigree_s", "age_s")

fit_pred <- brm(
  as.formula(paste("diabetes_bin ~", paste(preds_pred, collapse = " + "))),
  data    = pima_train,
  family  = bernoulli(link = "logit"),
  prior   = c(
    prior(normal(0, 2),   class = "Intercept"),
    prior(normal(0, 2.5), class = "b")
  ),
  iter    = 6000,
  warmup  = 3000,
  chains  = 4,
  cores   = 4,
  seed    = 123,
  silent  = 2,
  refresh = 0
)

saveRDS(fit_pred, "data/chp8_fit_pred")
fit_pred <- readRDS("data/chp8_fit_pred")
summary(fit_pred)
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_bin ~ glucose_s + pressure_s + triceps_s + insulin_s + mass_s + pedigree_s + age_s 
   Data: pima_train (Number of observations: 294) 
  Draws: 4 chains, each with iter = 6000; warmup = 3000; thin = 1;
         total post-warmup draws = 12000

Population-Level Effects: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     -1.02      0.17    -1.36    -0.70 1.00    16727     9033
glucose_s      1.17      0.21     0.77     1.59 1.00    13667     9304
pressure_s     0.13      0.18    -0.22     0.47 1.00    15827    10330
triceps_s      0.14      0.21    -0.27     0.56 1.00    12276     9365
insulin_s     -0.04      0.18    -0.40     0.32 1.00    14506     9991
mass_s         0.43      0.23    -0.01     0.90 1.00    10696    10023
pedigree_s     0.42      0.16     0.11     0.75 1.00    16745     8846
age_s          0.54      0.17     0.20     0.88 1.00    14334     9935

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

5.5 Posterior predictive check

Before evaluating external performance, we verify in-sample fit.

pp_check(fit_pred, ndraws = 100, type = "bars") +
  labs(title = "Posterior predictive check - binary outcome") +
  theme_bw()

5.6 Extract posterior predicted probabilities on the test set

# epred gives E[Y | X, theta] integrated over posterior = P(Y=1 | X)
pred_draws <- posterior_epred(
  fit_pred,
  newdata   = pima_test,
  allow_new_levels = TRUE
)
# dim: (n_draws x n_test)

# Posterior mean predicted probability for each test observation
p_hat <- colMeans(pred_draws)   # posterior mean

pima_test <- pima_test %>%
  mutate(p_hat = p_hat)

5.7 Discrimination: AUC / c-statistic

The AUC summarizes the probability that the model assigns a higher predicted risk to a randomly chosen case than to a randomly chosen non-case.

library(pROC)

roc_obj <- roc(pima_test$diabetes_bin, pima_test$p_hat,
               quiet = TRUE)

# Point estimate AUC using posterior mean probabilities
auc_point <- auc(roc_obj)
cat("AUC (posterior mean predictions):", round(auc_point, 3), "\n")
AUC (posterior mean predictions): 0.881 
# Bayesian uncertainty in AUC: compute AUC for each posterior draw
auc_draws <- apply(pred_draws, 1, function(p) {
  tryCatch(
    as.numeric(auc(roc(pima_test$diabetes_bin, p, quiet = TRUE))),
    error = function(e) NA_real_
  )
})

auc_df <- tibble(auc = auc_draws[!is.na(auc_draws)])

ggplot(auc_df, aes(x = auc)) +
  geom_histogram(bins = 60, fill = "steelblue", colour = "white") +
  geom_vline(xintercept = quantile(auc_df$auc, c(0.025, 0.975)),
             linetype = "dashed", colour = "tomato") +
  labs(title = "Posterior distribution of AUC",
       x = "AUC", y = "Count") +
  theme_bw()

quantile(auc_df$auc, c(0.025, 0.5, 0.975))
     2.5%       50%     97.5% 
0.8524590 0.8750554 0.8887904 

5.8 Calibration

Calibration assesses whether predicted probabilities match observed event rates. We use a calibration plot (decile bins) and the Expected Calibration Error (ECE).

# Calibration plot: decile-binned observed vs predicted
n_bins <- 10
cal_df <- pima_test %>%
  mutate(bin = cut(p_hat, breaks = seq(0, 1, by = 1/n_bins),
                   include.lowest = TRUE, labels = FALSE)) %>%
  group_by(bin) %>%
  summarise(
    mean_pred = mean(p_hat),
    obs_rate  = mean(diabetes_bin),
    n         = n(),
    se        = sqrt(obs_rate * (1 - obs_rate) / n)
  )

ggplot(cal_df, aes(x = mean_pred, y = obs_rate)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey50") +
  geom_point(colour = "steelblue") +
  geom_errorbar(aes(ymin = obs_rate - 1.96 * se,
                    ymax = obs_rate + 1.96 * se),
                width = 0.02, colour = "steelblue") +
  # scale_size_continuous(name = "n in bin", range = c(2, 8)) +
  coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(title = "Calibration plot (decile bins)",
       x = "Mean predicted probability",
       y = "Observed event rate") +
  theme_bw()

5.9 Brier score

The Brier score is the mean squared error between predicted probabilities and observed binary outcomes. Lower is better; a model predicting the prevalence everywhere has a Brier score equal to \(p(1-p)\).

\[\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2\]

# Brier score for each posterior draw
brier_draws <- apply(pred_draws, 1, function(p) {
  mean((p - pima_test$diabetes_bin)^2)
})

# Null (intercept-only) Brier score for reference
prev <- mean(pima_test$diabetes_bin)
brier_null <- prev * (1 - prev)

brier_df <- tibble(brier = brier_draws)

ggplot(brier_df, aes(x = brier)) +
  geom_histogram(bins = 60, fill = "goldenrod", colour = "white") +
  geom_vline(xintercept = quantile(brier_df$brier, c(0.025, 0.975)),
             linetype = "dashed", colour = "tomato") +
  geom_vline(xintercept = brier_null,
             linetype = "dotted", colour = "black", linewidth = 1) +
  annotate("text", x = brier_null + 0.003, y = 80,
           label = paste0("Null Brier = ", round(brier_null, 3)),
           hjust = 0, size = 3) +
  labs(title = "Posterior distribution of Brier score",
       x = "Brier score", y = "Count") +
  theme_bw()

cat("Brier score (posterior mean):", round(mean(brier_draws), 3), "\n")
Brier score (posterior mean): 0.137 
cat("95% CrI: [", round(quantile(brier_draws, 0.025), 3), ",",
    round(quantile(brier_draws, 0.975), 3), "]\n")
95% CrI: [ 0.126 , 0.152 ]

6. Bayesian Fairness Analysis for Logistic Regression

6.1 What is algorithmic fairness?

When logistic regression models inform decisions (e.g., screening, treatment allocation, risk stratification), we should check whether the model performs equally well across demographic subgroups. Algorithmic fairness asks: does the model disadvantage any group relative to another?

There is no single universally agreed definition of fairness. Key metrics capture different normative perspectives and - importantly - they cannot all be satisfied simultaneously when base rates differ across groups (the “impossibility theorems”).

6.2 Fairness metrics: definitions

Let \(A \in \{0, 1\}\) be a protected attribute (e.g., Sex: 0 = female, 1 = male), \(\hat{Y}\) be the model’s prediction (binary, based on a threshold \(\tau\)), and \(Y\) the true outcome.

Metric Definition Interpretation
Demographic parity \(P(\hat{Y}=1 \mid A=0) = P(\hat{Y}=1 \mid A=1)\) Equal positive prediction rates
Equal opportunity \(P(\hat{Y}=1 \mid Y=1, A=0) = P(\hat{Y}=1 \mid Y=1, A=1)\) Equal true positive rates (sensitivity)
Equalized odds Equal TPR and FPR across groups Equal error rates in both directions
Predictive parity \(P(Y=1 \mid \hat{Y}=1, A=0) = P(Y=1 \mid \hat{Y}=1, A=1)\) Equal positive predictive value
Calibration by group \(P(Y=1 \mid \hat{p}=p, A=a) = p\) for all \(a\) Predicted probabilities are accurate within each group
AUC by group \(\text{AUC}_{A=0} = \text{AUC}_{A=1}\) Equal discriminative ability
Why these metrics conflict

If the prevalence of the outcome differs between groups, it is mathematically impossible to achieve demographic parity, equalized odds, and predictive parity simultaneously. The appropriate metric depends on the application and ethical framework.

6.3 Computing fairness metrics from posterior draws

We continue with the Pima dataset and treat Sex (which we simulate here as it is not in the original dataset) as the protected attribute. In your own data, replace sex with the appropriate variable.

# The Pima dataset does not include sex; we simulate a binary sex variable
# for illustration. In practice, use the actual subgroup variable in your data.
set.seed(99)
pima_test <- pima_test %>%
  mutate(sex = rbinom(n(), 1, 0.48))  # 0 = female, 1 = male

cat("Subgroup sizes in test set:\n")
Subgroup sizes in test set:
table(pima_test$sex)

 0  1 
49 49 
cat("\nOutcome prevalence by sex:\n")

Outcome prevalence by sex:
tapply(pima_test$diabetes_bin, pima_test$sex, mean)
        0         1 
0.3673469 0.3877551 
# Decision threshold
tau <- 0.5

# For each posterior draw, compute fairness metrics
fairness_draws <- apply(pred_draws, 1, function(p) {
  df <- pima_test %>%
    mutate(p_pred = p,
           y_hat  = as.integer(p >= tau))

  compute_metrics <- function(d) {
    TP  <- sum(d$y_hat == 1 & d$diabetes_bin == 1)
    FP  <- sum(d$y_hat == 1 & d$diabetes_bin == 0)
    TN  <- sum(d$y_hat == 0 & d$diabetes_bin == 0)
    FN  <- sum(d$y_hat == 0 & d$diabetes_bin == 1)
    TPR <- TP / max(TP + FN, 1)   # sensitivity / recall
    FPR <- FP / max(FP + TN, 1)   # 1 - specificity
    PPV <- TP / max(TP + FP, 1)   # precision
    PR  <- mean(d$y_hat)           # positive rate
    c(TPR = TPR, FPR = FPR, PPV = PPV, PR = PR)
  }

  m0 <- compute_metrics(filter(df, sex == 0))
  m1 <- compute_metrics(filter(df, sex == 1))

  c(
    TPR_diff = m1["TPR"] - m0["TPR"],   # equal opportunity gap
    FPR_diff = m1["FPR"] - m0["FPR"],   # equalized odds (FPR component)
    PPV_diff = m1["PPV"] - m0["PPV"],   # predictive parity gap
    PR_diff  = m1["PR"]  - m0["PR"],    # demographic parity gap
    TPR_f = m0["TPR"], TPR_m = m1["TPR"],
    FPR_f = m0["FPR"], FPR_m = m1["FPR"],
    PPV_f = m0["PPV"], PPV_m = m1["PPV"],
    PR_f  = m0["PR"],  PR_m  = m1["PR"]
  )
})

fairness_df <- as.data.frame(t(fairness_draws))
colnames(fairness_df) <- c("TPR_diff", "FPR_diff", "PPV_diff", "PR_diff",
                            "TPR_f", "TPR_m", "FPR_f", "FPR_m",
                            "PPV_f", "PPV_m", "PR_f",  "PR_m")

6.4 Visualize posterior fairness gaps

gap_vars <- c("TPR_diff", "FPR_diff", "PPV_diff", "PR_diff")
gap_labels <- c("Equal opportunity\n(TPR gap: male ??? female)",
                "Equalized odds\n(FPR gap: male ??? female)",
                "Predictive parity\n(PPV gap: male ??? female)",
                "Demographic parity\n(Positive rate gap: male ??? female)")

gap_long <- fairness_df %>%
  select(all_of(gap_vars)) %>%
  pivot_longer(everything(), names_to = "metric", values_to = "gap") %>%
  mutate(metric = factor(metric, levels = gap_vars, labels = gap_labels))

ggplot(gap_long, aes(x = gap)) +
  geom_histogram(bins = 60, fill = "steelblue", colour = "white") +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "black") +
  facet_wrap(~ metric, scales = "free_x", ncol = 2) +
  labs(title = "Posterior distribution of fairness gaps",
       subtitle = paste0("Decision threshold = ", tau),
       x = "Gap", y = "Count") +
  theme_bw()

The discrete/gapped pattern is expected. The fairness metrics (TPR, FPR, PPV, positive rate) are computed from counts within subgroups. With a fixed test set and fixed subgroup sizes, the possible values of these metrics are discrete fractions with a fixed denominator.

6.5 Summary table of fairness metrics by subgroup

# Posterior mean and 95% CrI for each metric, by group
fairness_summary <- bind_rows(
  fairness_df %>%
    summarise(across(c(TPR_f, FPR_f, PPV_f, PR_f,
                       TPR_m, FPR_m, PPV_m, PR_m),
                     list(mean = mean,
                          lo   = ~ quantile(.x, 0.025),
                          hi   = ~ quantile(.x, 0.975))))
)

# Format as readable table
tibble(
  Metric = c("Sensitivity (TPR)", "1-Specificity (FPR)",
             "Precision (PPV)", "Positive rate"),
  Female_mean = c(fairness_df$TPR_f.mean %||% mean(fairness_df$TPR_f),
                  mean(fairness_df$FPR_f),
                  mean(fairness_df$PPV_f),
                  mean(fairness_df$PR_f)),
  Male_mean   = c(mean(fairness_df$TPR_m), mean(fairness_df$FPR_m),
                  mean(fairness_df$PPV_m), mean(fairness_df$PR_m)),
  Gap_mean    = c(mean(fairness_df$TPR_diff), mean(fairness_df$FPR_diff),
                  mean(fairness_df$PPV_diff), mean(fairness_df$PR_diff)),
  Gap_95CrI   = c(
    paste0("[", round(quantile(fairness_df$TPR_diff, 0.025), 3), ", ",
                round(quantile(fairness_df$TPR_diff, 0.975), 3), "]"),
    paste0("[", round(quantile(fairness_df$FPR_diff, 0.025), 3), ", ",
                round(quantile(fairness_df$FPR_diff, 0.975), 3), "]"),
    paste0("[", round(quantile(fairness_df$PPV_diff, 0.025), 3), ", ",
                round(quantile(fairness_df$PPV_diff, 0.975), 3), "]"),
    paste0("[", round(quantile(fairness_df$PR_diff,  0.025), 3), ", ",
                round(quantile(fairness_df$PR_diff,  0.975), 3), "]")
  )
) %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  datatable(rownames = FALSE,
            options = list(dom = 't',
                           columnDefs = list(
                             list(className = 'dt-center', targets = 0:4))))

6.6 AUC by subgroup

idx_f <- which(pima_test$sex == 0)
idx_m <- which(pima_test$sex == 1)

auc_by_sex <- apply(pred_draws, 1, function(p) {
  auc_f <- tryCatch(
    as.numeric(auc(roc(pima_test$diabetes_bin[idx_f], p[idx_f], quiet = TRUE))),
    error = function(e) NA_real_
  )
  auc_m <- tryCatch(
    as.numeric(auc(roc(pima_test$diabetes_bin[idx_m], p[idx_m], quiet = TRUE))),
    error = function(e) NA_real_
  )
  c(AUC_female = auc_f, AUC_male = auc_m, AUC_diff = auc_m - auc_f)
}) %>%
  t() %>%
  as.data.frame()

auc_by_sex %>%
  select(AUC_female, AUC_male) %>%
  pivot_longer(everything(), names_to = "Group", values_to = "AUC") %>%
  mutate(Group = recode(Group, AUC_female = "Female", AUC_male = "Male")) %>%
  ggplot(aes(x = AUC, fill = Group)) +
  geom_density(alpha = 0.5) +
  labs(title = "Posterior AUC by sex",
       x = "AUC", y = "Density", fill = NULL) +
  theme_bw()

cat("AUC difference (male - female):\n")
AUC difference (male - female):
cat("Mean:", round(mean(auc_by_sex$AUC_diff, na.rm=TRUE), 3), "\n")
Mean: 0.116 
cat("95% CrI: [", round(quantile(auc_by_sex$AUC_diff, 0.025, na.rm=TRUE), 3),
    ",", round(quantile(auc_by_sex$AUC_diff, 0.975, na.rm=TRUE), 3), "]\n")
95% CrI: [ 0.065 , 0.164 ]
Key takeaway from fairness analysis

The Bayesian approach provides full posterior distributions over each fairness metric. Rather than a single point estimate of, say, the TPR gap, we can ask: “What is the probability that the true TPR gap exceeds 0.10?” This is directly answerable from the posterior and is far more informative for decision-making than a binary hypothesis test.


References

Bastos, Mayara Lisboa, Gamuchirai Tavaziva, Syed Kunal Abidi, Jonathon R Campbell, Louis-Patrick Haraoui, James C Johnston, Zhiyi Lan, et al. 2020. “Diagnostic Accuracy of Serological Tests for Covid-19: Systematic Review and Meta-Analysis.” Bmj 370.
Piironen, Juho, Markus Paasiniemi, Alejandro Catalina, Frank Weber, Osvaldo Martin, and Aki Vehtari. 2025. projpred: Projection Predictive Feature Selection.” https://mc-stan.org/projpred/.